MapReduce and Spark

MapReduce concept

MapReduce is one of the main components in the Hadoop ecosystem. Generally, MapReduce is a framework used to access and process the big data stored in Hadoop Distributed File System (HDFS) by gathering the data from the distributed computing system into one output result (Talend, 2021). In addition, it can work with different data types (structured, semi-structured, and unstructured) and works with distributed computing systems. MapReduce is the proper tool for dealing with big data because it can deal with different types of data and work on multiple machines simultaneously as a cluster. MapReduce consists of two parts, map and reduce. Furthermore, MapReduce consists of two parts, map and reduce.

Hadoop MapReduce architecture

The MapReduce program has two phases: Map and Reduce. Data are divided and mapped in Map tasks, while data is shuffled and decreased in Reduce tasks (Taylor, 2021).

A map function analyzes a key/value pair to create a collection of intermediate key/value pairs. In contrast, a reduced function merges all intermediate values associated with the same intermediate key (Dean & Ghemawat, 2008).

MapReduce in the Hadoop ecosystem has four steps of working: split, map, shuffle, and reduce. The first two steps (split, map) are done with the map phase, and the second phase shuffles the data and then reduces them. Figure 1 shows these processes, and these operations are done through two files Map and Reduce.

Figure 1: MapReduce and Hadoop process (Taylor, 2021).

The chart shows the input data; then, the different processes were done on the inputs and the final output. After MapReduce got the input from HDFS, the map file worked to split the inputs into fixed-size groups of data called input splits (Taylor, 2021). The second step is to map each split group by creating a (key, value) model. After, the mapping outcomes took by the reduced file, which shuffles the mapping to similar group keys. Lastly, the reducer will aggregate the similarities into different keys. This operation reduces the data size efficiently. Read more about Hadoop ecosystems

Summarizes Spark RDD concept

Spark RDD is an essential component of the Apache Spark system. It stands for “Resilient Distributed Dataset”; it is an immutable group of objects that computes on the cluster’s various nodes (Team, 2019). Apache Spark uses Spark RDD to achieve MapReduce operations efficiently and fast (Gaur, 2020).

To understand the concept of Spark RDD, we need to understand what RDD means.

Resilient: This means if something goes wrong with Spark RDD, it has the potential to reconstruct itself (Mehrotra & Grade, 2019).
Distributed: as mentioned, MapReduce works with clusters, so distributed means distributing the data to multiple nodes.
Dataset: it is a collection of data. The data could be in any type, such as JSON or CSV, to name a few.

A book written by Mehrotra and Grade says each program in the Spark ecosystem compiles to RDD before executing (2019).

Spark Modules

The Spark ecosystem consists of multiple models that work to achieve the Spark jobs, and each model has a specific kind of role in the Spark ecosystem. The following paragraphs will represent four Spark models: Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX, with a brief explanation of the model.

Spark SQL

Spark SQL is used to analyze, query, and operate the structured data stored in the HDFS (Mehrotra & Grade, 2019). In addition, it provides a common way to represent several data sources, such as Hive, Avro, Parquet, ORC, JSON, and JDBC. It can even combine data from different sources (Apache Spark, n.d.). It is a vital tool for organizations that use the Hadoop ecosystem. According to Mehrotra and Grade, most of the data in an organization is organized (2019).

Spark Streaming

Spark Streaming is considered one of the most potent Apache Spark tools. Spark Streaming is a fault-tolerant, scalability streaming processing system that can handle batch and streaming operations (Databricks, 2021). The primary role of Spark Streaming is to be processing the data in real-time. In Apache Spark Quick Start Guide book, Mehrotra and Grade say that Spark can analyze the data in real-time using Spark Streaming. The analysis includes data analysis, ML, and graph processing (2019).

Spark MLlib

As mentioned in the previous paragraph, Apache Spark can analyze the data using machine learning. The Spark MLlib is a built-in library in the Spark system, and Programmers who use Apache Spark can use this library in any model to perform machine learning on the datasets. Furthermore, the MLlib library has the functions needed to execute a variety of statistical studies, including correlations, sampling, and hypothesis testing, to name a few (Mehrotra & Grade, 2019).

Spark GraphX

GraphX is a Spark-based graph analytics framework. It was created to replace specialized graph processing frameworks with a general-purpose distributed data–flow framework. It is fault-tolerant and makes use of in-memory computing (Mehrotra & Grade, 2019). Many organizations can use Spark GraphX to achieve their business goals. For example, social media platforms, banking systems, and healthcare sectors. GraphX consists of nodes and edges, each node represents a set of data, and the edges represent the relationships between these data.

Two projects real-world projects that use MapReduce

Many real-world applications are using the MapReduce tool to process their big data. Singh and Rayapati, in the book “Learning big data with Amazon Elastic MapReduce: easily learn, build, and execute real-world Big Data solutions using Hadoop and AWS EMR” (2014), show some examples that MapReduce was used to process a huge amount of data for several organizations. In this part, we will introduce two examples.

The first real-world project that performs MapReduce is the social media platform. Social media websites have a massive amount of data produced by their users. The one-way social media companies can use MapReduce is to analyze the relationships between the users. In addition, MapReduce can solve the problem of understanding the common followers between two or more users on Facebook or Twitter (Singh & Rayapati, 2014).

The second real-world example, also shown in Singh and Rayapati’s book (2014), talks about using MapReduce in E-commerce websites. Amazon, for example, has a vast number of products and users. MapReduce can help Amazon show the recommended items to users based on their search, recently reviewed items, or purchase history. Using the recommended system can give the e-commerce companies more products that users may not find due to the large number of items stored in the inventory.

Two projects real-world use Spark.

Likewise, many organizations use Spark to process their data. An article entitled “Top 5 Apache Spark Use Cases” Project Pro website (2021) shows many real-world cases that use Spark to operate their business. We will show two cases in the following paragraphs.

Firstly, using Spark in the gaming sector. Gaming companies have a huge amount of data. They deal with different kinds of players and patterns. Also, they sometimes need to work with real-time in-game events. Spark can help the gaming industry by analyzing data using Spark Streaming to perform many tasks, such as level auto adjustment based on the player skill and player retention, to name a few (projectpro, 2021). One example of a gaming company that uses Spark is Tencent; they have the most significant number of users in mobile gaming (projectpro, 2021). They use Spark to process data in-memory computing features which enhance the performance in real-time. Also, they use Spark to analyze the chats between different players to detect the abusive language in the chat (projectpro, 2021).

Secondly, using Spark in the information services field. Databricks is one of the famous platforms in the data area. They use Spark to perform machine learning applications on Amazon AWS and Microsoft Azure (projectpro, 2021). Furthermore, they expand their business by creating open-source applications such as Delta Lake and MLflow (projectpro, 2021).

Author: Zaid Altukhi

References

Apache Spark. (n.d.). Spark SQL & DataFrames | Apache Spark. Retrieved October 13, 2021, from https://spark.apache.org/sql/
Databricks. (2021, March 13). What is Spark Streaming? https://databricks.com/glossary/what-is-spark-streaming
Dean, J., & Ghemawat, S. (2008). MapReduce. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
Gaur, C. (2020, April 23). A Complete Guide to RDD in Apache Spark. Xenonstack. https://www.xenonstack.com/blog/rdd-in-spark/
Mehrotra, S., & Grade, A. (2019). Apache Spark Quick Start Guide: Quickly learn the art of writing efficient big data applications with Apache Spark. Packt Publishing.
projectpro. (2021, October 12). Top 5 Apache Spark Use Cases. https://www.projectpro.io/article/top-5-apache-spark-use-cases/271#mcetoc_1fb9oit5m0
Singh, A., & Rayapati, V. (2014). Learning Big Data with Amazon Elastic MapReduce. Packt Publishing.
Talend. (2021, January 6). MapReduce 101: What It Is & How to Get Started – Talend. Talend – A Leader in Data Integration & Data Integrity. https://www.talend.com/resources/what-is-mapreduce/
Taylor, D. (2021, October 6). What is MapReduce in Hadoop? Architecture | Example. Guru99. https://www.guru99.com/introduction-to-mapreduce.html
Team, D. (2019, May 7). Spark RDD – Introduction, Features & Operations of RDD. DataFlair. https://data-flair.training/blogs/spark-rdd-tutorial/

اخر المقالات

Technology

MapReduce concept

Hadoop MapReduce architecture