What is the Hadoop ecosystem? What are the Hadoop ecosystem components?

Introduction

Nowadays, the need for a system that can handle a massive amount of data has been required. Relational Databases Management System (RDBMS) is designed to process and store a small amount of data. NoSQL databases came to solve the problem of storing and processing a high amount of data known as Big Data. In addition, the ability of RDBMS is limited to work only with the structured data type. However, NoSQL can deal with all kinds of data (structured, semi-structured, unstructured). NoSQL itself faces some difficulties in running perfectly as a system. So, Hadoop solved these issues by building one eco-system that can store, manage, and process big data. Hadoop is open-source software built to work with big data. This paper will explain the difference between SQL and NoSQL databases and give an overview of Hadoop; then, it will briefly discuss the Hadoop eco-system components as shown in Figure 1.

SQL V.S NoSQL Database

There are some main contrasts between SQL databases and NoSQL databases. These differences give a clear idea about which one of these databases is the proper one to use when an organization wants to set up a database. Based on the data type, data size, and the purpose of using the database. There are three main differences as follow:

The data type used on SQL is limited by only one data type. However, NoSQL can work with all data types. SQL databases work only with relational databases, which depend on the tables (entities) and the relationship. Relational databases work with structured data. Nevertheless, NoSQL can contain all kinds of data types (structured, semi-structured, and unstructured data).
Secondly, the way of databases scaling. SQL databases scale vertically, which means if the organization wants to expand their databases capability, they need to add more memory, CPUs, or RAM. Besides, vertical scaling is expensive and is considered one of the SQL challenges. However, the NoSQL scale Horizontally allows adding more nodes to the database easily and is not as expensive as vertical scaling.
The third difference is data size that can be stored and processed. SQL databases can work perfectly with the small data size, but when the organization has big data, the NoSQL databases will be the suitable database to be used.

Hadoop overview

Hadoop is an open-source software framework that can store and process big data effectively. It provides faster and inexpensive costs compared to traditional systems (Uzunkaya et al., 2015). In addition, Hadoop came to solve the big data that needed to be distributed into cluster hardware. Distributed databases allow the data to be spread in nodes in advance, which provides high accessibility and availability to the database. However, the nodes in Hadoop work with the shared-nothing architecture, which means the communication between nodes is little as possible. Hadoop can store and manage a massive amount of data using the Hadoop ecosystem.

Figure 1: Hadoop Eco-system concept chart

This Hadoop eco-system concept chart was built based on different sources. The sources are:

Hadoop Ecosystem (GeeksforGeeks, 2021b)
Hadoop Ecosystem and Their Components (Data Flair, 2019)
Introduction to the Hadoop Ecosystem for Big Data and Data Engineering (Bhandari, 2020)

All these sources help to create a clear idea about the Hadoop ecosystem architecture.

What is Hadoop Eco-system?

The Hadoop ecosystem is a system that offers different open-source services to solve big data problems. These components work together to manage and process big data through the Hadoop framework. Distributed File Systems HDFS, YARN, MapReduce, Spark, PIG, HIVE, and Hbase are examples of Hadoop ecosystem components.

The Hadoop ecosystem can be beneficial to many sectors that have big data and a high amount of real-time data streaming or those organizations that want to process big data. For instance, the healthcare sector is one of the most sectors that need to work with different kinds of data, such as, X-ray images and patient data. Sometimes ML needs to apply these data, but if they are using relational databases, this will make the analyzing face difficulties.

Another example is how the financial sector can use the Hadoop ecosystem to run the business inefficiently way. Because the banking sector has complicated systems with crucial big data, Hadoop components can work efficiently to achieve the sector’s goals. An example of a financial organization that has benefited from a big data system is JPMorgan Chase, which is a huge financial firm that exists in more than 100 countries (ProjectPro, 2020).

The ecosystem is divided into five main parts, and each one has tools that perform a specific task as follows:

Data Storage: HDFS (Hadoop Distributed File System).
Resource Management: YARN (Yet Another Resource Negotiator).
Data Management: Flume and ZooKeeper.
Data Processing: MapReduce, Apache Spark
Data Access: PIG, HIVE, and Hbase.

Each component has a specific role in the Hadoop ecosystem part. The following sections will explain these parts in detail.

Data Storage:

The toolkit is responsible for storing the big data, is Hadoop has its file system, HDFS. Hadoop Distributed File System is the central part of the Hadoop ecosystem, which makes sure the data can be stored in the distributed system properly. This includes making the data available to retrieve and process in real time if needed. HDFS makes multiple copies of the data to distribute it through the cluster (ProjectPro, 2021). The Hadoop Distributed File System has two main components:

NameNode: This node contains the metadata, known as data about data (GeeksforGeeks, 2021). Also, NameNode is the master node that is responsible for maintaining track of the storage cluster.
DataNode: operates as a processing node in a Hadoop cluster, summing up the various systems (ProjectPro, 2021).

Resource Management

Because Hadoop works in a distributed computing environment, managing the data in the cluster is required. YARN’s main role is to manage the resources, nodes, and applications.

Rasheeda Goel, in his article in Intellipaat.com, explains YARN components (2020) by saying that there are three main components in YARN to manage the Hadoop resources:

Recourse Manager:

it is the major component responsible for managing all applications and distributing the resources for each application as needed. This operation is done by scheduling the resources (Goel, 2020). Recourse Manager consists of two components:

1. Scheduler: This component’s job is to distribute the resources, such as CPUs and memory, at a specific time based on the schedule. Also, it is observed the applications.
2. Application Manager: this part manages the running applications in the cluster.
Node Manager: this part has five responsibilities to make sure that the management operations work well as follow:
1. Report the resource usage to the Resource Manager.
2. Track the node status while YARN is running.
3. Managing the workflow between the users and specific nodes and handling these nodes.
4. Update the data in Resource Manager.
5. Obey the Recourse manager if it gives an order to destroy the data container.

Application Master

application in YARN refers to any job submitted into the system. Application Master has four missions that need to be done to make sure that applications work well:

Execute the applications and manage the errors.
Get the resources from the Recourse Manager.
Monitoring and executing other components’ tasks in the system.
Regularly check the status of the applications with the Resource Manager and update the resource demand records.

YARN has many features that make the Hadoop ecosystem valuable to work with big data in distributed computing. The first benefit is that YARN supports different kinds of processing methods, and the second one is that YARN can process streaming data and work with MapReduce to apply the data queries.

The YARN manages the data through distributed computing. Which make the data transitional between different component done smoothly and make sure that all errors are managed in proper ways.

Data Management:

There are many different tools or software that take care of managing data in the Hadoop ecosystem. However, in this paper, two data management tools will be discussed.

Flume: According to the Apache Flume™ website, “Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data” (n.d). Flume has a main component, which is the agent. The agent holds three different components, source, channel, and Sink. Taylor in guru99.com explains the Flume work as follow:
1. Source: receive and store events that come from an external source. Then send these events to the channel.
2. channel: receive and store the events from the source to send it to the Sink.
3. Sink: is the last component in the agent. It is responsible for removing the event from the channel and sending it to external storage like HDFS.

Flume has three crucial features mentioned in the same article that wrote by Taylor (2021):

Has various levels that offer data reliability (Taylor, 2021). Usually, the organizations that use Hadoop systems need high reliability in their work. Flume is the proper tool that can provide this kind of dependability they are seeking to.
Has two ways to move the gathered data from source to sink, scheduled or event-driven (Taylor, 2021).
The ability to have HDFS and HBase file systems in the sink component (Taylor, 2021).

ZooKeeper: Apache ZooKeeper is an open-source server that reliably coordinates and manages the large clustering process (Apache ZooKeeperTM, n.d.). Doddamani (2020) explains the four ZooKeeper components:
1. Client-server: used to access data from the Server.
2. Server: gives the services that ordered by the client
3. Leader server: it is a server run automatically by getting recovery if any server failed
4. Follower server: follow the leader server instructions.

In addition, ZooKeeper has two main featured listed on the Data Flair website in (2019):

Fast: ZooKeeper can fast read the data where writing is less common than reading (Data Flair, 2019).
Ordered: managing all transaction records (Data Flair, 2019).

ZooKeeper is an essential component in some Hadoop systems. It manages the resources in the Hadoop system to make the data process done ineffectively through the Hadoop ecosystem components. For example, passing the data from HDFS to HBase need ZooKeeper to arrange these operations.

Data Processing:

The hadoop ecosystem processes the data through different tools. This article will focus on two of them, MapReduce and Apache Spark, with their components.

MapReduce: it is software used to write applications to process big data stored in HDFS. Besides, it is the core Hadoop ecosystem data processing component (Data Flair, 2019). Moreover, MapReduce is divided into two main components:
1. Map: dived the data into a set of data.
2. Reduce: take the map function outcomes and combine them based on the key.

According to the Data Flair website (2019), MapReduce has four features:

Simplicity: It can be written in many known programming languages, such as Java, python, and C++. Furthermore, it is a straightforward application to run.
Scalability: MapReduce can handle a vast amount of data.
Speed: MapReduce can process the data that usually take days in hours.
Fault Tolerance: it has a high ability to deal with failures.

Apache Spark: It offers an optimized engine to execute graphs, has high-level tools for processing different kinds of data such as structured data, streaming, and machine learning (Data Flair, 2018). There are six components under Apache Spark, and each component performs a specific task that Apache Spark can do. Data Flair website (2018) listed them as follows:
1. Spark Core: could provide in-memory computation.
2. Spark SQL: used for processing structured and semi-structured data and can provide high optimization.
3. Spark Streaming: it is an extra tool added to Spark API to process the streaming data effectively.
4. Spark MLlib: this is a Machine Learning library that can run high-quality ML algorithms with high-speed ability. Big data can be any kind of data semi-structured for example and this cannot be analyzed using the traditional types of databases. MLlib library is a useful tool for performing this kind of analyzing.
5. Spark GraphX: it is an engine to store and analyze the network graph data model. The graph data model is used in different sectors. For example, social media platforms can use this tool to store and analyze users’ data with no complicated steps.
6. SparkR: this component can work with Data Frames using R. In addition, most of the structured datasets can be analyzed as Data Frames. This component can be very helpful in order to analyze those kinds of data.

Data Access:

The last component in the Hadoop ecosystem is data access. There are many tools under data access in the Hadoop ecosystem. However, this paper will discuss three famous tools: PIG, HIVE, and HBase.

PIG: is a query processing language developed by Yahoo to analyse big data effectively and efficiently (ProjectPro, 2021). It provides the ability to write complex data transformations using a simple way (Kiran, 2021). PIG consists of four main components:
1. Parser: the parser task is to check the syntax of the script. The check outcome in a Directed Acyclic Graph (DAG) form. DAG has PIG Latin statements and logical operators (Pedamkar, 2021).
2. Optimizer: executes more logical optimizations like projection and push-down (Data Flair, 2018a).
3. Compiler: automatically converts PIG jobs to MapReduce jobs and uses the opportunities to enhance the scripts (Pedamkar, 2021).
4. Execution engine: finally, all MapReduce applications are sent to the Hadoop system as submitted order (Pedamkar, 2021).

HIVE: is a data warehouse software developed by Facebook and Amazon. It offers an SQL-Like interface for the user to access the data store in Hadoop distributed file system (GeeksforGeeks, 2021a). In addition, HIVE can read, write, process, and quiring big data using SQL syntax. Moreover, HIVE consists of two components:
1. HCatalog helps manage and store the tables for the Hadoop system and allows users who use PIG or MapReduce to read and write on data easily (Leverenz, 2018). HCatalog supports any kind of data, such as CSV, RCFile, or text files. Also, HCatalog presents the tables with their relationships in HDFS (Leverenz, 2018).
2. WebHCat: a REST API that can receive HTTP requests to access the Hadoop ecosystem of different components such as MapReduce, YARN, or PIG.

HBase: “HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS)” (IBM, n.d.). Hbase contains the data in table format. It depends on ZooKeeper for high-performance coordination and works well with HIVE (IBM, n.d.). Taylor in (2021b) lists four different components of HBase:
1. HMaster: run on NameNodes, which is part of HDFS and responsible for changing data schema operations. Also, it assigns the regions to the region servers.
2. HBase Region Server: assign read and write requests that are received from the clients to a specific region.
3. HBase Regions is the basic component in the Hbase architecture, which comprises column families and distributed tables.
4. ZooKeeper: it was mentioned in the data processing part, but it works with Hbase to coordinate the Hbase operations by establishing the region server’s communications with the clients, tracking server failures, offering the synchronized distribution, and maintaining the information configurations to name a few.

Conclusion:

To sum up, the Hadoop ecosystem consists of many different components that make sure the goal of Hadoop is done in a perfect way. This paper divided the component into five sections: data storage, resource management, data management, data processing, and data access. Under these sections, there are some tools that perform specific tasks that were briefly explained. Although there are many other Hadoop ecosystem tools not mentioned, the writer thinks the most important tools were inserted into the concept chart Figure 1. This paper can be developed by adding more tools and examples that can clarify how the Hadoop ecosystem is big and reliable in dealing with different kinds of big data, and it is suitable to work in different sectors also.

Author: Zaid Altukhi

References

Apache Flume^TM. (n.d.). Welcome to Apache Flume — Apache Flume. Retrieved October 7, 2021, from https://flume.apache.org
Apache ZooKeeper^TM. (n.d.). Apache ZooKeeper. Retrieved October 7, 2021, from https://zookeeper.apache.org
Bhandari, A. (2020, October 23). Introduction to the Hadoop Ecosystem for Big Data and Data Engineering. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/
Data Flair. (2018a, April 23). Apache Pig Architecture – Learn Pig Hadoop Working. DataFlair. https://data-flair.training/blogs/pig-architecture/
Data Flair. (2018b, November 16). Apache Spark Ecosystem – Complete Spark Components Guide. DataFlair. https://data-flair.training/blogs/apache-spark-ecosystem-components/
Data Flair. (2019, April 26). Hadoop Ecosystem and Their Components – A Complete Tutorial. DataFlair. https://data-flair.training/blogs/hadoop-ecosystem-components/
Doddamani, S. (2020, October 12). What is Apache Zookeeper? Intellipaat Blog. https://intellipaat.com/blog/what-is-apache-zookeeper/
GeeksforGeeks. (2021a, July 2). Apache Hive. https://www.geeksforgeeks.org/apache-hive/
GeeksforGeeks. (2021b, August 2). Hadoop Ecosystem. https://www.geeksforgeeks.org/hadoop-ecosystem/
Goel, K. (2020, September 16). What is Yarn. Intellipaat Blog. https://intellipaat.com/blog/tutorial/hadoop-tutorial/what-is-yarn/#Why-is-YARN-used-in-Hadoop?
IBM. (n.d.). What is HBase? | IBM. Retrieved October 8, 2021, from https://www.ibm.com/topics/hbase
Kiran, R. (2021, January 29). Hadoop Components that you Need to know about. Edureka. https://www.edureka.co/blog/every-hadoop-component/
Leverenz, L. (2018, December 16). HCatalog UsingHCat – Apache Hive – Apache Software Foundation. Confluence. https://cwiki.apache.org/confluence/display/Hive/HCatalog%2BUsingHCat#HCatalogUsingHCat-Overview
Pedamkar, P. (2021, March 1). Pig Architecture. EDUCBA. https://www.educba.com/pig-architecture/
ProjectPro. (2020, July 13). Hadoop Use Cases. https://www.projectpro.io/article/hadoop-use-cases/232
ProjectPro. (2021, September 6). Hadoop Ecosystem Components and Its Architecture. https://www.projectpro.io/article/hadoop-ecosystem-components-and-its-architecture/114#toc-5
Taylor, D. (2021a, October 6). Apache Flume Tutorial: What is, Architecture & Hadoop Example. Guru99. https://www.guru99.com/create-your-first-flume-program.html#2
Taylor, D. (2021b, October 7). HBase Architecture: Use Cases, Components & Data Model. Guru99. https://www.guru99.com/hbase-architecture-data-flow-usecases.html
Uzunkaya, C., Ensari, T., & Kavurucu, Y. (2015). Hadoop Ecosystem and Its Analysis on Tweets. Procedia – Social and Behavioral Sciences, 195, 1890–1897. https://doi.org/10.1016/j.sbspro.2015.06.429