In this article, we will discuss the architecture of the Hadoop framework and the limitations of MapReduce in processing the big data, which act as a motivation for the development of the Apache Spark framework. After that, we will discuss the different components of Spark and its architecture and finally, we take a look at PySpark API.
Table of contents:
1. Big Data and the Emergence of Hadoop Framework.
2. Brief overview of the Hadoop Architecture.
3. Working of MapReduce and its Limitations.
4. Introduction to Apache Spark
5. Components of Apache Spark
6. Architecture of Apache Spark
7. Comparing Hadoop with Spark
8. Overview of PySpark API
This article will provide a detailed explanation of Apache Spark components and Spark architecture along with its workflow. This blog is the first one in the series of articles on Apache Spark (PySpark).
Before we discuss the concepts of PySpark and its ecosystem, we will start by revisiting the emergence of big data and big data processing using the Hadoop framework.
“Data is the new oil” is the most popular catchphrase highlighting the importance of data. The amount of data we are generating every minute of the day is truly mind-blowing. According to Forbes, we are generating 2.5 quintillion bytes of data every day at our current pace, but that pace is only increasing with the deeper penetration of smartphones and the growth of the Internet of Things (IoT).
Here are some of the data figures from Forbes:
- By 2025, the amount of data generated each day is expected to reach 463 exabytes globally.
- Google, Facebook, Microsoft, and Amazon store at least 1,200 petabytes of information.
- The world spends almost $1 million per minute on commodities on the Internet.
- Electronic Arts process roughly 50 terabytes of data every day.
- By 2025, there would be 75 billion Internet-of-Things (IoT) devices in the world
- By 2030, nine out of every ten people aged six and above would be digitally active.
Every industry is graced with more data, making sense of what all this data means, and extracting relevant information is make or break for the company.
The main challenges with big data are:
- Storing and managing the huge volumes of data.
- Processing the data efficiently and extracting valuable business insights with less turnaround time.
These challenges led to the creation of the Hadoop framework.
In this section, we will briefly discuss the different components of the Hadoop framework and how they can be used to process huge amounts of data to extract insights from the data.
Apache Hadoop is a framework that allows for distributed processing of large data sets across clusters of computers.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer.
There are 3 core components of Hadoop:
- Hadoop Distributed File System (HDFS) — A distributed storage file system that provides high-throughput access to application data.
- Yet Another Resource Negotiator (YARN) — Framework for job scheduling and cluster resource management.
- MapReduce — System for parallel processing of large data sets.
Hadoop works on the concept of master/slave (or worker) architecture for data storage and distributed data processing.
In this architecture, we have one master node and multiple worker (or slave) nodes. Master node assigns various functions to the worker nodes and manages the resources. The worker nodes do the actual computing and stores the real data whereas on master we have metadata. This means the master node stores where the information about the data.
Resource manager in master and Node manager(s) in slave nodes are a part of the YARN framework. The resource manager monitors the reports of resource utilization from node manager(s) for each task.
Now that we have briefly discussed the components of Hadoop and its architecture. We will focus mostly on the processing part of the Hadoop framework — MapReduce.
MapReduce is the software framework that allows us to write applications that process large amounts of data in parallel on large clusters (thousands of nodes) in a fault-tolerant manner.
Each job in MapReduce usually splits the input data into multiple discreet chunks, which are to be processed by the map tasks parallelly. The outputs of the map phase are aggregated or summarized or transformed by the reduce phase to produce the desired result. The framework takes care of scheduling tasks, monitoring them, and re-executes the failed tasks.
MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
Constraints in MapReduce
Some of the major drawbacks of MapReduce are listed below:
- In MapReduce after each operation, all the data is written back to the physical storage. This means it reads the data from physical storage and then it writes back to the storage when the tasks are completed.
- The frequent reading and writing to the physical storage become inefficient if we need faster data processing or real-time data processing. Since it needs to read the data from physical storage multiple times during the operation.
No Real-time Processing
- MapReduce initially designed to process data in batches. It takes a huge dataset at once, processes the data, and writes output to the physical storage.
- It doesn’t support real-time or near real-time processing like stream processing.
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing.
Spark can be considered as the enhanced version of MapReduce. Unlike MapReduce, Spark doesn’t store the intermediate data on the disks. It stores the intermediate data electronically in RAM. Since all the data processing takes place in memory, Spark offers a faster way to process huge data.
- Spark is developed in UC Berkeley’s AMPLab in 2009 and open-sourced in 2010 under a BSD license. Later it was donated to Apache Software Foundation in 2013, now Apache Spark has become a top-level big data processing engine.
- It is the most reliable and extensively used big data processing platform in the industry today.
- Tech giants like Netflix, Yahoo, and eBay have deployed Spark at a massive scale collectively processing multiple petabytes of data on clusters of over 8,000 nodes.
Features of Spark
- Spark can run an application 100x faster than Hadoop for large-scale data processing and 10 times faster when running on disk by using in-memory computation.
- This is possible because of fewer read/write operations to the disk, unlike MapReduce.
- Spark stores the intermediate data in Memory. It uses more RAM instead of network and physical storage I/O. It can be used for real-time or near real-time data processing.
Ease of Use
- Spark written in Scala and provides APIs to work with Scala, Java, Python, R, and SQL. Spark is flexible for developers to build parallel apps in their language of choice.
- Besides MapReduce operations, Spark supports SQL queries, streaming data, machine learning, and graph processing.
- Spark can read/write to any storage format that can support Hadoop Framework like HDFS, S3, HBase, Cassandra, Avro, etc…
- You can also run Spark using its standalone cluster, on EC2, Hadoop YARN, on Mesos, or on Kubernetes, Apache Hive, and other data sources.
Components of Spark
In this section, we will briefly discuss the different components of Apache Spark.
- SparkCore API — Spark Core is the main execution engine for the Spark platform that all other functionalities are built on top of it. Spark core provides fault tolerance, in-memory computation, resource management, and references to external data sources. It also provides a distributed execution framework to support a wide variety of programming languages like Python, Java, Scala for ease of development.
- Spark SQL — Spark SQL is a module for working with structured data. Spark SQL can be used to query structured data inside Spark programs. It supports Java, Python, R, and SQL. Spark SQL can be integrated with a variety of data sources including Hive, Avro, Parquet, JSON, and JDBC. Spark SQL also supports the HiveQL syntax allowing you to access existing Hive warehouses
- Spark Streaming — Spark streaming leverages the Spark Core API to perform real-time interactive data analytics while leveraging Spark’s fault tolerance semantics out of the box. Spark streaming can be integrated with a wide variety of popular data sources like HDFS, Flume, Kafka, and Twitter.
- Spark MLib — MLib is a scalable machine learning library built on top of Spark Core API that delivers high-quality algorithms. MLib can be used with any Hadoop data like HDFS, HBase, or local storage, making it easy to plug into Hadoop workflows. The library is usable in Java, Scala, and Python as part of Spark applications.
- GraphX — GraphX is Spark API for building graphs and executing graph-parallel computation. GraphX enables users to interactively build, transform graph-structured data at scale.
Spark follows master/slave (worker) architecture similar to the Hadoop framework where the master is the driver and slaves are the workers. It is a cluster system consists of a master and multiple slaves.
In the master node of a cluster, we have the driver program that schedules the execution of jobs and manages the resources with the cluster manager. Whenever we initialize Spark or perform spark-submit on a cluster, Spark Context will be created by the driver program inside the master node. SparkContext is the main entry point into the Spark functionality.
- Spark driver program splits the large input data into small chunks of data (Usual partition sizes are 64MB, 128MB, or 256MB) and distributes it across the multiple worker nodes to perform a task on that data.
- A job is split into multiple tasks by the SparkContext which are distributed over to worker nodes. Tasks are then executed by the individual worker nodes (or slave nodes) and controlled by the driver program through SparkContext on the master node.
- The job of the worker nodes is to execute these tasks and report the status of each task to the driver. The number of tasks at a time will be equal to the number of partitions.
- SparkContext works with the cluster manager to manage various jobs. The cluster manager allocates resources (like CPU, memory) to the worker nodes, where the tasks are executed. Spark cluster manager can use standalone cluster manager or Apache Mesos or Hadoop YARN or Kubernetes for allocating resources.
Spark Internal Work-Flow
- When a Spark application code is submitted, the driver program in the master node initializes SparkContext and implicitly converts the Spark code containing transformations & actions into DAG, Directed Acyclic Graph. DAG is a graph that performs a sequence of computations on the data.
- Driver program performs optimizations and converts DAG into physical execution plan with many tasks i.e… Driver program decides what tasks will be executed on each worker node and schedules the work across the worker nodes.
- Once the work allocation is done by the drive program, it then negotiates with the cluster manager for the resources for the worker nodes. The cluster manager then launches executors on the worker nodes on behalf of the driver.
- Before executors begin execution, they register themselves with the driver program so that the driver has a holistic view of all the executors. Executors report the status of each task to the driver.
Spark vs Hadoop?
- Spark competes with MapReduce rather than the complete Hadoop framework.
- Spark doesn’t have its own distributed file system but can use HDFS as its underlying storage.
- Spark can run on top of Hadoop, gets benefited from Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.).
So it would be unfair to term Spark as a replacement for Hadoop but it can be considered as a better version of MapReduce.
PySpark is the Python API written in Python to support Spark.
Python is one of the most widely used programming languages, especially for data science and it is easier to learn compared to other programming languages. Many practitioners are already familiar with Python Pandas data transformations and manipulations for extracting insights from the data. With strong support from the open-source community, PySpark was developed using the Py4j library.
Advantages of using PySpark:
- Python is very easy to learn and implement and provides a simple and comprehensive API.
- PySpark Provides an interactive shell to analyze the data in a distributed environment.
- With Python, the readability of code, maintenance, and familiarity is far better.
- It features various options for data visualization, which is difficult using Scala or Java.
In this article, we started by discussing big data and the challenges associated with processing big data. We have discussed the architecture of the Hadoop framework to process large data and the constraints of MapReduce. After that, we discussed the different components of Spark and its architecture, and finally the advantages of PySpark API.
Feel free to reach out to me via LinkedIn or Twitter if you have any doubts about understanding the concepts. I hope this article has helped you in understanding the Spark Components and Spark Architecture.
In my next blog, we will discuss how to set up PySpark in your local computer and get started with PySpark Hands-on. So make sure you follow me on Medium to get notified as soon as it drops.
Until next time Peace 🙂
Senior Consultant Data Science|| Freelancer. Writer @ TDataScience & Hackernoon|| connect & fork @ Niranjankumar-c
Connect with Niranjan Kumar https://twitter.com/Nkumar_n?s=20