Member-only story

#64 Apache Spark (part 1): Fundamentals of Spark Architecture & Spark Core (RDD)

5 min readJun 29, 2022

Please refer to series Get started with Spark (part 1–2–3) to get to understand briefly Spark.

Apache Spark architecture

Spark is a Hadoop enhancement to MapReduce. The primary difference between Spark and MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing speeds are up to 100x faster than MapReduce.

The Spark ecosystem consists of five primary modules:

Spark Core: Underlying execution engine that schedules and dispatches tasks and coordinates input and output (I/O) operations.
Spark SQL: Gathers information about structured data to enable users to optimize structured data processing.
Spark Streaming and Structured Streaming: Both add stream processing capabilities. Spark Streaming takes data from different streaming sources and divides it into micro-batches for a continuous stream. Structured Streaming, built on Spark SQL, reduces latency and simplifies programming.
Machine Learning Library (MLlib): A set of machine learning algorithms for scalability plus tools for feature selection and building ML…

#64 Apache Spark (part 1): Fundamentals of Spark Architecture & Spark Core (RDD)

Apache Spark architecture

Written by Hang Nguyen

No responses yet