Getting started with Spark (part 2)

Hang Nguyen
4 min readApr 18, 2022

After discussing in part 1 all some knowledge on hardware, let’s move on to Hadoop.

Parallel computing

In general, it means that multiple CPUs share the same memory, while for distributed computing, each CPU has its own memory and is connected to other machines across a network.

Hadoop Vocabulary

  • Hadoop — an ecosystem of tools for big data storage and data analysis. Hadoop is an older system than Spark but is still used by many companies. The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.
  • Hadoop MapReduce — a system for processing and analyzing large data sets in parallel.
  • Hadoop YARN — a resource manager that schedules jobs across a cluster. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
  • Hadoop Distributed File System (HDFS) — a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.

As Hadoop matured, other tools were developed to make Hadoop easier to work with. These tools included:

  • Apache Pig — a SQL-like language that runs on top of Hadoop MapReduce
  • Apache Hive — another SQL-like interface that runs on top of…

--

--

Hang Nguyen

A Data Engineer with a passion for technology, literature, and philosophy.