Getting started with Spark (part 1)

Hang Nguyen
3 min readApr 18, 2022

Spark is currently one of the most popular tools for big data analytics. You might have heard of Hadoop and that Spark is generally faster than Hadoop, which makes it become more popular over the last few years.

Before getting started with the tool, we should revise some knowledge on hardware

CPU (Central Processing Unit)

The CPU is the “brain” of the computer. Every process on your computer is eventually handled by your CPU. This includes calculations and also instructions for the other components of the compute.

The CPU can also store small amounts of data inside itself in what are called registers. These registers hold data that the CPU is working with at the moment.

For example, say you write a program that reads in a 40 MB data file and then analyzes the file. When you execute the code, the instructions are loaded into the CPU. The CPU then instructs the computer to take the 40 MB from disk and store the data in memory (RAM). If you want to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together. The accumulation of the sum needs to be stored somewhere while the CPU grabs the next number.

This cumulative sum will be stored in a register. The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.

--

--

Hang Nguyen

A Data Engineer with a passion for technology, literature, and philosophy.