#62 Big data technology (part 2): Hadoop architecture, HDFS, YARN, Map Reduce, Hive & HBase

Hang Nguyen
14 min readJun 29, 2022

Hadoop

Hadoop is a set of open-source programs and procedures which can be used as the framework for Big Data operations. It is used for processing massive data in distributed file systems that are linked together. It allows for running applications on clusters. (A cluster is a collection of computers working together at the same to time to perform tasks.) It should be noted that Hadoop is not a database but an ecosystem that can handle processes and jobs in parallel or concurrently.

Hadoop is optimized to handle massive quantities of data which could be:

  • Structured, tabular data
  • Unstructured data, such as images and videos, or
  • Semi-structured data, using relatively inexpensive computers.

The core components of Hadoop include:

  • Hadoop Common, which is an essential part of the Apache Hadoop Framework that refers to the collection of common utilities and libraries that support other Hadoop modules.
  • There is a storage component called Hadoop Distributed File System, or HDFS. It handles large data sets running on commodity hardware. (A commodity hardware is low-specifications industry-grade hardware and scales a single Hadoop cluster to hundreds and even thousands.)
  • The next component is MapReduce which is a processing unit of Hadoop and an important core…

--

--

Hang Nguyen
Hang Nguyen

Written by Hang Nguyen

A Data Engineer with a passion for technology, literature, and philosophy.