Member-only story
#62 Big data technology (part 2): Hadoop architecture, HDFS, YARN, Map Reduce, Hive & HBase
Hadoop
Hadoop is a set of open-source programs and procedures which can be used as the framework for Big Data operations. It is used for processing massive data in distributed file systems that are linked together. It allows for running applications on clusters. (A cluster is a collection of computers working together at the same to time to perform tasks.) It should be noted that Hadoop is not a database but an ecosystem that can handle processes and jobs in parallel or concurrently.
Hadoop is optimized to handle massive quantities of data which could be:
- Structured, tabular data
- Unstructured data, such as images and videos, or
- Semi-structured data, using relatively inexpensive computers.
The core components of Hadoop include:
- Hadoop Common, which is an essential part of the Apache Hadoop Framework that refers to the collection of common utilities and libraries that support other Hadoop modules.
- There is a storage component called Hadoop Distributed File System, or HDFS. It handles large data sets running on commodity hardware. (A commodity hardware is low-specifications industry-grade hardware and scales a single Hadoop cluster to hundreds and even thousands.)
- The next component is MapReduce which is a processing unit of Hadoop and an important core component to the Hadoop framework. MapReduce processes data by splitting large amounts of data into smaller units and processes them simultaneously. For a while, MapReduce was the only way to access the data stored in the HDFS. There are now other systems like Hive and Pig.
- The last component is YARN, which is short for “Yet Another Resource Negotiator.” YARN is a very important component because it prepares the RAM and CPU for Hadoop to run data in batch processing, stream processing, interactive processing, and graph processing, which are stored in HDFS.
Drawbacks of Hadoop:
Although efficient at first glance, Hadoop failed at simple tasks.
- Hadoop is not good for processing transactions due to its lack of random access.
- Hadoop is not good when the work cannot be done in parallel or when there are dependencies within the data. Dependencies arise…