Hadoop Introduction

The term ‘Big Data’ is used to refer to a set of large datasets that contain large amounts of data, huge velocity, and a variety of data that is increasing day by day. Using conventional data management systems Like Relational Databases, it is difficult to process large data. Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve major data management and processing challenges.

Hadoop is an open source architecture for storing and processing large data in a distributed environment. It consists of two modules, one is MapReduce and the other is Hadoop Distributed File System (HDFS).


This is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data into large clusters of hardware.

HDFS: The Hadoop distributed file system is part of the Hadoop framework used to store and process databases. It provides a fault-tolerant file system to run on objects' hardware.
The Hadoop ecosystem has various sub-projects (tools) such as Scoop, Pig, and Hive that can be used to assist Hadoop modules.

Scoop: Used to import and export data between HDFS and RDBMS.

Pig: This is a procedural language platform used to create scripts for MapReduce functions.

Hive: This is the site used to create SQL-type scripts to perform MapReduce functions.