Hadoop is an open-source framework that allows the distributed processing of large datasets across clusters of computers using simple programming models. It has two main components: HDFS (Hadoop Distributed File System) for storage and MapReduce for processing.
Hadoop offers distributed storage (HDFS), fault tolerance, scalability, and the ability to process data using simple programming (MapReduce).
HDFS is a distributed file system where data is stored across multiple nodes. It follows a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data. Data is divided into blocks and replicated to ensure reliability.
The secondary NameNode takes periodic snapshots of the NameNode’s metadata to prevent data loss. It helps to reduce the load on the primary NameNode.
MapReduce is a programming model in Hadoop for processing large datasets. It breaks the processing into two phases: Map (data filtering and sorting) and Reduce (data aggregation).
Apache Spark is an open-source, distributed computing system known for its in-memory data processing, which makes it faster than Hadoop MapReduce. It supports batch and real-time data processing.
RDD (Resilient Distributed Dataset) is the core abstraction in Spark. RDDs represent a fault-tolerant, distributed collection of elements that can be processed in parallel.
Spark is faster than Hadoop MapReduce due to in-memory computation, supports real-time processing, and provides libraries for machine learning and graph processing. Hadoop is mainly disk-based and used for batch processing.
A Directed Acyclic Graph (DAG) in Spark is a sequence of computations performed on data. It tracks how data should be transformed and executed, ensuring optimizations in the process.
Spark Streaming is a component of Spark that allows for the processing of real-time streaming data from sources like Kafka, Flume, or HDFS.
A Big Data Administrator is responsible for the installation, configuration, and maintenance of Big Data infrastructure (like Hadoop clusters). They ensure data availability, security, backup, performance tuning, and troubleshooting.
Monitoring tools like Ambari, Cloudera Manager, and Ganglia are used to track the performance of Hadoop clusters. Administrators also check logs, manage disk space, and ensure smooth task execution.
Kerberos is a security protocol used for authentication in Hadoop. It ensures that both clients and servers authenticate themselves securely before any communication happens.
Regular snapshots of HDFS metadata are taken using the secondary NameNode. Data is replicated across multiple nodes to avoid data loss. Recovery mechanisms include restoring from checkpoints and backups.
YARN (Yet Another Resource Negotiator) is the resource management layer in Hadoop. It manages and allocates resources to different applications running in the cluster.
Hadoop can run in three modes:
Standalone Mode: Default mode where Hadoop runs on a single machine. It’s mostly used for debugging and development.
Pseudo-distributed Mode: Each Hadoop service (NameNode, DataNode, etc.) runs in separate Java processes on the same machine. Good for testing and small-scale learning.
Fully Distributed Mode: Hadoop services run on a cluster of machines. It is the production environment where large datasets are processed.
The Combiner is an optional mini-reducer that processes Map output before it is sent to the actual Reducer. It helps in reducing the amount of data transferred between the Map and Reduce phases, improving the efficiency of MapReduce jobs.
Zookeeper is a centralized service used to maintain configuration information, provide distributed synchronization, and manage group services for large distributed systems like Hadoop. It helps in coordinating and managing the cluster to avoid inconsistencies.
HBase is a NoSQL, distributed, column-oriented database built on top of HDFS. It provides real-time read/write access to large datasets. Unlike HDFS, which is good for batch processing, HBase is ideal for real-time data access and storing sparse data.