Big Data refers to large volumes of structured, semi-structured, and unstructured data that cannot be efficiently processed using traditional data processing methods. It is characterized by the 4 Vs: Volume, Variety, Velocity, and Veracity.
HDFS (Hadoop Distributed File System): For storing large datasets.
YARN (Yet Another Resource Negotiator): Manages resources.
MapReduce: For processing large datasets.
Hive, Pig, Sqoop, Flume, Oozie: Additional tools for querying, moving, and managing data.
Hadoop: Primarily relies on HDFS and MapReduce for data storage and batch processing. It’s disk-based and suitable for long-running batch jobs.
Spark: An in-memory data processing engine that is faster for both batch and real-time data processing. It supports a wider range of tasks, including machine learning and streaming.
HDFS is a distributed file system that splits large files into smaller blocks (typically 128 MB or 256 MB) and stores them across multiple nodes. It has two main components:NameNode: Manages the metadata of the files (i.e., file names, block locations).
DataNode: Stores the actual data blocks.
RDDs are Spark's primary data abstraction that represents a distributed collection of objects. RDDs allow parallel operations on large datasets across multiple nodes and support fault-tolerance.
Spark Streaming is a component of Spark that processes real-time data streams. It ingests data in mini-batches, processes the data, and then stores the results. It supports sources like Kafka, HDFS, and Flume.
MapReduce: Writes intermediate data to disk between each step, making it slower.
Spark: Uses in-memory processing, which allows it to perform tasks much faster by reducing the need for disk I/O.
YARN (Yet Another Resource Negotiator) is Hadoop's resource management layer. It assigns resources to various applications running in the cluster and schedules tasks.
Hive is a data warehousing tool in the Hadoop ecosystem that allows for SQL-like querying of large datasets stored in HDFS. It simplifies querying and analysis using HQL (Hive Query Language).
Apache Flume is a tool for ingesting large amounts of streaming data into Hadoop. It is often used to move log data from web servers to HDFS or HBase.
RDD: Provides a low-level API for distributed data processing.
DataFrame: Higher-level abstraction built on top of RDDs that provides better optimization using Spark's Catalyst Optimizer and is easier to work with for SQL-like operations.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines. It integrates with Spark Streaming to provide real-time processing capabilities by acting as a message broker.
In Spark, DAG represents the sequence of operations (transformations and actions) on RDDs. Spark’s DAG scheduler optimizes the execution plan by breaking the workflow into stages and tasks.
Oozie is a workflow scheduling tool that helps automate and manage jobs in the Hadoop ecosystem. It allows scheduling and coordinating of tasks like MapReduce, Pig, and Hive jobs.
Partitioning refers to dividing data into smaller chunks to be processed in parallel across multiple nodes in a Hadoop cluster. Each partition is assigned to a node for efficient processing.
Inner Join: Returns records with matching keys.
Left Outer Join: Returns all records from the left table, even if there are no matches in the right table.
Right Outer Join: Returns all records from the right table, even if there are no matches in the left table.
Full Outer Join: Returns all records when there is a match in either the left or right table.
The combiner is an optional component in MapReduce that performs local aggregation of data before sending it to the reducer. This helps to minimize the amount of data transferred between the map and reduce phases, optimizing performance.
In Spark, transformations like map() or filter() are not immediately executed. Instead, Spark builds a lineage of transformations and executes them only when an action (e.g., count(), collect()) is called. This is known as lazy evaluation, which helps in optimization.
Use the correct level of parallelism by tuning the number of partitions.
Avoid shuffles by using wide transformations efficiently.
Use cache and persist to store intermediate results in memory.
Optimize the serialization format (e.g., using Kryo for faster serialization).
Catalyst is Spark SQL’s query optimizer that helps to optimize logical query plans by applying several rules. It leverages both logical and physical optimization techniques to generate efficient execution plans.