Apache Spark adopts a distributed architecture to achieve scalability and fault tolerance. The key components of the architecture are:
Driver Program: Acts as the central coordinator, transforming user code into tasks and scheduling them across executors. The driver:
Cluster Manager: Manages the allocation of resources (CPU, memory) to Spark applications. It can be:
Executors: Worker nodes that execute the tasks assigned by the driver. Each executor:
+-----------------------+
| Driver Program |
| (SparkContext) |
+-----------------------+
|
v
+-----------------------+
| Cluster Manager |
| (YARN, Kubernetes) |
+-----------------------+
/ \
/ \
v v
+----------+ +----------+
| Executor | | Executor |
| #1 | | #2 |
+----------+ +----------+
Spark can run on various cluster managers, each providing different capabilities for resource management:
Spark applications follow a specific data flow:
+-----------+ +-----------+ +-----------+
| RDD 1 | -> | RDD 2 | -> | RDD 3 |
+-----------+ +-----------+ +-----------+
| |
v v
+--------------+ +-----------+
|Transformation| | Action |
+--------------+ +-----------+
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.appName("SparkExample") \
.getOrCreate()
# Create RDD from list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
# Transformation: Square each element
rdd_squared = rdd.map(lambda x: x ** 2)
# Action: Collect results
result = rdd_squared.collect()
print(result) # Output: [1, 4, 9, 16, 25]
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
public class SparkExample {
public static void main(String[] args) {
// Initialize Spark
SparkSession spark = SparkSession.builder()
.appName("SparkExample")
.getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
// Create RDD from list
JavaRDD<Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
// Transformation: Square each element
JavaRDD<Integer> squared = rdd.map(x -> x * x);
// Action: Collect results
List<Integer> result = squared.collect();
System.out.println(result); // Output: [1, 4, 9, 16, 25]
}
}
import org.apache.spark.sql.SparkSession
object SparkExample {
def main(args: Array[String]): Unit = {
// Initialize Spark
val spark = SparkSession.builder
.appName("SparkExample")
.getOrCreate()
// Create RDD from list
val data = List(1, 2, 3, 4, 5)
val rdd = spark.sparkContext.parallelize(data)
// Transformation: Square each element
val squared = rdd.map(x => x * x)
// Action: Collect results
val result = squared.collect()
println(result.mkString(", ")) // Output: 1, 4, 9, 16, 25
}
}
Apache Spark has revolutionized big data processing with its speed, ease of use, and versatility. Whether you're performing batch processing, real-time analytics, machine learning, or graph computations, Spark provides a unified platform that can handle diverse workloads efficiently.
In the next tutorial, we'll explore different execution modes and cluster managers in Apache Spark to help you choose the right deployment strategy for your needs.