This tutorial delves into the fundamental concepts of Apache Spark's core, including Resilient Distributed Datasets (RDDs), partitioning strategies, shuffle operations, and the Spark execution model.
RDDs are the basic abstraction in Spark. They are fault-tolerant, parallel data structures that allow for distributed processing.
RDDs can be created in several ways:
from pyspark.sql import SparkSession
from pyspark import StorageLevel
spark = SparkSession.builder.appName("example").getOrCreate()
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.storage.StorageLevel;
import java.util.Arrays;
import java.util.List;
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = spark.sparkContext().parallelize(data);
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
val spark = SparkSession.builder.appName("example").getOrCreate()
val data = Seq(1, 2, 3, 4, 5)
val rdd = spark.sparkContext.parallelize(data)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
rdd = spark.sparkContext.textFile("data.txt")
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();
JavaRDD<String> rdd = spark.sparkContext().textFile("data.txt");
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
val spark = SparkSession.builder.appName("example").getOrCreate()
val rdd = spark.sparkContext.textFile("data.txt")
RDD operations are divided into two types: transformations and actions.
Transformations create new RDDs from existing ones. They are lazily evaluated.
Examples:
map()
: Applies a function to each element of the RDD.rdd2 = rdd.map(lambda x: x * 2)
JavaRDD<Integer> rdd2 = rdd.map(x -> x * 2);
val rdd2 = rdd.map(x => x * 2)
filter()
: Filters elements based on a condition.rdd2 = rdd.filter(lambda x: x % 2 == 0)
JavaRDD<Integer> rdd2 = rdd.filter(x -> x % 2 == 0);
val rdd2 = rdd.filter(x => x % 2 == 0)
flatMap()
: Applies a function that returns a list, and then flattens the results.rdd2 = rdd.flatMap(lambda x: [x, x * 2])
JavaRDD<Integer> rdd2 = rdd.flatMap(x -> Arrays.asList(x, x * 2));
val rdd2 = rdd.flatMap(x => Seq(x, x * 2))
reduceByKey()
: Merges the values for each key using a reducer function.rdd2 = rdd.map(lambda x: (x % 2, x)) \
.reduceByKey(lambda a, b: a + b)
JavaPairRDD<Integer, Integer> rdd2 = rdd.mapToPair(x -> new Tuple2<>(x % 2, x))
.reduceByKey((a, b) -> a + b);
val rdd2 = rdd.map(x => (x % 2, x))
.reduceByKey(_ + _)
sortByKey()
: Sorts the RDD by key.rdd2 = rdd.map(lambda x: (x % 2, x)) \
.sortByKey()
JavaPairRDD<Integer, Integer> rdd2 = rdd.mapToPair(x -> new Tuple2<>(x % 2, x))
.sortByKey();
val rdd2 = rdd.map(x => (x % 2, x))
.sortByKey()
Actions trigger the execution of the RDD computation graph and return a value.
Examples:
collect()
: Returns all elements of the RDD to the driver.result = rdd.collect()
List<Integer> result = rdd.collect();
val result = rdd.collect()
count()
: Returns the number of elements in the RDD.count = rdd.count()
long count = rdd.count();
val count = rdd.count()
first()
: Returns the first element of the RDD.first_element = rdd.first()
Integer firstElement = rdd.first();
val firstElement = rdd.first()
reduce()
: Reduces the elements of the RDD to a single value using a function.sum = rdd.reduce(lambda a, b: a + b)
int sum = rdd.reduce((a, b) -> a + b);
val sum = rdd.reduce(_ + _)
take()
: Returns the first n elements of the RDD.first_three = rdd.take(3)
List<Integer> firstThree = rdd.take(3);
val firstThree = rdd.take(3)
RDDs can be persisted in memory or disk to reuse them in subsequent computations.
rdd.cache()
rdd.persist(StorageLevel.MEMORY_AND_DISK)
rdd.cache();
rdd.persist(StorageLevel.MEMORY_AND_DISK());
rdd.cache()
rdd.persist(StorageLevel.MEMORY_AND_DISK)
RDDs can be partitioned to distribute data across the cluster.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
rdd = spark.sparkContext.parallelize(data, num_partitions)
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = spark.sparkContext()
.parallelize(data, num_partitions);
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
val spark = SparkSession.builder.appName("example").getOrCreate()
val data = Seq(1, 2, 3, 4, 5)
val rdd = spark.sparkContext
.parallelize(data, num_partitions)
Partitioning determines how RDDs are split across the cluster.
Shuffle operations redistribute data across partitions.
Examples:
groupByKey()
reduceByKey()
sortByKey()
join()
This tutorial provided an overview of the core concepts in Apache