This tutorial covers performance tuning techniques for Apache Spark.
Spark's memory is divided into several regions:
Use MEMORY_AND_DISK
storage level:
rdd.persist(StorageLevel.MEMORY_AND_DISK) ```
Avoid unnecessary shuffles:
df = df.repartition(10) ```
Default serialization method.
Faster and more compact serialization.
Configure Kryo:
spark = SparkSession.builder
.appName("KryoExample")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
```
Broadcast variables allow you to efficiently distribute read-only data to all executors.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BroadcastExample").getOrCreate()
values = {"a": 1, "b": 2, "c": 3}
broadcast_values = spark.sparkContext.broadcast(values)
rdd = spark.sparkContext.parallelize(["a", "b", "c"])
rdd2 = rdd.map(lambda x: broadcast_values.value[x])
rdd2.collect()
Accumulators are variables that can be updated in parallel by executors.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AccumulatorExample").getOrCreate()
accumulator = spark.sparkContext.accumulator(0)
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
rdd.foreach(lambda x: accumulator.add(x))
accumulator.value
Reduces the number of partitions.
rdd2 = rdd.coalesce(2)
Suitable for small tables.
df1.join(df2, df1["key"] == df2["key"], "broadcast")
Default join strategy.
Enabled by default.
spark.executor.cores
spark.executor.memory
Access at http://<driver-node>:4040
.
Configured in `spark