HOME
ABOUT

Execution Modes

Understanding Apache Spark Execution Environments

When you work with Apache Spark, you need to understand two important concepts that determine how your Spark applications run:

  1. Cluster Managers - These control where your application runs (the physical infrastructure)
  2. Execution Modes - These control how your application runs (the deployment strategy)

This guide explains both concepts in simple terms with practical examples to help you choose the right setup for your needs.

Cluster Managers: Where Your Application Runs

Cluster managers are like traffic controllers for your data processing jobs. They handle important tasks such as:

  • Allocating computing resources (CPU, memory) to your applications
  • Scheduling jobs to run efficiently
  • Monitoring application performance
  • Ensuring different applications don't interfere with each other

            +------------------+
            | Cluster Managers |
            +------------------+
                     |
   +-------+---------+-------+--------+
                                  
Local  Standalone  YARN    Mesos  Kubernetes

Types of Cluster Managers

1. Local

The Local cluster manager runs everything on a single computer, making it perfect for learning and testing.

Key features:

  • Runs on a single machine using multiple threads
  • Simulates a distributed environment without actual network communication
  • No setup required - comes built-in with Spark

When to use it:

  • Learning Spark for the first time
  • Developing and testing your applications
  • Processing small datasets that fit on one machine

Example configuration:

val spark = SparkSession.builder()
  .master("local[4]") // Use 4 CPU cores
  .appName("My Local App")
  .getOrCreate()

2. Standalone

Standalone is Spark's own built-in cluster manager that's easy to set up but still provides distributed processing.

Key features:

  • Simple setup with minimal configuration
  • Works well for small to medium-sized clusters
  • Provides basic resource scheduling
  • Doesn't require additional software

When to use it:

  • Small to medium production environments
  • When you don't need advanced resource management
  • When you want to avoid the complexity of other cluster managers

Architecture:


+----------------+
| Master Node    |
+----------------+
       
+----------------+
| Worker Nodes   |
| [Executors]    |
+----------------+

Example configuration:


# Start a standalone master
./sbin/start-master.sh

# Start workers
./sbin/start-worker.sh spark://master-hostname:7077

# Submit application
spark-submit --master spark://master-hostname:7077 myapp.py

3. YARN (Yet Another Resource Negotiator)

YARN is the resource manager that comes with Hadoop and is widely used in enterprise environments.

Key features:

  • Enterprise-grade resource management
  • Tight integration with the Hadoop ecosystem
  • Advanced scheduling capabilities
  • Strong security features
  • Supports multiple applications sharing the same cluster

When to use it:

  • Enterprise production environments
  • When you're already using Hadoop
  • When you need advanced resource scheduling
  • Multi-tenant environments where many users share resources

Architecture:


+----------------+
| ResourceManager|
+----------------+
       
+----------------+
| NodeManager(s) |
| [Containers]   |
+----------------+

Example configuration:


# Submit application to YARN
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-memory 4g \
  myapp.py

4. Mesos

Mesos is a general-purpose cluster manager that can run many types of applications, not just Spark.

Key features:

  • Supports multiple frameworks (not just Spark)
  • Fine-grained resource sharing
  • Custom resource scheduling policies
  • Works well in heterogeneous environments

When to use it:

  • When running multiple frameworks (not just Spark)
  • When you need fine-grained resource control
  • In diverse computing environments

Example configuration:


# Submit application to Mesos
spark-submit \
  --master mesos://mesos-master:5050 \
  --deploy-mode cluster \
  myapp.py

5. Kubernetes

Kubernetes is a modern container orchestration platform that's becoming increasingly popular for running Spark.

Key features:

  • Container-based deployment
  • Dynamic scaling capabilities
  • Self-healing when failures occur
  • Consistent environment across development and production
  • Strong isolation between applications

When to use it:

  • Cloud-native environments
  • When you're already using Kubernetes for other applications
  • When you need to scale resources dynamically
  • When you want consistent environments

Architecture:


+----------------+
| Control Plane  |
+----------------+
       
+----------------+
| Worker Nodes   |
| [Pods]         |
+----------------+

Example configuration:


# Submit application to Kubernetes
spark-submit \
  --master k8s://kubernetes-master:443 \
  --deploy-mode cluster \
  --conf spark.kubernetes.container.image=spark:latest \
  myapp.py

Comparison of Cluster Managers

Feature Local Standalone YARN Mesos Kubernetes
Setup Complexity None Low Medium High High
Resource Management Basic Basic Advanced Advanced Advanced
Scalability Limited Medium High High High
Security None Basic Advanced Advanced Advanced
Best For Development Small clusters Hadoop integration Multiple frameworks Cloud-native

Execution Modes: How Your Application Runs

Execution modes determine how the components of your Spark application are distributed. There are three main execution modes, each with different characteristics.

1. Local Execution Mode

In Local mode, everything runs in a single Java Virtual Machine (JVM) process on your local machine.


+----------------------------------------+
|           Single JVM Process            |
|  +-------------+    +--------------+    |
|  |   Driver    |    |  Executor    |    |
|  | (Thread 1)  |    | (Thread 2-N) |    |
|  +-------------+    +--------------+    |
+----------------------------------------+

How it works:

  • Your entire Spark application runs on a single machine
  • The driver and executors are just different threads in the same process
  • No actual network communication happens

Configuration options:

  • local[1]: Uses just 1 thread (the driver does everything)
  • local[N]: Uses N threads (1 for the driver, N-1 for executors)
  • local[*]: Uses as many threads as you have CPU cores

Example:


# Python example with PySpark
from pyspark.sql import SparkSession

# Create a session with local mode using all available cores
spark = SparkSession.builder \
    .appName("LocalModeExample") \
    .master("local[*]") \
    .getOrCreate()

# Now you can run Spark operations
df = spark.read.csv("data.csv", header=True)
result = df.groupBy("category").count()
result.show()

Best for:

  • Learning Spark
  • Development and testing
  • Processing small datasets
  • Running unit tests

2. Client Execution Mode

In Client mode, the driver runs on your client machine, while executors run on the cluster.


+-----------------+        +----------------------+
|  Client Machine |        |     Cluster Nodes    |
|  +-----------+  |        |  +----------------+  |
|  |  Driver   |  | <----> |  |   Executor 1   |  |
|  +-----------+  |        |  +----------------+  |
+-----------------+        |  |   Executor 2   |  |
                           |  +----------------+  |
                           +----------------------+

How it works:

  • The driver program runs on the machine where you submit the job
  • The driver coordinates the job and maintains the SparkContext
  • Executors run on cluster nodes and perform the actual data processing
  • Results are sent back to the driver

Important considerations:

  • Your client machine must have a network connection to the cluster
  • Your client machine must stay running for the entire job
  • If your client machine disconnects, the job will fail
  • The driver consumes resources on your client machine

Example:


# Submit a Spark job in client mode to YARN
spark-submit \
  --master yarn \
  --deploy-mode client \
  --num-executors 5 \
  --executor-memory 4g \
  --executor-cores 2 \
  my_spark_app.py

Best for:

  • Interactive data analysis
  • When you need to see results immediately
  • Debugging applications
  • Using Spark with notebooks (Jupyter, Zeppelin)
  • Ad-hoc queries

3. Cluster Execution Mode

In Cluster mode, both the driver and executors run on the cluster nodes.


+----------------------+
|     Cluster Nodes    |
|  +----------------+  |
|  |    Driver      |  |
|  +----------------+  |
|  |   Executor 1   |  |
|  +----------------+  |
|  |   Executor 2   |  |
|  +----------------+  |
+----------------------+

How it works:

  • The driver program runs on one of the worker nodes in the cluster
  • The cluster manager takes care of managing all resources
  • Your client machine only submits the job and can disconnect afterward
  • All processing happens within the cluster

Key advantages:

  • Your client machine can disconnect after submitting the job
  • Better fault tolerance (driver failures can be recovered)
  • More efficient resource utilization
  • Better for production workloads

Example:


# Submit a Spark job in cluster mode to YARN
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-memory 8g \
  --executor-cores 4 \
  --driver-memory 4g \
  my_spark_app.py

Best for:

  • Production deployments
  • Long-running jobs
  • Scheduled batch processing
  • When client machine stability isn't guaranteed
  • Resource-intensive applications

Comparison of Execution Modes

Feature Local Mode Client Mode Cluster Mode
Driver Location Client machine Client machine Cluster node
Client Dependency Always needed Needed during execution Only needed for submission
Network Requirements None Client must connect to cluster Client only needs submission access
Fault Tolerance Limited Limited for driver Better driver recovery
Best For Development Interactive analysis Production workloads

Common Issues and Troubleshooting

Memory Issues

Symptom: Your Spark job fails with "OutOfMemoryError" or performs poorly.

Solution:

  • Increase driver memory: --driver-memory 4g
  • Increase executor memory: --executor-memory 8g
  • Adjust Spark memory fractions:
    --conf spark.memory.fraction=0.8
    --conf spark.memory.storageFraction=0.3

Slow Performance

Symptom: Your Spark job runs much slower than expected.

Solution:

  • Increase parallelism: Use more executors or cores
  • Check data skew: Ensure data is evenly distributed
  • Use proper partitioning: df.repartition(100)
  • Enable dynamic allocation:
    --conf spark.dynamicAllocation.enabled=true
    --conf spark.shuffle.service.enabled=true

Connection Issues

Symptom: Your Spark job fails to connect to the cluster.

Solution:

  • Check network connectivity between client and cluster
  • Verify cluster manager is running
  • Ensure proper authentication credentials
  • Check firewall settings

Summary

Choosing the right combination of cluster manager and execution mode depends on your specific needs:

  • For learning and development:

    • Local cluster manager with local execution mode
  • For interactive analysis:

    • Standalone/YARN/Kubernetes with client execution mode
  • For production workloads:

    • YARN/Kubernetes with cluster execution mode

By understanding these concepts, you can make informed decisions about how to deploy your Spark applications for optimal performance and reliability.

Related Articles

  • Introduction
  • Installation
  • Architecture
  • Execution Modes
  • Spark Submit Command
  • Spark Core: RDD
  • DataFrames and Datasets
  • Data Sources and Formats
  • Spark SQL
  • Spark Structured Streaming
  • Spark Unstructured Streaming
  • Performance Tuning
  • Machine Learning with MLlib
  • Graph Processing with GraphX
  • Advanced Spark Concepts
  • Deployment and Production
  • Real-world Applications
  • Integration with Big Data Ecosystem
  • Best Practices and Design Patterns
  • Hands-on Projects