HOME
ABOUT

Data Sources and Formats

Data Sources and Formats

This tutorial explores how to read data from and write data to various data sources and formats in Apache Spark.

Built-In Data Sources (CSV, JSON, Parquet, Avro, ORC)

Spark supports several built-in data sources.

  • CSV:

df = spark.read.csv("data.csv", header=True, inferSchema=True) df.write.csv("output.csv", header=True) ```

  • JSON:

df = spark.read.json("data.json") df.write.json("output.json") ```

  • Parquet:

df = spark.read.parquet("data.parquet") df.write.parquet("output.parquet") ```

  • Avro:

df = spark.read.format("avro").load("data.avro") df.write.format("avro").save("output.avro") ```

  • ORC:

df = spark.read.orc("data.orc") df.write.orc("output.orc") ```

JDBC/ODBC Connectivity

Spark can connect to databases using JDBC/ODBC.

df = spark.read.format("jdbc") \
 .option("url", "jdbc:mysql://localhost:3306/mydb") \
 .option("dbtable", "mytable") \
 .option("user", "myuser") \
 .option("password", "mypassword") \
 .load()

df.write.format("jdbc") \
 .option("url", "jdbc:mysql://localhost:3306/mydb") \
 .option("dbtable", "mytable") \
 .option("user", "myuser") \
 .option("password", "mypassword") \
 .mode("append") \
 .save()

Reading From and Writing to Databases

See JDBC/ODBC Connectivity.

Working with Cloud Storage (S3, Azure Blob, GCS)

Spark can read data from and write data to cloud storage.

  • S3:

df = spark.read.parquet("s3a://mybucket/data.parquet") df.write.parquet("s3a://mybucket/output.parquet") ```

  • Azure Blob:

df = spark.read.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/data.parquet") df.write.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/output.parquet") ```

  • GCS:

df = spark.read.parquet("gs://mybucket/data.parquet") df.write.parquet("gs://mybucket/output.parquet") ```

Custom Data Source Implementation

You can implement custom data sources by extending Spark's DataSource API.

Related Articles

  • Introduction
  • Installation
  • Architecture
  • Execution Modes
  • Spark Submit Command
  • Spark Core: RDD
  • DataFrames and Datasets
  • Data Sources and Formats
  • Spark SQL
  • Spark Structured Streaming
  • Spark Unstructured Streaming
  • Performance Tuning
  • Machine Learning with MLlib
  • Graph Processing with GraphX
  • Advanced Spark Concepts
  • Deployment and Production
  • Real-world Applications
  • Integration with Big Data Ecosystem
  • Best Practices and Design Patterns
  • Hands-on Projects