This tutorial explores how to read data from and write data to various data sources and formats in Apache Spark.
Spark supports several built-in data sources.
CSV:
df = spark.read.csv("data.csv", header=True, inferSchema=True) df.write.csv("output.csv", header=True) ```
JSON:
df = spark.read.json("data.json") df.write.json("output.json") ```
Parquet:
df = spark.read.parquet("data.parquet") df.write.parquet("output.parquet") ```
Avro:
df = spark.read.format("avro").load("data.avro") df.write.format("avro").save("output.avro") ```
ORC:
df = spark.read.orc("data.orc") df.write.orc("output.orc") ```
Spark can connect to databases using JDBC/ODBC.
df = spark.read.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/mydb") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", "mypassword") \
.load()
df.write.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/mydb") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", "mypassword") \
.mode("append") \
.save()
See JDBC/ODBC Connectivity.
Spark can read data from and write data to cloud storage.
S3:
df = spark.read.parquet("s3a://mybucket/data.parquet") df.write.parquet("s3a://mybucket/output.parquet") ```
Azure Blob:
df = spark.read.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/data.parquet") df.write.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/output.parquet") ```
GCS:
df = spark.read.parquet("gs://mybucket/data.parquet") df.write.parquet("gs://mybucket/output.parquet") ```
You can implement custom data sources by extending Spark's DataSource API.