Go Programming Hub

Data Sources and Formats

This tutorial explores how to read data from and write data to various data sources and formats in Apache Spark.

Built-In Data Sources (CSV, JSON, Parquet, Avro, ORC)

Spark supports several built-in data sources.

CSV:

df = spark.read.csv("data.csv", header=True, inferSchema=True) df.write.csv("output.csv", header=True) ```

JSON:

df = spark.read.json("data.json") df.write.json("output.json") ```

Parquet:

df = spark.read.parquet("data.parquet") df.write.parquet("output.parquet") ```

Avro:

df = spark.read.format("avro").load("data.avro") df.write.format("avro").save("output.avro") ```

ORC:

df = spark.read.orc("data.orc") df.write.orc("output.orc") ```

JDBC/ODBC Connectivity

Spark can connect to databases using JDBC/ODBC.

df = spark.read.format("jdbc") \
 .option("url", "jdbc:mysql://localhost:3306/mydb") \
 .option("dbtable", "mytable") \
 .option("user", "myuser") \
 .option("password", "mypassword") \
 .load()

df.write.format("jdbc") \
 .option("url", "jdbc:mysql://localhost:3306/mydb") \
 .option("dbtable", "mytable") \
 .option("user", "myuser") \
 .option("password", "mypassword") \
 .mode("append") \
 .save()

Reading From and Writing to Databases

See JDBC/ODBC Connectivity.

Working with Cloud Storage (S3, Azure Blob, GCS)

Spark can read data from and write data to cloud storage.

S3:

df = spark.read.parquet("s3a://mybucket/data.parquet") df.write.parquet("s3a://mybucket/output.parquet") ```

Azure Blob:

df = spark.read.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/data.parquet") df.write.parquet("wasbs://mycontainer@myaccount.blob.core.windows.net/output.parquet") ```

GCS:

df = spark.read.parquet("gs://mybucket/data.parquet") df.write.parquet("gs://mybucket/output.parquet") ```

Custom Data Source Implementation

You can implement custom data sources by extending Spark's DataSource API.

Introduction
Installation
Architecture
Execution Modes
Spark Submit Command
Spark Core: RDD
DataFrames and Datasets
Data Sources and Formats
Spark SQL
Spark Structured Streaming
Spark Unstructured Streaming
Performance Tuning
Machine Learning with MLlib
Graph Processing with GraphX
Advanced Spark Concepts
Deployment and Production
Real-world Applications
Integration with Big Data Ecosystem
Best Practices and Design Patterns
Hands-on Projects

Data Sources and Formats

Data Sources and Formats

Built-In Data Sources (CSV, JSON, Parquet, Avro, ORC)

JDBC/ODBC Connectivity

Reading From and Writing to Databases

Working with Cloud Storage (S3, Azure Blob, GCS)

Custom Data Source Implementation

Related Articles