This tutorial covers Apache Spark SQL, a module for structured data processing using SQL queries and DataFrames.
Spark SQL allows you to run SQL queries on DataFrames.
df.createOrReplaceTempView("my_table")
result = spark.sql("SELECT * FROM my_table WHERE age > 30")
result.show()
You can create temporary views or tables to run SQL queries.
df.createOrReplaceTempView("my_table")
Spark SQL provides a Catalog API for managing metadata.
spark.catalog.listTables()
You can define custom functions (UDFs) to extend SQL.
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def age_bucket(age):
if age < 20:
return "Teen"
elif age < 60:
return "Adult"
else:
return "Senior"
age_bucket_udf = udf(age_bucket, StringType())
df = df.withColumn("age_group", age_bucket_udf(df["age"]))
df.show()
Window functions perform calculations across a set of rows.
from pyspark.sql.window import Window
import pyspark.sql.functions as func
window_spec = Window.partitionBy("city").orderBy(func.col("age").desc())
df = df.withColumn("rank", func.rank().over(window_spec))
df.show()
Partitioning:
df = df.repartition(10) ```
Caching:
df.cache() ```
Spark SQL can interact with Hive metastore.
spark.sql.warehouse.dir
in `spark-defaults.conf