Go Programming Hub

Machine Learning with MLlib

This tutorial covers machine learning with MLlib in Apache Spark.

Overview of MLlib

MLlib is Spark's machine learning library.

Data Preparation for ML

Feature Extraction

Extract features from raw data.

Feature Transformation

Transform features to improve model performance.

Feature Engineering and Transformations

VectorAssembler

Combines multiple columns into a single vector column.

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
 inputCols=["feature1", "feature2"],
 outputCol="features")

df = assembler.transform(df)

StandardScaler

Scales features to have zero mean and unit variance.

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(
 inputCol="features",
 outputCol="scaledFeatures",
 withStd=True, withMean=True)

scalerModel = scaler.fit(df)
df = scalerModel.transform(df)

Pipeline API

The Pipeline API allows you to chain multiple transformations and estimators together.

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[assembler, scaler])

model = pipeline.fit(df)
df = model.transform(df)

Supervised Learning Algorithms

Classification (Logistic Regression, Decision Trees, Random Forest)

Logistic Regression

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
 featuresCol="scaledFeatures",
 labelCol="label",
 maxIter=10)

lrModel = lr.fit(df)

predictions = lrModel.transform(df)

Decision Trees

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(
 featuresCol="scaledFeatures",
 labelCol="label")

dtModel = dt.fit(df)

predictions = dtModel.transform(df)

Random Forest

from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(
 featuresCol="scaledFeatures",
 labelCol="label")

rfModel = rf.fit(df)

predictions = rfModel.transform(df)

Regression Models

Linear Regression

from pyspark.ml.regression import LinearRegression

lr = LinearRegression(
 featuresCol="scaledFeatures",
 labelCol="label",
 maxIter=10)

lrModel = lr.fit(df)

predictions = lrModel.transform(df)

Unsupervised Learning

Clustering (K-means, Hierarchical)

K-means

from pyspark.ml.clustering import KMeans

kmeans = KMeans(
 featuresCol="scaledFeatures",
 k=2)

model = kmeans.fit(df)

predictions = model.transform(df)

Dimensionality Reduction (PCA)

PCA

from pyspark.ml.feature import PCA

pca = PCA(
 inputCol="scaledFeatures",
 k=2)

model = pca.fit(df)

result = model.transform(df)

Model Evaluation and Validation

Evaluators

Introduction
Installation
Architecture
Execution Modes
Spark Submit Command
Spark Core: RDD
DataFrames and Datasets
Data Sources and Formats
Spark SQL
Spark Structured Streaming
Spark Unstructured Streaming
Performance Tuning
Machine Learning with MLlib
Graph Processing with GraphX
Advanced Spark Concepts
Deployment and Production
Real-world Applications
Integration with Big Data Ecosystem
Best Practices and Design Patterns
Hands-on Projects

Machine Learning with MLlib

Machine Learning with MLlib

Overview of MLlib

Data Preparation for ML

Feature Extraction

Feature Transformation

Feature Engineering and Transformations

VectorAssembler

StandardScaler

Pipeline API

Supervised Learning Algorithms

Classification (Logistic Regression, Decision Trees, Random Forest)

Logistic Regression

Decision Trees

Random Forest

Regression Models

Linear Regression

Unsupervised Learning

Clustering (K-means, Hierarchical)

K-means

Dimensionality Reduction (PCA)

PCA

Model Evaluation and Validation

Evaluators

Related Articles