This tutorial will introduce you to built-in library of Apache Spark for Machine learning.
Abstract
Apache Spark is a large scale data processing engine and comes with following built-in libraries -
- Spark SQL (Structured Data Analysis)
- Spark Streaming (Real-time stream processing)
- Spark MLlib (Machine Learning)
- Spark GraphX (Graph Computations)
Introduction to Apache Spark MLlib
Apache Spark MLlib is a module / library for scalable, practical and easy machine learning. It provides with following functionalities, tools and utilities -
- Machine Learning Algorithms - It provides implementations of common machine learning algorithms such as classification, clustering, regression and user collaborative filtering.
- Featurization - Features represent data attributes (columns or fields) that are fed to Machine learning algorithms. Spark MLlib provides tools for feature extraction, transformation, dimensionality reduction and selection.
- Pipelines - Pipelines are quite useful as these enable us to wire in different steps such as dimensionality reduction, indexing features, indexing labels and model building.
- Persistence - Spark provides tools to save and load algorithms, models and pipelines.
- Utilities - Spark also provides utilities for linear algebra, statistics, data handling, vectors etc.
Here are the algorithms that Spark MLlib provides out of the box implementations for -
- Logistic regression
- Decision tree classifier
- Random forest classifier
- Gradient-boosted tree classifier
- Multilayer preceptron classifier
- One vs Rest classifier
- Naive Bayes
- Linear regression
- Generalized linear regression
- Decision tree regression
- Random forest regression
- Survival regression
- Isotonic regression
- K-means
- Latent dirichlet allocation (LDA)
- Bisecting k-means
- Gaussian mixture model (GMM)
- Alternating least squares (ALS)
Apache Spark MLlib API Details
Apache Spark MLlib comes with following two APIs, depending on data structures used for data extraction and transformations -
RDD based API
This API utilizes Spark RDDs for data extraction and transformations. All the classes of this API are present in spark.mllib package.
This API has entered maintenance mode since Spark 2.0 which means -
- No new features will be added
- It will soon be deprecated
- It is expected to be remvoed in Spark 3.0
Hence, new applications should not use this API and existing applications using this API should migrate to DataFrame based API as soon as possible.
DataFrame based API
This is primary and recommended API of Spark for machine learning and utilizes DataFrame API of Spark SQL for data extraction and transformations. This API performs better than RDD based API and provides a more user-friendly API than RDDs. All the classes of this API are present in spark.ml package.
Here is the sample code for Logistic Regression -
// Load training data
Dataset<Row> training = spark.read().format("libsvm")
.load("data/mllib/sample_libsvm_data.txt");
// Create instance of LogisticRegression algorithm
LogisticRegression lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8);
// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);
// Print the coefficients and intercept for logistic regression
System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.