Recent Tutorials and Articles
    Getting Started with Apache Spark MLlib
    Published on: 25th May 2018
    Posted By: Amit Kumar

    This tutorial will introduce you to built-in library of Apache Spark for Machine learning.


    Apache Spark is a large scale data processing engine and comes with following built-in libraries -

    • Spark SQL (Structured Data Analysis)
    • Spark Streaming (Real-time stream processing)
    • Spark MLlib (Machine Learning)
    • Spark GraphX (Graph Computations)


    Introduction to Apache Spark MLlib

    Apache Spark MLlib is a module / library for scalable, practical and easy machine learning. It provides with following functionalities, tools and utilities -

    1. Machine Learning Algorithms - It provides implementations of common machine learning algorithms such as classification, clustering, regression and user collaborative filtering.
    2. Featurization - Features represent data attributes (columns or fields) that are fed to Machine learning algorithms. Spark MLlib provides tools for feature extraction, transformation, dimensionality reduction and selection.
    3. Pipelines - Pipelines are quite useful as these enable us to wire in different steps such as dimensionality reduction, indexing features, indexing labels and model building.
    4. Persistence - Spark provides tools to save and load algorithms, models and pipelines.
    5. Utilities - Spark also provides utilities for linear algebra, statistics, data handling, vectors etc.

    Here are the algorithms that Spark MLlib provides out of the box implementations for -

    1. Logistic regression
    2. Decision tree classifier
    3. Random forest classifier
    4. Gradient-boosted tree classifier
    5. Multilayer preceptron classifier
    6. One vs Rest classifier
    7. Naive Bayes
    8. Linear regression
    9. Generalized linear regression
    10. Decision tree regression
    11. Random forest regression
    12. Survival regression
    13. Isotonic regression
    14. K-means
    15. Latent dirichlet allocation (LDA)
    16. Bisecting k-means
    17. Gaussian mixture model (GMM)
    18. Alternating least squares (ALS)


    Apache Spark MLlib API Details

    Apache Spark MLlib comes with following two APIs, depending on data structures used for data extraction and transformations -

    RDD based API

    This API utilizes Spark RDDs for data extraction and transformations. All the classes of this API are present in spark.mllib package.

    This API has entered maintenance mode since Spark 2.0 which means -

    1. No new features will be added
    2. It will soon be deprecated
    3. It is expected to be remvoed in Spark 3.0

    Hence, new applications should not use this API and existing applications using this API should migrate to DataFrame based API as soon as possible.


    DataFrame based API

    This is primary and recommended API of Spark for machine learning and utilizes DataFrame API of Spark SQL for data extraction and transformations. This API performs better than RDD based API and provides a more user-friendly API than RDDs. All the classes of this API are present in package.

    Here is the sample code for Logistic Regression -

    // Load training data
    Dataset<Row> training ="libsvm")
    // Create instance of LogisticRegression algorithm
    LogisticRegression lr = new LogisticRegression()
    // Fit the model
    LogisticRegressionModel lrModel =;
    // Print the coefficients and intercept for logistic regression
    System.out.println("Coefficients: " + lrModel.coefficients() + " Intercept: " + lrModel.intercept());



    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 25th May 2018

    Comment Form is loading comments...