Recent Tutorials and Articles
    Occupancy Detection with Decision Tree Classifier in Spark MLlib
    Published on: 25th May 2018
    Posted By: Amit Kumar

    This tutorial will provide instructions for developing an application for detecting room occupancy using Decision Tree classifier in Apache Spark MLlib.

    Problem Statement


    We have a problem of detecting whether a room is occupied based on data collected such as Temperature, Humdiity, Light, CO2 etc. Since there is no fixed rule for figuring it out, we will be using Machine learning approach by training the sytem with a training data set.

    We will be leveraging Occupancy Detection Data Set from UCI Machine Learning Repository to train our Machine learning algorithm model. This dataset contains 3 files - datatraining.txt, dataset.txt and dataset1.txt.

    We will be utilizing datatraining.txt file in this tutorial. However, there is a problem in header of this file as it contains a missing column name "id". Here are the original and updated csv file header for your reference - 

    Original and Problematic Header with missing "id" -

    "date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"

     

    Updated and Corrected Header -

    "id","date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"

     

    This problem will be modeled using Decision Tree classifier that is introduced in next section.

     

    Introduction to Decision Tree Classifier


    Decision Tree classifier is a machine learning technique based on Decision trees. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from data features. Decision rules are basically a set of if-then-else conditions. 

    Here are the advantages of Decision Tree classifier - 

    • Simple to understand and to interpret. Trees can be visualised.
    • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
    • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
    • Able to handle both numerical (continouous values) and categorical data. 
    • Able to handle multi-output problems.

    However, on the other hand, here are the disadvantages of Decision Tree classifier -

    • Decision tree learners do not support missing values.
    • Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting.
    • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
    • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
    • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

     

    Pre-requisites


    We will be using a Java Maven project to develop program for detecting Room occupancy. On these lines, here are pre-requisites to enable you to follow instructions effectively -

    • Basic knowledge of Spark ML - If you are new to Spark ML, you are recommended to go through tutorial - Getting Started with Apache Spark MLlib
    • JDK 7 or later
    • Eclipse IDE (or your favourite Java IDE) with Maven plugin

     

    Room Occupancy Detection Program in Spark MLlib


    First step is to create a Java Maven project in Eclipse (or any Java) IDE and add following dependency to pom.xml to include Spark SQL and MLlib -

    <dependency>
    	<groupId>org.apache.spark</groupId>
    	<artifactId>spark-sql_2.11</artifactId>
    	<version>2.0.0</version>
    </dependency>
    <dependency>
    	<groupId>org.apache.spark</groupId>
    	<artifactId>spark-mllib_2.11</artifactId>
    	<version>2.0.0</version>
    </dependency>

     

    Next step is to copy datatraining.txt with added "id" header column to src/main/resources in your Maven project.

    It's time to develop a Java program to detect room occupancy. However, let's first see steps that we will be following in our program in order to understand it better -

    Room Occupancy Detection Flow

     

    Finally, here is the Java program based on Spark SQL and Spark MLlib for room occupancy detection using Decision Tree classifier -

    package com.aksain.sparkml.basics.decisiontree;
    
    import org.apache.spark.ml.Pipeline;
    import org.apache.spark.ml.PipelineModel;
    import org.apache.spark.ml.PipelineStage;
    import org.apache.spark.ml.classification.DecisionTreeClassifier;
    import org.apache.spark.ml.feature.IndexToString;
    import org.apache.spark.ml.feature.StringIndexer;
    import org.apache.spark.ml.feature.StringIndexerModel;
    import org.apache.spark.ml.feature.VectorAssembler;
    import org.apache.spark.ml.feature.VectorIndexer;
    import org.apache.spark.ml.feature.VectorIndexerModel;
    import org.apache.spark.sql.DataFrameReader;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    
    /**
     * Detects whether a Room is occupied using Decision Tree classifier.
     * 
     * @author amit-kumar
     */
    public class RoomOccupancyDetector {
    
    	/**
    	 * @param args
    	 */
    	public static void main(String[] args) {
    		// Create Spark Session to create connection to Spark
    		final SparkSession sparkSession = SparkSession.builder().appName("Spark Decision Tree Classifer Demo")
    				.master("local[5]").getOrCreate();
    
    		// Get DataFrameReader using SparkSession and set header option to true
    		// to specify that first row in file contains name of columns
    		final DataFrameReader dataFrameReader = sparkSession.read().option("header", true);
    		final Dataset<Row> trainingData = dataFrameReader.csv("src/main/resources/datatraining.txt");
    
    		// Create view and execute query to convert types as, by default, all
    		// columns have string types
    		trainingData.createOrReplaceTempView("TRAINING_DATA");
    		final Dataset<Row> typedTrainingData = sparkSession
    				.sql("SELECT cast(Temperature as float) Temperature, cast(Humidity as float) Humidity, "
    						+ "cast(Light as float) Light, cast(CO2 as float) CO2, "
    						+ "cast(HumidityRatio as float) HumidityRatio, "
    						+ "cast(Occupancy as int) Occupancy FROM TRAINING_DATA");
    
    		// Combine multiple input columns to a Vector using Vector Assembler
    		// utility
    		final VectorAssembler vectorAssembler = new VectorAssembler()
    				.setInputCols(new String[] { "Temperature", "Humidity", "Light", "CO2", "HumidityRatio" })
    				.setOutputCol("features");
    		final Dataset<Row> featuresData = vectorAssembler.transform(typedTrainingData);
    		// Print Schema to see column names, types and other metadata
    		featuresData.printSchema();
    
    		// Indexing is done to improve the execution times as comparing indexes
    		// is much cheaper than comparing strings/floats
    
    		// Index labels, adding metadata to the label column (Occupancy). Fit on
    		// whole dataset to include all labels in index.
    		final StringIndexerModel labelIndexer = new StringIndexer().setInputCol("Occupancy")
    				.setOutputCol("indexedLabel").fit(featuresData);
    
    		// Index features vector
    		final VectorIndexerModel featureIndexer = new VectorIndexer().setInputCol("features")
    				.setOutputCol("indexedFeatures").fit(featuresData);
    
    		// Split the data into training and test sets (30% held out for
    		// testing).
    		Dataset<Row>[] splits = featuresData.randomSplit(new double[] { 0.7, 0.3 });
    		Dataset<Row> trainingFeaturesData = splits[0];
    		Dataset<Row> testFeaturesData = splits[1];
    
    		// Train a DecisionTree model.
    		final DecisionTreeClassifier dt = new DecisionTreeClassifier().setLabelCol("indexedLabel")
    				.setFeaturesCol("indexedFeatures");
    
    		// Convert indexed labels back to original labels.
    		final IndexToString labelConverter = new IndexToString().setInputCol("prediction")
    				.setOutputCol("predictedOccupancy").setLabels(labelIndexer.labels());
    
    		// Chain indexers and tree in a Pipeline.
    		final Pipeline pipeline = new Pipeline()
    				.setStages(new PipelineStage[] { labelIndexer, featureIndexer, dt, labelConverter });
    
    		// Train model. This also runs the indexers.
    		final PipelineModel model = pipeline.fit(trainingFeaturesData);
    
    		// Make predictions.
    		final Dataset<Row> predictions = model.transform(testFeaturesData);
    
    		// Select example rows to display.
    		System.out.println("Example records with Predicted Occupancy as 0:");
    		predictions.select("predictedOccupancy", "Occupancy", "features")
    				.where(predictions.col("predictedOccupancy").equalTo(0)).show(10);
    
    		System.out.println("Example records with Predicted Occupancy as 1:");
    		predictions.select("predictedOccupancy", "Occupancy", "features")
    				.where(predictions.col("predictedOccupancy").equalTo(1)).show(10);
    
    		System.out.println("Example records with In-correct predictions:");
    		predictions.select("predictedOccupancy", "Occupancy", "features")
    				.where(predictions.col("predictedOccupancy").notEqual(predictions.col("Occupancy"))).show(10);
    	}
    }

    Note: You can find complete project code in my Github repository.

    Here is the output that you will get on your console after executing above program like any simple Java program - 

    root
     |-- Temperature: float (nullable = true)
     |-- Humidity: float (nullable = true)
     |-- Light: float (nullable = true)
     |-- CO2: float (nullable = true)
     |-- HumidityRatio: float (nullable = true)
     |-- Occupancy: integer (nullable = true)
     |-- features: vector (nullable = true)
    
    Records with Predicted Occupancy as 0:
    +------------------+---------+--------------------+
    |predictedOccupancy|Occupancy|            features|
    +------------------+---------+--------------------+
    |                 0|        0|[19.0,31.38999938...|
    |                 0|        0|[19.0,31.38999938...|
    |                 0|        0|[19.0499992370605...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    |                 0|        0|[19.1000003814697...|
    +------------------+---------+--------------------+
    only showing top 10 rows
    
    Records with Predicted Occupancy as 1:
    +------------------+---------+--------------------+
    |predictedOccupancy|Occupancy|            features|
    +------------------+---------+--------------------+
    |                 1|        1|[19.7000007629394...|
    |                 1|        1|[19.8899993896484...|
    |                 1|        1|[19.9449996948242...|
    |                 1|        1|[19.9633331298828...|
    |                 1|        1|[20.0,29.19750022...|
    |                 1|        1|[20.0,29.44499969...|
    |                 1|        1|[20.1000003814697...|
    |                 1|        0|[20.1333332061767...|
    |                 1|        1|[20.1749992370605...|
    |                 1|        1|[20.2000007629394...|
    +------------------+---------+--------------------+
    only showing top 10 rows
    
    Records with In-correct predictions:
    +------------------+---------+--------------------+
    |predictedOccupancy|Occupancy|            features|
    +------------------+---------+--------------------+
    |                 0|        1|[19.5249996185302...|
    |                 1|        0|[20.1333332061767...|
    |                 1|        0|[20.3150005340576...|
    |                 1|        0|[20.3150005340576...|
    |                 1|        0|[21.2900009155273...|
    |                 1|        0|[21.6000003814697...|
    |                 1|        0|[21.625,19.222499...|
    |                 1|        0|[21.7224998474121...|
    |                 1|        0|[22.0,18.02499961...|
    |                 1|        0|[22.0499992370605...|
    +------------------+---------+--------------------+
    only showing top 10 rows

     

     

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 25th May 2018

    Comment Form is loading comments...