This tutorial will provide instructions for developing an application for detecting room occupancy using Decision Tree classifier in Apache Spark MLlib.
Problem Statement
We have a problem of detecting whether a room is occupied based on data collected such as Temperature, Humdiity, Light, CO2 etc. Since there is no fixed rule for figuring it out, we will be using Machine learning approach by training the sytem with a training data set.
We will be leveraging Occupancy Detection Data Set from UCI Machine Learning Repository to train our Machine learning algorithm model. This dataset contains 3 files - datatraining.txt, dataset.txt and dataset1.txt.
We will be utilizing datatraining.txt file in this tutorial. However, there is a problem in header of this file as it contains a missing column name "id". Here are the original and updated csv file header for your reference -
Original and Problematic Header with missing "id" -
"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
Updated and Corrected Header -
"id","date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
This problem will be modeled using Decision Tree classifier that is introduced in next section.
Introduction to Decision Tree Classifier
Decision Tree classifier is a machine learning technique based on Decision trees. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from data features. Decision rules are basically a set of if-then-else conditions.
Here are the advantages of Decision Tree classifier -
- Simple to understand and to interpret. Trees can be visualised.
- Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
- The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
- Able to handle both numerical (continouous values) and categorical data.
- Able to handle multi-output problems.
However, on the other hand, here are the disadvantages of Decision Tree classifier -
- Decision tree learners do not support missing values.
- Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting.
- Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
- Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Pre-requisites
We will be using a Java Maven project to develop program for detecting Room occupancy. On these lines, here are pre-requisites to enable you to follow instructions effectively -
- Basic knowledge of Spark ML - If you are new to Spark ML, you are recommended to go through tutorial - Getting Started with Apache Spark MLlib
- JDK 7 or later
- Eclipse IDE (or your favourite Java IDE) with Maven plugin
Room Occupancy Detection Program in Spark MLlib
First step is to create a Java Maven project in Eclipse (or any Java) IDE and add following dependency to pom.xml to include Spark SQL and MLlib -
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.0</version>
</dependency>
Next step is to copy datatraining.txt with added "id" header column to src/main/resources in your Maven project.
It's time to develop a Java program to detect room occupancy. However, let's first see steps that we will be following in our program in order to understand it better -
Finally, here is the Java program based on Spark SQL and Spark MLlib for room occupancy detection using Decision Tree classifier -
package com.aksain.sparkml.basics.decisiontree;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.DecisionTreeClassifier;
import org.apache.spark.ml.feature.IndexToString;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.StringIndexerModel;
import org.apache.spark.ml.feature.VectorAssembler;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;
import org.apache.spark.sql.DataFrameReader;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
/**
* Detects whether a Room is occupied using Decision Tree classifier.
*
* @author amit-kumar
*/
public class RoomOccupancyDetector {
/**
* @param args
*/
public static void main(String[] args) {
// Create Spark Session to create connection to Spark
final SparkSession sparkSession = SparkSession.builder().appName("Spark Decision Tree Classifer Demo")
.master("local[5]").getOrCreate();
// Get DataFrameReader using SparkSession and set header option to true
// to specify that first row in file contains name of columns
final DataFrameReader dataFrameReader = sparkSession.read().option("header", true);
final Dataset<Row> trainingData = dataFrameReader.csv("src/main/resources/datatraining.txt");
// Create view and execute query to convert types as, by default, all
// columns have string types
trainingData.createOrReplaceTempView("TRAINING_DATA");
final Dataset<Row> typedTrainingData = sparkSession
.sql("SELECT cast(Temperature as float) Temperature, cast(Humidity as float) Humidity, "
+ "cast(Light as float) Light, cast(CO2 as float) CO2, "
+ "cast(HumidityRatio as float) HumidityRatio, "
+ "cast(Occupancy as int) Occupancy FROM TRAINING_DATA");
// Combine multiple input columns to a Vector using Vector Assembler
// utility
final VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(new String[] { "Temperature", "Humidity", "Light", "CO2", "HumidityRatio" })
.setOutputCol("features");
final Dataset<Row> featuresData = vectorAssembler.transform(typedTrainingData);
// Print Schema to see column names, types and other metadata
featuresData.printSchema();
// Indexing is done to improve the execution times as comparing indexes
// is much cheaper than comparing strings/floats
// Index labels, adding metadata to the label column (Occupancy). Fit on
// whole dataset to include all labels in index.
final StringIndexerModel labelIndexer = new StringIndexer().setInputCol("Occupancy")
.setOutputCol("indexedLabel").fit(featuresData);
// Index features vector
final VectorIndexerModel featureIndexer = new VectorIndexer().setInputCol("features")
.setOutputCol("indexedFeatures").fit(featuresData);
// Split the data into training and test sets (30% held out for
// testing).
Dataset<Row>[] splits = featuresData.randomSplit(new double[] { 0.7, 0.3 });
Dataset<Row> trainingFeaturesData = splits[0];
Dataset<Row> testFeaturesData = splits[1];
// Train a DecisionTree model.
final DecisionTreeClassifier dt = new DecisionTreeClassifier().setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures");
// Convert indexed labels back to original labels.
final IndexToString labelConverter = new IndexToString().setInputCol("prediction")
.setOutputCol("predictedOccupancy").setLabels(labelIndexer.labels());
// Chain indexers and tree in a Pipeline.
final Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] { labelIndexer, featureIndexer, dt, labelConverter });
// Train model. This also runs the indexers.
final PipelineModel model = pipeline.fit(trainingFeaturesData);
// Make predictions.
final Dataset<Row> predictions = model.transform(testFeaturesData);
// Select example rows to display.
System.out.println("Example records with Predicted Occupancy as 0:");
predictions.select("predictedOccupancy", "Occupancy", "features")
.where(predictions.col("predictedOccupancy").equalTo(0)).show(10);
System.out.println("Example records with Predicted Occupancy as 1:");
predictions.select("predictedOccupancy", "Occupancy", "features")
.where(predictions.col("predictedOccupancy").equalTo(1)).show(10);
System.out.println("Example records with In-correct predictions:");
predictions.select("predictedOccupancy", "Occupancy", "features")
.where(predictions.col("predictedOccupancy").notEqual(predictions.col("Occupancy"))).show(10);
}
}
Note: You can find complete project code in my Github repository.
Here is the output that you will get on your console after executing above program like any simple Java program -
root
|-- Temperature: float (nullable = true)
|-- Humidity: float (nullable = true)
|-- Light: float (nullable = true)
|-- CO2: float (nullable = true)
|-- HumidityRatio: float (nullable = true)
|-- Occupancy: integer (nullable = true)
|-- features: vector (nullable = true)
Records with Predicted Occupancy as 0:
+------------------+---------+--------------------+
|predictedOccupancy|Occupancy| features|
+------------------+---------+--------------------+
| 0| 0|[19.0,31.38999938...|
| 0| 0|[19.0,31.38999938...|
| 0| 0|[19.0499992370605...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
| 0| 0|[19.1000003814697...|
+------------------+---------+--------------------+
only showing top 10 rows
Records with Predicted Occupancy as 1:
+------------------+---------+--------------------+
|predictedOccupancy|Occupancy| features|
+------------------+---------+--------------------+
| 1| 1|[19.7000007629394...|
| 1| 1|[19.8899993896484...|
| 1| 1|[19.9449996948242...|
| 1| 1|[19.9633331298828...|
| 1| 1|[20.0,29.19750022...|
| 1| 1|[20.0,29.44499969...|
| 1| 1|[20.1000003814697...|
| 1| 0|[20.1333332061767...|
| 1| 1|[20.1749992370605...|
| 1| 1|[20.2000007629394...|
+------------------+---------+--------------------+
only showing top 10 rows
Records with In-correct predictions:
+------------------+---------+--------------------+
|predictedOccupancy|Occupancy| features|
+------------------+---------+--------------------+
| 0| 1|[19.5249996185302...|
| 1| 0|[20.1333332061767...|
| 1| 0|[20.3150005340576...|
| 1| 0|[20.3150005340576...|
| 1| 0|[21.2900009155273...|
| 1| 0|[21.6000003814697...|
| 1| 0|[21.625,19.222499...|
| 1| 0|[21.7224998474121...|
| 1| 0|[22.0,18.02499961...|
| 1| 0|[22.0499992370605...|
+------------------+---------+--------------------+
only showing top 10 rows
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.