Recent Tutorials and Articles
    Appliances Energy Prediction using Linear Regression in Spark MlLib
    Published on: 2018-10-04 03:10:16
    Posted By: Amit Kumar

    This tutorial introduces you to linear regression with a use case of predicting appliances energy usage.

    Introduction to Linear Regression


    Linear regression is a machine learning technique suitable for scenario with -

    1. Continuous output variable (Y)
    2. Output variable (Y) having linear relationship with dependent variable (X)

    Mathematically, relationship between Y and X can be denoted as  -

    Y = BX + C

    And results into a line in two dimensions containing X and Y as shown below -

    In the above chart, vertical axis is Y and horizontal axis is X. Values of B and C from equation define line slope and intercept (point where this line touches Y axis).

    As part of Linear regression, our goal is to find optimized values of B and C.

    Please note that in above equation, X denotes features of a problem and is typically more than one. Hence X really denotes a Vector having multiple features as below -

    X = [X1 X2 ... Xn]  where n denotes number of features

    Likewise, B (line slope) is also vector since this may vary for different elements of Vector X. Hence above equation can be realized as -

    Y = ∑ BiXi + C         where i =1 to N

    Since there could be many possible values of vector B and C, we try to find the ones that have minimum value for a error function which is most commonly defined as -

    Error = ∑ (Yi - Y'i)*2    where Y is actual output and Y' is predicted output and i = 1 to M (no of samples in training set)

     

    Problem Statement


    We have a problem of predicting appliances energy consumption of a house with 9 rooms. In order to perform prediction, we will be utilizing room's temperature and humidity along with weather details such as temperature, visibility, windspeed etc. Since there is no fixed rule for figuring it out, we will be using Machine learning approach by training the sytem with a training data set.

    We will be leveraging Appliances Energy Prediction Data Set from UCI Machine Learning Repository to train our Machine learning algorithm model. This dataset contains a single CSV file with name - energydata_complete.csv.

    From this dataset, we will be leveraging following attributes as features -

    "T1", "RH_1", "T2", "RH_2", "T3", "RH_3", "T4", "RH_4", "T5", "RH_5", "T6", "RH_6", "T7", "RH_7", "T8", "RH_8", "T9", "RH_9", "T_OUT", "PRESS_OUT", "RH_OUT", "WIND", "VIS"

     

    And following column is a column to be predicted and hence will be used as label for training linear regression model as part of training data set.

    "Appliances"

     

    Pre-requisites


    We will be using a Java Maven project to develop program for detecting Room occupancy. On these lines, here are pre-requisites to enable you to follow instructions effectively -

    • Basic knowledge of Spark ML - If you are new to Spark ML, you are recommended to go through tutorial - Getting Started with Apache Spark MLlib
    • JDK 7 or later
    • Eclipse IDE (or your favourite Java IDE) with Maven plugin

     

    Room Occupancy Detection Program in Spark MLlib


    First step is to create a Java Maven project in Eclipse (or any Java) IDE and add following dependency to pom.xml to include Spark SQL and MLlib -

    <dependency>
    	<groupId>org.apache.spark</groupId>
    	<artifactId>spark-sql_2.11</artifactId>
    	<version>2.0.0</version>
    </dependency>
    <dependency>
    	<groupId>org.apache.spark</groupId>
    	<artifactId>spark-mllib_2.11</artifactId>
    	<version>2.0.0</version>
    </dependency>

     

    Next step is to copy energydata_complete.csv file to src/main/resources in your Maven project.

    It's time to develop a Java program to perdict appliances energy comsumption based on Spark SQL and Spark MLlib using Linear Regression -

    package com.aksain.sparkml.basics.regression;
    
    import java.io.IOException;
    
    import org.apache.spark.ml.Pipeline;
    import org.apache.spark.ml.PipelineModel;
    import org.apache.spark.ml.PipelineStage;
    import org.apache.spark.ml.feature.VectorAssembler;
    import org.apache.spark.ml.regression.LinearRegression;
    import org.apache.spark.sql.DataFrameReader;
    import org.apache.spark.sql.Dataset;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.SparkSession;
    
    public class LinearRegessionApplicancesEnergyPrediction {
    
    	public static void main(String[] args) throws IOException {
    		// Create Spark Session to create connection to Spark
    		final SparkSession sparkSession = SparkSession.builder().appName("Spark Linear Regression Demo")
    				.master("local[5]").getOrCreate();
    
    		// Get DataFrameReader using SparkSession and set header option to true
    		// to specify that first row in file contains name of columns
    		final DataFrameReader dataFrameReader = sparkSession.read().option("header", true);
    		final Dataset<Row> trainingData = dataFrameReader.csv("src/main/resources/energydata_complete.csv");
    
    		// Create view and execute query to convert types as, by default, all
    		// columns have string types
    		trainingData.createOrReplaceTempView("TRAINING_DATA");
    		final Dataset<Row> typedTrainingData = sparkSession
    				.sql("SELECT cast(Appliances as float) Appl_Energy, cast(T1 as float) T1, cast(RH_1 as float) RH_1, cast(T2 as float) T2, cast(RH_2 as float) RH_2, cast(T3 as float) T3, cast(RH_3 as float) RH_3, "
    						+ "cast(T4 as float) T4, cast(RH_4 as float) RH_4, cast(T5 as float) T5, cast(RH_5 as float) RH_5, cast(T6 as float) T6, cast(RH_6 as float) RH_6, "
    						+ "cast(T7 as float) T7, cast(RH_7 as float) RH_7, cast(T8 as float) T8, cast(RH_8 as float) RH_8, cast(T9 as float) T9, cast(RH_9 as float) RH_9, "
    						+ "cast(T_out as float) T_OUT, cast(Press_mm_hg as float) PRESS_OUT, cast(RH_out as float) RH_OUT, cast(Windspeed as float) WIND, "
    						+ "cast(Visibility as float) VIS FROM TRAINING_DATA");
    		
    		// Combine multiple input columns to a Vector using Vector Assembler
    		// utility
    		final VectorAssembler vectorAssembler = new VectorAssembler()
    				.setInputCols(new String[] { "T1", "RH_1", "T2", "RH_2", "T3", "RH_3", "T4", "RH_4", "T5", "RH_5", "T6", "RH_6", "T7", "RH_7", "T8", "RH_8", "T9", "RH_9", 
    						"T_OUT", "PRESS_OUT", "RH_OUT", "WIND", "VIS"})
    				.setOutputCol("features");
    		final Dataset<Row> featuresData = vectorAssembler.transform(typedTrainingData);
    		// Print Schema to see column names, types and other metadata
    		featuresData.printSchema();
    
    		// Split the data into training and test sets (30% held out for
    		// testing).
    		Dataset<Row>[] splits = featuresData.randomSplit(new double[] { 0.7, 0.3 });
    		Dataset<Row> trainingFeaturesData = splits[0];
    		Dataset<Row> testFeaturesData = splits[1];
    		
    		
    		// Load the model
    		PipelineModel model = null;
    		try {
    			model = PipelineModel.load("src/main/resources/applianceenergyprediction");
    		} catch(Exception exception) {
    		}
    		
    		if(model == null) {
    			// Train a Linear Regression model.
    			final LinearRegression regression = new LinearRegression().setLabelCol("Appl_Energy")
    					.setFeaturesCol("features");
    
    			// Using pipeline gives you benefit of switching regression model without any other changes
    			final Pipeline pipeline = new Pipeline()
    					.setStages(new PipelineStage[] { regression });
    
    			// Train model. This also runs the indexers.
    			model = pipeline.fit(trainingFeaturesData);
    			model.save("src/main/resources/applianceenergyprediction");
    		}
    
    		// Make predictions.
    		final Dataset<Row> predictions = model.transform(testFeaturesData);
    		predictions.show();
    	}
    
    }

    Note: You can find complete project code in my Github repository.

    Here is the output similar to what you will get on your console after executing above program like any simple Java program - 

    root
     |-- Appl_Energy: float (nullable = true)
     |-- T1: float (nullable = true)
     |-- RH_1: float (nullable = true)
     |-- T2: float (nullable = true)
     |-- RH_2: float (nullable = true)
     |-- T3: float (nullable = true)
     |-- RH_3: float (nullable = true)
     |-- T4: float (nullable = true)
     |-- RH_4: float (nullable = true)
     |-- T5: float (nullable = true)
     |-- RH_5: float (nullable = true)
     |-- T6: float (nullable = true)
     |-- RH_6: float (nullable = true)
     |-- T7: float (nullable = true)
     |-- RH_7: float (nullable = true)
     |-- T8: float (nullable = true)
     |-- RH_8: float (nullable = true)
     |-- T9: float (nullable = true)
     |-- RH_9: float (nullable = true)
     |-- T_OUT: float (nullable = true)
     |-- PRESS_OUT: float (nullable = true)
     |-- RH_OUT: float (nullable = true)
     |-- WIND: float (nullable = true)
     |-- VIS: float (nullable = true)
     |-- features: vector (nullable = true)
    
    
    [Stage 3:>                                                          (0 + 3) / 3]
                                                                                    
    
    [Stage 4:>                                                          (0 + 3) / 3]
                                                                                    
    
    [Stage 5:>                                                          (0 + 3) / 3]
    [Stage 5:===================>                                       (1 + 2) / 3]
                                                                                    
    
    [Stage 6:>                                                          (0 + 3) / 3]
                                                                                    
    +-----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+--------------------+------------------+
    |Appl_Energy|       T1|     RH_1|       T2|     RH_2|       T3|     RH_3|       T4|     RH_4|       T5|     RH_5|        T6|     RH_6|       T7|     RH_7|       T8|     RH_8|       T9|     RH_9|     T_OUT|PRESS_OUT|   RH_OUT|     WIND|      VIS|            features|        prediction|
    +-----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+--------------------+------------------+
    |       10.0|    17.29|     43.2|    16.29|    43.29|     17.7|    41.26|     15.3|     42.5|15.555555| 48.91111|       6.4|     99.9|    15.69|    38.09|16.644444|45.944443|    15.19|42.363335|       5.9|766.73334|99.833336|      4.0|     14.0|[17.2900009155273...| 88.99296073871162|
    |       10.0|     18.7|   46.345|    17.89|     46.4|     19.7|    44.79|    17.76|     46.9|    17.29|   53.045| 3.4333334|    95.53|    17.29|    43.29|    17.39|49.610554|     16.7|     47.4|       4.2|763.36664|92.333336|      2.0|28.333334|[18.7000007629394...| 68.99121998396122|
    |       10.0|     19.1|45.826668|    18.39|    45.26|     19.7|    44.56|     18.2|    45.59|     17.2|   52.334|       6.4| 97.42333|17.957222|41.461113|   17.945|47.669445|16.963333|46.626667|       7.3|    765.2|     74.0|      2.0|     40.0|[19.1000003814697...| 85.96612359854683|
    |       10.0|     20.7| 44.72333|    19.39|    45.29|     21.6|    41.92|    19.39|45.066666|18.028572|53.057144|      10.5| 90.29667|     18.1|42.566666|20.533333|     53.0|    18.29|     47.2|      10.6|    756.6|     89.5|      8.5|     34.5|[20.7000007629394...| 73.91703054246963|
    |       20.0|    17.29|     43.2|16.356667|    43.29|     17.7|     41.2|     15.3|    42.53|15.533334|48.961113| 6.4666667|     99.9|    15.69|    38.09|16.655556| 45.97167|    15.19|42.433334|       5.9|766.86664|99.666664|      4.0|     19.0|[17.2900009155273...| 88.45790878258944|
    |       20.0|17.666666|    41.06|     16.6|41.626667|     18.0|    40.29|     15.6|     40.7|15.757222|     48.4| 4.3333335|    99.69|    15.89|36.120556|17.061111|43.874443|    15.39|     40.5|  4.233333| 762.6667|    100.0|      5.0|      8.0|[17.6666660308837...| 87.24653890315918|
    |       20.0|    17.89|    44.29|17.133333|     44.7|     19.0|    42.09|    16.79|    44.53|16.627777| 50.18222|  8.533334|82.096664|     17.0|39.857224|17.094444|    46.53|     16.2|    44.09|      8.25|    763.1|82.666664|10.833333|     40.0|[17.8899993896484...|106.30295826500716|
    |       20.0|    17.89|    44.86|     17.1|    45.09|     19.0|     42.4|    16.79|     45.0|     16.6|     50.2|       8.8|84.433334|16.920555|     40.4|17.022223|   47.075|     16.2|    44.59|  8.433333|762.56665|84.333336|     10.0|     40.0|[17.8899993896484...| 106.7702799425575|
    |       20.0|17.926666|     44.2|17.133333|    44.59|     19.0|     42.2|    16.79|     44.5|16.661112|    50.24|       8.5| 81.96667|     17.0|    39.76|     17.1|     46.4|     16.2|     43.9|       8.1|    763.4|81.666664|10.333333|     40.0|[17.9266662597656...|107.68999563905112|
    |       20.0|18.066668| 38.72333|     17.0|37.966667|    18.39|    39.23|    16.89|36.826668|     16.2|49.774445|     -3.03|90.956665|   15.885|32.203888|18.538889| 42.94278|  15.7175|   40.375|      -3.3|763.43335|91.666664|      1.0|51.666668|[18.0666675567626...|  96.1276655150146|
    |       20.0|     18.1|    39.76|     17.0|    40.23|    18.29|     39.7|    15.89|     39.5|     16.0|     48.0|      3.79|     99.3|     16.1|    34.77|     17.5|     42.7|     15.6|    39.59|       3.2|    760.3|     96.0|      6.0|     42.0|[18.1000003814697...|101.06214517084106|
    |       20.0|     18.1|     39.9|     17.0|   40.345|    18.26|39.663334|    15.89|    39.59|     16.0|     48.0|     3.845|    99.45|     16.1|     34.9|17.463333| 42.76611|     15.6|     39.7| 3.3333333|    760.4|95.666664|      6.0|48.333332|[18.1000003814697...|100.17620777198982|
    |       20.0|    18.39|    38.79|    17.39|     39.0|18.426666|    39.23|     16.2|    38.59|16.144444|47.614445| 4.6566668| 89.89333|    16.29|     33.4|17.817778|41.433334|     15.8|     39.2|       3.7|    760.7|73.666664|      6.0|21.666666|[18.3899993896484...| 101.6756491656231|
    |       20.0|    18.39|    38.79|    17.39|     39.0|     18.5|39.363335|     16.2|    38.59|     16.2|    47.59|      4.59| 88.62666|    16.29|33.296112|17.867777|41.482777|     15.8|    39.29|       3.7|    760.8|71.333336|      6.0|22.333334|[18.3899993896484...|103.52987896185846|
    |       20.0|    18.39|    38.79|     17.5|38.863335|     18.5|    39.26|    16.29|     38.5|     16.2|    47.57|     3.545|    89.69|    16.29|     33.2|     18.0|    41.59|    15.89|     39.7| 3.2333333|    761.7|70.333336|5.6666665|23.333334|[18.3899993896484...| 95.35030831625421|
    |       20.0|    18.39|    42.29|     17.7|    41.79|18.323334|     43.5|     15.8|     46.7|    15.39|     51.0| 5.7633333|     99.9|     15.6|    39.79|    18.79|51.575554|     15.3|    47.59|  5.366667|768.76666|     92.0|      4.0|     23.0|[18.3899993896484...| 80.23149818464226|
    |       20.0|     18.5|    43.29|     17.6|    43.95|    19.39|    42.56|    17.29|    43.53|    16.89|    51.09|       2.0|     98.3|     17.5|    38.79|17.583334|     45.0|     16.6|     43.2|       3.9|    764.5|     80.0|      6.0|     40.0|[18.5,43.29000091...| 82.84486722105063|
    |       20.0|     18.6|     40.5|17.426666|     40.9|     19.1|     39.9|     18.2|    38.03|16.688889| 56.68778|-4.2388887| 87.25611|    17.39| 33.41111|   18.765|42.181667|     16.6|    38.56|-2.2666667|757.43335|     80.0|      1.0|56.666668|[18.6000003814697...|  77.7176230752578|
    |       20.0|     18.7|    37.79|     17.6|     37.7|     19.1|     40.4|16.356667|     38.2|     16.6|46.927776|-2.1266668|     92.7|    16.29|     32.9|18.627777|45.991665|     16.0|    43.09|      -1.3|    763.4|     84.0|      2.0|     65.0|[18.7000007629394...| 68.46048948590804|
    |       20.0|     18.7|    42.29|18.033333|    41.53|    19.79|43.326668|    18.29|41.966667|17.566668|    63.06|      3.06|    91.33|     18.1|42.466667|     19.1|    50.59|     16.7|47.126667|       3.4|    752.1|     89.0|      5.0|     28.0|[18.7000007629394...| 79.67022791676764|
    +-----------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+---------+---------+---------+----------+---------+---------+---------+---------+--------------------+------------------+
    only showing top 20 rows
    
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
    4 Oct, 2018 9:03:06 AM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 360
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 51B for [intercept] DOUBLE: 1 values, 8B raw, 10B comp, 1 pages, encodings: [PLAIN, BIT_PACKED]
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 45B for [coefficients, type] INT32: 1 values, 10B raw, 12B comp, 1 pages, encodings: [PLAIN, RLE, BIT_PACKED]
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for [coefficients, size] INT32: 1 values, 7B raw, 9B comp, 1 pages, encodings: [PLAIN, RLE, BIT_PACKED]
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 36B for [coefficients, indices, list, element] INT32: 1 values, 13B raw, 15B comp, 1 pages, encodings: [PLAIN, RLE]
    4 Oct, 2018 9:03:07 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 245B for [coefficients, values, list, element] DOUBLE: 23 values, 198B raw, 202B comp, 1 pages, encodings: [PLAIN, RLE]
    

     

     

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 2018-10-04 03:10:16

    Comment Form is loading comments...