Recent Tutorials and Articles
    Developing Java Application in Apache Spark
    Published on: 25th May 2018
    Posted By: Amit Kumar

    This tutorial will provide you with the instructions for developing the java applications in Apache Spark using Maven and Eclipse IDE.


    In this tutorial, we will be demonstrating how to develop Java applications in Apache Spark using Eclipse IDE and Apache Maven. Since our main focus is on Apache Spark related application development, we will be assuming that you are already accustomed to these tools. On the same lines, it is assumed that you already have following tools installed on your machine - 

    • JDK 8 since we will be using Lambda expressions
    • Eclipse IDE with Apache Maven Plugin


    Setting Up Maven Java Project

    Creating Maven Project:

    We will be starting with creating a Maven project in Eclipse IDE by following below steps - 

    1. Open New Project wizard in Eclipse IDE as shown below:

      New Maven Project
    2. On next screen, select option Create a simple project to create quick project as below:

      New Maven Project Type and Location Selection
    3. Enter Group Id and Artifiact Id on next screen and finally click on Finish to create the project as below:

      Group and Artifact Id Selection

    At this point, you will start seeing your new project (in my case, it is spark-basics) in Project Explorer.

    Adding Spark Dependency:

    Next step is to add Apache Spark libraries to our newly created project. In order to do so, we will be adding following maven dependency to our project's pom.xml file.

    <!-- -->

    For completion purpose, here is what my pom.xml looks like after adding above dependency -

    <project xmlns="" xmlns:xsi=""
    		<!-- -->

    After adding this dependency, Eclipse will automatically start downloading the libraries from Maven repository. Please be patient as it may take a while for Eclipse to download the jars and build your project.

    Developing Word Count Example

    Now, we will create Java program for counting words in input list of sentences as below -

    package com.aksain.spark.basics.rdds;
    import java.util.Arrays;
    import java.util.List;
    import java.util.Map;
    import org.apache.spark.SparkConf;
    import scala.Tuple2;
     * @author Amit Kumar
     * Demonstrates the counting of words in input list of sentences.
    public class WordCountDemo {
    	public static void main(String[] args) {
    		// Prepare the spark configuration by setting application name and master node "local" i.e. embedded mode
    		final SparkConf sparkConf = new SparkConf().setAppName("Word Count Demo").setMaster("local");
    		// Create the Java Spark Context by passing spark config
    		try(final JavaSparkContext jSC = new JavaSparkContext(sparkConf)) {
    			// Create list of sentences
    			final List<String> sentences = Arrays.asList(
    					"All Programming Tutorials",
    					"Getting Started With Apache Spark",
    					"Developing Java Applications In Apache Spark",
    					"Getting Started With RDDs In Apache Spark"
    			// Split the sentences into words, convert words to key, value with key as word and value 1, 
    			// and finally count the occurrences of a word
    			final Map<String, Object> wordsCount = jSC.parallelize(sentences)
    									.flatMap((x) -> Arrays.asList(x.split(" ")))
    									.mapToPair((x) -> new Tuple2<String, Integer>(x, 1))

    Main highlights of the program are that we create spark configuration, Java spark context and then use Java spark context to count the words in input list of sentences.


    Running Word Count Example

    Finally, we will be executing our word count program. We can run our program in following two ways - 

    1. Local mode: Since we are setting master as "local" in SparkConf object in our program, we can simply run this application from Eclipse like any other Java application. In other words, we can simply perform these operations on our program:  Right Click -> Run As -> Java Application.
    2. Cluster mode: In this mode, we will need to remove setMaster("local") in SparkConf object as master information will be provided at the time of submitting the code to cluster. We will first need to create JAR file (spark-basics.jar) of our code and use below command to submit it to spark running on YARN cluster - 
      	--class com.aksain.spark.basics.rdds.WordCountDemo
      	--deploy-mode cluster
      	--master yarn


    Either way, here is the output that this program will generate -

    {Apache=3, Started=2, With=2, All=1, Getting=2, Applications=1, In=2, Developing=1, Tutorials=1, Spark=3, Programming=1, Java=1, RDDs=1}



    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 25th May 2018

    Comment Form is loading comments...