Recent Tutorials and Articles
    Setting up HDFS and YARN clusters in Hadoop 2.6.0
    Published on: 2018-05-25 16:05:37

    This tutorial provides step by step instructions to configure and start up Hadoop cluster including HDFS, YARN and JobHistory server.

    Installing Hadoop

    We will be utilizing two virtual machines with following configuration to set up Apache Hadoop cluster -

    Parameter Name Virtual Machine 1 Virtual Machine 2
    Name VM1 VM2
    IP Address 192.168.111.130 192.168.111.132
    Operating System Ubuntu-14.04.1-64bit Ubuntu-14.04.1-64bit
    No of CPU Cores 4 4
    RAM 6 GB 6 GB

    First step to install Hadoop is to download its binaries on both the virtual machines. In this article, we will be installing Apache Hadoop 2.6.0 to set up cluster which can be downloaded from here.

    Once the libraries have been downloaded on the virtual machines, you can extract it to a directory where you would like hadoop to be installed. We will refer this directory as $Hadoop_Base_Dir throughout this tutorial.

    Pre-requisites

    Before you continue, please ensure that following pre-requisites have been fulfilled to ensure that you are able to follow this article without any problems:

    1. JDK 6 or higher installed on both the virtual machines
    2. JAVA_HOME variable set to the path where JDK is installed
    3. Root access on all the virtual machines as all the steps should ideally be performed by root user
    4. Updated /etc/hosts file on both the virtual machines with the IP address of other virtual machine. E.g. /etc/hosts on VM1 will need to have IP address of VM2 along with hostname (VM2). In my case, this additional line in VM1 hosts file looks like 192.168.111.132 VM2.
    Configuring Hadoop Cluster

    After installing hadoop libraries, next step is to configure these in order to set up cluster. We will be setting up VM1 as HDFS NameNode and YARN Resource Manager while VM2 will be configured as HDFS DataNode and YARN Node Manager.

    For the sake of simplicity, minimum mandatory configuration (you may refer to all the properties and their default values by following the links in References section) will be done in following 4 steps for core framework, HDFS, YARN and MapReduce respectively:

    1. Core Framework - To configure common Apache Hadoop component, following code needs to be placed in $Hadoop_Base_Dir/etc/hadoop/core-site.xml file on all the virtual machines.
      $Hadoop_Base_Dir/etc/hadoop/core-site.xml
      <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://192.168.111.130:50050/</value>
        </property>
        <property>
            <name>io.file.buffer.size</name>
            <value>131072</value>
        </property>
      </configuration>
      
      

      As shown in the above code, atleast two properties need to be set. First property named fs.defaultFS contains the hdfs NameNode URI which, in our case, is the IP address of VM1 along with the hdfs port 50050. The other property named io.file.buffer.size is added to specify the size of read/write buffer in sequential files.

    2. HDFS - HDFS configuration will be done differently in $Hadoop_Base_Dir/etc/hadoop/hdfs-site.xml on both the virtual machines. VM1 will be configured to act as NameNode while VM2 will be set up as DataNode.

      To configure VM1 as NameNode, following properties need to be added to $Hadoop_Base_Dir/etc/hadoop/hdfs-site.xml :

      HDFS NameNode Configuration - $Hadoop_Base_Dir/etc/hadoop/hdfs-site.xml
      <configuration>
          <property>
              <name>dfs.namenode.name.dir</name>
              <value>/opt/app/big_data/hadoop/hadoop-2.6.0/hdfs/data</value>
          </property>
      </configuration>
      
      

      To configure VM2 as DataNode, following properties need to be added to $Hadoop_Base_Dir/etc/hadoop/hdfs-site.xml :

      HDFS DataNode Configuration - $Hadoop_Base_Dir/etc/hadoop/hdfs-site.xml
      <configuration>
          <property>
              <name>dfs.datanode.data.dir</name>
              <value>/opt/app/big_data/hadoop/hadoop-2.6.0/hdfs/data</value>
          </property>
      </configuration>
      
      

      Once HDFS data and name directories have been set, please create the directories if not created already.

    3. YARN - Next step is to configure YARN cluster wherein we will configure VM1 (192.168.111.130) as resource manager and other virtual machine as node manager.

      In order to configure VM1 as ResourceManager, update the $Hadoop_Base_Dir/etc/hadoop/yarn-site.xml as follows:

      YARN ResourceManager Configuration - $Hadoop_Base_Dir/etc/hadoop/yarn-site.xml
      <configuration>
          <property>
              <name>yarn.resourcemanager.hostname</name>
              <value>192.168.111.130</value>
          </property>
      </configuration>
      
      

      Here is how other virtual machine's $Hadoop_Base_Dir/etc/hadoop/yarn-site.xml need to be configured in order to be configured as NodeManager

      YARN NodeManager Configuration - $Hadoop_Base_Dir/etc/hadoop/yarn-site.xml
      <configuration>
          <property>
              <name>yarn.resourcemanager.hostname</name>
              <value>192.168.111.130</value>
          </property>
          <property>
              <name>yarn.nodemanager.local-dirs</name>
              <value>/opt/app/big_data/hadoop/hadoop-2.6.0/yarn/data</value>
          </property>
          <property>
              <name>yarn.nodemanager.logs-dirs</name>
              <value>/opt/app/big_data/hadoop/hadoop-2.6.0/yarn/logs</value>
          </property>
      </configuration>
      
      

      Once YARN data and logs directories have been set, please create the directories if not created already.

    4. MapReduce - This section will talk about setting up the framework for implementing MapReduce using property mapreduce.framework.name. Since we are setting up YARN, we will be using it to implement MapReduce. This basically means that whenever the tasks would be submitted, these will be forwarded to YARN cluster for processing. Other valid value for this property is Local which means that map reduce will be expecuted locally.

       

      MapReduce Configuration - $Hadoop_Base_Dir/etc/hadoop/mapred-site.xml
      <configuration>
          <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
      </configuration>
      
      
    Starting Up Hadoop Cluster

    We will be starting Hadoop Cluster's HDFS, YARN and Job History step wise to make the things easy and clear, as we did for the configuration part.

    • Starting up HDFS cluster - In order to start HDFS cluster, following steps need to be taken -

       

      1. Format the new distributed file system by running the following command on NameNode (VM1):
        $Hadoop_Base_Dir/bin/hdfs namenode -format hadoop-dfs
        
        
      2. Start HDFS NameNode by executing the following command on NameNode (VM1):
        $Hadoop_Base_Dir/sbin/hadoop-daemon.sh start namenode
        
        
      3. Start HDFS DataNodes by running the following command on all the DataNodes (in our case, only VM2):
        $Hadoop_Base_Dir/sbin/hadoop-daemon.sh start datanode
        
        

       

    • Starting up YARN cluster - Following steps need to be taken to get YARN cluster started -

       

      1. Start YARN ResourceManager by executing the following command on ResourceManager (VM1):
        $Hadoop_Base_Dir/sbin/yarn-daemon.sh start resourcemanager
        
        
      2. Start YARN NodeManagers by running the following command on all the NodeManagers (in our case, only VM2):
        $Hadoop_Base_Dir/sbin/yarn-daemon.sh start nodemanager
        
        

       

    • Starting up MapReduce JobHistory server - You can start MapReduce JobHistory server by executing the following command on VM1 -

       

      $Hadoop_Base_Dir/sbin/mr-jobhistory-daemon.sh start historyserver
      
      

       

    Accessing Web Interfaces

    After all the commands to start Hadoop cluster have been successfully executed as per instructions in above section, next step is to check whether the cluster has been setup successfully by accessing the web interfaces.

     

    • You can monitor HDFS cluster by accessing the web application accesible at the url - http://<NameNode(VM1)-hostname>:50070. For me, it was accessible at http://192.168.111.130:50070. Once you are successfully able to access it, click on Datanodes option in the top menu and verify that you are able to see one data node as shown in below snapshot.
    • On the same lines, you can also monitor YARN cluster by accessing the web application accesible at the url - http://<ResourceManager(VM1)-hostname>:8088. For me, it was accessible at http://192.168.111.130:8088. Once you are successfully able to access it, click on Nodes link on the left side and verify that you are able to see one NodeManager as shown in below snapshot.
    • Finally, Hadoop also provides you with the option to see the jobs that have been finished. This can be achieved by accessing the web application accesible at the url - http://<jobHistory(VM1)-hostname>:19888. For me, it was accessible at http://192.168.111.130:19888 as shown in following snapshot.

     

    References

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Published on: 2018-05-25 16:05:37

    Comment Form is loading comments...