Recent Tutorials and Articles
    Reading and Writing files in HDFS 2.6 using Java API
    Published on: 25th May 2018
    Posted By: Amit Kumar

    This tutorial provides instructions for creating, reading, writing files in HDFS (Hadoop Distributed File System) using Java API of Apache Hadoop 2.6.2.

    Pre-requisites


    Here are the pre-requisites that are required before following the instructions in this tutorial:

    1. Apache Hadoop Cluster - If you don't have running cluster of Apache Hadoop, please set up Apache Hadoop 2.6 cluster.
    2. JDK 7 or later installed on machine
    3. Eclipse IDE with Apache Maven Plugin

     

    Create a Maven Project


    We will be starting with creating a Maven project in Eclipse using below steps:

    1. Open New Project wizard in Eclipse IDE as shown below:

      New Maven Project
    2. On next screen, select option Create a simple project to create quick project as below:

      New Maven Project Type and Location Selection
    3. Enter Group Id and Artifiact Id on next screen and finally click on Finish to create the project as below:

      HDFS Basics Maven Project

    At this point, you will start seeing your new project (in my case, it is hdfs-basics) in Project Explorer.

     

    Adding Maven Dependency for HDFS Libraries


    Next step is to add Apache Hadoop libraries to our newly created project. In order to do so, we will be adding following maven dependencies to our project's pom.xml file.

    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    <dependency>
    	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-hdfs</artifactId>
    	<version>2.6.2</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    <dependency>
    	<groupId>org.apache.hadoop</groupId>
    	<artifactId>hadoop-common</artifactId>
    	<version>2.6.2</version>
    </dependency>

    For completion purpose, here is what my pom.xml looks like after adding above dependency -

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    	<modelVersion>4.0.0</modelVersion>
    	<groupId>com.aksain.hdfs.basics</groupId>
    	<artifactId>hdfs-basics</artifactId>
    	<version>0.0.1-SNAPSHOT</version>
    	<dependencies>
    		<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
    		<dependency>
    			<groupId>org.apache.hadoop</groupId>
    			<artifactId>hadoop-hdfs</artifactId>
    			<version>2.6.2</version>
    		</dependency>
    		<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
    		<dependency>
    			<groupId>org.apache.hadoop</groupId>
    			<artifactId>hadoop-common</artifactId>
    			<version>2.6.2</version>
    		</dependency>
    
    	</dependencies>
    </project>

    After adding these dependencies, Eclipse will automatically start downloading the libraries from Maven repository. Please be patient as it may take a while for Eclipse to download the jars and build your project.

     

    Java Program for Creating File in HDFS


    Now we will create a Java program for creating a file named tutorials-links.txt in directory /allprogtutorials in Hadoop HDFS. We will then add tutorial links to this newly created file. Please replace 192.168.1.8 with your HDFS NameNode IP address / host name before running the program.

    package com.aksain.hdfs.basics;
    
    import java.io.PrintWriter;
    import java.net.URI;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FSDataOutputStream;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hdfs.DistributedFileSystem;
    
    /**
     * @author Amit Kumar
     * 
     * Demonstrates creating of a file in Distributed HDFS and writing content into it.
     *
     */
    public class HDFSJavaAPIWriteDemo {
    	
    	public static void main(String[] args) throws Exception{
    		// Impersonates user "root" to avoid performance problems. You should replace it 
    		// with user that you are running your HDFS cluster with
    		System.setProperty("HADOOP_USER_NAME", "root");
    		
    		// Path that we need to create in HDFS. Just like Unix/Linux file systems, HDFS file system starts with "/"
    		final Path path = new Path("/allprogtutorials/tutorials-links.txt");
    		
    		// Uses try with resources in order to avoid close calls on resources
    		// Creates anonymous sub class of DistributedFileSystem to allow calling initialize as DFS will not be usable otherwise
    		try(final DistributedFileSystem dFS = new DistributedFileSystem() { 
    					{
    						initialize(new URI("hdfs://192.168.1.8:50050"), new Configuration());
    					}
    				}; 
    				// Gets output stream for input path using DFS instance
    				final FSDataOutputStream streamWriter = dFS.create(path);
    				// Wraps output stream into PrintWriter to use high level and sophisticated methods
    				final PrintWriter writer = new PrintWriter(streamWriter);) {
    			// Writes tutorials information to file using print writer
    			writer.println("Getting Started with Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-apache-spark.php");
    			writer.println("Developing Java Applications in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/developing-java-applications-in-spark.php");
    			writer.println("Getting Started with RDDs in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-rdds-in-spark.php");
    			
    			System.out.println("File Written to HDFS successfully!");
    		}
    	}
    }
    

    You can execute this program simply by peforming these operations on our program: Right Click -> Run As -> Java Application. Here is the output that you will see if the program runs successfully.

    File Written to HDFS successfully!

    You can go to your HDFS user interface to browse the file system in order to verify whether file has been written successfully by visiting the link - http://192.168.1.8:50070/explorer.html#/allprogtutorials. Obviously you need to replace 192.168.1.8 with the ip address/host of your NameNode machine. 

     

    Java Program for Reading File from HDFS


    Now we will create a Java program for reading a file named tutorials-links.txt in directory /allprogtutorials in Hadoop HDFS. We will then print the contents of the file on console.Please replace 192.168.1.8 with your HDFS NameNode IP address / host name before running the program.

    package com.aksain.hdfs.basics;
    
    import java.net.URI;
    import java.util.Scanner;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FSDataInputStream;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.hdfs.DistributedFileSystem;
    
    /**
     * @author Amit Kumar
     * 
     * Demonstrates reading of a file from Distributed HDFS.
     *
     */
    public class HDFSJavaAPIReadDemo {
    	
    	public static void main(String[] args) throws Exception{
    		// Impersonates user "root" to avoid performance problems. You should replace it 
    		// with user that you are running your HDFS cluster with
    		System.setProperty("HADOOP_USER_NAME", "root");
    		
    		// Path that we need to create in HDFS. Just like Unix/Linux file systems, HDFS file system starts with "/"
    		final Path path = new Path("/allprogtutorials/tutorials-links.txt");
    		
    		// Uses try with resources in order to avoid close calls on resources
    		// Creates anonymous sub class of DistributedFileSystem to allow calling initialize as DFS will not be usable otherwise
    		try(final DistributedFileSystem dFS = new DistributedFileSystem() {
    					{
    						initialize(new URI("hdfs://192.168.1.8:50050"), new Configuration());
    					}
    				}; 
    				// Gets input stream for input path using DFS instance
    				final FSDataInputStream streamReader = dFS.open(path);
    				// Wraps input stream into Scanner to use high level and sophisticated methods
    				final Scanner scanner = new Scanner(streamReader);) {
    			
    			System.out.println("File Contents: ");
    			// Reads tutorials information from file using Scanner
    			while(scanner.hasNextLine()) {
    				System.out.println(scanner.nextLine());
    			}
    			
    		}
    	}
    }
    

    You can execute this program simply by peforming these operations on our program: Right Click -> Run As -> Java Application. Here is the output that you will see if the program runs successfully.

    File Contents: 
    Getting Started with Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-apache-spark.php
    Developing Java Applications in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/developing-java-applications-in-spark.php
    Getting Started with RDDs in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-rdds-in-spark.php
    

     

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 25th May 2018

    Comment Form is loading comments...