This tutorial provides instructions for creating, reading, writing files in HDFS (Hadoop Distributed File System) using Java API of Apache Hadoop 2.6.2.
Pre-requisites
Here are the pre-requisites that are required before following the instructions in this tutorial:
- Apache Hadoop Cluster - If you don't have running cluster of Apache Hadoop, please set up Apache Hadoop 2.6 cluster.
- JDK 7 or later installed on machine
- Eclipse IDE with Apache Maven Plugin
Create a Maven Project
We will be starting with creating a Maven project in Eclipse using below steps:
- Open New Project wizard in Eclipse IDE as shown below:
- On next screen, select option Create a simple project to create quick project as below:
- Enter Group Id and Artifiact Id on next screen and finally click on Finish to create the project as below:
At this point, you will start seeing your new project (in my case, it is hdfs-basics) in Project Explorer.
Adding Maven Dependency for HDFS Libraries
Next step is to add Apache Hadoop libraries to our newly created project. In order to do so, we will be adding following maven dependencies to our project's pom.xml file.
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.2</version>
</dependency>
For completion purpose, here is what my pom.xml looks like after adding above dependency -
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.aksain.hdfs.basics</groupId>
<artifactId>hdfs-basics</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.2</version>
</dependency>
</dependencies>
</project>
After adding these dependencies, Eclipse will automatically start downloading the libraries from Maven repository. Please be patient as it may take a while for Eclipse to download the jars and build your project.
Java Program for Creating File in HDFS
Now we will create a Java program for creating a file named tutorials-links.txt in directory /allprogtutorials in Hadoop HDFS. We will then add tutorial links to this newly created file. Please replace 192.168.1.8 with your HDFS NameNode IP address / host name before running the program.
package com.aksain.hdfs.basics;
import java.io.PrintWriter;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
/**
* @author Amit Kumar
*
* Demonstrates creating of a file in Distributed HDFS and writing content into it.
*
*/
public class HDFSJavaAPIWriteDemo {
public static void main(String[] args) throws Exception{
// Impersonates user "root" to avoid performance problems. You should replace it
// with user that you are running your HDFS cluster with
System.setProperty("HADOOP_USER_NAME", "root");
// Path that we need to create in HDFS. Just like Unix/Linux file systems, HDFS file system starts with "/"
final Path path = new Path("/allprogtutorials/tutorials-links.txt");
// Uses try with resources in order to avoid close calls on resources
// Creates anonymous sub class of DistributedFileSystem to allow calling initialize as DFS will not be usable otherwise
try(final DistributedFileSystem dFS = new DistributedFileSystem() {
{
initialize(new URI("hdfs://192.168.1.8:50050"), new Configuration());
}
};
// Gets output stream for input path using DFS instance
final FSDataOutputStream streamWriter = dFS.create(path);
// Wraps output stream into PrintWriter to use high level and sophisticated methods
final PrintWriter writer = new PrintWriter(streamWriter);) {
// Writes tutorials information to file using print writer
writer.println("Getting Started with Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-apache-spark.php");
writer.println("Developing Java Applications in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/developing-java-applications-in-spark.php");
writer.println("Getting Started with RDDs in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-rdds-in-spark.php");
System.out.println("File Written to HDFS successfully!");
}
}
}
You can execute this program simply by peforming these operations on our program: Right Click -> Run As -> Java Application. Here is the output that you will see if the program runs successfully.
File Written to HDFS successfully!
You can go to your HDFS user interface to browse the file system in order to verify whether file has been written successfully by visiting the link - http://192.168.1.8:50070/explorer.html#/allprogtutorials. Obviously you need to replace 192.168.1.8 with the ip address/host of your NameNode machine.
Java Program for Reading File from HDFS
Now we will create a Java program for reading a file named tutorials-links.txt in directory /allprogtutorials in Hadoop HDFS. We will then print the contents of the file on console.Please replace 192.168.1.8 with your HDFS NameNode IP address / host name before running the program.
package com.aksain.hdfs.basics;
import java.net.URI;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;
/**
* @author Amit Kumar
*
* Demonstrates reading of a file from Distributed HDFS.
*
*/
public class HDFSJavaAPIReadDemo {
public static void main(String[] args) throws Exception{
// Impersonates user "root" to avoid performance problems. You should replace it
// with user that you are running your HDFS cluster with
System.setProperty("HADOOP_USER_NAME", "root");
// Path that we need to create in HDFS. Just like Unix/Linux file systems, HDFS file system starts with "/"
final Path path = new Path("/allprogtutorials/tutorials-links.txt");
// Uses try with resources in order to avoid close calls on resources
// Creates anonymous sub class of DistributedFileSystem to allow calling initialize as DFS will not be usable otherwise
try(final DistributedFileSystem dFS = new DistributedFileSystem() {
{
initialize(new URI("hdfs://192.168.1.8:50050"), new Configuration());
}
};
// Gets input stream for input path using DFS instance
final FSDataInputStream streamReader = dFS.open(path);
// Wraps input stream into Scanner to use high level and sophisticated methods
final Scanner scanner = new Scanner(streamReader);) {
System.out.println("File Contents: ");
// Reads tutorials information from file using Scanner
while(scanner.hasNextLine()) {
System.out.println(scanner.nextLine());
}
}
}
}
You can execute this program simply by peforming these operations on our program: Right Click -> Run As -> Java Application. Here is the output that you will see if the program runs successfully.
File Contents:
Getting Started with Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-apache-spark.php
Developing Java Applications in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/developing-java-applications-in-spark.php
Getting Started with RDDs in Apache Spark => http://www.allprogrammingtutorials.com/tutorials/getting-started-with-rdds-in-spark.php
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.