Last Updated: February 25, 2016

HDFS - create a new file

TL;DR

When you create a file using FileSystem#create a new thread is spawned to handle real data transfer. Remember to call FSDataOutputStream#flush and FSDataOutputStream#sync.

Overview

Creating new files on HDFS using hadoop-hdfs classes is quite simple. You have to call FileSystem#getFileSystem with proper configuration object to create an instance of DistributedFileSystem. Then call its #create method to get FSDataOutputStream instance and use it to write down your data. See the snippet below:

Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:54310");

FileSystem fs = FileSystem.get(conf);
FSDataOutputStream stream = fs.create(new     Path("/file.txt"));
stream.write(“test”.getBytes());
stream.flush();
stream.sync();
stream.close();

But how data get from your machine to HDFS if #write method does not block the current thread for data transfer?

In details

Let's look at it a little bit closer how it works. DistributedFileSystem#create first expands all relative paths into absolute ones. Then resolve all possible symlinks in those paths into absolute paths.

After that it calls DFSClient#create which forward call to DFSOutputStream#newStreamForCreate. This particular method is responsible for the magic.
It calls namenode via RPC to create file entry in the file system. At this moment, created file is visible and readable for all HDFS users!

Next, still in body of #newStreamForCreate, a new DFSOutputStream is created. Its constructor spawns new DataStreamer (a child of Thread) which is immediately started.

DFSClient returns DFSOutputStream instance.
DistributedFileSystem wraps DFSOutputStream in HdfsDataOutputStream before return it. Also, it calls DataStreamer#start to run streamer as a thread. That’s how you data is streamed to HDFS.

Additional notes

If you pass a Progressable instance to DistributedFileSystem#create then the method #progress would be called by DFSOutputStream in data uploading loop. However, there is no direct way to get to know how much data has been already sent.

Legend

org.apache.hadoop.fs.FileSystem
org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.fs.DFSOutputStream
org.apache.hadoop.hdfs.DFSOutputStream.DataStreamer
org.apache.hadoop.hdfs.client.HdfsDataOutputStream
org.apache.hadoop.hdfs.DFSClient
org.apache.hadoop.hdfs.FSDataOutputStream
org.apache.hadoop.util.Demon
org.apache.hadoop.util.Progressable

#bigdata

#hadoop

#hdfs