Diving into HDFS
Yesterday I wanted to start learning about how HDFS (the Hadoop Distributed File System) works internally. I knew that
- It’s distributed, so one file may be stored across many different machines
- There’s a namenode, which keeps track of where all the files are stored
- There are data nodes, which contain the actual file data
But I wasn’t quite sure how to get started! I knew how to navigate the
filesystem from the command line (hadoop fs -ls /
, and friends), but
not how to figure out how it works internally.
Colin Marc pointed me to this great library called snakebite which is a Python HDFS client. In particular he pointed me to the part of the code that reads file contents from HDFS. We’re going to tear it apart a bit and see what exactly it does!
Getting started: Elastic MapReduce!
I didn’t want to set up a Hadoop cluster by hand, and I had some AWS credit that I’d gotten for free, so I set up a small Amazon Elastic MapReduce cluster. I worked on this with with Pablo Torres and Sasha Laundy and we spent much of the morning fighting with it and trying to figure out protocol versions and why it wasn’t working with Snakebite.
What ended up working was choosing AMI version “3.0.4 (hadoop 2.2.0)”. This is CDH5 and Hadoop protocol version 9. Hadoop versions are confusing. We installed that and Snakebite version 2.4.1 and that almost worked.
Important things:
- We needed to look at
/home/hadoop/conf/core-site.xml
to find the namenode IP and port (infs.default.name
- We needed to edit snakebite/config.py to say ‘fs.default.name’ instead of ‘fs.defaultFS’. Who knows. It worked.
Once we did this, we could run snakebite ls /
successfully! Time to
move on to breaking things!
Putting data into our cluster
I copied some Wikipedia data from one of Amazon’s public datasets like this;
hadoop distcp s3://datasets.elasticmapreduce/wikipediaxml/part-116.xml /wikipedia
This creates a file in HDFS called /wikipedia
. You can see more
datasets that are easy to copy into HDFS from Amazon at
https://s3.amazonaws.com/datasets.elasticmapreduce/.
Getting a block from our file!
Now that we have a Hadoop cluster, some data in HDFS, and a tool to look at it with (snakebite), we can really get started!
Files in HDFS are split into blocks. When getting a file from HDFS, the first thing we need to do is to ask the namenode where the blocks are stored.
With the help of a lot of snakebite source diving, I write a small
Python function to do this called find_blocks
. You can see it in a
tiny Python module I made called
hdfs_fun.py.
To get it to work, you’ll need a Hadoop cluster and snakebite.
>>> cl = hdfs_fun.create_client() >>> hdfs_fun.find_blocks(cl, '/wikipedia') [snakebite.protobuf.hdfs_pb2.LocatedBlockProto at 0xe33a910, snakebite.protobuf.hdfs_pb2.LocatedBlockProto at 0xe33ab40
One of the first things I did was use strace
to find out what data actually gets sent over the wire when I call this function. Here’s a snippet: (the whole thing)
Part of the request: asking for the block locations for the
/wikipedia
file.
sendto(7, "\n\21getBlockLocations\22.org.apache.hadoop.hdfs.protocol.ClientProtocol\30\1", 69, 0, NULL, 0) = 69 sendto(7, "\n\n/wikipedia\20\0\30\337\260\240]", 19, 0, NULL, 0) = 19
Part of the response: (I’ve removed most of it to point out some of the important parts)
recvfrom(7, "....BP-1019336183-10.165.43.39-1400088409498.......................... 10.147.177.170-9200-1400088495802........................ BP-1019336183-10.165.43.39-1400088409498.............10.147.177.170-9200-1400088495802 \360G(\216G0\361G8\0\20\200\240\201\213\275\f\30\200\340\376] \200\300\202\255\274\f(\200\340\376]0\212\306\273\205\340(8\1B\r/default-rackP\0 \0*\10\n\0\22\0\32\0\"\0\30\0\"\355", 731, 0, NULL, NULL) = 731
Back in our Python console, we can see what some of these numbers mean:
>>> blocks[0].b.poolId u'BP-1019336183-10.165.43.39-1400088409498' >>> blocks[0].b.numBytes 134217728L >>> blocks[0].locs[0].id.ipAddr u'10.147.177.170' >>> blocks[0].locs[0].id.xferPort 9200 >>> blocks[1].b.poolId u'BP-1019336183-10.165.43.39-1400088409498' >>> blocks[1].b.numBytes 61347935L
So we have two blocks! The two numBytes
add up to the total size of
the file! Cool! They both have the same poolId
, and it also turns
out that they have the same IP address and port
Reading a block
Let’s try to read the data from a block! (you can see the read_block
function here in
hdfs_fun.py
>>> block = blocks[0] >>> gen = hdfs_fun.read_block(block) # returns a generator >>> load = gen.next()
If I look at strace
, it starts with:
connect(8, {sa_family=AF_INET, sin_port=htons(9200), sin_addr=inet_addr("10.147.177.170")}, 16) = 0 sendto(8, "\nB\n5\n3\n(BP-1019336183-10.165.43.39-1400088409498\20\211\200\200\200\4\30\361\7\22\tsnakebite\20\0\30\200\200\200@", 75, 0, NULL, 0) = 75
Awesome. We can see easily that it’s connecting to the block’s data
node (10.147.177.170
on port 9200
, and asking for something with
id BP-1019336183-10.165.43.39-1400088409498
). Then the data node
starts sending back data!!!
recvfrom(8, "ot, it's a painting. Thomas Graeme apparently lived in the mid-18th century, according to the [[Graeme Park]] article. The rationale also says that this image is "used on the biography page about him by USHistory.org of Graeme Park." I cannot quite figure out what this means, but I am guessing that it means the uploader took this image from a page hosted on USHistory.org. A painting of a man who lived in the mid-18th century is likely to be the public domain, as claimed, but we have no good source", 512, 0, NULL, NULL) = 512
AMAZING. We have conquered HDFS.
That’s all for this blog post! We’ll see if I do more later today.