As you may already know, I really like strace. (It has a whole category on this blog). So when the people at Big Data Montreal asked if I wanted to give a talk about stracing Hadoop, the answer was YES OBVIOUSLY.
I set up a small Hadoop cluster (1 master, 2 workers, replication set to 1) on Google Compute Engine to get this working, so that's what we'll be talking about. It has one 14GB CSV file, which contains part of this Wikipedia revision history dataset
Let's start diving into HDFS! (If this is familiar to you, I talked about a lot of this already in Diving into HFDS. There are new things, though! At the end of this we edit the blocks on the data node and see what happens and it's GREAT.)
Files are split into blocks
HDFS is a distributed filesystem, so a file can be split across many machines. I wrote a little module to help explore how a file is distributed. Let's take a look!
You can see the source code for all this in hdfs_fun.py.
This tells us that
wikipedia.csv is split into 113 blocks, which are
all 128MB except the last one, which is smaller. They have block IDs
1073742025 - 1073742137. Some of them are on hadoop-w-0, and some are on
Let's see the same thing using strace!
Part 1: talk to the namenode!
We ask the namenode where /wikipedia.csv is...
... and get an answer!
recvfrom(4, "\255\202\2\n\251\202\2\10\350\223\354\2378\22\233\2\n7\n'BP-572418726-10.240.98.73-1417975119036\20\311\201\200\200\4\30\261\t \200\200\200@\20\0\32\243\1\nk\n\01610.240.146.168\22%hadoop-w-1.c.stracing-hadoop.internal\32$358043f6-051d-4030-ba9b-3cd0ec283f6b \332\206\3(\233\207\0030\344\206\0038\0\20\200\300\323\356&\30\200\300\354\372\32 \200\240\377\344\4(\200\300\354\372\0320\374\260\234\276\242)8\1B\r/default-rackP\0X\0`\0 \0*\10\n\0\22\0\32\0\"\0002\1\0008\1B'DS-3fa133e4-2b17-4ed1-adca-fed4767a6e6f\22\236\2\n7\n'BP-572418726-10.240.98.73-1417975119036\20\312\201\200\200\4\30\262\t \200\200\200@\20\200\200\200@\32\243\1\nk\n\01610.240.146.168\22%hadoop-w-1.c.stracing-hadoop.internal\32$358043f6-051d-4030-ba9b-3cd0ec283f6b \332\206\3(\233\207\0030\344\206\0038\0\20\200\300\323\356&\30\200\300\354\372\32 \200\240\377\344\4(\200\300\354\372\0320\374\260\234\276\242)8\1B\r/default-rackP\0X\0`\0 \0*\10\n\0\22\0\32\0\"\0002\1\0008\1B'DS-3fa133e4-2b17-4ed1-adca-fed4767a6e6f\22\237\2\n7\n'BP-572418726-10.240.98.73-1417975119036\20\313\201\200\200\4\30\263\t \200\200\200@\20\200\200\200\200\1\32\243\1\nk\n\01610.240.109.224\22%hadoop-w-0.c.stracing-hadoop.internal\32$bd6125d3-60ea-4c22-9634-4f6f352cfa3e \332\206\3(\233\207\0030\344\206\0038\0\20\200\300\323\356&\30\200\240\342\335\35 \200\240\211\202\2(\200\240\342\335\0350\263\257\234\276\242)8\1B\r/default-rackP\0X\0`\0 \0*\10\n\0\22\0\32\0\"\0002\1\0008\1B'DS-c5ef58ca-95c4-454d-adf4-7ceaf632c035\22\237\2\n7\n'BP-572418726-10.240.98.73-1417975119036\20\314\201\200\200\4\30\264\t \200\200\200@\20\200\200\200\300\1\32\243\1\nk\n\01610.240.146.168\22%hadoop-w-1.c.stracing-hadoop.inte"..., 33072, 0, NULL, NULL) = 32737
The hostnames in this answer totally match up with the table of where we think the blocks are!
Part 2: ask the datanode for data!
So the next part is that we ask
10.240.146.168 for the first block.
This sequence matches up exactly with the order of the blocks in the table up at the top! So fun. Next, we can look at the message the client is sending to the datanodes:
This is a little hard to read, but it turns out it's a Protocol Buffer and so we can parse it pretty easily. Here's what it's trying to say:
And then, of course, we get a response:
Which is just the beginning of a CSV file! How wonderful.
Part 3: Finding the block on the datanode.
Seeing the datanode send us the data is nice, but what if we want to get even closer to the data? It turns out that this is really easy. I sshed to my data node and ran
with the idea that maybe there was a file with
1073742025 in the name that had the block data. And there was!
It has exactly the right size (134217728 bytes), and if we look at the beginning, it contains exactly the data from the first 128MB of the CSV file. GREAT.
Super fun exciting part: Editing the block on the datanode
So I was giving this talk yesterday, and was doing a live demo where I was ssh'd into the data node, and we were looking at the file for the block. And suddenly I thought... WAIT WHAT IF WE EDITED IT GUYS?!
And someone commented "No, it won't work, there's metadata, the checksum will fail!". So, of course, we tried it, because toy clusters are for breaking.
And it worked! Which wasn't perhaps super surprising because replication
was set to 1 and maybe a 128MB file is too big to take a checksum of
every time you want to read from it, but REALLY FUN. I edited the
beginning of the file to say
AWESOME AWESOME AWESOME instead of
whatever it said before (keeping the file size the same), and then a
snakebite cat /wikipedia.csv showed the file starting with
So some lessons:
- I'd really like to know more about data consistency in Hadoop clusters
- live demos are GREAT
- writing a blog is great because then people ask me to give talks about fun things I write about like stracing Hadoop
That's all folks! There are slides for the talk I gave, though this post is guaranteed to be much better than the slides. And maybe video for that talk will be up at some point.