I used to think that executables were totally impenetrable. I'd compile a C program, and then that was it! I had a Magical Binary Executable that I could no longer read.
It is not so! Executable file formats are regular file formats that you can understand. I'll explain some simple tools to start! We'll be working on Linux, with ELF binaries. (binaries are kind of the definition of platform-specific, so this is all platform-specific.) We'll be using C, but you could just as easily look at output from any compiled language.
Let's write a simple C program,
Then we compile it (
gcc -o hello hello.c), and we have a binary called
hello. This originally seems impenetrable (how do we even binary?!),
but let's see how we can investigate it! We're going to learn what
symbols, sections, and segments are. At a high level:
- symbols are like function names, and are used to answer "If I call
printfand it's defined somewhere else, how do I find it?"
- symbols are organized into sections -- code lives in one section
.text), and data in another (
- sections are organized into segments
Throughout we'll use a tool called
readelf to look at these.
So, let's dive into our binary!
Step 1: open it in a text editor!
This is most naive possible way to view a binary. If run
I get something like this:
ELF>@@[email protected] @@@@@@��88@@@@�� ((`(`� PP`P`��P�td@,,Q�tdR�td((`(`��/lib64/ld-linux-x86-64.so.2GNUGNUϨ�n��8�w�j7*oL�h�� __gmon_start__libc.so.6puts__libc_start_mainGLIBC_2.2.5ui 1```H��k����H���5 H��fff.�H�=p UH��t�H��]�H`��]Ð�UH����@�����]Ð�����������H�l$�L�d$�H�- L�% L�l$�L�t$�L�|$�H�\$�H��8L)�A��I��H��I���s���H��[email protected]��L��D��A��H��H9�u�H�\H�l$L�d$L�l$ L�t$(L�|$0H��8��Ð�������������UH��SH�H� H���t�(`DH���H�H���u�H�Ð�H��o���H��Penguin!;,����H
There's text here, though! This was not a total failure. In particular it says "Penguin!" and "ELF". ELF is the name of the binary format. So that's something! Then there are a bunch of unprintable symbols, which isn't a huge surprise because this is a binary.
Step 2: use
readelf to see the symbol table
Throughout we're going to use a tool called
readelf to explore our
binary. Let's start by running
readelf --symbols on it. (another
popular tool to do this is
Here we see three symbols:
main is the address of my
puts looks like a reference to the
printf function I called
in it (which I guess the compiler changed to
puts as an
_start is pretty important.
When the program starts running, you might think it starts at
It doesn't! It actually goes to
_start. This does a bunch of Very
Important Things that I don't understand very well, including calling
main. So I won't explain them.
So, what's a symbol?
When you write a program, you might write a function called
When you compile the program, the binary for that function is labelled
with a symbol called
hello. If I call a function (like
from a library, we need a way to look up the code for that function!
The process of looking up functions from libraries is called
linking. It can happen either just after we compile the program
("static linking") or when we run the program ("dynamic linking").
So symbols are what allow linking to work! Let's find the symbol for
printf! It'll be in
libc, where all the C standard library
If I run
nm on my copy of libc, it tells me "no symbols". But the
internet tells me I can use
objdump -tT instead! This works!
objdump -tT /lib/x86_64-linux-gnu/libc-2.15.so gives me
If you look at it, you'll see
everything you might expect libc to have. From here we can start to
imagine how dynamic linking works -- we see that
and then we can look up the location of
puts in libc's symbol table.
Step 3: use
objdump to see the binary, and learn about sections!
Opening our binary in a text editor was a bad way to open it.
objdump is a better way. Here's an excerpt:
You can see that it shows us all the bytes in the file as hex on the left, and a translation into ASCII on the right.
The are a whole bunch of sections here (see this gist for the whole thing). This shows you all the bytes in your binary! Some sections we care about:
.textis the program's actual code (the assembly).
mainare both part of the
.rodatais where some read-only data is stored (in this case, our string "Penguin!")
.interpis the filename of the dynamic linker!
The major difference between sections and segments is that
sections are used at link time (by
ld) and segments are used at
objdump shows us the contents of the sections, which
is nice, but doesn't give us as much metadata about the sections as
I'd like. Let's try
Neat! We can see
.text is executable and read-only,
only data") is read-only, and
.data is read-write.
Step 4: Look at some assembly!
We mentioned briefly that
.text contains assembly code. We can
actually look at what it is really easily. If we were magicians, we
would already be able to read and understand this:
It starts with
31ed4989. Those are bytes that our CPU interprets as
code! And runs! However we are not magicians (I don't know what
ed means!) and so we will use a disassembler instead.
So we see that
31 ed is xoring two things. Neat! That's all the
assembly we'll do for now.
Step 5: Segments!
Finally, a program is organized into segments or program
headers. Let's look at the segments for our program using
Segments are used to determine how to separate different parts of the
program into memory. The first
LOAD segment is marked R E (read /
execute) and the second is
.text is in the first
segment (we want to read it but never write to it), and
.bss are in the second (we need to write to them, but not execute
Executables aren't magic. ELF is a file format like any other! You can
objdump to inspect your Linux binaries. Try
it out! Have fun.
- I found this introduction to ELF helpful for explaining sections and segments
- There's a wonderful graphic showing the structure of an ELF binary.
- For learning more about how linkers work, there's a wonderful 20 part series about linkers, which I wrote about here and here.
- I haven't talked much about assembly at all here! Read Dan Luu's Editing Binaries: Easier than it sounds