A workshop on strace & tcpdump
This week at work, I ran a workshop on tcpdump and strace. a couple of people on Twitter asked about it so here are some notes. This is mostly just so I can reuse them more easily next time, but maybe you will also find it interesting. The notes are a bit sparse and it’s very unclear that anybody other than me will find them legible or useful, but I decided to put them here anyway.
I basically did a bunch of live demos of how to use tcpdump & strace, and then took questions & comments as people had them. I ran it in an hour, which I think was fine for people who already had some familiarity with the tools, but really aggressive if you’re learning from scratch. Will do that differently next time.
tcpdump
Why use tcpdump? We do a lot of web development; almost all of our services talk to each other with network requests. So a tool that can spy on network requests can be a really good general-purpose debugging tool. I’ve used this quite a bit at work and it’s been great.
I didn’t really explain what TCP was, which seemed okay.
Ask everyone to install Wireshark at the beginning of the workshop. Wireshark is really easy to install on OS X now! Nobody had trouble with this.
step 1: laptop demo
- start
python -mSimpleHTTPServer 8080
- run
curl localhost:8080/404
- now that we have some simple network traffic on my laptop, we can look at it with tcpdump!
- run
sudo tcpdump
by itself. That’s a lot of output, and it’s a little difficult to read. Don’t worry! - talk about network interfaces, here we actually need to run
tcpdump -i lo0
ortcpdump -i any
to see the local traffic - talk a little bit about what the output here means, but note that it’s kind of difficult to understand
- run
tcpdump -A port 8080 -i any
.port 8080
is the same assrc port 8080 or dst port 8080
; a really great shortcut. the pcap filter syntax can be a little difficult to remember at first, all I’ve need to really know so far is how to filter by port and IP.-A
shows us the contents of the packets. We can see the GET request, the content type, and the response in tcpdump’s output! This is really cool! - point out that Python’s SimpleHTTPServer is writing out each header line in a separate packet, and that this makes no sense.
step 2: look at some QA network traffic
- ssh to a dev machine
- run tcpdump on some relevant network traffic, talk about what that network traffic actually means and how you can tell (by looking at the port, knowing what services run on which ports, looking at the hostname the packets are going to)
step 3: tcpdumping for performance
this section is about how to debug performance problems using tcpdump!
- find a service in production that makes HTTP requests
- talk about packet captures in production (it’s generally safe to do, just be careful to not accidentally fill up your disk if you’re trying to do packet capture on video streaming or something, and be aware that there may be customer data in there)
- “now we’re going to get timing information for those HTTP requests!”
- run
tcpdump port $correct_port -w output.pcap
- press ctrl+c when you feel like you have enough data
- “pcap files are like chocolate cookies – every packet analysis tool understands how to read them. So you can write the pcap file on the server and then copy it over for offline analysis”
- copy the file over to my laptop, get everyone else to copy over the file as well
- WIRESHARK TIME!!! =D =D =D
- “as you’ve seen, tcpdump output is a little difficult to interpret and search. some people are really good at that but IMO it’s easier to stick with a basic knowledge of tcpdump and do more advanced stuff in Wireshark”
- open up the pcap file in Wireshark.
Wireshark features to demo:
- searching by HTTP status (try THAT with tcpdump!)
- right clicking on a packet to get every other packet in that TCP conversation
- click statistics -> conversations in the menu and you can sort all TCP sessions by duration to find performance problems
- you can colour packets by type (TCP/UDP/whatever) to more easily visually see what’s going on
- show where the packet timings show up, and talk through how you can use this to diagnose whether the client or the server is the problem (if the client sends a packet and then the server takes 5 seconds to reply to it, that’s AWESOME EVIDENCE)
- ask other people in the room with experience for their favorite Wireshark features and tactics to get information about it. people had really great suggestions.
that’s all for tcpdump!
strace
step 1: system calls & how to read strace output
- talk about what a system call is “the API for your operating system”
- run strace on a small program like
ls
- talk through what the parts of the output mean (“this is the system call, this is a file descriptor, these are the arguments”)
- explain what happens in
strace ls
(first you haveexecve(ls)
, then the dynamic linker happens, the first part of the strace output is always the same, then you see stuff that’s more specific tols
, and at the end you haveexit
)
step 2: getting configuration files
- Find a Java program which has some configuration
- “we have no idea how it’s configured! How will we ever find out?”
- STRACE OBVIOUSLY
- run
strace -e open -o strace_output.txt the_java_program
(I used a Hadoop program).-e
means “this system call” and-o
writes the output to a file - it turns out that this actually doesn’t work if the Java program starts child processes – you usually want to run
strace -f
. - Run
strace -f
instead - grep the output file for
.xml
because practically every java program is configured with xml - we find our configuration file! we are winners
- mention looking at calls to
write
step 3: attaching to a running program, and strace a CPU-only program
run
while True:
pass
and then attach to it with sudo strace -p
. When you’re attaching, you always have to run as root. This program doesn’t have any system calls!
If you want another cool demo of stracing a running program, find
is a good example, run find /
and then attach to it with strace – it’ll show you which files find
is looking at right now!
step 4: stracing to understand a performance problem
- (secretly, beforehand) make a tiny flask server that responds to
GET /hello
with ‘hi!’ after sleeping for 2 seconds - run a small bash script that just runs curl in a loop
- the script is slow! But why is it slow?
- run
strace
on the script, and see how you can see it pause really clearly on thewait
. Talk about other system calls you’ll often see strace pause (select) - now is a really good time to mention that STRACING PRODUCTION PROGRAMS WILL SLOW THEM DOWN AND YOU NEED TO BE VERY CAREFUL. Sometimes you do it anyway if the process is a mess
- talk about what to do if you don’t know what a system call means (what’s
wait4
? You can runman 2 wait4
on any system to get the man page for that system call)
done!
- Break for questions.
- mention that I wrote a zine about strace which is an ok basic reference
yay!
I thought this went pretty well, especially given that I prepared it only 2 hours in advance. I think I’ll do it again sometime! I want to get better at doing workshops and talks at least 2-3 times because preparing good material is so hard, and I always learn so much about the talk/workshop the first time I give it.
If you want to adapt this workshop for your cool friends who you want know about strace or tcpdump, you could! Most likely this is way too sketchy for anybody else to use but me.