I can spy on my CPU cycles with perf!
Yesterday I talked about using perf
to profile assembly
instructions. Today I learned how to make flame graphs with perf
today and it is THE BEST. I found this because
Graydon Hoare pointed me to Brendan
Gregg’s excellent
page on how to use perf.
Wait up! What’s perf
? I’ve talked about strace
a lot before (in
Debug your programs like they’re closed source).
strace
lets you see which system calls a program is calling. But
what if you wanted to know
- how many CPU instructions it ran?
- How many L1 cache misses there were?
- profiling information for each assembly instruction?
strace
only does system calls, and none of those things are system
calls. So it can’t tell you any of those things!
perf
is a Linux tool that can tell you all of these things, and
more! Let’s run a quick example on the
bytesum program from yesterday.
bork@kiwi ~/w/howcomputer> perf stat ./bytesum_mmap *.mp4 Performance counter stats for './bytesum_mmap The Newsroom S01E04.mp4': 158.141639 task-clock # 0.994 CPUs utilized 22 context-switches # 0.139 K/sec 9 CPU-migrations # 0.057 K/sec 133 page-faults # 0.841 K/sec 438,662,273 cycles # 2.774 GHz [82.43%] 269,916,782 stalled-cycles-frontend # 61.53% frontend cycles idle [82.38%] 131,557,379 stalled-cycles-backend # 29.99% backend cycles idle [66.66%] 681,518,403 instructions # 1.55 insns per cycle # 0.40 stalled cycles per insn [84.88%] 130,568,804 branches # 825.645 M/sec [84.85%] 20,756 branch-misses # 0.02% of all branches [83.68%] 0.159154389 seconds time elapsed
This is super neat information, and there’s a lot more (see perf list
). But we can do even more fun things!
Flame graphs with perf
I wanted to profile my bytesum
program. But how do you even profile
C programs? Here’s a way to do it with perf
:
sudo perf record -g ./bytesum_mmap *.mp4 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Here’s the SVG this gave me:
This is AMAZING. But what does it mean? Basically perf
periodically
interrupts the program and finds out where in the stack it is. The
width of each part of the stack in the graph above is the proportion
of samples that happened there. (so about 30% of the execution time
was spend in main
). I don’t know what the colour means here.
We can see that there are 3 big parts – there’s the mmap
call (on
the left), the main program execution (in the middle), and the
sys_exit
part on the right. Apparently stopping my program takes a
long time! Neat!
But there’s more!
Is it really L1 cache misses? We can find out!
So yesterday I made a program with really bad memory access patterns (bytesum_stride.c), and I conjectured that it was way slower because it was causing way too many L1 cache misses.
But with perf
, we can check if that’s actually true! Here are the
results (reformatted a bit to be more compact):
bork@kiwi ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_mmap *.mp4 17,175,214 L1-dcache-misses # 11.48% of all L1-dcache hits 149,568,438 L1-dcache-loads bork@kiwi ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_stride *.mp4 1000 1,031,902,483 L1-dcache-misses # 193.16% of all L1-dcache hits 534,219,219 L1-dcache-loads
So, uh, that’s really bad. We now have 60 times more L1 cache misses, and also 3 times more hits.
Other amazing things
- Go to Brendan Gregg’s perf page and read the whole thing. Also possibly everything he’s ever written. His recent post on strace is great too.
- The perf tutorial is pretty long, but I found it somewhat helpful.
- FlameGraph!
- I spent a little bit of time running cachegrind with
valgrind --tool=cachegrind ./bytesum_mmap my_file
which can give you possibly even more information about CPU caches thanperf
can. Still haven’t totally wrapped my head around this.
There are still so many things I don’t understand at all!