I can spy on my CPU cycles with perf!
Yesterday I talked about using perf to profile assembly
instructions. Today I learned how to make flame graphs with perf
today and it is THE BEST. I found this because
Graydon Hoare pointed me to Brendan
Gregg’s excellent
page on how to use perf.
Wait up! What’s perf? I’ve talked about strace a lot before (in
Debug your programs like they’re closed source).
strace lets you see which system calls a program is calling. But
what if you wanted to know
- how many CPU instructions it ran?
- How many L1 cache misses there were?
- profiling information for each assembly instruction?
strace only does system calls, and none of those things are system
calls. So it can’t tell you any of those things!
perf is a Linux tool that can tell you all of these things, and
more! Let’s run a quick example on the
bytesum program from yesterday.
bork@kiwi ~/w/howcomputer> perf stat ./bytesum_mmap *.mp4
Performance counter stats for './bytesum_mmap The Newsroom S01E04.mp4':
158.141639 task-clock # 0.994 CPUs utilized
22 context-switches # 0.139 K/sec
9 CPU-migrations # 0.057 K/sec
133 page-faults # 0.841 K/sec
438,662,273 cycles # 2.774 GHz [82.43%]
269,916,782 stalled-cycles-frontend # 61.53% frontend cycles idle [82.38%]
131,557,379 stalled-cycles-backend # 29.99% backend cycles idle [66.66%]
681,518,403 instructions # 1.55 insns per cycle
# 0.40 stalled cycles per insn [84.88%]
130,568,804 branches # 825.645 M/sec [84.85%]
20,756 branch-misses # 0.02% of all branches [83.68%]
0.159154389 seconds time elapsed
This is super neat information, and there’s a lot more (see perf list). But we can do even more fun things!
Flame graphs with perf
I wanted to profile my bytesum program. But how do you even profile
C programs? Here’s a way to do it with perf:
sudo perf record -g ./bytesum_mmap *.mp4 sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Here’s the SVG this gave me:
This is AMAZING. But what does it mean? Basically perf periodically
interrupts the program and finds out where in the stack it is. The
width of each part of the stack in the graph above is the proportion
of samples that happened there. (so about 30% of the execution time
was spend in main). I don’t know what the colour means here.
We can see that there are 3 big parts – there’s the mmap call (on
the left), the main program execution (in the middle), and the
sys_exit part on the right. Apparently stopping my program takes a
long time! Neat!
But there’s more!
Is it really L1 cache misses? We can find out!
So yesterday I made a program with really bad memory access patterns (bytesum_stride.c), and I conjectured that it was way slower because it was causing way too many L1 cache misses.
But with perf, we can check if that’s actually true! Here are the
results (reformatted a bit to be more compact):
bork@kiwi ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_mmap *.mp4
17,175,214 L1-dcache-misses # 11.48% of all L1-dcache hits
149,568,438 L1-dcache-loads
bork@kiwi ~/w/howcomputer> perf stat -e L1-dcache-misses,L1-dcache-loads ./bytesum_stride *.mp4 1000
1,031,902,483 L1-dcache-misses # 193.16% of all L1-dcache hits
534,219,219 L1-dcache-loads
So, uh, that’s really bad. We now have 60 times more L1 cache misses, and also 3 times more hits.
Other amazing things
- Go to Brendan Gregg’s perf page and read the whole thing. Also possibly everything he’s ever written. His recent post on strace is great too.
- The perf tutorial is pretty long, but I found it somewhat helpful.
- FlameGraph!
- I spent a little bit of time running cachegrind with
valgrind --tool=cachegrind ./bytesum_mmap my_filewhich can give you possibly even more information about CPU caches thanperfcan. Still haven’t totally wrapped my head around this.
There are still so many things I don’t understand at all!