perf is a profiling tool for Linux, that I've written about a few times on this blog before. I was interviewed on a podcast recently where the host asked me "so, julia, tell me how perf works!" and I gave a sort of unsatisfying answer "you know, sampling?".
So it turns out I don't really know how perf works. And I like knowing how stuff works. Last week I read some of the man page for
perf_event_open, the system call that perf uses. It's 10,000 words but pretty helpful! I'm still quite confused about perf, so I'm going to tell you, fair reader, what I know, and then maybe you can help me out with my questions.
There is not a lot of documentation for perf. The best resource I know is on Brendan Gregg's site, but it does not answer all the questions I have! To answer some of these questions, we're going to read the Linux kernel source code. Because it's Saturday night.
So, let's imagine you want to know exactly how many CPU instructions happen when you run
ls. It turns out that your CPU stores information about this kind of thing! And perf can tell you. Here's what the answer looks like, from
But how does that work? Well, the Wikipedia page on hardware performance counters mentions
One of the first processors to implement such counter and an associated instruction
RDPMCto access it was the Intel Pentium, but they were not documented until Terje Mathisen wrote an article about reverse engineering them in Byte July 1994: 
This is AWESOME. Maybe this is how Linux reads hardware counters and gives them back to us in
perf stat!! Further grepping for uses of
native_read_pmc reveals that we read hardware counters via
rdpmcl in x86/kernel/cpu/perf_event.c.
This code is a little impenetrable to me, but here's a hypothesis for how this could work. Let's say we're running
ls. This code might get scheduled on and off the CPU a few times.
So! Here's what I think this looks like.
One important outcome of this, if I understand correctly is -- hardware counters are exact counters, and they're low enough overhead that the kernel can just always run
rdpmc every time it's done running a piece of code. There's no sampling or approximations involved.
Sampling software events
The core of perf events looks like it's in kernel/events/core.c. This file includes the definition of the
perf_event_open system call, on line 8107. Files with 10,000 lines of C code are not my specialty, but I'm going to try to make something of this.
My goal: understand how perf does sampling of CPU events. For the sake of argument, let's pretend we only wanted to save the state of the CPU's registers every time we sample.
We know from the
perf_event_open man page that perf writes out events to userspace ("hi! I am in julia's awesome function right now!"). It writes events to a mmap'd ring buffer. Which is some data structure in memory. Okay.
Further inspection of this 10,000 line
core.c file reveals that the code outputs data to userspace in the
So, let's find the code that copies all the x86 registers into userspace! It turns out it's not too hard to find -- it's in this file called perf_regs.c. There are like 15 registers to copy! Neat.
In this case it makes sense that we sample -- we definitely couldn't save all the registers every instruction. That would be way too much work!
So now I can see a little tiny bit of the code that perf uses to do sampling. This isn't terribly enlightening, but it does make me feel better.
- when does perf do its sampling? is it when the process gets scheduled onto the CPU? how is the sampling triggered? I am completely confused about this.
- what is the relationship between perf and kprobes? if I just want to sample the registers / address of the instruction pointer from
ls's execution, does that have anything to do with kprobes? with ftrace? I think it doesn't, and that I only need kprobes if I want to instrument a kernel function (like a system call), but I'm not sure.
- are kprobes and ftrace the same kernel system? I feel like they are but I am confused.
reading kernel code: not totally impossible
I probably skimmed like 4000 lines of Linux kernel code (the perf parts!) to write this post, in 3 hours. There are definitely at least 20,000 lines of code related to perf. Maybe 100,000? I do not have the Linux source on my computer -- I used livegrep and github to look at it.
I only understood probably 10% of what I looked at, but I still learned some things about how perf works internally! This is neat.