Notes on BPF & eBPF

Today it was Papers We Love, my favorite meetup! Today Suchakra Sharma (@tuxology on twitter/github) gave a GREAT talk about the original BPF paper and recent work in Linux on eBPF. It really made me want to go write eBPF programs!

The paper is The BSD Packet Filter: A New Architecture for User-level Packet Capture

I wanted to write some notes on the talk here because I thought it was super super good.

To start, here are the slides and a pdf. The pdf is good because there are links at the end and in the PDF you can click the links.

what’s BPF?

Before BPF, if you wanted to do packet filtering you had to copy all the packets into userspace and then filter them there (with “tap”).

this had 2 problems:

if you filter in userspace, it means you have to copy all the packets into userspace, copying data is expensive
the filtering algorithms people were using were inefficient

The solution to problem #1 seems sort of obvious, move the filtering logic into the kernel somehow. Okay. (though the details of how that’s done isn’t obvious, we’ll talk about that in a second)

But why were the filtering algorithms inefficient! Well!!

If you run tcpdump host foo it actually runs a relatively complicated query, which you could represent with this tree:

Evaluating this tree is kind of expensive. so the first insight is that you can actually represent this tree in a simpler way, like this:

Then if you have ether.type = IP and ip.src = foo you automatically know that the packet matches host foo, you don’t need to check anything else. So this data structure (they call it a “control flow graph” or “CFG”) is a way better representation of the program you actually want to execute to check matches than the tree we started with.

How BPF works in the kernel

The main important here is that packets are just arrays of bytes. BPF programs run on these arrays of bytes. They’re not allowed to have loops but they can have smart stuff to figure out the length of the IP header (IPv6 & IPv4 are different lengths!) and then find the TCP port based on that length

x = ip_header_length
port = *(packet_start + x + port_offset)

(it looks different from that but it’s basically the same). There’s a nice description of the virtual machine in the paper/slides so I won’t explain it.

When you run tcpdump host foo this is what happens, as far as I understand

convert host foo into an efficient DAG of the rules
convert that DAG into a BPF program (in BPF bytecode) for the BPF virtual machine
Send the BPF bytecode to the Linux kernel, which verifies it
compile the BPF bytecode program into native code. For example here’s the JIT code for ARM and for x86
when packets come in, Linux runs the native code to decide if that packet should be filtered or not. It’l often run only 100-200 CPU instructions for each packet that needs to be processed, which is super fast!

the present: eBPF

But BPF has been around for a long time! Now we live in the EXCITING FUTURE which is eBPF. I’d heard about eBPF a bunch before but I felt like this helped me put the pieces together a little better. (i wrote this XDP & eBPF post back in April when I was at netdev)

some facts about eBPF:

eBPF programs have their own bytecode language, and are compiled from that bytecode language into native code in the kernel, just like BPF programs
eBPF programs run in the kernel
eBPF programs can’t access arbitrary kernel memory. Instead the kernel provides functions to get at some restricted subset of things.
they can communicate with userspace programs through BPF maps
there’s a bpf syscall as of Linux 3.18

kprobes & eBPF

You can pick a function (any function!) in the Linux kernel and execute a program that you write every time that function happens. This seems really amazing and magical.

For example! There’s this BPF program called disksnoop which tracks when you start/finish writing a block to disk. Here’s a snippet from the code:

BPF_HASH(start, struct request *);
void trace_start(struct pt_regs *ctx, struct request *req) {
	// stash start timestamp by request ptr
	u64 ts = bpf_ktime_get_ns();
	start.update(&req, &ts);
}
...
b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")

This basically declares a BPF hash (which the program uses to keep track of when the request starts / finishes), a function called trace_start which is going to be compiled into BPF bytecode, and attaches trace_start to the blk_start_request kernel function.

This is all using the bcc framework which lets you write Python-ish programs that generate BPF code. You can find it (it has tons of example programs) at https://github.com/iovisor/bcc

uprobes & eBPF

So I sort of knew you could attach eBPF programs to kernel functions, but I didn’t realize you could attach eBPF programs to userspace functions! That’s really exciting. Here’s an example of counting malloc calls in Python using an eBPF program.

things you can attach eBPF programs to

network cards, with XDP (which I wrote about a while back)
tc egress/ingress (in the network stack)
kprobes (any kernel function)
uprobes (any userspace function apparently ?? like in any C program with symbols.)
probes that were built for dtrace called “USDT probes” (like these mysql probes). Here’s an example program using dtrace probes
the JVM
tracepoints (not sure what that is yet)
seccomp / landlock security things
a bunch more things

this talk was super cool

There are a bunch of great links in the slides and in LINKS.md in the iovisor repository. It is late now but soon I want to actually write my first eBPF program!