Julia Evans

Why I love log files

in performance

Dan Luu wrote a fantastic blog post recently on the limitations of sampling profilers & tracing tools from the future, which you should totally read. I’m pretty into having conversations with people via blogging, so this is my much more pedestrian and less futuristic response to that post.

tracing vs profiling

One of the biggest takeaways from Dan’s post for me is:

Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average.

I learned this recently at work! I had a Thing that was slow. And I was staring at a dashboard with all kinds of graphs that looked like this:

This was maybe a graph of 95th percentile latency or something. Whatever. It wasn’t helping. None of the spikes in the graphs had anything to do with my slow thing. Then I asked Nelson for advice and he was basically like “dude, look at the logs!”.

The basic rationale for this is exactly what Dan says – if you have 1000 events that were unusually slow, they probably failed because of a relatively uncommon event. And uncommon events do not always show up in graphs. But know where they show up? LOG FILES. WHICH ARE THE BEST. As long as you’ve put in enough print statements.

log files: total instant success

So, back to debugging my Thing. I stopped trying to look at graphs, took Nelson’s advice and looked at the log files for the 10 slowest requests.

The logs were all like

00:00:01.000 do thing 1
00:00:01.012 do thing 2
00:00:01.032 do thing 3
00:00:01.045 do thing 4
00:00:01.056 do thing 5
00:00:02.158 do thing 6
00:00:02.160 do thing 7

In this log file, obviously thing 6 is the problem! It took like a second before getting to thing 7! What the hell, thing 6.

I’ve gotten a little more into debugging performance problems, and every time I do it, I find that

  • graphs are a great way to get an overall general picture of the system (oh, the error rate suddenly and permanently increased by 2x on March 13 at 3pm?)
  • graphs are also great for seeing correlations (every time we deploy code our CPU usage spikes and we have timeouts?) (but remember: CORRELATION IS NOT CAUSATION GUYS. NEVER.)
  • logs are amazing for digging into short spikes of timeouts or slow requests. Often I’ll look at the logs for a few requests, they’ll all exhibit the same behavior, and I’ll have a much clearer understanding of what happened there.

That’s how I learned that logging a bunch of events (“do thing 1”) at a few well-chosen points in your code can be pretty useful, if log files are the only way you have to trace requests.

a very small digression on Splunk

Splunk is a tool that ingests your logs. It costs money and they are definitely not paying me but it is AMAZING WIZARD MAGIC.

Today I was talking about performance with a coworker and then we used Splunk to parse the nginx server logs, extract all the request times, and draw the median and 95th percentiles every hour. That’s like totally trivial with Splunk.

magical tracing tools are like log files!

I think it’s cool that these performance tools from the future that Dan describes are basically like.. magical wizard log files. And they’re great for the same reason that log files are great. Because they are both tracing tools and tracing is important!

Zipkin is a tool that serves a similar purpose, somewhere in between a bunch of print statements and the high performance instrumentation tools Dan is describing. It’s based on Dapper, a distributed tracing framework at Google. I don’t know what other tracing tools exist, though! Let me know?

Have high expectations for your computers

I gave a talk at CUSEC (a conference for software engineering students) in January where I talked about performance. One of the points I wanted to make in that talk was – computers are fast, and you should have high expectations for how fast they can be.

When I was in undergrad I didn’t know anything about how fast a computer or a disk was. I was awesome at algorithms but I probably would have believed you if you told me it was impossible to do more than a million square roots in a second.

Now I know a little more, and I have higher expectations, and it’s been really useful to me! For instance: if you have a machine that’s crawling and unable to do more than 10 requests per second? You should probably expect more. Let me try to convince you of that, and we’ll start high.

a million QPS

What if you wanted to handle a million requests per second (aka QPS) on your computer?

First, that’s totally possible. I was originally alerted to this possibility by the amazing Kelly Sommers and her Haywire web server. As I understand it, it was originally a toy project for her to experiment with writing high-performance C code. According to its README, Haywire can process 9 million HTTP requests per second.

When I first read that I thought it was a typo. Surely nothing can do a million anythings per second? It wasn’t a typo.

But Haywire is a toy project, right? If we’re talking about doing a million requests per second, we should talk about a real project that people use in production. Great! Let’s talk about nginx.

There’s a great blog post by Datacratic, a little ad tech company here in Montreal, called 1M QPS with nginx and Ubuntu 12.04 on EC2. They say it was pretty easy to reach 500K QPS, and then they did some extra tuning to get to a million. Welp.

Java: 50,000 QPS

Okay, let’s add an extra couple of layers of difficulty. First, let’s use a Java server, which we’d expect to be maybe a little slower, and secondly, let’s have our service actually do something nontrivial. For this I visited Netty’s testimonials page

That page says:

During our lab testing, the library has displayed excellent performance characteristics, allowing us to service 50,000+ messages per second from approximately 30,000 connected clients on a commodity Intel server costing approximately $850

50,000 QPS on a single server is still not shabby. I wish I had that! I imagine you could probably serve hundreds of thousands of requests per second in Java if you did practically nothing and were really careful about your memory usage. (if nginx can do it, why not netty? is there a reason?) But I’m making that up – I’m not a Java performance wizard (yet).

Python, 1,000 QPS

Okay, Python is slow, right? Way slower than Java. I’m going to get into way sketchier territory here – we’ve resorted to a tumblr post benchmarking a toy Python server against Go

I tested this one on my laptop. You can really do 1,000 HTTP requests per second in a toy Python HTTP server! Neat.

Reasons your webserver might be slow

Okay! We’re now in charge of performance at an Important Web Startup. Let’s say you want to serve 500 requests per second on a single server. I’m talking about throughput here – I’m supposing it’s okay if it takes a long time to serve each request, you just need to finish 500 every second.

First: your CPU can slow you down

Let’s say you have 4 cores, and you want to serve 500 requests per second (throughput). Then your CPU budget per request is 8 milliseconds. That’s just CPU budget – if you’re waiting for a database, that doesn’t count.

I’m finding CPU budgets real useful to think about right now because I just found out that my service that I thought was slowed down by its network/database latency was ACTUALLY BEING SLOWED DOWN BY UNNECESSARY CPU ACTIVITY. I was mad.

So, 8 milliseconds. We already know that a millisecond is a really long time in Java. It’s a million CPU instructions! But honestly – even if you’re in Python, 8 milliseconds is kind of a long time! If you have a webserver where the requests take more than 8ms of CPU time, it might be worth understanding why! It was super worth it for me to understand this.

Second, you can have worker starvation

I hate worker starvation. It is the worst. Worker starvation is when you have, say, 100 workers that can work on requests, and they’re all busy. Then when the 101st request that second comes in, it has to wait! Even if you have extra CPU to spare! This SUCKS.

I’m not going to pretend I know the answer to how to fix this for everyone. It sucks though. Maybe you can add more workers! Maybe you can do more work asynchronously! Maybe you can use threading! This stuff isn’t easy but sometimes you can make progress.

Third, you can saturate your network connection

If you have a web service, that means you need to receive network requests and send responses! If you run out of network bandwidth (perhaps because you’re doing huge database queries?), you can’t do any more! You either need a new machine, or more bandwidth, or figure out how to make your damn requests smaller. Sorry.

Amazon EC2 has relatively inexpensive machines that do 1Gbps of network activity, according to this random blog post. I don’t have a lot to say about that, but that seems like a lot of network to me.

Fourth, your database might not be able to handle the load

You can’t fix this by adding more machines! Here you need to put your database on a bigger computer or replicate it or run less expensive queries or something.

Fifth, you only have so much disk I/O

If you can read from your disk at 200MB/s – that’s it. You can’t read more. This is like the network saturation problem. No good!

There are a lot more things that can go wrong

I stopped at 5, but there’s a lot of reasons your server can be slow! I’ve probably missed a bajillion things. There are a lot of resources that you can be stuck waiting for (like other external services!). You could have a problem with locks! You can’t know until you investigate.

understand why your webserver can’t do a million QPS

I’m kind of joking here, but kind of not! Suppose your server gets super sad when you throw more than 10 requests per second at it. Why can’t it do 10 times that?

some possible responses:

  • it’s because it’s in Python! (are you really doing more than 100ms of CPU per request? that’s a lot! why is that happening?)
  • it’s because of the database! (cool! databases are really fast though! can you get a list of the most expensive queries and understand why those are happening?)
  • I need to read a lot from disk and I have no more disk bandwidth. (do you know for sure why you’re reading so much?)
  • it’s because of my network connection! (maybe just run dstat to make sure your network activity is what you think it is? You can break it down by process!)

I’ve found that it’s really easy to be wrong about why something is slow. Taking an hour or a day to look closely and make sure my assumptions about why my server is slow has been an awesome use of time. Spending a day setting up profiling feels terrible but then is AWESOME when you make everything faster and everybody loves you.

believing in my computers is really useful to me

I have this mission to develop an absolute sense of how fast computers are. Instead of “Go is faster than Python”, I want to be able to know things like “I should be able to easily do 100 QPS in Python”.

Sometimes I tell someone about a programming problem I solved in a millisecond, and they say “I can solve it in a microsecond!”. As long as they’re not a jerk, I actually find this totally delightful – if computers can do a thing 1000 times faster than I thought, that’s amazing! I can add it to my list of magical computer facts.

Raising my expectations about how fast computers can and should be has been super useful to me. Now when I see that it takes a hour to process a million records, I get mad. If an HTTP request takes 100ms of CPU time to process, I get mad. I try to restrict my anger to cases where it actually matters – if the processing job is in my way, or if web performance is actually a problem.

But when performance becomes a problem, I find my slowly-building intuitions about performance are incredibly useful. I can feel way more confident spending time profiling and optimizing the problem if I’m sure that the program can be faster. And so far it’s worked! I’ve been able to make a couple of things much much faster.

Maybe there will be more interesting optimizations in my future! Maybe having high expectations about computers will be useful to you too! A toast to fast programs :)

TIL: clock skew exists

I learned some new things yesterday about distributed systems yesterday! Redis is a key-value store that can be distributed, and apparently it has a proposal for a locking system called Redlock.

Yesterday I read the articles where Martin Kleppman criticizes Redlock and Redlock’s author, antirez, responds.

These are very interesting to read as a distributed systems n00b – the authors of the articles are making significantly different arguments, and so it’s a useful exercise to try to reason through the arguments and figure out who you think is right.

Martin Kleppman, as far as I understand, argues that Redlock isn’t safe to use because

  • processes can pause for arbitrary amount of time (because of garbage collection or CPU scheduling)
  • network requests can be arbitrarily delayed at any time
  • The clock on a computer can be unreliable (because it goes at the wrong speed, or has jumped back in time)

Probably other reasons too, but those are the ones I understood.

Here’s a chart from his post illustrating how things can go wrong:

process pauses & network delays are totally real. but, clock skew?

Part of antirez’s response is “well, you can assume your clock is mostly correct though!”.

What they need to do is just, for example, to be able to count 5 seconds with a maximum of 10% error. So one counts actual 4.5 seconds, another 5.5 seconds, and we are fine.

Now, I totally believe in garbage collection. I’ve seen services become unresponsive because they were garbage collecting. So I definitely believe in process pauses.

I also believe in arbitrary network delays! If you told me that some of my replies saying “hey I’m finished with the lock” would be (very very occasionally) delayed for 1-2 seconds, I’d believe you. 1-2 seconds is a lot of computer time!!

As far as I can tell, either of those two things by themselves is enough to make Redlock not safe. (if you account for process pauses, you could still have another process pause after you account for them, right?)

But, we were talking about clock skew.

Clock skew is when your clock is going, say, 10% or 2x faster than it should. If uncorrected, this is a disaster for a system that depends on knowing the right time.

Here’s the thing – I haven’t seen a lot of systems, but I’ve never seen a system with a clock running too fast. My laptop’s clock mostly works, as far as I know my servers at work have approximately correct clocks (though maybe they don’t and I just don’t know!). Why is Martin Kleppman so worried about this?

I asked about this, and @handler and @aphyr both helped me out. They were basically both totally “julia systems with clocks that go at bogus speeds totally exist. 2x clock skew is real”.

Check out this 1999 paper by Nelson Minar that surveys NTP (network time protocol) servers and finds that they’re often serving incorrect times.

The trouble with timestamps by @aphyr has a great explanation about why you should care about your clock.

Here’s an attack on Tor where if you induce high CPU load on a node, the temperature is likely to cause the clock skew to increase on that node. There’s also a follow up paper

Leap seconds are real. (there was one in 2015!) This post 5 different ways to handle leap seconds has a ton of cool graphs. (thanks Nick Coghlan!)

Google pretty much knows what time it is – they invested a ton of time in a system called Spanner. Here are a couple of articles about that.

distributed systems & weird computer things

Distributed systems researchers are really really concerned with adverse, sometimes uncommon, things that can happen to your computer (like clock skew).

I think they’re concerned for three reasons:

  1. they’re trying to prove theorems, and if you’re proving a theorem you need to worry about all the edge cases
  2. they’re operating literally 20 million computers. if you have 20 million computers everything that can go wrong with a computer will go wrong.
  3. some of those seemingly uncommon things are actually quite common (like network partitions)

I don’t know at all how many computers Google has, but I bet it’s a lot. Like 20 million or something.

Reasons 1 and 2 are related – if you prove a theorem that your system is safe, then it’ll be safe even if you have 20 million computers. Theorems are the best.

So my laptop’s clock is probably mostly okay, but at Google scale I’d imagine you have computers with broken clocks all the time.

reasoning about distributed systems is interesting

I’m not a distributed systems engineer, really. (though I deal with some at work sometimes). I think, if you have plans to interact with distributed systems in the future, it’s really useful to try to reason through issues like this for yourself! There’s a ton of terminology (the first time I watched one of the Jepsen talks I was like “wat.”)

So I think it’s fun to practice sometimes. Maybe one day you learn what linearizability is! and 6 months later you’re like “oh I actually didn’t get it I was totally wrong.”

clock skew is real

People I know have experienced it! It is not just a weird theoretical thing to give researchers jobs. Huh.

thanks for Camille Fournier, Michael Handler, and Kyle Kingsbury for trying to explain distributed systems things to me. I have likely made lots of mistakes in writing this, and those mistakes are all mine :)

How CPU load averages work (and using them to triage webserver performance!)

in performance

CPU load averages have long been a little mysterious to me. I understood that low is good, and high is bad, but I thought of them as a mostly inscrutable number. I have now reached a small epiphany about them, which I would like to share with you!

I tweeted earlier today:

I understand CPU load averages now! If I have a load average of 6, and am processing 60 requests/second, then each one takes 6/60=0.1s of CPU time

and someone responded:

CPU load average is the number of processes in the runnable state. Little to nothing to do with CPU time.

I thought this was a totally reasonable response. I also still thought I was right, but I needed to do some work first, and it wouldn’t fit in a tweet.

It turns out that I was kinda wrong, but I think also kinda right! What follows will hopefully be correct. When doing calculations, I’m going to assume that your processes using CPU are all doing it for the same reason, and that reason is to serve HTTP requests.

Before I explain what load averages have to do with CPU time (spoiler: we’re going to do a tiny bit of queueing theory!), I want to tell you what a load average is, and why the formula I tweeted is awesome.

What’s a load average?

Modern operating systems (since, like, 4.2BSD in 1983) can run more than one process on a single CPU (this is called “CPU scheduling”). My computer is running 300 processes right now! The operating system keeps track of a state for every process. The man page for ps lists them:

       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state
       of a process:
       D    uninterruptible sleep (usually IO)
       R    running or runnable (on run queue)
       S    interruptible sleep (waiting for an event to complete)
       T    stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    defunct ("zombie") process, terminated but not reaped by its parent.

The load average is the average, in the last minute / 5 minutes / 15 minutes, of the number of processes in a running or runnable state. As far as I understand, ‘runnable’ means “I’d be running if you’d let me”. Processes that are asleep don’t count. Almost every process on my computer is asleep at any given time.

Given this definition, you may understand why someone would say this has “Little to nothing to do with CPU time”. It doesn’t seem like it would!

A quick note on multiple CPU cores

If there are 3 processes that want to run on a CPU at the same time, and your computer has 4 CPU cores, then you’re totally okay! They can all run. So a load average of 3 is fine is you have 4 cores, and bad if you have 1 core.

The number of cores you have doesn’t affect the formula we’re going to talk about here, though.

Why CPU load averages are awesome

The other day at work, I had a server that had a load average of 6. It was processing 60 HTTP requests per second. (the numbers here are all fake)

Both of these numbers are easy to get! The load average is in the output of top (for instance load average: 6.12, 6.01, 5.98), and I got the requests per second processed (or throughput) by counting log lines in the service’s log file.

So! According to our formula from above, each request was taking 6 / 60 = 0.1s = 100ms of time using-or-waiting-for-the-CPU. I asked my awesome coworker to double check this division to make sure that was right. 100ms is a bajillion years of CPU time, and I was super concerned. That story is for another time! But being able to calculate that number so quickly was SUPER USEFUL to me for understanding the server’s performance.

Why the formula is correct

So! I posited this formula that tells you CPU time per request = load average / request throughput (requests per second). Why does that work?

There’s this theorem called Little’s Law, that states:

The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = λW.

This is pretty intuitive: if 10 people per hour (W) arrive at your store, and they spend 30 minutes each there (λ), then on average there will be 5 people (L) at a time in your store.

Now, let’s imagine the CPU is your store, and that HTTP requests are people. The load average tells you how many processes at a time are in line to use the CPU (L). Since in my case I have 1 HTTP request / process, this is the same as the number of requests in line to use the CPU. Note that we care about the steady-state load average – if the load is constantly changing then it’s much harder to reason about. So we want the “average load average”. In my example system at work, the load average had been about 6 for a long time.

If your system is in a steady state (constant load), then the rate of incoming requests will on average, over a long period of time, be the same as the rate of finishing requests. That rate is W.

Lastly, λ is the amount of time each request spends on the CPU (in a running or runnable state).


  • L = load average (average # requests in a running or runnable state)
  • λ = average total time each request spends in a running or runnable state
  • W = throughput (requests per second)

So if we want to do my example from the previous section, we get:

time spent on CPU = λ = L / W = 6 / 60 = 0.1s per request.


There are quite a few assumptions built into this formula, which I’ll make explicit now. First, I told you “The load average tells you how many processes at a time are in line to use the CPU (L)”. This isn’t actually true!

The Wikipedia page on load averages remarks that:

However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system.

So, here are the cases when this “CPU time per request = load average / throughput” formula won’t work for you:

  • some of your processes are in uninterruptible sleep
  • your system has a highly fluctuating load average / throughput
  • you’re handling more than 1 HTTP request per thread (for instance if you’re using Node or Go or…).
  • the CPU activity on your system is caused by something other than your HTTP request processing
  • this time (time running + time waiting for the CPU) includes time spent doing context switches between processes, and time spent on-CPU inside the kernel

It’s also worth noting that the load average is an exponentially decaying average. This means that if your load average is changing over time, it’s hard to know what the non-exponentially-decaying load average is.

There’s likely another caveat I’ve missed, but I think that’s most of them.

a version for time spent on the CPU

We’ve found a formula for “time the request spends on the CPU (or waiting for it to be free)”. But what if we wanted to ignore the time it spent waiting? I have an idea that I made up just now.

If the CPU load is low (like, less than half your number of cores), I think it’s reasonable to assume that any process that wants to be scheduled gets scheduled immediately. So there’s nothing to do.

But what if your CPU is overloaded? Suppose I have 4 CPUs. Then we could instead define

  • L = average number of processes in a running state (which should be 4, since the CPU is at capacity)
  • λ = average time each request spends in a running state
  • W = throughput (requests per second)

Then we can still try to calculate our new λ, from our example from before!

λ = L / W = 4 / 60 = 0.066 s = 66ms per request on the CPU.

I think this math still holds up, but it feels a little shakier to me. I would love comments on this.

this formula = awesome

I had a good experience with this formula yesterday! Being able to quickly triage the number of milliseconds of CPU time per request was an awesome start to doing some more in-depth performance analysis! (which I won’t go into here)

I hope it will help you as well! If you think I’ve gotten this all wrong, let me know on twitter :)

Thanks to Kamal Marhubi, Darius Bacon, Dan Luu, and Sean Allen for comments

Fast integer sets with Roaring Bitmaps (and, making friends with your modern CPU)

in performance

I went to the Papers we Love Montreal meetup this week where Daniel Lemire spoke about roaring bitmaps, and it was AWESOME.

I learned about a cool new data structure, why sets of integers are super important, how modern CPUs are different from they were in the 90s, and how we should maybe think about that when designing data structures.

integer sets (TIL hash sets aren’t fast)

The talk was about building integer sets (like {1,28823,24,514}) that let you do really really fast intersections & unions & other set operations.

But why even care about integer sets? I was super confused about this at the beginning of the talk.

Well, suppose you’re building a database! Then you need to run queries like select * from people where name = 'julia' and country = 'canada'. This involves finding all the rows where name = 'julia' and where country = 'canada' and taking a set intersection! And the rows IDs are integers.

You might have millions of rows in your database that match, so both the size of the set in memory and the speed of the set intersection are important! A fast integer set library here will really help you.

One of the big takeaways for me in this talk was “whoa, hashsets are not fast. okay.” It turns out there are much much faster ways to represent large sets!


Here’s the basic story about how to make fast integer sets!

To represent the set [0,1,2,5], you don’t need 4 integers! (128 bits) Instead, you can just store the bits 111001 (set the bit 5 to 1 to indicate that 5 is in the set!). That’s a “bitmap” or “bitset”, and it only uses 6 bits. AMAZING.

Taking ‘AND’ or ‘OR’ of two bitmaps corresponds to set intersection or union, and is extremely wizard fast. Even faster than you might think! This is where the talk got really surprising to me, so we’re going to take a break and talk about CPUs for a while.

modern CPUs and parallelization (but not the kind you think)

I have this idea in my head that my CPU can do 2.7 billion instructions a second (that’s what 2.7 GHz means, right?). But I learned this is not actually really true! It does 2.7 billion cycles a second, but can potentially do way more instructions than that. Sometimes.

Imagine two different programs. In the first one, you’re doing something dead-simple like taking the AND of a bunch of bits. In the second program, you have tons of branching, and what you do with the end of the data depends on what the beginning of the data is.

The first one is definitely easier to parallelize! But usually when I hear people talk about parallelization, they mean splitting a computation across many CPU cores, or many computers. That’s not what this is about.

This is about doing more than one instruction per CPU cycle, on a single core, in a single thread. I still don’t understand very much about modern CPU architecture (Dan Luu writes cool blog posts about that, though). Let’s see a short experiment with that in action.

some CPU cycle math

In Computers are fast, we looked at summing the bytes in a file as fast as possible. What’s up with our CPU cycles, though? We can find out with perf stat:

$ perf stat ./bytesum_intrinsics The\ Newsroom\ S01E04.mp4
 Performance counter stats for './bytesum_intrinsics The Newsroom S01E04.mp4':

       783,592,910 cycles                    #    2.735 GHz                    
       472,822,242 stalled-cycles-frontend   #   60.34% frontend cycles idle   
     1,126,180,100 instructions              #    1.44  insns per cycle        
            31,039 branch-misses             #    0.02% of all branches        

       0.288103005 seconds time elapsed

My laptop’s clock speed is 2.735 GHz. That means in 0.288103 seconds it has time for about 780,000,000 cycles (do the math! it checks out.) Not-coincidentally, that’s exactly how many cycles perf reports the program using. But cycles are not instructions!! My understanding of modern CPUs is – old CPUs used to do only one instruction per cycle, and it was easy to reason about. New CPU have “instruction pipelines” and can do LOTS. I don’t actually know what the limit is.

For this program, it’s doing 1.4 instructions per cycle. I have no idea if that’s good, or why. If I look at perf stat ls, it does 0.6 instructions per cycle. That’s more than 2x less! In the talk, Lemire said that you can get up to an 8x speed increase by doing more instructions per cycle!

Here’s the disassembly of bytesum_intrinsics, and which instructions it spends the most time on (the numbers on the right are percentages). Dan Luu would probably be able to interpret this and tell me what’s going on, but I don’t know how.

Roaring Bitmaps: a library for compressed bitmaps

Okay, back to bitmaps!

If you want to represent the set {0,1,100000000} with a bitmap, then you have a problem. You don’t want to use 100000000 bits to represent 3 elements. So compression of bitmaps is a big deal. The Roaring Bitmap library gives you a data structure and a bunch of algorithms to take the intersection and union of compressed bitmaps fast. He said they can usually use 2-3 bits per integer they store?! I’m not going to explain how it works at all – read roaringbitmap.org or the github repository’s README if you want to know more.

modern CPUs & roaring bitmaps

What does all this talk of CPUs have to do with bitmaps, though?

Basically they designed the roaring bitmaps data structure so that the CPU can do less instructions (using single-instruction-multiple-data/SIMD instructions that let you operate on 128 bits at a time) and then additionally do lots of instructions per cycle (good parallelization!). I thought this was super awesome.

He showed a lot of really great benchmark results, and I felt pretty convinced that this is a good library.

The whole goal of the Papers We Love meetup is to show programmers work academics are doing that can actually help them write programs. This was a fantastic example of that.

Sendfile (a system call for web developers to know about!)

in kernel, strace,

The other day I learned about a new (to me) exciting Linux system call! (for newcomers, a system call is an operation you can ask the operating system to do). This one seems really important to know about if you’re configuring a webserver! So let’s learn about it.

Before this, I knew about basic system calls like open and read for files, and sendto and recvfrom for networking. And a few fancier things like futex and select for mutexes and waiting.

why sendfile was invented

Suppose I want to send you a big file over a network connection. Normally I’d just read the file incrementally, and then write the contents to the socket. So, at a minimum, we need to

  • use read (requires a context switch into kernel code)
  • (implicitly, copy the data from kernel memory into user memory)
  • use sendto or write (another context switch)

This means we need to copy data (bad) and use two system calls instead of one (also bad).

So the idea is – this pattern of reading a file and writing to a socket is really common! So they made a system call to just do that! Then the kernel can do all the work of reading and writing, and save you CPU time. And you don’t need to copy any data around! AMAZING.

the performance implications

I found this google code discussion on a Python FTP library. One person says that by using the sendfile system call, they could transfer 1.5GB/s instead of 800MB/s! That’s pretty awesome for a small change.

This paper from Netflix describes using sendfile on FreeBSD to go from 6Gbps to 40Gbps of network throughput. They also talk about implementing (part of?) TLS in the kernel to improve TLS performance.

the disasters

I then read “The brokenness of the sendfile() system call”. Wait?! But I thought sendfile was awesome and we should always use it? Not so!

That post describes how on OS X, sendfile wouldn’t send any data until the socket was closed, causing up to 5 second delays. That’s TERRIBLE. So sendfile isn’t some kind of universal panacea, and that’s why webservers let you turn it on and off.

some other reading on sendfile

Rob Pike (one of the creators of Go) thinks sendfile is “bizarre”. I find his argument in that post pretty difficult to follow (if the kernel provides a way to do something, and that way gives you better performance in practice, why not use it?). But I thought it was interesting.

Life of a HTTP request, as seen by my toy web server is interesting, and describes how the author uses sendfile for large files, but not for small files. You don’t need to write your own webserver to take advantage of this – you can configure apache and nginx to use sendfile!

The sendfile man page is actually quite readable, and it tells you something very important! It recommends using the TCP_CORK TCP option for better network performance. We learned about how understanding TCP is important in Why you should understand (a little) about TCP, and that’s pretty important here as well. In this case you need to decide whether to use TCP_CORK and TCP_NODELAY. One thing I read recommended using both.

You can also use sendfile to copy files quickly! (like, think about how cp is implemented!) So you want to write to a file real fast… walks through some optimizations to file copying and gets a 25% improvement by using sendfile and other tricks. I straced cp on my machine just now, and it seems like it does not use sendfile. It’s super interesting to me how much abstractions break down when you’re trying to really optimize performance.

next step: splice & tee

These days sendfile is a wrapper around the splice system call, which seems to be the same thing – copy data from one file/pipe/socket to another – but with some extra options.

There’s a neat thread on the Linux Kernel Mailing List from 2006, just after those system calls came into existence, where Linus explains what they’re for and how to think about them.

I found this paragraph helpful:

Anyway, when would you actually use a kernel buffer? Normally you’d use it it you want to copy things from one source into another, and you don’t actually want to see the data you are copying, so using a kernel buffer allows you to possibly do it more efficiently, and you can avoid allocating user VM space for it

That post also makes it clear that sendfile used to be a separate system call and is now just a wrapper around splice.

There’s also vmsplice, which I think is related and important. But right now my brain is full. Maybe we’ll learn about vmsplice later.

why this is amazing

It makes me really happy when learning a new system call helps me understand how to do something really practical. Now I know that if I’m building something that serves large files and I care about the performance, I should make sure I understand if it’s using sendfile!

Guessing Linux kernel registers

in kernel

I have a long-standing project to try to learn to use ftrace, a Linux kernel tracing tool. As usual when I want to learn more, I turned to one of Brendan Gregg’s tools – the perf-tools repo on Github. There’s a whole delightful directory of examples. It’s the best.

He has a lot of tools that are ready to use and easy to understand – for instance, you can use execsnoop to see every program that’s being executed. Those are awesome.

Then there are some that require… more work. I was interested in ftrace, and I was using his kprobe script to trace system calls and their arguments.

I started out by running sudo kernel/kprobe 'p:SyS_read' to tell me every time the read system call happened. This gives me output like

 Chrome_ChildIOT-8978  [000] d...  7402.349880: SyS_read: (SyS_read+0x0/0xa0)
 Chrome_ChildIOT-4102  [002] d...  7402.349922: SyS_read: (SyS_read+0x0/0xa0)
            Xorg-1853  [001] d...  7402.349962: SyS_read: (SyS_read+0x0/0xa0)
            Xorg-1853  [001] d...  7402.349969: SyS_read: (SyS_read+0x0/0xa0)
 Chrome_IOThread-4092  [003] d...  7402.349974: SyS_read: (SyS_read+0x0/0xa0)

But WHAT FILE DID IT READ? This is not good at all.

In the example, he says, on the open system call.

Here I guessed that the mode was in register %cx, and cast it as a 16-bit unsigned integer (“:u16”). Your platform and kernel may be different, and the mode may be in a different register. If fiddling with such registers becomes too painful or unreliable for you, consider installing kernel debuginfo and using the named variables with perf_events “perf probe”.

This was wizardry. How could he guess that the mode of a file was in the register %cx? What even are the registers? This makes no sense.

I partly figured this out and got more information about the read system calls, so I will now tell you!

what even is a register

I know that registers are what the CPU uses to store data in when calculating things. But how many even are there? How do I guess which one is right?

First, I found this page describing x86 registers. It tells me that there are

General registers

Segment registers

Index and pointers

From the description, the segment registers seem safe to ignore! Awesome. The instruction pointer and the stack pointer tell me what instruction is running right now and where the stack is. I also don’t care about that. So that leaves me with only 7 registers to worry about (eax, ebx, ecx, edx, esi, edi, and ebp). That’s way better.

printing the registers

So before we were running sudo kernel/kprobe 'p:SyS_read'. We can also print the registers for the read system call! Here goes. For some reason we need to take off the e.

sudo kernel/kprobe 'p:SyS_read ax=%ax bx=%bx cx=%cx dx=%dx si=%si di=%di' | grep chrome-4095
          chrome-4095  [001] d...  7665.279404: SyS_read: (SyS_read+0x0/0xa0) ax=0 bx=2cb4726adec0 cx=0 dx=2 si=7fff1282f70e di=9
          chrome-4095  [001] d...  7665.279562: SyS_read: (SyS_read+0x0/0xa0) ax=0 bx=2cb4726adec0 cx=0 dx=2 si=7fff1282f70e di=9
          chrome-4095  [002] d...  7665.400594: SyS_read: (SyS_read+0x0/0xa0) ax=0 bx=2cb4726adec0 cx=0 dx=2 si=7fff1282f70e di=9

Let’s compare this to the output of strace:

sudo strace -e read -p 4095
Process 4095 attached - interrupt to quit
read(9, "!", 2)                         = 1
read(9, "!", 2)                         = 1
read(9, "!", 2)                         = 1

Ok, awesome! In the output of strace. I know that 9 is the file descriptor, 2 is the length to read, and the middle value is the string. This must mean that %di is the file descriptor, and %dx is the amount of data to read!

I can label those now and be a register-guessing-wizard like Brendan Gregg!

sudo kernel/kprobe 'p:julia_smart_read SyS_read fd=%di:u16 bytes_to_read=%dx' | grep chrome-4095
          chrome-4095  [003] d...  7854.905089: julia_smart_read: (SyS_read+0x0/0xa0) fd=9 bytes_to_read=2
          chrome-4095  [003] d...  7854.945585: julia_smart_read: (SyS_read+0x0/0xa0) fd=9 bytes_to_read=2
          chrome-4095  [002] d...  7854.945852: julia_smart_read: (SyS_read+0x0/0xa0) fd=9 bytes_to_read=2

So now I know which file descriptors are being read!

The advantage of using ftrace instead of strace is that the overhead is way lower: when I strace find it makes it 20x slower, but with ftrace it’s totally okay. I’m still not sure where the string that we read is (I think it’s in %si, though!)

Now I am one step closer to being able to trace system calls with less overhead. Guessing registers is really tedious but it seems to be totally possible!

update: turns out you don’t have to guess at all! the registers used for system calls are always the same :D. here is a table with all the answers ❤

Calling C from Rust

in rust

Yesterday I asked Kamal how to call C code from Rust, for a project I’m thinking about. It turned out to be a little harder than I expected! Largely because I don’t know Rust well, and fixing compiler errors is nontrivial. 30 minutes and some number of inscrutable-to-me compiler errors later, we figured it out.

I want to do something pretty simple – copy the string “Hello, world!” and print it.

Here’s the Rust code that calls C. It doesn’t use any special libraries – just Rust.

extern {
    // Our C function definitions!
    pub fn strcpy(dest: *mut u8, src: *const u8) -> *mut u8;
    pub fn puts(s: *const u8) -> i32;

fn main() {
    let x = b"Hello, world!\0"; // our string to copy
    let mut y = [0u8; 32]; // declare some space on the stack to copy the string into
    unsafe {
      // calling C code is definitely unsafe. it could be doing ANYTHING
      strcpy(y.as_mut_ptr(), x.as_ptr()); // we need to call .as_ptr() to get a pointer for C to use

I’m mostly writing this down so that I don’t forget, but maybe it will be useful for you too!

Along the way I found out that cargo (Rust’s build tool) is super easy to get started with – all you need to do is run

$ cargo new --bin my-project
$ cd my-project
$ cargo run
   Compiling my-project v0.1.0 (file:///home/bork/work/my-project)
     Running `target/debug/my-project`
Hello, world!

and you have a new Rust project that you can run with cargo run!

Rust is way easier to install than it was in 2013, though I still find the error messages hard to understand.

There might be more of these short project-notes posts to come – hopefully some of you will find them interesting!

A few notes on my CUSEC talk

in talks

I gave a talk today! If you came, thanks for coming! Here are some notes and links if you’d like to learn more!

on debugging

on computers being fast