I hadn't written any threaded programs before yesterday. I knew sort of abstractly about some concurrency concepts (mutexes! people say compare-and-swap but I don't totally get it!), but actually understanding a Thing is hard if I've never done it. So yesterday I decided to write a program with threads! In this post, we're going to:
- Write a threaded program that gets the wrong answer because of a race condition
- Fix that race condition in C and Rust, using 2 different approaches (mutexes and atomics)
- Find out why Rust is slower than C
- Talk a little about the actual system calls and instructions that make some of this work
At first I was going to write a concurrent hashmap, but Kamal wisely pointed out that I should start with something simpler, like a counter!
So. We're going to get 20 threads to count to 20,000,000 all together. We'll have a global counter variable, and increment it like this:
That seems so safe! What can go wrong here is that if two threads try to increment the number at the exact same time, then it'll only get incremented once instead of twice. This is called a race condition.
Writing a race condition
Here's what my original C program looks like, with the bug. I wrote this
by knowing that people used a library called "pthreads" to do threads in
c, and googling "pthreads example". I'm not going to explain it very
much, but essentially it creates 20 threads and has them all run the
AddThings function which increments a global counter a million times.
Full version: counter_race.c.
This program a) runs very fast and b) returns wildly different answers each time. We're expecting 20,000,000. I ran it 10 times and got 10 different answers, between 2,838,838 and 5,695,671.
First try: mutexes! (and learning that mutexes can be Really Slow)
A mutex (or lock) is a way to control access to a resource so that two threads don't change it in conflicting ways at the same time.
A typical pattern for using a lock is:
Mutexes are often implemented on Linux systems with the
call. Basically it's
a way of saying "hey, kernel! This lock is closed, so I'd like to stop
running. Can you please wake me up when it's available again?".
I learned during these explorations that all this making system calls and going to sleep and waking up again is actually pretty expensive. But let's do performance numbers first!
So the C pthread library has a mutex implementation like this. Let's implement our counter with it! You can see the full implementation t counterwithmutex.c. It's a pretty small change: we need to add
at the beginning, and replace
counter += 1 with
If we run our new program, it calculates the correct answer every time!
Amazing! What does the performance of this look like? I'm going to do
all my profiling with
perf stat (perf is an amazing program that you
can read more about in I can spy on my CPU cycles with perf!)
Our original counter with the race condition took more like 0.08 seconds. This is a really big performance hit, even if it means that we have a program that works instead of a program that doesn't!
Mutexes in Rust, too! (it's even slower!)
I decided to implement the same thing in Rust because, well, Rust is fun! You can see it at rustcountermutex.rs.
We create a mutex with
and increment it with
I basically got this to work by copying the Rust mutex documentation. I'm pretty impressed by how much Rust's documentation has improved in the last year.
I ran this, and I was expecting it to perform about as well as my C code. It didn't.
My first instinct was to profile it! I used Brendan Gregg's excellent flame graph library, and ran
What is even going on here?! These two graphs look exactly the same. Why does the Rust one taking longer?
So, off to the races in the #rust IRC channel! Fortunately, the people in #rust are the Nicest People. You can see them helping me out in the logs =D.
After a while, someone named Sharp explains that Rust's mutexes are
implemented in a Slow Way using channels. This seems to make sense, but
then why couldn't I see that from the flamegraph? He explains helpfully
that channels in Rust are also implemented with the
futex syscall, so
it's spending all of its time in the same syscalls, just doing it less
Sharp also suggests using an atomic instead of a mutex, so that's the next step!
Making it fast with atomics in Rust
This one is at rustcounteratomics.rs. I did this without actually understanding what an atomic even is, so I'm not going to explain anything yet.
Basically we replace our mutex with a
and our loop with
I'm not going to talk about the
Relaxed right now (because I don't understand it as well as I'd like), but basically this increments our counter in a threadsafe way (so that two threads can't race).
And it works! And it's fast!
Here's the new flamegraph:
You can see from the new flamegraph that it's definitely not using mutexes at all. But we still don't know how these atomics work, which is troubling. Let's implement the same thing in C, to see if it makes it a little clearer.
Atomics in C: even faster!
To use atomics in our C program, I replaced
with something called
You might have noticed that the
fetch_add in Rust is suspiciously
__sync_add_and_fetch. This is a special GCC atomic builtin
which generates assembly instructions to safely increment our counter.
That GCC documentation page is pretty readable! One interesting thing is this:
All of the routines are described in the Intel documentation to take “an optional list of variables protected by the memory barrier”. It's not clear what is meant by that; it could mean that only the following variables are protected, or it could mean that these variables should in addition be protected. At present GCC ignores this list and protects all variables which are globally accessible. If in the future we make some use of this list, an empty list will continue to mean all globally accessible variables.
It's sort of refreshing to hear the people who write GCC (who I think of as MAGICAL WIZARDS WHO KNOW EVERYTHING) say that they read some Intel documentation and it was not clear what it meant! This stuff must really not be easy.
This C program is a little faster than the Rust version, clocking in at around 0.44 seconds on my machine. I don't know why.
What actual CPU instructions are involved?
I don't really read assembly, so we'll need some help to see which are
the Magical Safe Instructions.
perf is the best program in the
universe, and it can help us with this!
perf record and
annotate together let us see which instructions in our program are
taking the most time.
and we can try it with the Rust program, too:
So we can see that there's a
lock instruction prefix that increments a
variable in each case. Googling for "lock instruction finds us this x86 instruction set reference:
In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
In both cases over 99% of the run time is spent in the instruction right
after that instruction. I'm not totally sure why that is, but it could
be that the
lock itself is fast, but then once it's done the memory it
updated needs to be synchronized and the next instruction needs to wait
for that to happen. That's mostly made up though. If you want to explain
it to me I would be delighted.
(If you've heard about compare-and-swap, that's a similar instruction that lets you update variables without creating races)
We are now slightly closer to being concurrency wizards
This was really fun! In January I was talking to a (super nice!) company that built distributed systems about interviewing there, and they sent me some questions to answer. One of the questions was something like "can you discuss the pros and cons of using a lock-free approach for implementing a thread-safe hashmap?"
My reaction at the time was WHAT ARE YOU EVEN ASKING ME HELP. But these concurrency explorations make me feel like that question is a lot more reasonable! Using atomic instructions in this case was way faster than using a mutex, and I feel like I have a slightly better sense of how all this works now.
Also when I see a process waiting in a
futex(... system call when I
strace it, I understand what's going on a little better! This is
Thanks are due to Kamal for having lots of wonderful suggestions, and the people of the ever-amazing #rust IRC channel. You can see all the code for this post at https://github.com/jvns/fun-with-threads/.