Async IO on Linux: select, poll, and epoll

This week I got a new book in the mail: The Linux Programming Interface. My awesome coworker Arshia recommended it to me so I bought it! It’s written by the maintainer of the Linux man-pages project, Michael Kerrisk. It talks about the Linux programming interface as of kernel 2.6.x.

Here’s the cover.

In the contributing guidelines (you can contribute to the linux man pages!! mind=blown), there’s a list of missing man pages that would be useful to contribute. It says:

You need to have a reasonably high degree of understanding of the topic, or be prepared to invest the time (e.g., reading source code, writing test programs) to gain that understanding. Writing test programs is important: quite a few kernel and glibc bugs have been uncovered while writing test programs during the preparation of man pages.

I thought this was a cool reminder of how you can learn a lot by documenting something & writing small test programs!

But today we’re going to talk about something I learned from this book: the select, poll, and epoll system calls.

Chapter 63: Alternative I/O models

This book is huge: 1400 pages. I started it at Chapter 63 (“alternative I/O models”) because I’ve been meaning to understand better what’s up with select, poll and epoll for quite some time. And writing up things I learn helps me understand them, so here’s my attempt at explaining!

This chapter is basically about how to monitor a lot of file descriptors for new input/output. Who needs to watch a lot of file descriptors at a time? Servers!

For example if you’re writing a web server in node.js on Linux, it’s actually using the epoll Linux system call under the hood. Let’s talk about why, how epoll is different from poll and select, and about how it works!

Servers need to watch a lot of file descriptors

Suppose you’re a webserver. Every time you accept a connection with the accept system call (here’s the man page), you get a new file descriptor representing that connection.

If you’re a web server, you might have thousands of connections open at the same time. You need to know when people send you new data on those connections, so you can process and respond to them.

You could have a loop that basically does:

for x in open_connections:
    if has_new_input(x):
        process_input(x)

The problem with this is that it can waste a lot of CPU time. Instead of spending all CPU time to ask “are there updates now? how about now? how about now? how about now?”, instead we’d rather just ask the Linux kernel “hey, here are 100 file descriptors. Tell me when one of them is updated!”.

The 3 system calls that let you ask Linux to monitor lots of file descriptors are poll, epoll and select. Let’s start with poll and select because that’s where the chapter started.

First way: select & poll

These 2 system calls are available on any Unix system, while epoll is Linux-specific. Here’s basically how they work:

Give them a list of file descriptors to get information about
They tell you which ones have data available to read/write to

The first surprising thing I learned from this chapter are that poll and select fundamentally use the same code.

I went to look at the definition of poll and select in the Linux kernel source to confirm this and it’s true!

here’s the definition of the select syscall and do_select
and the definition of the poll syscall and do_poll

They both call a lot of the same functions. One thing that the book mentioned in particular is that poll returns a larger set of possible results for file descriptors like POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR while select just tells you “there’s input / there’s output / there’s an error”.

select translates from poll’s more detailed results (like POLLWRBAND) into a general “you can write”. You can see the code where it does this in Linux 4.10 here.

The next thing I learned is that poll can perform better than select if you have a sparse set of file descriptors .

To see this, you can actually just look at the signatures for poll and select!

int ppoll(struct pollfd *fds, nfds_t nfds,
          const struct timespec *tmo_p, const sigset_t
          *sigmask)`
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

With poll, you tell it “here are the file descriptors I want to monitor: 1, 3, 8, 19, etc” (that’s the pollfd argument. With select, you tell it “I want to monitor 19 file descriptors. Here are 3 bitsets with which ones to monitor for reads / writes / exceptions.” So when it runs, it loops from 0 to 19 file descriptors, even if you were actually only interested in 4 of them.

There are a lot more specific details about how poll and select are different in the chapter but those were the 2 main things I learned!

why don’t we use poll and select?

Okay, but on Linux we said that your node.js server won’t use either poll or select, it’s going to use epoll. Why?

From the book:

On each call to select() or poll(), the kernel must check all of the specified file descriptors to see if they are ready. When monitoring a large number of file descriptors that are in a densely packed range, the timed required for this operation greatly outweighs [the rest of the stuff they have to do]

Basically: every time you call select or poll, the kernel needs to check from scratch whether your file descriptors are available for writing. The kernel doesn’t remember the list of file descriptors it’s supposed to be monitoring!

Signal-driven I/O (is this a thing people use?)

The book actually describes 2 ways to ask the kernel to remember the list of file descriptors it’s supposed to be monitoring: signal-drive I/O and epoll. Signal-driven I/O is a way to get the kernel to send you a signal when a file descriptor is updated by calling fcntl. I’ve never heard of anyone using this and the book makes it sound like epoll is just better so we’re going to ignore it for now and talk about epoll.

level-triggered vs edge-triggered

Before we talk about epoll, we need to talk about “level-triggered” vs “edge-triggered” notifications about file descriptors. I’d never heard this terminology before (I think it comes from electrical engineering maybe?). Basically there are 2 ways to get notifications

get a list of every file descriptor you’re interested in that is readable (“level-triggered”)
get notifications every time a file descriptor becomes readable (“edge-triggered”)

what’s epoll?

Okay, we’re ready to talk about epoll!! This is very exciting to because I’ve seen epoll_wait a lot when stracing programs and I often feel kind of fuzzy about what it means exactly.

The epoll group of system calls (epoll_create, epoll_ctl, epoll_wait) give the Linux kernel a list of file descriptors to track and ask for updates about activity on those file descriptors.

Here are the steps to using epoll:

Call epoll_create to tell the kernel you’re gong to be epolling! It gives you an id back
Call epoll_ctl to tell the kernel file descriptors you’re interested in updates about. Interestingly, you can give it lots of different kinds of file descriptors (pipes, FIFOs, sockets, POSIX message queues, inotify instances, devices, & more), but not regular files. I think this makes sense – pipes & sockets have a pretty simple API (one process writes to the pipe, and another process reads!), so it makes sense to say “this pipe has new data for reading”. But files are weird! You can write to the middle of a file! So it doesn’t really make sense to say “there’s new data available for reading in this file”.
Call epoll_wait to wait for updates about the list of files you’re interested in.

performance: select & poll vs epoll

In the book there’s a table comparing the performance for 100,000 monitoring operations:

# operations  |  poll  |  select   | epoll
10            |   0.61 |    0.73   | 0.41
100           |   2.9  |    3.0    | 0.42
1000          |  35    |   35      | 0.53
10000         | 990    |  930      | 0.66

So using epoll really is a lot faster once you have more than 10 or so file descriptors to monitor.

who uses epoll?

I sometimes see epoll_wait when I strace a program. Why? There is the kind of obvious but unhelpful answer “it’s monitoring some file descriptors”, but we can do better!

First – if you’re using green threads or an event loop, you’re likely using epoll to do all your networking & pipe I/O!

For example, here’s a golang program that uses epoll on Linux!

package main

import "net/http"
import "io/ioutil"

func main() {
    resp, err := http.Get("http://example.com/")
        if err != nil {
            // handle error
        }
    defer resp.Body.Close()
    _, err = ioutil.ReadAll(resp.Body)
}

Here you can see the golang run time using epoll to do a DNS lookup:

16016 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP
16016 epoll_create1(EPOLL_CLOEXEC <unfinished ...>
16016 epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042824, u64=139818699396808}}
16020 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042632, u64=139818699396616}}

Basically what this is doing is connecting 2 sockets (on file descriptors 3 and 4) to make DNS queries (to 127.0.1.1:53), and then using epoll_ctl to ask epoll to give us updates about them

Then it makes 2 DNS queries for example.com (why 2? nelhage suggests one of them is querying for the A record, and one for the AAAA record!), and uses epoll_wait to wait for replies

# these are DNS queries for example.com!
16016 write(3, "\3048\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\34\0\1", 29
16020 write(4, ";\251\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1", 29
# here it tries to read a response but I guess there's no response
# available yet
16016 read(3,  <unfinished ...>
16020 read(4,  <unfinished ...>
16016 <... read resumed> 0xc8200f4000, 512) = -1 EAGAIN (Resource temporarily unavailable)
16020 <... read resumed> 0xc8200f6000, 512) = -1 EAGAIN (Resource temporarily unavailable)
# then it uses epoll to wait for responses
16016 epoll_wait(5,  <unfinished ...>
16020 epoll_wait(5,  <unfinished ...>

So one reason your program might be using epoll “it’s in Go / node.js / Python with gevent and it’s doing networking”.

What libraries do go/node.js/Python use to use epoll?

node.js uses libuv (which was written for the node.js project)
the gevent networking library in Python uses libev/libevent
golang uses some custom code, because it’s Go. This looks like it might be the implementation of network polling with epoll in the golang runtime – it’s only about 100 lines which is interesting. You can see the general netpoll interface here – it’s implemented on BSDs with kqueue instead

Webservers also implement epoll – for example here’s the epoll code in nginx.

more select & epoll reading

I liked these 3 posts by Marek:

In particular these talk about how epoll’s support for multithreaded programs has not historically been good, though there were some improvements in Linux 4.5.

and this:

using select (2) the right way

ok that’s enough

I learned quite a few new things about select & epoll by writing this post! We’re at 1800 words now so I think that’s enough. Looking forward to reading more of this Linux programming interface book and finding out more things!

Probably there are some wrong things in this post, let me know what they are!

One small thing I like about my job is that I can expense programming books! This is cool because sometimes it causes me to buy and read books that teach me things that I might not have learned otherwise. And buying a book is way cheaper than going to a conference!