Debugging netlink requests
This week I was working on a Kubernetes networking problem. Basically our container network backend was reporting that it couldn’t delete routes, and we didn’t know why.
I started reading the code that was failing, and it was using a library called “netlink”. I’d never heard of that before this week.
what’s netlink?
Wikipedia says:
Netlink socket family is a Linux kernel interface used for inter-process communication (IPC) between both the kernel and userspace processes, and between different userspace processes, in a way similar to the Unix domain sockets.
The program I was debugging was creating/deleting routes from the route table. It seems like netlink is capable of doing lots of things (communicate kernel <-> userspace and userspace <-> userspace), but in this case what was happening was pretty simple
- userspace program creates a netlink socket
- userspace program sends a message with that socket asking the kernel to delete a route
- kernel deletes the route (or in our case, fails and returns an error message)
how to see netlink messages with strace
Let’s create some netlink messages! Luckily this is easy: if we use the
ip
tool to create and delete a route, it uses netlink.
ip route add 172.16.5.0/24 via 127.0.0.1 dev lo
ip route del 172.16.5.0/24 via 127.0.0.1 dev lo
Cool, let’s strace it! Here’s the command:
strace -s 100 -f -o out -x ip route add 172.16.5.0/24 via 127.0.0.1 dev lo
and the output:
socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=13058, groups=00000000},
sendmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
msg_iov(1)=[{"\x34\x00\x00\x00\x18\x00\x05\x06\x6e\xbc\xac\x59\x00\x00\x00\x00\x02\x18\x00\x00\xfe\x03\x00\x01\x00\x00\x00\x00\x08\x00\x01\x00\xac\x10\x05\x00\x08\x00\x05\x00\x7f\x00\x00\x01\x08\x00\x04\x00\x01\x00\x00\x00",
52}], msg_controllen=0, msg_flags=0}, 0) = 52
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
msg_iov(1)=[{"\x24\x00\x00\x00\x02\x00\x00\x00\x6e\xbc\xac\x59\x02\x33\x00\x00\x00\x00\x00\x00\x34\x00\x00\x00\x18\x00\x05\x06\x6e\xbc\xac\x59\x00\x00\x00\x00",
32768}], msg_controllen=0, msg_flags=0}, 0) = 36
So we see that it:
- creates the netlink socket & binds to it
- sends a message (
\x34\x00...
) - receives a response
Okay, but what does that message say? Here’s the message again:
\x34\x00\x00\x00\x18\x00\x05\x06\x9e\xbc\xac\x59\x00\x00\x00\x00\x02\x18\x00\x00\xfe\x03\x00\x01\x00\x00\x00\x00\x08\x00\x01\x00\xac\x10\x05\x00\x08\x00\x05\x00\x7f\x00\x00\x01\x08\x00\x04\x00\x01\x00\x00\x00
Not super understandable, right? Well, luckily there’s a Python tool
that can help us understand it! We’ll save this to a file called
message
.
decoding netlink messages with pyroute2
I googled how to decode netlink messages and I found this great page: http://docs.pyroute2.org/debug.html.
Decoding my netlink message turned out to be pretty simple: I just had to run this:
pip install pyroute2
wget https://raw.githubusercontent.com/svinota/pyroute2/72e444714f37a313fb15bdb22734e517feefa9e9/tests/decoder/decoder.py
python decoder.py pyroute2.netlink.rtnl.rtmsg.rtmsg message
Here’s the output!
{'attrs': [('RTA_DST', '172.16.5.0'),
('RTA_GATEWAY', '127.0.0.1'),
('RTA_OIF', 1)],
'dst_len': 24,
'family': 2,
'flags': 0,
'header': {'flags': 1541,
'length': 52,
'pid': 0,
'sequence_number': 1504493250,
'type': 24},
'proto': 3,
'scope': 0,
'src_len': 0,
'table': 254,
'tos': 0,
'type': 1}
I don’t understand all of this but we’re just going to focus on this part:
{'attrs': [('RTA_DST', '172.16.5.0'),
('RTA_GATEWAY', '127.0.0.1'),
('RTA_OIF', 1)],
The dst and gateway fields are pretty easy to understand there!
why the program I was debugging wasn’t working
You see this RTA_OIF
field? This field is a network interface id. For
example, on my laptop right now I have 5 network interfaces, numbered 1 through
5. The (correct) message above has RTA_OIF
set to 1, for the lo
loopback interface.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 3c:97:0e:55:b3:7f brd ff:ff:ff:ff:ff:ff
3: wlp3s0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DORMANT group default qlen 1000
link/ether 60:67:20:eb:7b:bc brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:a0:c5:c1:be brd ff:ff:ff:ff:ff:ff
5: nlmon0: <NOARP,UP,LOWER_UP> mtu 3776 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/netlink
But in our errant program, the RTA_OIF
field was set to 0! 0 is not even a
valid value for this field, I don’t think! (0 is not a valid network interface ID)
pyroute2 is great
pyroute2 is really cool, if I wanted to write a quick script to understand what’s going on with my network interfaces & routes I would 100% definitely try pyroute2. There are a lot of great examples here.
For example! If I want to run the equivalent of ip route add 172.16.5.0/24 via 127.0.0.1 dev lo
, that’s:
from pyroute2 import IPRoute
ip = IPRoute()
ip.route('add',
dst='172.16.0.0/24',
gateway='127.0.0.1',
oif=1)
Super simple! oif=1
means the same as dev lo
.
other ways to capture netlink messages: tcpdump + wireshark
You can also use tcpdump to capture netlink messages! here’s how:
# create the network interface
sudo ip link add nlmon0 type nlmon
sudo ip link set dev nlmon0 up
sudo tcpdump -i nlmon0 -w netlink.pcap # capture your packets
wireshark netlink.pcap # look at the results with wireshark
I tried this but had trouble for a couple reasons
- It didn’t work for me on the server I was working on (though it works on my laptop now)
- I actually found it harder to work with than the strace method – it captured too many packets and I found it hard to filter them in Wireshark.
ip monitor
You can also run ip monitor
and it’ll tell you some netlink requests. when I run it it prints out this stuff:
$ sudo ip monitor
fd68:29:f8f6::1 dev enp0s25 lladdr c4:6e:1f:95:d8:3e router STALE
fe80::c66e:1fff:fe95:d83e dev enp0s25 lladdr c4:6e:1f:95:d8:3e router STALE
192.168.1.144 dev enp0s25 lladdr 14:30:c6:ba:e4:6c STALE
192.168.1.144 dev enp0s25 lladdr 14:30:c6:ba:e4:6c PROBE
192.168.1.144 dev enp0s25 lladdr 14:30:c6:ba:e4:6c STALE
192.168.1.144 dev enp0s25 lladdr 14:30:c6:ba:e4:6c REACHABLE
fd68:29:f8f6::1 dev enp0s25 lladdr c4:6e:1f:95:d8:3e router PROBE
fd68:29:f8f6::1 dev enp0s25 lladdr c4:6e:1f:95:d8:3e router REACHABLE
fe80::c66e:1fff:fe95:d83e dev enp0s25 lladdr c4:6e:1f:95:d8:3e router PROBE
fe80::c66e:1fff:fe95:d83e dev enp0s25 lladdr c4:6e:1f:95:d8:3e router REACHABLE
It didn’t give me the information I wanted though.
nltrace
There’s also nltrace (for instance nltrace ip route list
) but in this case it didn’t give me the information I wanted. It’s not a maintained project but looks maybe useful!
that’s all!
It always makes me happy when I learn about a NEW LINUX THING during the course of my job. When I was in the middle of this I tweeted
kubernetes is cool but definitely not easy, my experience is definitely like “learn how all the networking works in excruciating detail”
which definitely feels true, it’s less like “set up networking and it works”
and more like “pick a networking backend, wait a month, discover weird
problems, strace it, learn things about netlink and what a RTA_OIF
is, fix
the bugs, eventually it works”. Maybe that isn’t everyone’s experience but that
is my experience so far!