Some ways DNS can break

When I first learned about it, DNS didn’t seem like it should be THAT complicated. Like, there are DNS records, they’re stored on a server, what’s the big deal?

But with DNS, reading about how it works in a textbook doesn’t prepare you for the sheer volume of different ways DNS can break your system in practice. It’s not just caching problems!

So I asked people on Twitter for example of DNS problems they’ve run into, especially DNS problems that didn’t initially appear to be DNS problems. (the popular “it’s always DNS” meme)

I’m not going to discuss how to solve or avoid any of these problems in this post, but I’ve linked to webpages discussing the problem where I could find them.

problem: slow network requests

Your network requests are a little bit slower than expected, and it’s actually because your DNS resolver is slow for some reason. This might be because the resolver is under a lot of load, or it has a memory leak, or something else.

I’ve run into this before with my router’s DNS forwarder – all of my DNS requests were slow, and I restarted my router and that fixed the problem.

problem: DNS timeouts

A couple of people mentioned network requests that were taking 2+ seconds or 30 seconds because of DNS queries that were timing out. This is sort of the same as “slow requests”, but it’s worse because queries can take several seconds to time out.

Sophie Haskins has a great blog post Misadventures with Kube DNS about DNS timeouts with Kubernetes.

problem: ndots

A few people mentioned a specific issue where Kubernetes sets ndots:5 in its /etc/resolv.conf

Here’s an example /etc/resolv.conf from Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances.

nameserver 100.64.0.10
search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
options ndots:5

My understanding is that if this is your /etc/resolv.conf and you look up google.com, your application will call the C getaddrinfo function, and getaddrinfo will:

look up google.com.namespace.svc.cluster.local.
look up google.com.svc.cluster.local.
look up google.com.cluster.local.
look up google.com.eu-west-1.compute.internal.
look up google.com.

Basically it checks if google.com is actually a subdomain of everything on the search line.

So every time you make a DNS query, you need to wait for 4 DNS queries to fail before you can get to the actual real DNS query that succeeds.

problem: it’s hard to tell what DNS resolver(s) your system is using

This isn’t a bug by itself, but when you run into a problem with DNS, often it’s related in some way to your DNS resolver. I don’t know of any foolproof way to tell what DNS resolver is being used.

A few things I know:

on Linux, I think that most things use /etc/resolv.conf to choose a DNS resolver. There are definitely exceptions though, for example your browser might ignore /etc/resolv.conf and use a different DNS-over-HTTPS service instead.
if you’re using UDP DNS, you can use sudo tcpdump port 53 to see where DNS requests are being sent. This doesn’t work if you’re using DNS over HTTPS or DNS over TLS though.

I also vaguely remember it being even more confusing on MacOS than on Linux, though I don’t know why.

problem: DNS servers that return NXDOMAIN instead of NOERROR

Here’s a problem that I ran into once, where nginx couldn’t resolve a domain.

I set up nginx to use a specific DNS server to resolve DNS queries
when visiting the domain, nginx made 2 queries, one for an A record, and one for an AAAA record
the DNS server returned a NXDOMAIN reply for the A query
nginx decided “ok, that domain doesn’t exist”, and gave up
the DNS server returned a successful reply for the AAAA query
nginx ignored the AAAA record because it had already given up

The problem was that the DNS server should have returned NOERROR – that domain did exist, it was just that there weren’t any A records for it. I reported the bug, they fixed it, and that fixed the problem.

I’ve implemented this bug myself too, so I understand why it happens – it’s easy to think “there aren’t any records for this query, I should return an NXDOMAIN error”.

problem: negative DNS caching

If you visit a domain before creating a DNS record for it, the absence of the record will be cached. This is very surprising the first time your run into it – I only learned about this last year!

The TTL for cache entry is the TTL of the domain’s SOA record – for example for jvns.ca, it’s an hour.

problem: nginx caching DNS records forever

If you put this in your nginx config:

location / {
    proxy_pass https://some.domain.com;
}

then nginx will resolve some.domain.com once on startup and never again. This is especially dangerous if the IP address for some.domain.com changes infrequently, because it might keep happily working for months and then suddenly break at 2am one day.

There are pretty well-known ways to fix this and this post isn’t about nginx so I won’t get into it, but it’s surprising the first time you run into it.

Here’s a blog post with a story of how this happened to someone with an AWS load balancer.

problem: Java caching DNS records forever

Same thing, but for Java: Apparently depending on how you configure Java, “the JVM default TTL [might be] set so that it will never refresh DNS entries until the JVM is restarted.”

I haven’t run into this myself but I asked a friend about it who writes more Java than me and they told me that it’s happened to them.

Of course, literally any software could have this problem of caching DNS records forever, but the main cases I’ve heard of in practice are nginx and Java.

problem: that entry in /etc/hosts you forgot about

Another variant on caching issues: entries in /etc/hosts that override your usual DNS settings!

This is extra confusing because dig ignores /etc/hosts, so everything SEEMS like it should be fine ("dig whatever.com is working!").

problem: your email isn’t being sent / is going to spam

The way email is sent and validated is through DNS (MX records, SPF records, DKIM records), so a lot of email problems are DNS problems.

problem: internationalized domain names don’t work

You can register domain names with non-ASCII characters or emoji like https://💩.la.

The way this works with DNS is that 💩.la gets translated into xn--ls8h.la with an encoding called “punycode”.

But even though there’s a clear standard for how they should work with DNS, a lot of software doesn’t handle internationalized domain names well! There’s a fun story about this in Julian Squires’ great talk The emoji that Killed Chrome!!.

problem: TCP DNS is blocked by a firewall

A couple of people mentioned that some firewalls allow UDP port 53 but not TCP port 53. But large DNS queries need to use TCP port 53, so this can cause weird intermittent problems that are hard to debug.

problem: musl doesn’t support TCP DNS

A lot of applications use libc’s getaddrinfo to make DNS queries. musl is an alternative to glibc that’s used in Alpine Docker container which doesn’t support TCP DNS. This can cause problems if you make DNS queries where the response would be too big to fit inside a regular DNS UDP packet (512 bytes).

I’m still a bit fuzzy on this so I might have it wrong, but my understanding of how this can break is:

musl’s getaddrinfo makes a DNS query
the DNS server notices that the response is too big to fit in a single DNS response packet
the DNS server returns an empty truncated response, expecting that the client will retry by making a TCP DNS query
musl does not support TCP so it does not retry

A blog post about this: DNS resolution issue in Alpine Linux

problem: round robin DNS doesn’t work with `getaddrinfo`

One way you could approach load balancing is to use “round robin DNS”. The idea is that every time you make a DNS query, you get a different IP address. Apparently this works if you use gethostbyname to make DNS queries, but it does not work if you use getaddrinfo because getaddrinfo sorts the IP responses it receives.

So you could run into an upsetting problem if you switch from gethostbyname to getaddrinfo behind the scenes without realising that this will break your DNS load balancing.

This is especially insidious because you might not realize that you’re switching to gethostbyname to getaddrinfo at all – if you’re not writing a C program, those functions calls are hidden inside some library. So it could be part of a seemingly innocuous upgrade.

Here are a couple of pages discussing this:

problem: a race condition when starting a service

A problem someone mentioned with Kubernetes DNS: they had 2 containers which started simultaneously and immediately tried to resolve each other. But the DNS lookup failed because the Kubernetes DNS change hadn’t happened yet, and then the failure was cached so it kept failing.

that’s all!

I’ve definitely missed some important DNS problems here, so I’d love to hear what I’ve missed. I’d also love links to blog posts that write up examples of these problems – I think it’s really useful to see how the problem specifically manifests in practice and how people debugged it.