Monitoring tiny web services

Hello! I’ve started to run a few more servers recently (nginx playground, mess with dns, dns lookup), so I’ve been thinking about monitoring.

It wasn’t initially totally obvious to me how to monitor these websites, so I wanted to quickly write up what how I did it.

I’m not going to talk about how to monitor Big Serious Mission Critical websites at all, only tiny unimportant websites.

goal: spend approximately 0 time on operations

I want the sites to mostly work, but I also want to spend approximately 0% of my time on the ongoing operations.

I was initially very wary of running servers at all because at my last job I was on a 24/7 oncall rotation for some critical services, and in my mind “being responsible for servers” meant “get woken up at 2am to fix the servers” and “have lots of complicated dashboards”.

So for a while I only made static websites so that I wouldn’t have to think about servers.

But eventually I realized that any server I was going to write was going to be very low stakes, if they occasionally go down for 2 hours it’s no big deal, and I could just set up some very simple monitoring to help keep them running.

not having monitoring sucks

At first I didn’t set up any monitoring for my servers at all. This had the extremely predictable outcome of – sometimes the site broke, and I didn’t find out about it until somebody told me!

step 1: an uptime checker

The first step was to set up an uptime checker. There are tons of these out there, the ones I’m using right now are updown.io and uptime robot. I like updown’s user interface and pricing structure more (it’s per request instead of a monthly fee), but uptime robot has a more generous free tier.

These

check that the site is up
if it goes down, it emails me

I find that email notifications are a good level for me, I’ll find out pretty quickly if the site goes down but it doesn’t wake me up or anything.

step 2: an end-to-end healthcheck

Next, let’s talk about what “check that the site is up” actually means.

At first I just made one of my healthcheck endpoints a function that returned 200 OK no matter what.

This is kind of useful – it told me that the server was on!

But unsurprisingly I ran into problems because it wasn’t checking that the API was actually working – sometimes the healthcheck succeeded even though the rest of the service had actually gotten into a bad state.

So I updated it to actually make a real API request and make sure it succeeded.

All of my services do very few things (the nginx playground has just 1 endpoint), so it’s pretty easy to set up a healthcheck that actually runs through most of the actions the service is supposed to do.

Here’s what the end-to-end healthcheck handler for the nginx playground looks like. It’s very basic: it just makes another POST request (to itself) and checks if that request succeeds or fails.

func healthHandler(w http.ResponseWriter, r *http.Request) {
	// make a request to localhost:8080 with `healthcheckJSON` as the body
	// if it works, return 200
	// if it doesn't, return 500
	client := http.Client{}
	resp, err := client.Post("http://localhost:8080/", "application/json", strings.NewReader(healthcheckJSON))
	if err != nil {
		log.Println(err)
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	if resp.StatusCode != http.StatusOK {
		log.Println(resp.StatusCode)
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	w.WriteHeader(http.StatusOK)
}

healthcheck frequency: hourly

Right now I’m running most of my healthchecks every hour, and some every 30 minutes.

I run them hourly because updown.io’s pricing is per healthcheck, I’m monitoring 18 different URLs, and I wanted to keep my healthcheck budget pretty minimal at $5/year.

Taking an hour to find out that one of these websites has gone down seems ok to me – if there is a problem there’s no guarantee I’ll get to fixing it all that quickly anyway.

If it were free to run them more often I’d probably run them every 5-10 minutes instead.

step 3: automatically restart if the healthcheck fails

Some of my websites are on fly.io, and fly has a pretty standard feature where I can configure a HTTP healthcheck for a service and restart the service if the healthcheck starts failing.

“Restart a lot” is a very useful strategy to paper over bugs that I haven’t gotten around to fixing yet – for a while the nginx playground had a process leak where nginx processes weren’t getting terminated, so the server kept running out of RAM.

With the healthcheck, the result of this was that every day or so, this would happen:

the server ran out of RAM
the healthcheck started failing
it get restarted
everything was fine again
repeat the whole saga again some number of hours later

Eventually I got around to actually fixing the process leak, but it was nice to have a workaround in place that could keep things running while I was procrastinating fixing the bug.

These healthchecks to decide whether to restart the service run more often: every 5 minutes or so.

this is not the best way to monitor Big Services

This is probably obvious and I said this already at the beginning, but “write one HTTP healthcheck” is not the best approach for monitoring a large complex service. But I won’t go into that because that’s not what this post is about.

it’s been working well so far!

I originally wrote this post 3 months ago in April, but I waited until now to publish it to make sure that the whole setup was working.

It’s made a pretty big difference – before I was having some very silly downtime problems, and now for the last few months the sites have been up 99.95% of the time!