Today I did the opening keynote at SRECon. This talk was a little less technical than my normal talks: instead of talking about tools like tcpdump (though tcpdump makes an appearance!), I wanted to talk about how to make a career where you’re constantly learning and how to be good at your job whether or not you’re the most experienced person.
Here’s the talk abstract, then the slides & a rough transcript. I’ve included links to every resource I mentioned.
I don't always feel like a wizard. I'm not the most experienced member on my team, like most people I sometimes find my work difficult, and I still have a TON TO LEARN.
But along the way, I have learned a few ways to debug tricky problems, get the information I need from my colleagues, and get my job done. We're going to talk about
At the end, we'll have a better understanding of how you can get a lot of awesome stuff done even when you're not the highest level wizard on your team.
- how asking dumb questions is actually a superpower
- how you can read the source code to programs when all other avenues fail
- debugging tools that make you FEEL like a wizard
- and how understanding what your _organization_ needs can make you amazing
So you want to be a wizard
(this transcript is nowhere near totally faithful; there’s a fair amount of “what i meant to say / what I said, kind of, I think” in here :) )
You can click on any of the slides to see a bigger version.
And there's a seemingly neverending amount of new technology to learn about. For instance we're looking at Kubernetes, and to operate a Kubernetes cluster you need to operate etcd, which means that you need to understand a bunch of distributed systems concepts to make sure you're doing it right.
In reliability engineering, what lives below us is typically "systems stuff" like operating systems & networking. Above us is stuff like business requirements & the programs we're trying to make run reliably.
This talk is mostly going to be about understanding lower-level systems, but we're also going to talk a little about humans and how to make sure you're actually building the right thing :)
First, understanding jargon is really useful. If someone says "hey, this process got killed by the OOM killer" it's useful to know what that means! (we're going to talk about what an OOM killer is later)
second, it lets you debug harder problems. When I set up a web server (Apache) for the first time, maybe 8 years ago, I didn't understand the HTTP protocol very well and I didn't understand what many of the configuration options I was using meant exactly.
So I would normally debug by Googling things and trying random fixes. This was a pretty viable strategy at the time (I got my webservers working!) but today when I configure webservers, it's important for me to actually understand what I'm doing and exactly what effect I expect it to have. And now I can fix problems much more easily.
rachelbythebay is a great collection of debugging stories, and it's clear throughout that she has a really deep understanding of the systems she works with.
The last reason is -- having a solid understanding of the systems you work with lets you innovate. I think Docker is a cool example of this. Docker was not the first thing to ever use namespaces (one of the kernel features that people call "containers"), but in order to make a tool that people loved to use, the Docker developers had to have a really good understanding of exactly what features Linux has to support isolating processes from others.
This is incredibly useful -- in networking, if you know what a packet is and how it's put together, then it really helps to tackle other more complicated concepts.
Let me tell you a quick story about how I learned what a system call was.
When I got to RC, I learned about the concept of a "system call"! (here's the blog post I wrote the day I learned that). System calls are how applications talk to the operating system. I felt kind of sad that I didn't know about them before, but the important thing was that I learned it! That's exciting!
TCP is the protocol that runs a lot of the internet that we use day to day. Often it "just works" and you don't need to think about it, but sometimes, well, we do need to think about it! So it's helpful to understand the basics.
The way I started learning about TCP was, I wrote a TCP stack in Python! This was really fun, it didn't take that long, and I learned a ton by doing it and writing up what I learned.
I think doing this kind of experiment is awesome because servers run out of memory in production, and it's cool to see what that looks like and how to reason about it in a safer environment. (hint: if you run "dmesg" and search for "oom" it will show you OOM killer activity)
I learned SO MUCH by doing this -- you can read more about it here
As I learned more and came back to his writing, I was able to understand more of it! I'm still not a distributed systems expert, but I'm happy I tried to read these posts even when I didn't understand them well.
That Linux kernel development book I mentioned is kind of similar. Its goal is to give you the tools you need to become a Linux kernel developer. I am not a Linux kernel developer (or at least not yet!). But I've learned a few interesting things by reading this book.
It's useful for me to remember that I can learn something even when I'm doing work which is sort of routine.
As a small example -- recently we had a computer that was swapping even though it had 16GB of free memory. This did not match my mental model ("computers only swap memory to disk when they're out of memory"). Obviously there was something wrong with my model. So I investigated, and I learned a couple new things about how swap works on Linux!
1. You could be actually out of RAM.
2. You could be "mostly" out of RAM. The "vm.swappiness" sysctl setting controls how likely your machine is to swap. This isn't what was happening to us, though.
3. A cgroup could be out of RAM, which was what was happening to us at the time (here's the blog post I wrote about that).
4. There's also a 4th reason I learned about afterwards: if you have no swap, and your vm.overcommit_ratio is set to 50% (which is the default), you can end up in a situation where only half your RAM can be used. That's no good! here's a post about overcommit on Linux
I'm happy I dug in a bit because now I understand how this part of Linux works better!
In 2003, when I was 15, my mom bought me a shiny new computer. I was really excited about Linux, so I installed a ton of different Linux distributions. (also, thanks to my mom!! I'm super lucky to have had a computer that I could bork repeatedly and the time & space to do tons of experiments)
Around 2009, in university, I was one of the sysadmins for a small lab of 7 Linux & Windows computers. The old sysadmin said "hey, want to help out?", gave me the root password, and we muddled our way through getting the computers to work for a bunch of math undergrads.
In 2013 I learned what a system call is and a bunch of basic things about operating systems! This was super awesome. (here's everything I learned at the Recurse Center
< And now I'm still continuing to learn.
Asking good questions is really important because people in general cannot just magically guess what I want them to tell me.
Instead, I instead try to remember to ask a less experienced person, who I think will still know the answer.
So not asking the most experienced person is actually a cool way to show trust in less experienced team members, reduce the bus factor, and spread knowledge around.
And often they have trouble remembering all the details to document them! So I like to (right after someone did something) to ask them to explain exactly what they did, or to ask if I can watch while they do it.
In my first job, I was writing plugins to make websites with Drupal, a PHP content management system. Once I remember I had a really specific question about how some Drupal thing worked. It wasn't documented, and there were no results on Google when I looked.
I asked my boss at the time if he knew and he told me "julia, you just have to go read the code and find out how it works!". I was a bit unsure about how to approach it ("there's so much code") but he pointed me to the relevant part of the Drupal codebase, and, sure enough, I could see the answer to my question there!
Since then I've looked at the code for a bunch of large open source questions to answer questions (nginx! linux!) and even if I'm not a super good C programmer, sometimes I can figure out the answer to my question.
We found out that the client would send the HTTP headers, wait 40ms, and then send the rest of the request. So the server wasn't the problem at all! But why was the client doing this? It's written in Ruby, and initially I maybe thought we should just blame Ruby, but that wasn't a really good reason (40ms is a very long time, even in Ruby).
So they were stuck in this kind of passive-aggressive-waiting situation.
I wrote a blog post about this called Why you should understand (a little) about TCP if you want to know more.
But we were processing a relatively small number of records, and it was taking 15 hours to do it, and it was NOT REASONABLE and I knew that the job was too slow. And I figured it out, and now it’s faster and everyone is happy.
So I decided I wanted to take some time to train my intuitions about how fast different computer operations should be.
The goal isn't to know exactly, but I think it's useful to be right up to an order of magnitude. So can you do it 100 times in a second? 10,000? 10 million times?
When I started out, I didn't have very good tools! But now I know about all kinds of profilers! I know about strace, and tcpdump, and way more tools for figuring out what's going on. It makes a huge difference.
But these days, when I run into a mysterious bug, I think it's kind of fun! I get to improve my understanding of the systems I work with, which is awesome!
But since then, I've learned to find them really useful! (learning to like design documents)
One thing I learned is that it's helpful at first to just share a design I'm working with a few people. Like I'll show it to a couple of other people on my team, see what they think, and then make changes! It's not always necessary to ask every single person who might have an opinion what they think.
It took me maybe 45 minutes to write up the plan (super fast!), I showed it to a manager on another team, he had a couple of things he asked me to do differently, and he was SO HAPPY I'd written down a plan so that he understood what was going on. Awesome!
The small project went super smoothly and I was really happly I wrote up a thing about it first.
This is great because it forces me to articulate why the project is important (why did we spend all that time on it), how it's going to impact other teams and what other people in the organization need to know about, and how we know that we actually met our goals for the project
The last thing is really important -- more often than I'd like to admit, I get to the end of a project and realize I'm not quite sure how we can tell whether the project is actually going to improve things or not. Planning that out at the beginning helps make sure that we put in the right metrics!
But this doesn't mean it's not worth designing at all! I like writing down my assumptions explicitly because when things do change, I can go back and see what assumptions we had are no longer true, and make sure that we update everything we need to update. Having a record of changes is useful!
I usually find it possible to stay motivated if I can remember "ok, I'm spending hours working on configuring nginx, and this is boring, but it's in service of this really cool goal!"
But if I *don't* remember the goal (or what I'm working on actually doesn't make sense), it sucks.
I'm trying to get better at saying "okay!!! this project! it has some slow and difficult pieces, so why is it so important? why are we going to feel awesome when it's done? which parts are the most important?"
But sometimes the task I'm being given is only maybe 80% thought through, and when I go to understand the exact reason for doing it, it turns out that we don't need to do it at all! (or maybe we should actually be doing something completely different!)
> I’ve yet to find the perfect job or thing to work on, but I have found a way to live a more meaningful life in tech.
> I now put people first. Regardless of the technology involved I gravitate towards helping people.
> People provide a much better feedback loop than computers or the abstract idea of a business.
> Everything I work on has a specific person or group of people in mind; this is what gives my work meaning; solving problems is not enough.