Julia Evans

How gzip uses Huffman coding

in programming

I wrote a blog post quite a while ago called gzip + poetry = awesome where I talked about how the gzip compression program uses the LZ77 algorithm to identify repetitions in a piece of text.

In case you don’t know what LZ77 is (I sure didn’t), here’s the video from that post that gives you an example of gzip identifying repetitions in a poem!

I thought this was a great demonstration, but it’s only half the story about how gzip works, and it’s taken me until now to write the rest of it down. So! Without further ado, let’s say we’re compressing this text:

a tapping, as of someone gently 
r{apping},{ rapping}
at my chamber door

How I learned to program in 10 years

The other day someone emailed me asking advice to learn how to program. He said he was planning to use “Learn Python the Hard Way”, which I understand is an excellent tutorial. I don’t generally like giving advice, because I think it’s way too easy to generalize your own experience into advice and do it wrong.

But then I found I had something to say! I thought a little bit about this blog, and how it might look like I sprang from the womb being excited about strace or thinking about how to build an operating system, or whatever. But in fact I spent about 10 years learning to program before going to Hacker School, mostly at fairly low speeds.

I drew a totally unscientific graph about what that felt like:

A/A testing

in machinelearning

Thursday evening I organized a meetup called Data Science: Inconvenient Truths where the amazing Diyang Tang, Clare Corthell, and Elena Grewal told great stories about things that have gone wrong when doing machine learning or data analysis.

Elena gave a talk (video of a previous version) about common mistakes you can make when running experiments. I was super excited to hear her talk because I’m working on an experiment right now at work and trying to interpret the results. The whole talk is great – it has really good examples of experiments Airbnb has worked on – and you should watch it, but I want to talk about the very small part where she mentions A/A testing.

I’ve heard of A/B testing! But I’d never heard of A/A testing before, didn’t really understand what she meant at the time, and didn’t manage to ask. (pro tip: ask questions when you have them :)). I slept on it, and I now think a) that I get it and b) it’s really simple and c) that it’s SUPER COOL AND MAYBE USEFUL TO ME.

Let’s pretend I have a widget store, and that I’m running an experiment where I have a Great Idea that I think will sell WAY MORE WIDGETS. I’ve rolled out my Great Idea to 33% of users, and I have a gorgeous dashboard that says that my Great Idea group has 2% higher sales than my other group, like this:

A 2% increase in sales is a pretty big deal! But how do I know that these results are actually significant? One great way is to do statistics – Dan McKinley built this calculator which you can see here that makes some assumptions and tells you how long you’ll need to run your experiment for to see statistical significance.

But let’s say you want to get a rough sense for whether or not your results might be significant without doing statistics.

This is where A/A testing comes in! The idea here is to compare two sets of users in the same experimental group and see how high the variation is. So instead of having a Great Idea group and a Control Group, we’ll use two control groups. And then we might see something like this:

Suddenly, the group we’re experimenting on doesn’t look so good anymore. It looks like any difference is likely to be just because of random noise. If we’d instead seen something like this, we’d be much more likely to believe that the Great Idea is actually doing well:

I like this because it seems like it can give you a rough sense for how significant your results are without having to decide on a statistical model. And it’s super intuitive! A question like “if we compare two groups of this size with the same characteristics, do we naturally see a lot of variation?” is a great smoke test.

Once I got off a plane and looked up what A/A testing actually is, I found out the graph above has a name! It’s called A/A/B testing, and A/A testing is when you literally just run an experiment where both groups are the same :)

Why a C++ programmer might say “I could never write Python”

I once heard a C++ programmer say “oh no, Python, that’s so hard, I could never write that. I’m so used to having a compiler!”. And at the time I thought they were just being silly – Python isn’t that hard! We’re humans, we can learn and grow! Of course they could!

But I’ve been writing both Scala and Ruby lately, and I’ve had some new thoughts. There’s not much to them, and maybe they’re obvious, but:

If you work with a compiled typed language like Scala, you develop a lot of skills around working with the type system to get better correctness.

If you spend all your time working in Python, by default you can’t even detect basic typos in your code like

def foo(elephant):
    return elephnt + 2

So you need to spend all your time learning how to write correct code without a static type checker, partly by writing a much better test suite, by using a linter, etc. Tests that wouldn’t tell you very much at all in Scala (that just run the code and don’t check the result) suddenly become incredibly useful! And it’s extra important to write testable code.

So maybe the C++ programmer who says she can’t write Python is really saying “Writing safer code in a dynamic language is a skill that takes time to learn! I have not yet learned it! I would be too scared to commit to writing a reliable Python program right now!”

And maybe this is part of why Haskell programmers get so attached to Haskell – because they’ve invested so much in learning the type system, and those skills don’t transfer well to other languages like Python.

I’d be interested to know what you think. (though: I do not want to talk about whether {Python, Ruby, Javascript} are better or worse than {Scala, C++, Haskell}. There are already too many flamewars discussing that so we’re not going to talk about it. I just want to talk about skills attached to specific kinds of programming languages and how transferable they are.)

Data Day Texas 2015

in conferences,, machinelearning

I went to Data Day Texas this past weekend. It was a 1-day conference, and I learned a lot!

Here are some things I learned! (any misunderstandings are mine, of course :))

Charity Majors: Upgrade your database – without losing your data, your performance, or your mind

This was by far my favorite talk (slides are here). Charity works at Parse, where they manage many thousands of MongoDB collections (as far as I understand it, at least one for each of their users). And sometimes they want to upgrade Mongo!

I understood before seeing this talk that doing database upgrades was hard, and that it’s appropriate to be incredibly skeptical, but I didn’t have any clue how to plan to reduce your uncertainty so that you can actually do the upgrade.

Some things I learned:

  1. How bad can it be if you don’t test the upgrade properly? (she saw one kind of query get 100x slower in the worst case, which would be a disaster). The examples of what can go in an upgrade that she gave were incredibly interesting.
  2. How much time is it appropriate to spend planning and testing a database upgrade? (they spend about a year)
  3. How do you know if the new database can handle your production workload? (snapshot it, take a day’s worth of operations and test it out on a production workload!)
  4. When you actually do the upgrade, how do you do it? (slowly, with lots of opportunities to roll back along the way)
  5. Does Mongo have merit? (They need to support a ton of very different workloads for their users, and it’s a great fit for that.)

There’s also a “A Very Short List Of Terrible Things Database Upgrades Have Done To Me” slide which is the best.

It gave me a little more appreciation for what it means to do ops at scale and to keep services running. I pretty rarely see talks that I feel really advance my understanding of a topic, and this was one of them.

(also, I think I have a thing or two to learn from her about writing talk titles)

Robert Munro – using humans to make your machine learning algorithms dramatically better

Let’s say that you’re writing a classifier that’s doing sentiment analysis. This is a task that’s pretty easy for humans (“is ‘had an amazing time with my friends watching terrible cult movies today’ positive?), but hard to do with machine learning, especially if you have limited training data to use.

He talked about how judiciously incorporating human input to get a better training set can give you much, much higher accuracy than just messing with your model parameters.

My absolute favorite thing about this talk was when he talked about the human/psychological aspects of using people to help you with classifications! If you’re writing a cat classifier and every single thing you show the human is not a cat, they’ll get bored and exhausted.

It made me think a lot about making sure if you’re asking people to help you with a task, you need to

  1. make the task interesting
  2. make sure the people helping you out have a lot of impact on your classification accuracy
  3. make sure that they know how high their impact is, and show them how the model is improving!

Ted Dunning – generating fake datasets

This was a fun talk about simulating datasets to

  • prove that you’re right about the Monty Hall problem
  • debug a database bug when your client can’t give you the data that caused it
  • do machine learning on data you don’t have

of these, the first two made the most sense to me – I had a much harder time imagining how it would be useful to do machine learning based on a simulated data set in real life, and I think I missed some of the explanation.

And he told us about a tool called log-synth that he wrote to generate fake datasets easily! I can pretty easily imagine myself using it to write ML tutorials :). It’s at https://github.com/tdunning/log-synth.

On reading the source code, not the docs

In my first programming job, I worked at a web development company and wrote modules for Drupal sites.

Every so often, I’d need to understand some specific Drupal edge case (say, how the text in the tabs on a search page was determined). And it wouldn’t be documented. And Stack Overflow didn’t have any answers about it, and my colleagues didn’t know. I like to think of this as a “oh right I actually need to know how to program” moment – nobody’s going to tell me what to do, I just need to figure it out.

The consultancy’s cofounder was Alex Dergachev, and he taught me a really important thing! Whenever I had a question like this, he’d tell me to just read Drupal’s source code, and then I’d know the answer. This is pretty obvious in retrospect, but it wasn’t obvious to me at the time.

I originally felt like Drupal was too complicated to understand (it’s a big codebase!), but it turned out that if I tried I could usually figure out what I needed to know.

This works particularly well when I understand pretty well what the code I’m using is doing, but am just unsure about some particular detail. For example “will this button appear for all users, or only admin users?” I use this all the time at work, and most of you probably do too! There are always details about code that aren’t exhaustively documented, and using grep to find answers is incredibly helpful.

But sometimes reading the code doesn’t work.

Programming doesn’t belong to men (it belongs to me)

One thing I’ve noticed since I started writing this blog is that I’ll get comments like these on my posts: (all these people are talking about me)

He’s only using the -Ofast gcc option, I wonder what he would get with -march=native -mtune=native which allows the compiler to use more instructions.

But more interestingly, why is he studying the amalgated C file instead of, you know, the sources?

I think he talks about the acutal .db file?

How would each machine/core’s counts know when they are done? If he wants a max of a million counts wouldn’t each machine have to check the counts of each other machine before continuing?

If you change the OP’s problem to summing a vector of floating point numbers, then even the way he has coded it there will still be differences from run to run.

One thing he doesn’t mention is the fear of looking stupid.

When this happens, when people implicitly assume that a Technical Thing On The Internet must be written by a man, I find it confusing. I didn’t grow up with the idea that I was worse at math or programming than the men around me (because, well, I wasn’t!) And I didn’t grow up with the idea that it was weird for me to write programs (why would it be?). And a huge number of the programmers I know and respect are women.

So the idea that programmers are all men or that programming is for men or that an article about the Linux kernel is probably written by a man just seems… silly to me. I feel like the community belongs to me, and like I’m a part of it.

And when people who have more power push marginalized people out of tech, I think, who do you think you are? Why do you think this belongs to you, and that you have the right to say who can come and who can’t?

Programming doesn’t belong to men, or to people who went to MIT, or to white people, or to English-speakers, or to Linux users, or to C programmers, or to people with CS degrees, or to people who are self-taught, or to people who can see, or to people who had easy access to computers when they were young. If you write programs, it belongs to you.

Fear makes you a worse programmer

in feelings

Yesterday morning, I asked on Twitter:

Does anyone have good writing about fear + programming (and how being afraid to make important changes makes you a worse programmer?)


I feel like there’s this really important line between caution (a++ excellent) and fear (which holds you back from doing necessary work)

A lot of super interesting discussion ensued, and I’d like to talk about some of it.

Before I start, Ryan Kennedy linked me to this slide deck of a presentation he gave called Fear Driven Development which I absolutely loved, and I think you should look at it. I think my favorite sentence from that presentation is “Fear creates local maximums.”

I find that when I’m afraid, I become super conservative. WE CANNOT POSSIBLY MAKE THIS CHANGE WHAT IF IT BREAKS?! And this means worse software! It’s actually kind of disastrous. If you’re scared of making changes, you can’t make something dramatically better, or do that big code cleanup. Maybe you can’t even deploy the code that you already wrote and tested, because it feels too scary. You just want to stick what’s sort-of-working, even if it’s not great.

Reproducing awesomeness

I had a conversation with my friend Tavish once about people we respect, and how to think about them, and how to be like them. (Tavish, by the way, gave a SUPER INTERESTING talk at PyCon this year about what programmers can learn from software engineering research, and about how measuring what makes good software instead of just getting into flamewars).

But about AMAZING PEOPLE. There are all kinds of people who I think are amazing. And Tavish suggested instead of being all “wow that person is so great they must be a genius and a wizard”, it’s better to ask “okay, what specifically do I admire about that person? Is that a way I want to be? How did they get like that? Can I do that too?”

And sometimes I can’t be like them, or I don’t want to! Like I might think that Terence Tao is an amazing mathematician, but it turns out I don’t actually want to be a mathematician. But I also like that he writes about his work in a public way, and that tries to demystify math research. And demystifying and writing about things I find interesting in public are things I can do without trying to win the Fields medal :)

Or my friend Sumana is delightful in many ways, and one of the many things I like about her is that she’s very good at giving thoughtful compliments and criticism. And I’m learning that looking at someone’s work carefully and telling them what I think is good about it and what could be improved can be a incredibly helpful thing to do. So I’m trying to do more of that! And she’s also a better writer than I am, and she’s been writing a blog for 14 years, and probably if I write more I will also be a better writer than I am now.

So I feel these days like saying WOW THAT PERSON IS SO GREAT I COULD NOT POSSIBLY is kind of… being unfair. Because I could probably be a little more like them if I think it’s important! And it’s kind of awkward for people to be idolized like that, and it’s better (for myself, and for the not-idolized-person), to just figure out what parts of them I’d like to be more like, and try some things out, and see what works.

And perhaps tell them sometimes that I think they are doing a Very Good Job of doing what makes them awesome.

(edit: There was a pretty huge omission in the original version of this post, which has been bothering me. There’s a lot of privilege involved in saying “hey that awesome thing? I can just go ahead and do that!”. I have this huge abundance of support and free time and money and amazing people who are happy to help me, and it makes it so much easier to do things that I think are interesting or important.)

Diving into concurrency: trying out mutexes and atomics

in concurrency

I hadn’t written any threaded programs before yesterday. I knew sort of abstractly about some concurrency concepts (mutexes! people say compare-and-swap but I don’t totally get it!), but actually understanding a Thing is hard if I’ve never done it. So yesterday I decided to write a program with threads! In this post, we’re going to:

  1. Write a threaded program that gets the wrong answer because of a race condition
  2. Fix that race condition in C and Rust, using 2 different approaches (mutexes and atomics)
  3. Find out why Rust is slower than C
  4. Talk a little about the actual system calls and instructions that make some of this work