Julia Evans

Nancy Drew and the Case of the Slow Program

in systems

Yesterday I tweeted:

I specifically wanted programming-language-independent ways to investigate questions like this, and I guess people who follow me on twitter get me because I got SO MANY GREAT ANSWERS. I’ll give you a list of all the answers at the end, but first! We’re going to mount an investigation.

Let’s start! I wrote up 3 example mystery programs, and you can find them in this github repository.

Mystery Program #1

Let’s investigate our first mystery slow program!

You can choose who submits talks to your conference

in conferences

Sometimes I see conference organizers say “well, we didn’t have a choice about the talk proposals we got!” or “we just picked the best ones!”. I think we all know by now that that’s bullshit, but just in case – it’s bullshit! =D

We have a choice about who submits talk proposals, and also about who submits the best talk proposals. I watched somebody I know get talk proposal feedback today, and their proposal started out good and got dramatically better. Now it’s great.

If you ask someone specifically to consider speaking at your conference, they’re WAY more likely to consider submitting a talk than if you don’t. If you then actively work with some talk submitters to help them focus and improve the talk they submit, their proposals will get better! And if you choose to focus your energies to work with (for instance) non-white people more than white people, then you’ll get more and better proposals from people who aren’t white.

You can see this with PyCon! 30% of the talks at last year’s PyCon were women, because lots of people have done tons of individual outreach to encourage their friends to give talks and spent lots of time working with them to write good proposals. As Jessica McKellar says:

Hello from your @PyCon Diversity Outreach Chair. % PyCon talks by women: (2011: 1%), (2012: 7%), (2013: 15%), (2014: 33%). Outreach works.

This makes me really happy! It means that if I’m working on a conference (like !!Con), then I know I can help get more diverse participation by sending emails to individual people who I’d like to hear from. Telling people that I like their work and that I’d like for them to talk about it is super fun! (and true!)

If there’s something you find exciting about programming and you often find you’re part of an underrepresented group when you go to conferences, I’d love it if you submitted a talk to !!Con. Double especially if you live in NYC!

1:1 topic ideas

Danielle Sucher started this great thread on twitter asking for ideas for what to talk about in 1:1s with your manager. I’m writing some of them up here so I don’t forget.

  • What’s happening now that I would like to not be happening in a month? (@zmagg)
  • Am I having tension with any of my colleagues I want to resolve before it gets worse?
  • what are promotions for? where am I relative to that, and what should I be working on?
  • or: “This is how I would like promotions to work when they happen. How would I fit in to that if they did?”
  • turn it around: what are you thinking about right now? what’s your top priority? what’s worrying you about this team?
  • Am I happy with my current project?
  • Do I feel like I’m learning? Are there things I feel like I’m not learning that I would like to?
  • Are there things about the way the team is working together that feel bad to me?
  • periodically: where do I want to be with my career?

There’s this further list of 101 questions to try that I find really really helpful as an exhaustive grab bag of “oh no I don’t know what to talk about give me ideas please!!!”.

How the locate command works (and let’s write a faster version in one minute!)

in coding

Sometimes I want to find all the Alanis songs on my computer. There are (basically) two ways to do this.

  1. find / -name '*Alanis*'
  2. locate Alanis

I’ve known for a long time that locate is faster than find, and that it had some kind of database, and that you could update the database using updatedb.

But I always somehow thought of the locate database as this Complicated Thing. Until today I started looking at it! On my machine it’s /var/lib/mlocate/mlocate.db. You can probably find it with locate mlocate or strace -e open locate whatever. It’s about 15MB on my computer.

When I cat it, here’s what part of it looks like.

1
2
3
4
/bin^@^@bash^@^@bunzip2^@^@busybox^@^@bzcat^@^@bzcmp^@^@bzdiff^@^@bzegrep^@^@bzexe^@^@bzfgrep^@^@bzgrep^@^@bzip2^@^@bzip2recover^@^@bzless^@^@bzmore^@^@cat^@^@chacl^@^@chgrp^@^@chmod^@^@chown^@^@chvt
^@^@cp^@^@cpio^@^@dash^@^@date^@^@dbus-cleanup-sockets^@^@dbus-daemon^@^@dbus-uuidgen^@^@dd^@^@df^@^@dir^@^@dmesg^@^@dnsdomainname^@^@domainname^@^@dumpkeys^@^@echo^@^@ed
^@^@egrep^@^@false^@^@fgconsole^@^@fgrep^@^@findmnt^@^@fuser^@^@fusermount^@^@getfacl^@^@grep^@^@gunzip^@^@gzexe^@^@gzip^@^@hostname^@^@ip^@^@kbd_mode^@^@kill^@^@kmod^@^@
less^@^@lessecho^@^@lessfile^@^@lesskey^@^@lesspipe^@^@ln^@^@loadkeys^@^@login^@^@loginctl^@^@lowntfs-3g^@^@ls^@^@lsblk^@^@lsmod^@^@mkdir^@^@mknod^@^@mktemp

And here’s what’s in the /bin directory:

1
2
3
4
5
6
7
8
9
10
11
$ ls /bin | head
bash
bunzip2
busybox
bzcat
bzcmp
bzdiff
bzegrep
bzexe
bzfgrep
bzgrep

COINCIDENCE THAT ALL OF THESE WORDS ARE THE SAME? I THINK NOT!

It turns out that the locate database is basically just a huge recursive directory listing (ls -R /). (I think it’s actually more complicated than that; there’s a paper at the end). So a slightly less space-efficient version of this whole locate business would be to create a database with this Very Sophisticated Command:

1
sudo find / > database.txt

This gives us a file that looks like

1
2
3
4
5
6
7
8
9
10
/
/vmlinuz.old
/var
/var/mail
/var/spool
/var/spool/rsyslog
/var/spool/mail
/var/spool/cups
/var/spool/cups/tmp
/var/spool/cron

Then we can more or less reproduce locate’s functionality by just doing

1
grep Alanis database.txt

I got curious about the relative speed of find vs locate vs our makeshift locate using grep. I have an SSD, so a find across all files on my computer is pretty fast:

1
2
$ time find / -name blah
0.59user 0.67system 0:01.71elapsed 73%CPU
1
2
$ time locate blah
0.26user 0.00system 0:00.30elapsed 87%CPU
1
2
$ time grep blah database.txt
0.04user 0.02system 0:00.10elapsed 64%CPU

Whoa, our homegrown locate using grep is actually way faster! That is surprising to me. Our homegrown database takes about 3x as much space as locate’s database (45MB instead of 15MB), so that’s probably part of why.

Anyway now you know how it works! This kind of makes me wonder if our database format which doesn’t use any clever compression tricks might actually be a better format if you’re not worried about the extra space it takes up. But I don’t really understand yet why locate is so much slower.

My current theory is that grep is better optimized than locate and that it can do smarter stuff. But if you know the real answer, or if you get different results on your computer, please tell me!

update: omg Mark Dominus tweeted at me within seconds and said he found exactly the same thing 10 years ago. Maybe this is really a thing! Or, more likely, there’s just stuff I don’t understand yet here. Either way I’d like to know!

further update: Patrick Collison pointed out this amazingly-titled (and short! 3 pages!) Finding Files Fast from 1983 which talks about locate’s design, and also claims that the source is pretty readable.

The 1983 paper specifically calls out “Why not simply build a list of static files and search with grep?”, and says that grepping a list of 20,000 files took 30 seconds (“unacceptable for an oft-used command”), and that their locate implementation gives them better performance. To compare, I have 700,000 files on my hard disk, and it takes about 0.05 seconds. It seems to me like the locate authors’ original issues are really not a problem anymore, 30 years later.

They’re also pretty worried about saving space in the locate database, which also isn’t really a concern anymore. This really makes me wonder what other standard unix programs make design assumptions that aren’t true in 2015.

How gzip uses Huffman coding

in programming

I wrote a blog post quite a while ago called gzip + poetry = awesome where I talked about how the gzip compression program uses the LZ77 algorithm to identify repetitions in a piece of text.

In case you don’t know what LZ77 is (I sure didn’t), here’s the video from that post that gives you an example of gzip identifying repetitions in a poem!



I thought this was a great demonstration, but it’s only half the story about how gzip works, and it’s taken me until now to write the rest of it down. So! Without further ado, let’s say we’re compressing this text:

1
2
3
a tapping, as of someone gently 
r{apping},{ rapping}
at my chamber door

How I learned to program in 10 years

The other day someone emailed me asking advice to learn how to program. He said he was planning to use “Learn Python the Hard Way”, which I understand is an excellent tutorial. I don’t generally like giving advice, because I think it’s way too easy to generalize your own experience into advice and do it wrong.

But then I found I had something to say! I thought a little bit about this blog, and how it might look like I sprang from the womb being excited about strace or thinking about how to build an operating system, or whatever. But in fact I spent about 10 years learning to program before going to Hacker School, mostly at fairly low speeds.

I drew a totally unscientific graph about what that felt like:

A/A testing

in machinelearning

Thursday evening I organized a meetup called Data Science: Inconvenient Truths where the amazing Diyang Tang, Clare Corthell, and Elena Grewal told great stories about things that have gone wrong when doing machine learning or data analysis.

Elena gave a talk (video of a previous version) about common mistakes you can make when running experiments. I was super excited to hear her talk because I’m working on an experiment right now at work and trying to interpret the results. The whole talk is great – it has really good examples of experiments Airbnb has worked on – and you should watch it, but I want to talk about the very small part where she mentions A/A testing.

I’ve heard of A/B testing! But I’d never heard of A/A testing before, didn’t really understand what she meant at the time, and didn’t manage to ask. (pro tip: ask questions when you have them :)). I slept on it, and I now think a) that I get it and b) it’s really simple and c) that it’s SUPER COOL AND MAYBE USEFUL TO ME.

Let’s pretend I have a widget store, and that I’m running an experiment where I have a Great Idea that I think will sell WAY MORE WIDGETS. I’ve rolled out my Great Idea to 33% of users, and I have a gorgeous dashboard that says that my Great Idea group has 2% higher sales than my other group, like this:

A 2% increase in sales is a pretty big deal! But how do I know that these results are actually significant? One great way is to do statistics – Dan McKinley built this calculator which you can see here that makes some assumptions and tells you how long you’ll need to run your experiment for to see statistical significance.

But let’s say you want to get a rough sense for whether or not your results might be significant without doing statistics.

This is where A/A testing comes in! The idea here is to compare two sets of users in the same experimental group and see how high the variation is. So instead of having a Great Idea group and a Control Group, we’ll use two control groups. And then we might see something like this:

Suddenly, the group we’re experimenting on doesn’t look so good anymore. It looks like any difference is likely to be just because of random noise. If we’d instead seen something like this, we’d be much more likely to believe that the Great Idea is actually doing well:

I like this because it seems like it can give you a rough sense for how significant your results are without having to decide on a statistical model. And it’s super intuitive! A question like “if we compare two groups of this size with the same characteristics, do we naturally see a lot of variation?” is a great smoke test.

Once I got off a plane and looked up what A/A testing actually is, I found out the graph above has a name! It’s called A/A/B testing, and A/A testing is when you literally just run an experiment where both groups are the same :)

Why a C++ programmer might say “I could never write Python”

I once heard a C++ programmer say “oh no, Python, that’s so hard, I could never write that. I’m so used to having a compiler!”. And at the time I thought they were just being silly – Python isn’t that hard! We’re humans, we can learn and grow! Of course they could!

But I’ve been writing both Scala and Ruby lately, and I’ve had some new thoughts. There’s not much to them, and maybe they’re obvious, but:

If you work with a compiled typed language like Scala, you develop a lot of skills around working with the type system to get better correctness.

If you spend all your time working in Python, by default you can’t even detect basic typos in your code like

1
2
def foo(elephant):
    return elephnt + 2

So you need to spend all your time learning how to write correct code without a static type checker, partly by writing a much better test suite, by using a linter, etc. Tests that wouldn’t tell you very much at all in Scala (that just run the code and don’t check the result) suddenly become incredibly useful! And it’s extra important to write testable code.

So maybe the C++ programmer who says she can’t write Python is really saying “Writing safer code in a dynamic language is a skill that takes time to learn! I have not yet learned it! I would be too scared to commit to writing a reliable Python program right now!”

And maybe this is part of why Haskell programmers get so attached to Haskell – because they’ve invested so much in learning the type system, and those skills don’t transfer well to other languages like Python.

I’d be interested to know what you think. (though: I do not want to talk about whether {Python, Ruby, Javascript} are better or worse than {Scala, C++, Haskell}. There are already too many flamewars discussing that so we’re not going to talk about it. I just want to talk about skills attached to specific kinds of programming languages and how transferable they are.)

Data Day Texas 2015

in conferences,, machinelearning

I went to Data Day Texas this past weekend. It was a 1-day conference, and I learned a lot!

Here are some things I learned! (any misunderstandings are mine, of course :))

Charity Majors: Upgrade your database – without losing your data, your performance, or your mind

This was by far my favorite talk (slides are here). Charity works at Parse, where they manage many thousands of MongoDB collections (as far as I understand it, at least one for each of their users). And sometimes they want to upgrade Mongo!

I understood before seeing this talk that doing database upgrades was hard, and that it’s appropriate to be incredibly skeptical, but I didn’t have any clue how to plan to reduce your uncertainty so that you can actually do the upgrade.

Some things I learned:

  1. How bad can it be if you don’t test the upgrade properly? (she saw one kind of query get 100x slower in the worst case, which would be a disaster). The examples of what can go in an upgrade that she gave were incredibly interesting.
  2. How much time is it appropriate to spend planning and testing a database upgrade? (they spend about a year)
  3. How do you know if the new database can handle your production workload? (snapshot it, take a day’s worth of operations and test it out on a production workload!)
  4. When you actually do the upgrade, how do you do it? (slowly, with lots of opportunities to roll back along the way)
  5. Does Mongo have merit? (They need to support a ton of very different workloads for their users, and it’s a great fit for that.)

There’s also a “A Very Short List Of Terrible Things Database Upgrades Have Done To Me” slide which is the best.

It gave me a little more appreciation for what it means to do ops at scale and to keep services running. I pretty rarely see talks that I feel really advance my understanding of a topic, and this was one of them.

(also, I think I have a thing or two to learn from her about writing talk titles)

Robert Munro – using humans to make your machine learning algorithms dramatically better

Let’s say that you’re writing a classifier that’s doing sentiment analysis. This is a task that’s pretty easy for humans (“is ‘had an amazing time with my friends watching terrible cult movies today’ positive?), but hard to do with machine learning, especially if you have limited training data to use.

He talked about how judiciously incorporating human input to get a better training set can give you much, much higher accuracy than just messing with your model parameters.

My absolute favorite thing about this talk was when he talked about the human/psychological aspects of using people to help you with classifications! If you’re writing a cat classifier and every single thing you show the human is not a cat, they’ll get bored and exhausted.

It made me think a lot about making sure if you’re asking people to help you with a task, you need to

  1. make the task interesting
  2. make sure the people helping you out have a lot of impact on your classification accuracy
  3. make sure that they know how high their impact is, and show them how the model is improving!

Ted Dunning – generating fake datasets

This was a fun talk about simulating datasets to

  • prove that you’re right about the Monty Hall problem
  • debug a database bug when your client can’t give you the data that caused it
  • do machine learning on data you don’t have

of these, the first two made the most sense to me – I had a much harder time imagining how it would be useful to do machine learning based on a simulated data set in real life, and I think I missed some of the explanation.

And he told us about a tool called log-synth that he wrote to generate fake datasets easily! I can pretty easily imagine myself using it to write ML tutorials :). It’s at https://github.com/tdunning/log-synth.

On reading the source code, not the docs

In my first programming job, I worked at a web development company and wrote modules for Drupal sites.

Every so often, I’d need to understand some specific Drupal edge case (say, how the text in the tabs on a search page was determined). And it wouldn’t be documented. And Stack Overflow didn’t have any answers about it, and my colleagues didn’t know. I like to think of this as a “oh right I actually need to know how to program” moment – nobody’s going to tell me what to do, I just need to figure it out.

The consultancy’s cofounder was Alex Dergachev, and he taught me a really important thing! Whenever I had a question like this, he’d tell me to just read Drupal’s source code, and then I’d know the answer. This is pretty obvious in retrospect, but it wasn’t obvious to me at the time.

I originally felt like Drupal was too complicated to understand (it’s a big codebase!), but it turned out that if I tried I could usually figure out what I needed to know.

This works particularly well when I understand pretty well what the code I’m using is doing, but am just unsure about some particular detail. For example “will this button appear for all users, or only admin users?” I use this all the time at work, and most of you probably do too! There are always details about code that aren’t exhaustively documented, and using grep to find answers is incredibly helpful.

But sometimes reading the code doesn’t work.