I specifically wanted programming-language-independent ways to
investigate questions like this, and I guess people who follow me on
twitter get me because I got SO MANY GREAT ANSWERS. I’ll give you a list
of all the answers at the end, but first! We’re going to mount an
Let’s start! I wrote up 3 example mystery programs, and you can find
them in this
Sometimes I see conference organizers say “well, we didn’t have a choice about
the talk proposals we got!” or “we just picked the best ones!”. I think we all
know by now that that’s bullshit, but just in case – it’s bullshit! =D
We have a choice about who submits talk proposals, and also about who submits
the best talk proposals. I watched somebody I know get talk proposal feedback
today, and their proposal started out good and got dramatically better. Now
If you ask someone specifically to consider speaking at your conference,
they’re WAY more likely to consider submitting a talk than if you don’t. If you
then actively work with some talk submitters to help them focus and improve the
talk they submit, their proposals will get better! And if you choose to focus
your energies to work with (for instance) non-white people more than white
people, then you’ll get more and better proposals from people who aren’t white.
You can see this with PyCon! 30% of the talks at last year’s PyCon were women,
because lots of people have done tons of individual outreach to encourage their
friends to give talks and spent lots of time working with them to write good
proposals. As Jessica McKellar says:
Hello from your @PyCon Diversity Outreach Chair. % PyCon talks by women:
(2011: 1%), (2012: 7%), (2013: 15%), (2014: 33%). Outreach works.
This makes me really happy! It means that if I’m working on a conference (like
!!Con), then I know I can help get more diverse
participation by sending emails to individual people who I’d like to hear from.
Telling people that I like their work and that I’d like for them to talk about
it is super fun! (and true!)
If there’s something you find exciting about programming and you often find
you’re part of an underrepresented group when you go to conferences, I’d love
it if you submitted a talk to !!Con.
Double especially if you live in NYC!
Sometimes I want to find all the Alanis songs on my computer. There are
(basically) two ways to do this.
find / -name '*Alanis*'
I’ve known for a long time that locate is faster than find, and that it had
some kind of database, and that you could update the database using updatedb.
But I always somehow thought of the locate database as this Complicated Thing.
Until today I started looking at it! On my machine it’s
/var/lib/mlocate/mlocate.db. You can probably find it with locate mlocate
or strace -e open locate whatever. It’s about 15MB on my computer.
$ ls /bin | head
COINCIDENCE THAT ALL OF THESE WORDS ARE THE SAME? I THINK NOT!
It turns out that the locate database is basically just a huge recursive
directory listing (ls -R /). (I think it’s actually more complicated than
that; there’s a paper at the end). So a slightly less space-efficient version
of this whole locate business would be to create a database with this Very
Then we can more or less reproduce locate’s functionality by just doing
grep Alanis database.txt
I got curious about the relative speed of find vs locate vs our makeshift
locate using grep. I have an SSD, so a find across all files on my computer
is pretty fast:
$ time find / -name blah
0.59user 0.67system 0:01.71elapsed 73%CPU
$ time locate blah
0.26user 0.00system 0:00.30elapsed 87%CPU
$ time grep blah database.txt
0.04user 0.02system 0:00.10elapsed 64%CPU
Whoa, our homegrown locate using grep is actually way faster! That is
surprising to me. Our homegrown database takes about 3x as much space as
locate’s database (45MB instead of 15MB), so that’s probably part of why.
Anyway now you know how it works! This kind of makes me wonder if our database
format which doesn’t use any clever compression tricks might actually be a
better format if you’re not worried about the extra space it takes up. But I
don’t really understand yet why locate is so much slower.
My current theory is that grep is better optimized than locate and that it can
do smarter stuff. But if you know the real answer, or if you get different
results on your computer, please tell me!
update: omg Mark Dominus tweeted at me within seconds and said he found exactly the same thing 10 years ago. Maybe this is really a thing! Or, more likely, there’s just stuff I don’t understand yet here. Either way I’d like to know!
The 1983 paper specifically calls out “Why not simply build a list of static
files and search with grep?”, and says that grepping a list of 20,000 files
took 30 seconds (“unacceptable for an oft-used command”), and that their locate
implementation gives them better performance. To compare, I have 700,000 files
on my hard disk, and it takes about 0.05 seconds. It seems to me like the
locate authors’ original issues are really not a problem anymore, 30 years
They’re also pretty worried about saving space in the locate database, which
also isn’t really a concern anymore. This really makes me wonder what other
standard unix programs make design assumptions that aren’t true in 2015.
I wrote a blog post quite a while ago called gzip + poetry = awesome
where I talked about how the gzip compression program uses the LZ77 algorithm
to identify repetitions in a piece of text.
In case you don’t know what LZ77 is (I sure didn’t), here’s the video from that
post that gives you an example of gzip identifying repetitions in a poem!
I thought this was a great demonstration, but it’s only half the story about
how gzip works, and it’s taken me until now to write the rest of it down. So!
Without further ado, let’s say we’re compressing this text:
a tapping, as of someone gently
at my chamber door
The other day someone emailed me asking advice to learn how to program.
He said he was planning to use “Learn Python the Hard Way”, which I
understand is an excellent tutorial. I don’t generally like giving
advice, because I think it’s way too easy to generalize your own
experience into advice and do it wrong.
But then I found I had something to say! I thought a little bit about
this blog, and how it might look like I sprang from the womb being
excited about strace or thinking about how to build an operating system,
or whatever. But in fact I spent about 10 years learning to program
before going to Hacker School, mostly at fairly low speeds.
I drew a totally unscientific graph about what that felt like:
Thursday evening I organized a meetup called Data Science: Inconvenient
Truths where the amazing Diyang Tang,
Clare Corthell, and
Elena Grewal told great stories about
things that have gone wrong when doing machine learning or data
Elena gave a talk (video of a previous version)
about common mistakes you can make when running experiments. I was super
excited to hear her talk because I’m working on an experiment right now
at work and trying to interpret the results. The whole talk is great –
it has really good examples of experiments Airbnb has worked on – and
you should watch it, but I want to talk about the very small part where
she mentions A/A
I’ve heard of A/B testing!
But I’d never heard of A/A testing before, didn’t really understand
what she meant at the time, and didn’t manage to ask. (pro tip: ask
questions when you have them :)). I slept on it, and I now think a) that
I get it and b) it’s really simple and c) that it’s SUPER COOL AND MAYBE
USEFUL TO ME.
Let’s pretend I have a widget store, and that I’m running an experiment
where I have a Great Idea that I think will sell WAY MORE WIDGETS. I’ve
rolled out my Great Idea to 33% of users, and I have a gorgeous
dashboard that says that my Great Idea group has 2% higher sales than my
other group, like this:
A 2% increase in sales is a pretty big deal! But how do I know that
these results are actually significant? One great way is to do
statistics – Dan McKinley built this calculator which you can see
here that makes some assumptions
and tells you how long you’ll need to run your experiment for to see
But let’s say you want to get a rough sense for whether or not your
results might be significant without doing statistics.
This is where A/A testing comes in! The idea here is to compare two
sets of users in the same experimental group and see how high the
variation is. So instead of having a Great Idea group and a Control
Group, we’ll use two control groups. And then we might see something
Suddenly, the group we’re experimenting on doesn’t look so good anymore.
It looks like any difference is likely to be just because of random
noise. If we’d instead seen something like this, we’d be much more
likely to believe that the Great Idea is actually doing well:
I like this because it seems like it can give you a rough sense for how
significant your results are without having to decide on a statistical
model. And it’s super intuitive! A question like “if we compare two
groups of this size with the same characteristics, do we naturally see a
lot of variation?” is a great smoke test.
Once I got off a plane and looked up what A/A testing actually is, I
found out the graph above has a name! It’s called A/A/B testing, and A/A
testing is when you literally just run an experiment where both groups
are the same :)
I once heard a C++ programmer say “oh no, Python, that’s so hard, I could never
write that. I’m so used to having a compiler!”. And at the time I thought they
were just being silly – Python isn’t that hard! We’re humans, we can learn and
grow! Of course they could!
But I’ve been writing both Scala and Ruby lately, and I’ve had some new
thoughts. There’s not much to them, and maybe they’re obvious, but:
If you work with a compiled typed language like Scala, you develop a lot of
skills around working with the type system to get better correctness.
If you spend all your time working in Python, by default you can’t even detect
basic typos in your code like
So you need to spend all your time learning how to write correct code without a
static type checker, partly by writing a much better test suite, by using a
linter, etc. Tests that wouldn’t tell you very much at all in Scala (that just
run the code and don’t check the result) suddenly become incredibly useful! And
it’s extra important to write testable code.
So maybe the C++ programmer who says she can’t write Python is really saying
“Writing safer code in a dynamic language is a skill that takes time to learn!
I have not yet learned it! I would be too scared to commit to writing a
reliable Python program right now!”
And maybe this is part of why Haskell programmers get so attached to Haskell –
because they’ve invested so much in learning the type system, and those skills
don’t transfer well to other languages like Python.
I’d be interested to know what you think. (though: I do not want to talk about
Haskell}. There are already too many flamewars discussing that so we’re not
going to talk about it. I just want to talk about skills attached to specific
kinds of programming languages and how transferable they are.)
I went to Data Day Texas this past weekend. It was a 1-day conference,
and I learned a lot!
Here are some things I learned! (any misunderstandings are mine, of
Charity Majors: Upgrade your database – without losing your data, your performance, or your mind
This was by far my favorite talk (slides are here).
Charity works at Parse, where they manage many
thousands of MongoDB collections (as far as I understand it, at least
one for each of their users). And sometimes they want to upgrade Mongo!
I understood before seeing this talk that doing database upgrades was
hard, and that it’s appropriate to be incredibly skeptical, but I didn’t
have any clue how to plan to reduce your uncertainty so that you can
actually do the upgrade.
Some things I learned:
How bad can it be if you don’t test the upgrade properly? (she saw
one kind of query get 100x slower in the worst case, which would be a
disaster). The examples of what can go in an upgrade that she gave
were incredibly interesting.
How much time is it appropriate to spend planning and testing a
database upgrade? (they spend about a year)
How do you know if the new database can handle your production
workload? (snapshot it, take a day’s worth of operations and test it
out on a production workload!)
When you actually do the upgrade, how do you do it? (slowly, with
lots of opportunities to roll back along the way)
Does Mongo have merit? (They need to support a ton of very different
workloads for their users, and it’s a great fit for that.)
There’s also a “A Very Short List Of Terrible Things Database Upgrades
Have Done To Me” slide which is the best.
It gave me a little more appreciation for what it means to do ops at
scale and to keep services running. I pretty rarely see talks that I
feel really advance my understanding of a topic, and this was one of
(also, I think I have a thing or two to learn from her about writing
Robert Munro – using humans to make your machine learning algorithms dramatically better
Let’s say that you’re writing a classifier that’s doing sentiment
analysis. This is a task that’s pretty easy for humans (“is ‘had an
amazing time with my friends watching terrible cult movies today’
positive?), but hard to do with machine learning, especially if you have
limited training data to use.
He talked about how judiciously incorporating human input to get a
better training set can give you much, much higher accuracy than just
messing with your model parameters.
My absolute favorite thing about this talk was when he talked about the
human/psychological aspects of using people to help you with
classifications! If you’re writing a cat classifier and every single
thing you show the human is not a cat, they’ll get bored and exhausted.
It made me think a lot about making sure if you’re asking people to help
you with a task, you need to
make the task interesting
make sure the people helping you out have a lot of impact on your
make sure that they know how high their impact is, and show them
how the model is improving!
Ted Dunning – generating fake datasets
This was a fun talk about simulating datasets to
prove that you’re right about the Monty Hall problem
debug a database bug when your client can’t give you the data that caused it
do machine learning on data you don’t have
of these, the first two made the most sense to me – I had a much harder time
imagining how it would be useful to do machine learning based on a simulated
data set in real life, and I think I missed some of the explanation.
And he told us about a tool called log-synth that he wrote to generate fake
datasets easily! I can pretty easily imagine myself using it to write ML
tutorials :). It’s at
Every so often, I’d need to understand some specific Drupal edge case
(say, how the text in the tabs on a search page was determined). And it
wouldn’t be documented. And Stack Overflow didn’t have any answers about
it, and my colleagues didn’t know. I like
to think of this as a “oh right I actually need to know how to program”
moment – nobody’s going to tell me what to do, I just need to figure it
The consultancy’s cofounder was Alex Dergachev,
and he taught me a really important thing! Whenever I had a question
like this, he’d tell me to just read Drupal’s source code, and then I’d
know the answer. This is pretty obvious in retrospect, but it wasn’t
obvious to me at the time.
I originally felt like Drupal was too complicated to understand (it’s a
big codebase!), but it turned out that if I tried I could usually figure
out what I needed to know.
This works particularly well when I understand pretty well what the code
I’m using is doing, but am just unsure about some particular detail.
For example “will this button appear for all users, or only admin
users?” I use this all the time at work, and most of you probably do
too! There are always details about code that aren’t exhaustively
documented, and using grep to find answers is incredibly helpful.