I went to Data Day Texas this past weekend. It was a 1-day conference,
and I learned a lot!
Here are some things I learned! (any misunderstandings are mine, of
Charity Majors: Upgrade your database – without losing your data, your performance, or your mind
This was by far my favorite talk (slides are here).
Charity works at Parse, where they manage many
thousands of MongoDB collections (as far as I understand it, at least
one for each of their users). And sometimes they want to upgrade Mongo!
I understood before seeing this talk that doing database upgrades was
hard, and that it’s appropriate to be incredibly skeptical, but I didn’t
have any clue how to plan to reduce your uncertainty so that you can
actually do the upgrade.
Some things I learned:
- How bad can it be if you don’t test the upgrade properly? (she saw
one kind of query get 100x slower in the worst case, which would be a
disaster). The examples of what can go in an upgrade that she gave
were incredibly interesting.
- How much time is it appropriate to spend planning and testing a
database upgrade? (they spend about a year)
- How do you know if the new database can handle your production
workload? (snapshot it, take a day’s worth of operations and test it
out on a production workload!)
- When you actually do the upgrade, how do you do it? (slowly, with
lots of opportunities to roll back along the way)
- Does Mongo have merit? (They need to support a ton of very different
workloads for their users, and it’s a great fit for that.)
There’s also a “A Very Short List Of Terrible Things Database Upgrades
Have Done To Me” slide which is the best.
It gave me a little more appreciation for what it means to do ops at
scale and to keep services running. I pretty rarely see talks that I
feel really advance my understanding of a topic, and this was one of
(also, I think I have a thing or two to learn from her about writing
Robert Munro – using humans to make your machine learning algorithms dramatically better
Let’s say that you’re writing a classifier that’s doing sentiment
analysis. This is a task that’s pretty easy for humans (“is ‘had an
amazing time with my friends watching terrible cult movies today’
positive?), but hard to do with machine learning, especially if you have
limited training data to use.
He talked about how judiciously incorporating human input to get a
better training set can give you much, much higher accuracy than just
messing with your model parameters.
My absolute favorite thing about this talk was when he talked about the
human/psychological aspects of using people to help you with
classifications! If you’re writing a cat classifier and every single
thing you show the human is not a cat, they’ll get bored and exhausted.
It made me think a lot about making sure if you’re asking people to help
you with a task, you need to
- make the task interesting
- make sure the people helping you out have a lot of impact on your
- make sure that they know how high their impact is, and show them
how the model is improving!
Ted Dunning – generating fake datasets
This was a fun talk about simulating datasets to
- prove that you’re right about the Monty Hall problem
- debug a database bug when your client can’t give you the data that caused it
- do machine learning on data you don’t have
of these, the first two made the most sense to me – I had a much harder time
imagining how it would be useful to do machine learning based on a simulated
data set in real life, and I think I missed some of the explanation.
And he told us about a tool called log-synth that he wrote to generate fake
datasets easily! I can pretty easily imagine myself using it to write ML
tutorials :). It’s at