Data Day Texas 2015

I went to Data Day Texas this past weekend. It was a 1-day conference, and I learned a lot!

Here are some things I learned! (any misunderstandings are mine, of course :))

Charity Majors: Upgrade your database – without losing your data, your performance, or your mind

This was by far my favorite talk (slides are here). Charity works at Parse, where they manage many thousands of MongoDB collections (as far as I understand it, at least one for each of their users). And sometimes they want to upgrade Mongo!

I understood before seeing this talk that doing database upgrades was hard, and that it’s appropriate to be incredibly skeptical, but I didn’t have any clue how to plan to reduce your uncertainty so that you can actually do the upgrade.

Some things I learned:

How bad can it be if you don’t test the upgrade properly? (she saw one kind of query get 100x slower in the worst case, which would be a disaster). The examples of what can go in an upgrade that she gave were incredibly interesting.
How much time is it appropriate to spend planning and testing a database upgrade? (they spend about a year)
How do you know if the new database can handle your production workload? (snapshot it, take a day’s worth of operations and test it out on a production workload!)
When you actually do the upgrade, how do you do it? (slowly, with lots of opportunities to roll back along the way)
Does Mongo have merit? (They need to support a ton of very different workloads for their users, and it’s a great fit for that.)

There’s also a “A Very Short List Of Terrible Things Database Upgrades Have Done To Me” slide which is the best.

It gave me a little more appreciation for what it means to do ops at scale and to keep services running. I pretty rarely see talks that I feel really advance my understanding of a topic, and this was one of them.

(also, I think I have a thing or two to learn from her about writing talk titles)

Robert Munro – using humans to make your machine learning algorithms dramatically better

Let’s say that you’re writing a classifier that’s doing sentiment analysis. This is a task that’s pretty easy for humans (“is ‘had an amazing time with my friends watching terrible cult movies today’ positive?), but hard to do with machine learning, especially if you have limited training data to use.

He talked about how judiciously incorporating human input to get a better training set can give you much, much higher accuracy than just messing with your model parameters.

My absolute favorite thing about this talk was when he talked about the human/psychological aspects of using people to help you with classifications! If you’re writing a cat classifier and every single thing you show the human is not a cat, they’ll get bored and exhausted.

It made me think a lot about making sure if you’re asking people to help you with a task, you need to

make the task interesting
make sure the people helping you out have a lot of impact on your classification accuracy
make sure that they know how high their impact is, and show them how the model is improving!

Ted Dunning – generating fake datasets

This was a fun talk about simulating datasets to

prove that you’re right about the Monty Hall problem
debug a database bug when your client can’t give you the data that caused it
do machine learning on data you don’t have

of these, the first two made the most sense to me – I had a much harder time imagining how it would be useful to do machine learning based on a simulated data set in real life, and I think I missed some of the explanation.

And he told us about a tool called log-synth that he wrote to generate fake datasets easily! I can pretty easily imagine myself using it to write ML tutorials :). It’s at https://github.com/tdunning/log-synth.