Julia Evans

Why a C++ programmer might say “I could never write Python”

I once heard a C++ programmer say “oh no, Python, that’s so hard, I could never write that. I’m so used to having a compiler!”. And at the time I thought they were just being silly – Python isn’t that hard! We’re humans, we can learn and grow! Of course they could!

But I’ve been writing both Scala and Ruby lately, and I’ve had some new thoughts. There’s not much to them, and maybe they’re obvious, but:

If you work with a compiled typed language like Scala, you develop a lot of skills around working with the type system to get better correctness.

If you spend all your time working in Python, by default you can’t even detect basic typos in your code like

1
2
def foo(elephant):
    return elephnt + 2

So you need to spend all your time learning how to write correct code without a static type checker, partly by writing a much better test suite, by using a linter, etc. Tests that wouldn’t tell you very much at all in Scala (that just run the code and don’t check the result) suddenly become incredibly useful! And it’s extra important to write testable code.

So maybe the C++ programmer who says she can’t write Python is really saying “Writing safer code in a dynamic language is a skill that takes time to learn! I have not yet learned it! I would be too scared to commit to writing a reliable Python program right now!”

And maybe this is part of why Haskell programmers get so attached to Haskell – because they’ve invested so much in learning the type system, and those skills don’t transfer well to other languages like Python.

I’d be interested to know what you think. (though: I do not want to talk about whether {Python, Ruby, Javascript} are better or worse than {Scala, C++, Haskell}. There are already too many flamewars discussing that so we’re not going to talk about it. I just want to talk about skills attached to specific kinds of programming languages and how transferable they are.)

Data Day Texas 2015

in conferences,, machinelearning

I went to Data Day Texas this past weekend. It was a 1-day conference, and I learned a lot!

Here are some things I learned! (any misunderstandings are mine, of course :))

Charity Majors: Upgrade your database – without losing your data, your performance, or your mind

This was by far my favorite talk (slides are here). Charity works at Parse, where they manage many thousands of MongoDB collections (as far as I understand it, at least one for each of their users). And sometimes they want to upgrade Mongo!

I understood before seeing this talk that doing database upgrades was hard, and that it’s appropriate to be incredibly skeptical, but I didn’t have any clue how to plan to reduce your uncertainty so that you can actually do the upgrade.

Some things I learned:

  1. How bad can it be if you don’t test the upgrade properly? (she saw one kind of query get 100x slower in the worst case, which would be a disaster). The examples of what can go in an upgrade that she gave were incredibly interesting.
  2. How much time is it appropriate to spend planning and testing a database upgrade? (they spend about a year)
  3. How do you know if the new database can handle your production workload? (snapshot it, take a day’s worth of operations and test it out on a production workload!)
  4. When you actually do the upgrade, how do you do it? (slowly, with lots of opportunities to roll back along the way)
  5. Does Mongo have merit? (They need to support a ton of very different workloads for their users, and it’s a great fit for that.)

There’s also a “A Very Short List Of Terrible Things Database Upgrades Have Done To Me” slide which is the best.

It gave me a little more appreciation for what it means to do ops at scale and to keep services running. I pretty rarely see talks that I feel really advance my understanding of a topic, and this was one of them.

(also, I think I have a thing or two to learn from her about writing talk titles)

Robert Munro – using humans to make your machine learning algorithms dramatically better

Let’s say that you’re writing a classifier that’s doing sentiment analysis. This is a task that’s pretty easy for humans (“is ‘had an amazing time with my friends watching terrible cult movies today’ positive?), but hard to do with machine learning, especially if you have limited training data to use.

He talked about how judiciously incorporating human input to get a better training set can give you much, much higher accuracy than just messing with your model parameters.

My absolute favorite thing about this talk was when he talked about the human/psychological aspects of using people to help you with classifications! If you’re writing a cat classifier and every single thing you show the human is not a cat, they’ll get bored and exhausted.

It made me think a lot about making sure if you’re asking people to help you with a task, you need to

  1. make the task interesting
  2. make sure the people helping you out have a lot of impact on your classification accuracy
  3. make sure that they know how high their impact is, and show them how the model is improving!

Ted Dunning – generating fake datasets

This was a fun talk about simulating datasets to

  • prove that you’re right about the Monty Hall problem
  • debug a database bug when your client can’t give you the data that caused it
  • do machine learning on data you don’t have

of these, the first two made the most sense to me – I had a much harder time imagining how it would be useful to do machine learning based on a simulated data set in real life, and I think I missed some of the explanation.

And he told us about a tool called log-synth that he wrote to generate fake datasets easily! I can pretty easily imagine myself using it to write ML tutorials :). It’s at https://github.com/tdunning/log-synth.

On reading the source code, not the docs

In my first programming job, I worked at a web development company and wrote modules for Drupal sites.

Every so often, I’d need to understand some specific Drupal edge case (say, how the text in the tabs on a search page was determined). And it wouldn’t be documented. And Stack Overflow didn’t have any answers about it, and my colleagues didn’t know. I like to think of this as a “oh right I actually need to know how to program” moment – nobody’s going to tell me what to do, I just need to figure it out.

The consultancy’s cofounder was Alex Dergachev, and he taught me a really important thing! Whenever I had a question like this, he’d tell me to just read Drupal’s source code, and then I’d know the answer. This is pretty obvious in retrospect, but it wasn’t obvious to me at the time.

I originally felt like Drupal was too complicated to understand (it’s a big codebase!), but it turned out that if I tried I could usually figure out what I needed to know.

This works particularly well when I understand pretty well what the code I’m using is doing, but am just unsure about some particular detail. For example “will this button appear for all users, or only admin users?” I use this all the time at work, and most of you probably do too! There are always details about code that aren’t exhaustively documented, and using grep to find answers is incredibly helpful.

But sometimes reading the code doesn’t work.

Programming doesn’t belong to men (it belongs to me)

One thing I’ve noticed since I started writing this blog is that I’ll get comments like these on my posts: (all these people are talking about me)

He’s only using the -Ofast gcc option, I wonder what he would get with -march=native -mtune=native which allows the compiler to use more instructions.

But more interestingly, why is he studying the amalgated C file instead of, you know, the sources?

I think he talks about the acutal .db file?

How would each machine/core’s counts know when they are done? If he wants a max of a million counts wouldn’t each machine have to check the counts of each other machine before continuing?

If you change the OP’s problem to summing a vector of floating point numbers, then even the way he has coded it there will still be differences from run to run.

One thing he doesn’t mention is the fear of looking stupid.

When this happens, when people implicitly assume that a Technical Thing On The Internet must be written by a man, I find it confusing. I didn’t grow up with the idea that I was worse at math or programming than the men around me (because, well, I wasn’t!) And I didn’t grow up with the idea that it was weird for me to write programs (why would it be?). And a huge number of the programmers I know and respect are women.

So the idea that programmers are all men or that programming is for men or that an article about the Linux kernel is probably written by a man just seems… silly to me. I feel like the community belongs to me, and like I’m a part of it.

And when people who have more power push marginalized people out of tech, I think, who do you think you are? Why do you think this belongs to you, and that you have the right to say who can come and who can’t?

Programming doesn’t belong to men, or to people who went to MIT, or to white people, or to English-speakers, or to Linux users, or to C programmers, or to people with CS degrees, or to people who are self-taught, or to people who can see, or to people who had easy access to computers when they were young. If you write programs, it belongs to you.

Fear makes you a worse programmer

in feelings

Yesterday morning, I asked on Twitter:

Does anyone have good writing about fear + programming (and how being afraid to make important changes makes you a worse programmer?)

and

I feel like there’s this really important line between caution (a++ excellent) and fear (which holds you back from doing necessary work)

A lot of super interesting discussion ensued, and I’d like to talk about some of it.

Before I start, Ryan Kennedy linked me to this slide deck of a presentation he gave called Fear Driven Development which I absolutely loved, and I think you should look at it. I think my favorite sentence from that presentation is “Fear creates local maximums.”

I find that when I’m afraid, I become super conservative. WE CANNOT POSSIBLY MAKE THIS CHANGE WHAT IF IT BREAKS?! And this means worse software! It’s actually kind of disastrous. If you’re scared of making changes, you can’t make something dramatically better, or do that big code cleanup. Maybe you can’t even deploy the code that you already wrote and tested, because it feels too scary. You just want to stick what’s sort-of-working, even if it’s not great.

Reproducing awesomeness

I had a conversation with my friend Tavish once about people we respect, and how to think about them, and how to be like them. (Tavish, by the way, gave a SUPER INTERESTING talk at PyCon this year about what programmers can learn from software engineering research, and about how measuring what makes good software instead of just getting into flamewars).

But about AMAZING PEOPLE. There are all kinds of people who I think are amazing. And Tavish suggested instead of being all “wow that person is so great they must be a genius and a wizard”, it’s better to ask “okay, what specifically do I admire about that person? Is that a way I want to be? How did they get like that? Can I do that too?”

And sometimes I can’t be like them, or I don’t want to! Like I might think that Terence Tao is an amazing mathematician, but it turns out I don’t actually want to be a mathematician. But I also like that he writes about his work in a public way, and that tries to demystify math research. And demystifying and writing about things I find interesting in public are things I can do without trying to win the Fields medal :)

Or my friend Sumana is delightful in many ways, and one of the many things I like about her is that she’s very good at giving thoughtful compliments and criticism. And I’m learning that looking at someone’s work carefully and telling them what I think is good about it and what could be improved can be a incredibly helpful thing to do. So I’m trying to do more of that! And she’s also a better writer than I am, and she’s been writing a blog for 14 years, and probably if I write more I will also be a better writer than I am now.

So I feel these days like saying WOW THAT PERSON IS SO GREAT I COULD NOT POSSIBLY is kind of… being unfair. Because I could probably be a little more like them if I think it’s important! And it’s kind of awkward for people to be idolized like that, and it’s better (for myself, and for the not-idolized-person), to just figure out what parts of them I’d like to be more like, and try some things out, and see what works.

And perhaps tell them sometimes that I think they are doing a Very Good Job of doing what makes them awesome.

(edit: There was a pretty huge omission in the original version of this post, which has been bothering me. There’s a lot of privilege involved in saying “hey that awesome thing? I can just go ahead and do that!”. I have this huge abundance of support and free time and money and amazing people who are happy to help me, and it makes it so much easier to do things that I think are interesting or important.)

Diving into concurrency: trying out mutexes and atomics

in concurrency

I hadn’t written any threaded programs before yesterday. I knew sort of abstractly about some concurrency concepts (mutexes! people say compare-and-swap but I don’t totally get it!), but actually understanding a Thing is hard if I’ve never done it. So yesterday I decided to write a program with threads! In this post, we’re going to:

  1. Write a threaded program that gets the wrong answer because of a race condition
  2. Fix that race condition in C and Rust, using 2 different approaches (mutexes and atomics)
  3. Find out why Rust is slower than C
  4. Talk a little about the actual system calls and instructions that make some of this work

Spying on Hadoop with strace

in hadoop, strace

As you may already know, I really like strace. (It has a whole category on this blog). So when the people at Big Data Montreal asked if I wanted to give a talk about stracing Hadoop, the answer was YES OBVIOUSLY.

I set up a small Hadoop cluster (1 master, 2 workers, replication set to 1) on Google Compute Engine to get this working, so that’s what we’ll be talking about. It has one 14GB CSV file, which contains part of this Wikipedia revision history dataset

Let’s start diving into HDFS! (If this is familiar to you, I talked about a lot of this already in Diving into HFDS. There are new things, though! At the end of this we edit the blocks on the data node and see what happens and it’s GREAT.)

LD_PRELOAD is super fun. And easy!

On Monday I went to Hacker School, and as always it was the most fun time. I hung out with Chase and we had fun with dynamic linkers!

I’d been hearing for a while that you can override arbitrary function calls in a program using an environment variable called LD_PRELOAD. But I didn’t realize how easy it was! Chase and I started using it and we got it working in, like, 5 minutes!

I googled “LD_PRELOAD hacks”, clicked on Dynamic linker tricks: Using LD_PRELOAD to cheat, inject features and investigate programs, and we were up and running.

The first example on that page has you write a new random function that always returns 42.

1
2
3
int rand(){
    return 42; //the most random number in the universe
}

That is LITERALLY ALL THE CODE YOU HAVE TO WRITE. Then you

1
2
gcc -shared -fPIC unrandom.c -o unrandom.so
export LD_PRELOAD=unrandom.so

and now every program you run will always return 42 for rand()!

PyData NYC (I gave a machine learning talk! yay!)

in conferences, machinelearning, talks

This past weekend I went to PyData NYC. It was super fun! I got to meet some people who do machine learning in different fields than me (THE BEST) and a guy at the bar told me about his classmates’ adventures in hacking login systems in Novell Netware. (Also the best. It turns out that one of the nice things about having a blog is that people occasionally come up to me at conferences and tell me GREAT THINGS).

And I gave my first-ever machine learning talk! I often feel like the main (only?) thing I’ve learned about machine learning so far is how important it is to evaluate your models and make sure they’re actually getting better. So I gave a talk called “Recalling with precision” about that (a terrible pun courtesy of @avibryant).

The basic structure:

  1. You’re building models, and you do lots of experiments
  2. You want to remember all of the results forever, so that you can tell if you’re improving
  3. Some arguments for why it’s important to evaluate whether your models actually work (for an example of what can go wrong, see The Value Added Teacher Model Sucks).
  4. “Just remember everything forever” is much easier said than done
  5. So let’s describe a lightweight system to actually remember some things forever! (basically: store training results in S3 forever; make a webapp to draw lots of graphs using those results)
  6. It exists! It’s open source. http://github.com/stripe/topmodel

(It’s still pretty new. If you try it and have thoughts, let me know?)

People asked lots of questions, so I think it may have been useful. It’s hard to tell, with talks :)

Some things I learned at PyData: (links largely taken from Michael Becker’s great wrap-up post)