This post is basically a list of books & other resources that teach statistics using programming. But first I wanted to say why I think that is important! You can skip further down if you’re already sold.
Statistics can sometimes seem boring and difficult to understand. When I start reading statistics textbooks, I read about
- the normal distribution
- chi-squared tests
- the central limit theorem
I know what some of those things are (I have a math degree, after all, and so I have some idea of what the central limit theorem says). But I often don’t find them that useful for solving my day to day statistics problems.
Like, here are some questions I sometimes have about data
- I measured some performance numbers, made a change, and then measured some new numbers. Did my change really make a difference? (“hypothesis testing”)
- I have a bunch of numbers and I want to know what the average is, and I want to know how seriously I should take that average (is it 6 +/- 1? or 6 +/= 0.01?) (“confidence intervals”)
One of the biggest problems with tests like the chi-squared test is that they make a lot of assumptions about how your data was generated. Usually they assume that your data is normally distributed. Not everything follows a normal distribution!
So – can I figure out if my change really made my code faster or not without having to make a bunch of assumptions?
It turns out the answer is “yes”, and that there’s a whole subfield of statistics devoted making less assumptions (“nonparametric statistics”). And even better – that subfield is actually easier to use than regular statistics.
Some of the methods:
- bootstrapping (which lets you calculate a mean + error bars for that mean!)
- shuffling your data (which lets you tell if 2 groups of numbers “really” have a different mean or not)
These methods are often really computational – like instead of using a bunch of formulas, you’ll write a program. And you’ll get statistically valid answers back! I like this because even though I know a lot of math, I often find programs more intuitive than formulas.
why program instead of use formulas?
A lot of the formulas you can use to do statistics make a lot of assumptions, and then you can quickly use a formula to calculate the statistical thing you want (like a chi-square test or whatever). This was necessary when people had really limited computational resources (like, they had a book with tables in it and a pen and paper).
But today we have computers! So we can use really dramatically different statistical methods than people used in the 19th century. And often you can make less assumptions, which can be really good!
Anyway, I asked for recommendations for “nonparametric statistics for programmers” resources on Twitter and I got a lot of good recommendations back. Here they are. I haven’t
some good “statistics for programmers” resources
statistics without the agonizing pain
This is a really nice 10 minute talk about how to do statistics using programming. Here, I even embedded it!
statistics for hackers, by jake vanderplas
This talk from PyCon 2016 is exactly the kind of intro to nonparametric methods I’m talking about!! It has a slide deck which is good to read by itself. It introduces shuffling & bootstrapping which I think are two of the most important statistics methods to know.
nonparametric stats with R
This is the best thing I found so far that actually explains these nonparametric methods in an introductory way with programming.
It’s is an online textbook that teaches basic nonparametric statistics with R. An Introduction to Statistical and Data Sciences via R
the two chapters i found most useful to look at were
- hypothesis testing
- how to calculate confidence intervals using bootstrapping which is a great thing to be able to do.
all of nonparametric statistics by Larry Wasserman
Now we’re going to veer away from nonparametric stats and into statistics books for programmers generally.
Allen Downey’s work
Allen Downey wrote this great textbook manifesto and his work looks really approachable. All of his books are available online for free which is a really lovely thing.
not so much nonparametric statistics but I hear really good things
Someone also mentioned
Allen is a great teacher, so it’s worth watching (or, even better, attending) his tutorials as well as reading the books.
I remember watching a statistics talk Allen gave at PyCon a few years back and being really impressed.
introduction to probability by Peter Norvig
probabilistic programming & bayesian methods for hackers
Probabilistic Programming & Bayesian Methods for Hackers by Cam Davidson-Pilon is a cool introduction to bayesian methods with a lot of calculations.
even more links
- a paper someone said was good (by Efron): Bootstrap Methods: another look at the jackknife
- this book by 5 people named lock
- this blog post has an overview of different nonparametric tests
- this podcast with Philip Guo and John DeNero where they talk about teaching stats to programmers
- nonparametric statistical methods
- openintro has free some statistics books
tell me if you have more cool recommendations
If there is an amazing book that teaches statistics with programming that I left out I would like to know about it! I’m on twitter at @b0rk.