Hi friends! For the first time today I'm going to tell you about my DAY AT WORK (machine learning at Stripe) yesterday. =D. This is a collaboration with Kamal Marhubi who did this profiling with me after work because I was so mad about the performance.
I used to think a millisecond was fast. At work, I have code that runs some VERYLARGENUMBER of times. It's distributed and split up into tasks, and an individual task runs the code more than 6 million times.
I wrote a benchmark for the Slow Code and found it could process ~1000 records/s. This meant that processing 6 million things would take 1.5 hours, which is Slow. The code is kind of complicated, so originally we all thought this was a reasonable amount of time. But my heart was sad.
Yesterday Avi (who is the best) and I
looked at why it was so damn slow (~1 millisecond/record) in some more
depth. This code is open source so I can show it to you! We profiled using
VisualVM and, after doing some optimizations, found out that it was
spending all its time in
DenseHLL$x$6. This is mystery Scala speak for
this code block from Twitter's Algebird library that estimates the size of a HyperLogLog:
This is a little inscrutable and I'm not going to explain what this code
arrSize in my case is 4096. So basically, we have
something like 10,000 floating point operations, and it takes about 1ms
to do. I am still new to performance optimizations, but I discussed it
with Kamal and we decided it was outrageous. Since this loop is hardly
doing anything omg, the obvious target is
-mj), because that looks like the hardest thing. (note: Java is pretty fast. if you are doing normal operations like adding and multiplying numbers it should go REALLY FAST. because computers are fast)
(note: Latency Numbers Every Programmer Should Know is great and useful in cases like this! Many CPU instructions take a nanosecond or something. so 10K of them should be on the order of 10 microseconds or so. Definitely not a millisecond.)
Kamal and I tried two things: replacing
Math.pow(2, -mj) with
1.0 / (1 << mj), and writing a lookup table (since
mj is a byte and has
256 possible values, we can just calculate 2^(-mj) for every possible
value up front).
The final performance numbers on the benchmark we picked were:
So we can literally make this code 100 times faster by just changing one line. Avi simultaneously came to the same conclusions and made this pull request Speed up HLL presentation by 100x. Hooray!
I'm learning intuitions for when code is slower than it should be and it is THE BEST. Being able to say "this code should not take 10s to process 10,000 records" is amazing. It is even more amazing when you can actually fix it.
If you're interested in the rest of my day at work for some reason, I
- worked with someone on understanding which of our machine learning models are doing the most work for us
- wrote 2 SQL queries to help someone on the Risk team find accounts with suspicious activity
- wrangled Scala performance (this) so that we can generate training sets for our machine learning models without tearing our hair out