Challenge: find Twitter memes with suffix arrays

This challenge is a mix of data analysis and using fun algorithms! It’s the second challenge in a a short series of programming challenge I’m writing with Julian. (the first one was to write a tiny fun window manager)

Twitter has a lot of memes. For example, if you search Twitter for Flight attendant: is there a doctor on this flight?, you’ll find a bunch of tweets making jokes like this:

Flight Attendant: is there a doctor on board?
Parent: *nudging* That should've been you
Me: Not now, this is serious
Parent: Not asking for a hacker to help, are they?
Me: AAAAAAAA\x00\xd0X?\xfc\x7fBBBBj\x0bX\x99Rfh-p\x89\xe1Rjhh/bash/bin\x89\xe3RQS\x89\xe1\xcd\x80
Parent:~#

or if you search as a kpop fan there are thousands of these:

me as a kpop fan 

- kpop fan age: 10 years
- first group ever stan: super junior
- current ult groups: iKON, X1, Day6
- number of albums: >20 
- concerts attended: 6
- lightsticks owned: 2

So! Suppose you have a million tweets from the last 2 days. How do you find the jokes / quizzes / memes people are playing with on Twitter?

Challenge: find the twitter memes in 1 million tweets

This is a pretty open ended challenge and you can do it any way you want. Here’s a SQLite database with 1.2 million tweets, collected from the twitter streaming api over 2 days. It’s 250MB (70MB compressed), it only has English tweets. It excludes retweets and many tweets that are generated by bots.

The challenge: find at least 5 Twitter memes using that dataset.

memes as common substrings

The idea here is that memes are substrings like “me as a kpop fan” that many different people are using. The tricky thing is that you don’t really know how long those substrings will be, and maybe you’re interested in phrases of different lengths.

You can probably do this challenge without using anything fancy (with a hashmap of phrases or something) but I think it’s a nice opportunity to play with a fun data structure: suffix arrays! So let’s talk about what those are.

suffix arrays: sort all suffixes

Suffix arrays sort all suffixes of a string. For example, here’s the suffix array for “plantain” which has the suffixes plantain, lantain, antain, ntain, tain, ain, in, n.

ain
antain
in
lantain
n
ntain
plantain
tain

Representing this as a list of strings would be very inefficient (quadratic space), so instead we replace each suffix with the index of its first character in the original string – [5,2,6,1,7,3,0,4].

5 (ain)
2 (antain)
6 (in)
1 (lantain)
7 (n)
3 (ntain)
0 (plantain)
4 (tain)

Here’s a real example of what a suffix array of 1 million tweets concatenated looks like. This is an excerpt from the middle of the suffix array, with some of the suffixes that start with A little.

...
 A little distracted for a bit ...what do i do w my life hon.........
 A little exercise I did this afternoon.  #comics #art #clip.........
 A little extra Christmas Cash on me! Good Luck to everyone!.........
 A little girl in Savannah, Ga., appears to be the 38th huma.........
 A little heavy on the smut t… https://t.co/nvoxE7SNjTI wa.........
 A little in state battle tonight. #nova vs #penn. two very .........
 A little kiss...” one more time I’m going to vomit. #TT.........
 A little late catching up on last nights @GoodDoctorABC. On.........
 A little less bling never hurt anyone! Next project...🎄 .........
 A little more intensity to augment their talent and a coupl.........
 A little more time, because I have never lived really  - Os.........
 A little mor… https://t.co/kcq3zf9jgeWe love MX ❤️<F0><9F><A7>.........
 A little over 50k! Can We Guess How Much Is In Your Account.........
 A little ray of joy &amp; light in the midst of these very .........
 A little refreshment… https://t.co/HgX8PmYwPIThank you @L.........
 A little respect goes a long way. .........
 A little salt in d country's troubled legal system“Grant................
 A little snow &amp; people lose all common senseromantic st...............
 A little sun for the soul @realfreewebcams https://t.co/3CB...............
 A little sunkissed moment for y’all. ...............
 ....

Again, this is actually represented by a bunch of integer indexes into a concatenated string of all the tweets, like [18238223, 1921812, ...] so it’s a LOT more memory efficient than actually repeating all those strings.

suffix arrays let you find common substrings!

So what does this have to do with Twitter memes? Well, we can basically

concatenate all tweets into a big string
make a suffix array of that string
iterate through the suffix array and notice when you see a lot of repeated substrings, like here:

me as a kpop fan ✨kpop fan age: 15 y/o ✨first group ever stan: blackpink ✨current ult groups: btxt ✨number of albu… https://t.co/24diHX9sLm
me as a kpop fan ⭐k-pop fan age: 12 y/o ⭐first group ever stan: bts ⭐current ult gps: bts and txt ⭐number of albu… https://t.co/8R95roQXoE
me as a kpop fan ⭐k-pop fan age: 14 y/o ⭐first group ever stan: girls generation ⭐current ult gp: txt ⭐number of a… https://t.co/010hLuJscF
me as a kpop fan ⭐k-pop fan age: 14-16 y/o ⭐first group ever stan: bts ⭐current ult gps: bts txt ⭐number of albums… https://t.co/0fDcxZGRrh
me as a kpop fan ⭐k-pop fan age: 15 y/o ⭐first group ever stan: blackpink ⭐current ult gps: txt ⭐number of albums… https://t.co/d8zZL83TvV
me as a kpop fan 🌸 k-pop fan age: 12 years old 🌸 first group ever stan: bts 🌸 current ult gps: bts &amp; wanna one 🌸 n… https://t.co/22R1nJpwNX
me as a kpop fan 🌸k-pop fan age: 10 🌸first group ever stan: 2pm 🌸current ult gps: skz,got7,itzy,twice, 🌸number of… https://t.co/mAluaP2yxH
me as a kpop fan 🌸k-pop fan age: 11 yo 🌸first group ever stan: beast 🌸current ult gps: ateez 🌸number of albums:  1… https://t.co/qxtFHG9HDg
me as a kpop fan 🌸k-pop fan age: 11 🌸first group ever stan: bts 🌸current ult gps: bts and ateez 🌸number of albums:… https://t.co/mKXlkrBBtC
me as a kpop fan 🌸k-pop fan age: 13 (now im 19) 🌸first group ever stan: snsd 🌸current ult gps: nct day6 aoa mamam… https://t.co/8XyQ3r5hwz
me as a kpop fan 🌸k-pop fan age: 13 years 🌸first group ever stan: 2pm,suju,bigbang 🌸current ult gps: bts,tbz,ateez… https://t.co/Zs1nQQz6Lt
me as a kpop fan 🌸k-pop fan age: 14 (2005) 🌸first group ever stan: super junior 🌸current ult gps: exo, gfriend, rv… https://t.co/vgmhe2vFMY
me as a kpop fan 🌸k-pop fan age: 14 y/o 🌸first group ever stan: nct dream 🌸current ult gps: svt and,,*insert stan… https://t.co/I38Ui69PvL
me as a kpop fan 🌸k-pop fan age: 15 y/o 🌸first group ever stan: 5sos 🌸current ult gps: bts and 5sos also some ggs… https://t.co/61ZmRkzmdl
me as a kpop fan 🌸k-pop fan age: 15 y/o 🌸first group ever stan: bts 🌸current ult gps: SVT, GOT7, Day6 🌸number of… https://t.co/16SWb3mSPg
me as a kpop fan 🌸k-pop fan age: 18 🌸first group ever stan: suju &amp; soshi 🌸current ult gps: snsd &amp; izone 🌸number of… https://t.co/SmSBFqJnGk
me as a kpop fan 🌸k-pop fan age: 19 y/o marupok 🌸first group ever stan: APINK 🌸current ult gps: SEVENTEEN 🌸number… https://t.co/StYjxr6uq9
me as a kpop fan 🌸k-pop fan age: 19 🌸first group ever stan: SuJu 🌸current ult gps: SuJu, SF9, SKZ, VIXX, ONEUS, NO… https://t.co/2o2DulCY5b

suffix arrays also enable really fast search

As an aside, the reason I got interested in suffix arrays in the first place was actually not for finding Twitter memes at all but for search.

I’ve spent a lot of time using Nelson Elhage’s livegrep at work to search code. It creates a suffix array using the divsufsort library. He has a blog post Regular Expression Search with Suffix Arrays where he talks about some of the implementation details.

The reason suffix arrays work for fast search is basically that if you’re looking for the string A little, you can do a binary search over the suffix array to find every instance of A little in your dataset. Binary searches are extremely fast so every search is guaranteed to run very quickly (in less than a microsecond I believe). What livegrep does is more complicated than that because it does a regular expression search, but that’s the idea to start.

There’s another blog post How to use suffix arrays to combat common limitations of full-text search applying suffix arrays to searching through a patent database. In that example, like with code search, the patent officers want to search patents for exact strings.

How do you make a suffix array?

You can use an existing suffix array library, for example index/suffixarray in Go, which is what I used, or divsufsort. There are Python bindings for divsufsort.

If you’re more excited about the data structures/algorithms aspect of suffix arrays you can also implement a suffix array-building algorithm yourself! I did not do this but you can see an implementation of qsufsort here in Go. That implementation links to a paper. There are lots of algorithms for constructing suffix arrays –sais and divsufsort are a couple of others.

5 or so hours, 100 lines of Go

As always with these challenges, I did this one to make sure that it’s both doable in a reasonable amount of time and fun for at least one person (me).

I did this one in about 5 hours and 100 lines of Go using the suffixarray implementation in the Go standard library, with a bit of bash shell scripting to postprocess the results. This is a messy data analysis challenge – as an example of a messy thing, Spotify released their end-of-2019 results while I was building the dataset and so there are a lot of tweets generated by the Spotify app.

My results ended up looking something like this:

5  an Aries and that’s why I gotta 
5  an Aries and that’s why I am so 
5  an Aquarius and that’s why I 
5  AM SO PROUD OF YOU 
5  am not a fan of 
5  am I the only one who 
5  am going to have to

Then I sifted through them pretty manually to find the Twitter memes.

suffix arrays are used in bioinformatics

This “find twitter memes using suffix arrays” approach is a silly thing but it does have some relationship to reality – DNA sequences are basically really long strings, and biologists need to find patterns in them, and they sometimes use suffix arrays to do it.

I looked up packages in Debian that use libdivsufsort and I found infernal:

Infernal (“INFERence of RNA ALignment”) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs). A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence.

that’s all!

Thanks to Julian for discussing suffix arrays and suffix trees and trigram indexes with me at length, and to Kamal who had the idea of using suffix arrays to find Twitter memes.