I love fun programming tutorials, and I love the Jupyter notebook for showing how to do cool Python stuff. So I was really happy this morning when I saw Allison Parrish (who makes a lot of delightful computer-generated language art) post these tutorials she’s written (which mostly use the Jupyter notebook) about how to parse and generate English text this morning!
First, some links to cool stuff Allison has done:
- her awesome website with a billion cool links
- Her !!Con talk lossy text compression, for some reason?!?! (which is basically about using JPEG compression to compress text, with weird and wonderful results. It’s 10 minutes, watch it, really)
- The Ephemerides is a lovely Twitter bot that posts computer-generated poems and pictures from space
- everyword tweeted every word in the English language
- awesome transcript of “Exploring (Semantic) Space With (Literal) Robots”, a talk by her about computer-generated poetry.
- A game called rewordable that I want to buy
And now the tutorials! To start, there’s this a basic intro to working with CSV files in Python (which is extremely useful, but I know that.
Here are the links to the 4 tutorials I was really excited about if you just want the links and don’t care what I have to say about them :)
First! Suppose you want to generate random text, like “I’m a banana, not a cucumber”. You could do this by writing like
"I'm a %s, not a %s" % ("banana", "cucumber"), but you’ll run into problems fast because it’s “I’m an apple”, not “I’m a apple”.
It turns out that there’s a cool library called Tracery to help you with text generation. Allison has 2 cool tutorials about Tracery:
Parsing text with spaCy
The next tutorial is NLP concepts with spaCy. Basically you can take a sentence or paragraph and parse it to figure out what it means! Some example of stuff she explains how to figure out:
Where the sentences are Whether a word is a verb or a noun or what Identify more complicated grammar constructs like the “prepositional phrases” (‘with reason and conscience’, ‘towards one another’)
She linked to some examples of how to use spacy. I ran the “what they’re doing” example on Pride and Prejudice and it wrote out:
Hurst is returning
Bingley is blaming
Collins is coming
Darcy is viewing
Bingley is providing
Wickham is caring
Darcy is viewing
Lady is remaining
Hill is coming
So it seems to have done a good job of identifying the characters in Pride and Prejudice! Neat!
Previously the NLP library I’d heard about was NLTK, and she has this very useful note in the tutorial:
(Traditionally, most NLP work in Python was done with a library called NLTK. NLTK is a fantastic library, but it’s also a writhing behemoth: large and slippery and difficult to understand. Also, much of the code in NLTK is decades out of date with contemporary practices in NLP.)
Understanding word vectors
Ok, the next tutorial is Understanding word vectors
The cool thing I learned from this is that you can programmatically “average” words like ‘day’ and ‘night’ to end up with ‘evening’! You can also figure out which animals are similar and all kinds of really cool stuff. I didn’t know that you could do this, if you want to know more you should read the excellent tutorial.
Fun building blocks for doing text experiments!
I think these 3 things (tracery for generating sentences, spacy for parsing text, and spacy (again) for seeing which words are similar to each other) seem like a super awesome way to get started with playing with text!