The Nteresting Nnovation of Google Ngrams

If you're unfamiliar with the term or concept of ngrams in general or Google Ngram Viewer in particular, a look at it in action is the best explanation:



This shows how often the words "overrated" and "underrated" appear in Google Books from 1800 to 2008 -- sort of. There are a few caveats, which Google is upfront about (although I wish they'd post a précis of the shortcomings of the database and the main erroneous conclusions that can be drawn from them on the main page of the Ngram Viewer). I'll get into the unique problems of computerized curation of a dataset so huge it comprises 6% of all the books in existence (so they claim, it depends how you count them, but it's a defensible number).

So as the title says, what they heck is an ngram? Well, what you see above are 1grams. If I look up phrases, they become 2grams (or bigrams), 3grams (trigrams), 4grams, 5 grams (not to be confused with pentagrams). Some fascinating things can be revealed by searching for multiword units; we'll look at them in later blog posts.

You have to be careful what conclusions you draw: from the above graph, could you say people were more pessimistic in 1850? No, we haven't run the proper controls: for instance, are there synonyms for "overrated" that took over in 1900? Are there certain kinds of books overrepresented in the database that are more likely to use these terms? Google published a paper with some interesting results (such as the effects of Nazi censorship), but they had the resources to have verifiable control experiments.

Still, it's an interesting database, and one I find myself turning to a lot. Just as there are those who pore through Google Street View to find oddities like people wearing horse head costumes; I do the same with Google Ngram Viewer. I don't like Google's presentation, though, so I wrote a script to automatically import results into python and create prettier graphs (that use per million instead of per cent so you don't have all those leading zeroes, for one):


That's a dramatic rise for "onto the". What could it possily mean? Well, I'll telll you... later.

3 comments

Cool stuff! Can the Ngrams be embedded directly to your site instead of the images? I wonder if they offer an API...

Reply

I'm sorry the comments aren't working in my new redesign. I'm working on it!

Reply

And now they're working! It was rather useless of me to write the above comment before I fixed the comments. Ah well.

Reply

Post a Comment