Culturomics meets Darth Vader


A few days ago, Google introduced a site they called Culturomics in which users could enter two or more terms and get a graph representing their frequency of occurrence with the Google Books archives. Depending on your word selection, you can get some interesting results. For instance, the verb form ain't has been attested since 1840 with the now "correct" not appearing widely until 1840.

I can tell this service has successfully grabbed the attention of the popular imagination, since some sci-fi fans put in some genre-themed word pairs. Apparently Darth Vader IS more popular than Luke Skywalker. However there are some issues to consider.

One critique comes from Mark Davies, a member of the Corpus of Historical American English (COHA). One feature that the COHA interface has that the n-gram doesn't have is that it lists frequently found co-occuring words (or "collocates"). For instance, in 1900, the word gay may frequently occur along side words such as "happiness, light, carefree". Today it is much more likely to co-occur along words such as "rights" or "marriage" (especially in news articles).

Davies also notes that the Google tool doesn't yet distinguish between parts of speech or differences in usage/meaning, and this can be very important. For instance, a chart of twitter shows a peak circa 1900, but at that time it referred to sounds a bird might make (or perhaps the sound of gossipy chit chat). Today it generally refers to the Twitter service - but there is no way to distinguish this use.

Similarly, the tool doesn't also allow you to view the types of passage in which a word occurs. For instance, the word ain't continues to be found in written text, but it may be that after 1840, the context for a lot of the uses is in writing guides saying to AVOID "ain't". That makes a difference in how to analyze usage of "ain't" over time.

That's not to say that there is no use to the Culturomics tool, especially if the terms are very specific and unambiguous (e.g. Darth Vader), but I do have to agree with Davies that it doesn't let you track the subtleties very well. But fortunately, there are other tools out there that linguists can use. But I will have to admit that the interface will be more complex.

Postscript: Jan 4, 2011

There have been several experiments online, especially on Language Log (e.g. northeaster vs nor'easter) working to see how the Google engine works, and Google is responding. One feature I initially missed is that you can narrow your corpus a bit. For instance, I ran the isn't/ain't pair again but restricted it the the "English fiction" corpus - this should rule out pesky grammar books (although it would be distinguish quote in "dialect"). It's still interesting to note that "isn't" is not a clear winner until sometime after 1900.

Postscript 2: Jan 7, 2011

One issue that could also be problematic is synonyms/heteronyms as well as multiple usage. For instance, a report of dove may not distinguish the irregular past from the bird of peace.