Word Cloud of Vladimir Nabokov's Lolita



There was certainly no dearth of images to choose from for this book; a tasteful cover for a book about ephebophilia is a challenge that many designers are fascinated by. There is a web site with 185 published book covers, a recent book of 80 commissioned conceptual book cover designs, and plenty of fan cover designs around the Internet.

Finally, I had to go with the iconic heart-shaped sunglasses from the Kubrick film poster (reproduced on many book covers afterwards), despite its notoriously presenting Lolita as a seductress instead of Nabokov's pragmatic, desperate, abused girl. Perhaps we are seeing her through Humbert Humbert's flawed perception.

1962 movie poster

By far the most common word is the title character's name: she appears as Dolores (her given name) 65 times, Dolly 100 times, Lolita 240 times and Lo 273 times, for a total of 678 mentions -- in a 110,000 word book, that's an extremely high rate of 0.6% of all words used. Her name is even repeated eight times in a row by a rapturous narrator in Chapter 26 -- a chapter so short, Lolita's name makes up over 12% of the total words.

The narrator's peculiar reduplicated name, "Humbert Humbert" appears in full 19 times; "Humbert" appears on its own another 87. Among the other strongly represented words are "Haze" (the Lolita's and her mother's surname, an inspiration for many jeux de mots by HH), and of course "young", "child", and HH's neologism "nymphet". Another word which does not appear as often, but is overrepresented in comparison to the English corpus, is "old" -- a contrast, of course, to "young".

The longest oft-repeated or nearly repeated phrase (six times in one form or another) are the lyrics to the half-remembered song "Oh Carmen, ... "the stars and the cars and the barmen", which first appears in the infamous Chapter 13 in which HH steals Lolita's apple (the symbolism is obvious) and maneuvers her onto his lap, where the physical contact makes him near-delirious. Another is "ladies and gentlemen of the jury", said in one form or another ten times, a reminder that HH fully expects to be judged by the reader.

The novel has about 110,000 words, 14,000 unique words, 10,000 unique word stems (e.g. counting "walk", "walking" and "walks" together), and 4,000 word stems used only once -- this is a high variety of words, typical for a master linguist like Nabokov. Many of these singletons are rare French and Latin words like "ensellure", "frétillement" and "quidquam", cultured words like "Chimène" (an opera) and "callypygean" (a classical reference referring to the buttocks" and plays on words like "honeymonsoon" and "dolorous" (referring to Lolita's given name).

"Incest" and "nubile" appear twice each, "tumescent" once, and "pedophilia" and "molest" do not appear at all.

Wikipedia article about the book.

Word cloud created using Tagxedo.

Word Cloud of Animal Farm and Nineteen Eighty-Four by George Orwell

I decided to post two at once by the same author this week. Warning: the Nineteen Eighty-Four word cloud is an animated GIF, if it hurts your eyes or your brain scroll further down the page and you'll see a non-animated version.


I made up my own design instead of using an existing book cover, because I found none that suggested themselves to me for a word cloud. The texture and colors are supposed to be reminiscent of a communist flag, and hopefully it's obvious that the cloud is in the form of a pig's head.

Once the trivial words are removed, the list contains mostly character names (Napoleon drives away Snowball, and thus his name is mentioned more times), types of animals (pigs, hens), farm features (windmill, barn) -- and then the words that remain are very evocative of the character of the book, such as comrade, orders, rebellion, commandments and orders.

This is a short book, at about 30,000 words. There are 3,900 individual words, but once the Porter Algorithm is used to identify word stems (so that "thinking", "thinks" and "think" are counted together, but "thought" is not -- it's an imperfect algorithm but it produces about as many false positives as negatives, and is more than reasonably accurate), the number drops to 2,900 words.

The title occurs 43 times, and the book has an oft-recurring six-word phrase: "four legs good, two legs bad" appears 13 times. For 0.25% of all words in a book to be the same six-letter phrase is highly unusual, especially without any of the words being as common as "the" or "and".

The words used only once are similar in character to the most common words: the personal (Simmonds, Caesar), pastoral (matchwood, piebald, stockbreeder, windowsill) and atmospheric (tunefully, conciliatory, bloodshed).

Animal Farm on Wikipedia

IF YOU DON'T LIKE THE ANIMATED GIF, SCROLL FURTHER DOWN.


Non-animated version:


Again, I could not find a book cover or movie poster that spoke to me for a word cloud. I was fooling around with some typography and came up with the following:



I wasn't sure it was clear that the background of the word cloud was supposed to be static, so I animated it.


The progatonist's name, Winston Smith, features heavily (526 times), as does that of his antagonist, O'Brien (205). Julia's name appears less than half as often as O'Brien's (100), but to be fair Winston doesn't learn it until partway through the book.

Nineteen Eighty-Four is famous for the introduction of newspeak, but those words appear relatively little: doublethink is used 30 times, compared to telescreen at 92 and Oceania at 60. Speakwrite and thoughtcrime are used 13 and 11 times, respectively.

Like Animal Farm, there is a long, repeated phrase: "Oranges and lemons, say the bells of St. Clement's" appears eight times -- exactly as often as the organization of which Julia is a member, the "junior anti-sex league".

There are many more common two-word phrases such as the culturally iconic "Big Brother" (78 times). The word "party" appears 70 times in recurring 2-grams such as "party member" and "inner party".

"Two and two make five" appears three times.

Among the words used only once are bastards, homosexuality, tribunal, silk, monopoly, romantic and intercourse.

The book is 100,000 words long; there are 8,500 individual words and 5,700 unique word stems determined by the Porter algorithm.

Nineteen Eighty-Four on Wikipedia

Word cloud created using Tagxedo.

Word Cloud of A Clockwork Orange by Anthony Burgess



Practice makes perfect; this is a replacement for the word cloud of this book I made two weeks ago, and I think it's much better looking. It is still based on the original, iconic Penguin cover:

A Clockwork Orange makes for an interesting word cloud, because so many of the words are in Nadsat; six of the top ten non-trivial words are in Anthony Burgess's invented Russian/Cockney youth dialect, including veck, "guy" (from Russian chellovek), viddy (to see) and horrorshow (an Anglicization of the Russian khorosho, "good").

The most common non-trivial word is "brother(s)"; it occurs 259 times, 86 of them as the phrase "O my brothers," which is how the sociopathic narrator, Alex, addresses the reader. Two memorable phrases from the book and the Kubrick movie are "ultra-violence" and "Ludwig van", which occurs relatively few times: 15 and 12, respectively.

The title (which is never uttered in the movie) appears nine times, since it's the title of a book within the book; in chapter two Alex mockingly reads it aloud before he and his gang rape the author's wife to death.

Alex uses the misspelling "heighth" 21 times, but gets it right twice, all in similar contexts.

About half of the words in the book are only used once, an exceptionally high proportion. Many of these are Nadsat, which is an interesting challenge, since the reader has to understand the meaning from context. Many of these are easy derivatives of English words ("chickiwick", "clopclopclop"), others are more challenging ("choodessny", "oobivat").

There are about 59,000 words in A Clockwork Orange, of which about 14,000 are non-trivial ("the", "in", etc.) There are about 5,500 unique words.

For more info:

A Clockwork Orange on Wikipedia

Word Cloud of The Catcher in the Rye





This word cloud is based not on a published book cover, but on an artist’s homage that is, in my opinion, far more striking and evocative than any of the “official” versions. The artist, M. S. Corley, was kind enough to give me permission to adapt his work; please look at the rest of his oeuvre at his website, The Art of M.S. Corley.

Three of the top five words in The Catcher in the Rye – “goddam”, “hell” and “damn” – are curse words, which of course is one of the reasons this is one of the most banned books in school libraries. In fact, continuing down the list with “chrissake”, “bastard”, “crap”, “sonuvabitch”, etc., my rough calculation is that just over 5% of the book consists of words that would not be acceptable at a 1951 supper table.
But they and many of the top words, like “lousy” and “terrific”, are essential to the voice Salinger gives his protagonist, Holden Caulfield. One of the most thematic and memorable words in the book is “phony”, but Salinger uses it relatively sparingly and strategically: 35 times, ranking it #46, behind Holden’s iconic “hat”, featured on this cover.
Holden calls everyone “old”, such as “old Phoebe” and “old Stradlater”, but the concordance software omits the most common 1,000 words in the English language so the word clouds aren’t full of words like “the” and “I”. “Old” appears 397 times (more than “or”!), which would have put it in the #1 spot, beating “goddam” with 245 appearances. Interestingly, when compared to the word frequencies in the Brown corpus of written English, “old” appears no more often in The Catcher in the Rye than it does elsewhere (it is 31st when ranked by frequency and also 31st by “keyness” compared to Brown).
“Phoebe” is the seventh-most-used word in the book; Holden mentions his sister’s name 115 times. “Rye” appears only seven times, most often as “If a body catch a body comin’ through the rye.” The book’s title, and the word “catcher”, appear just once, during Holden's conversation with Phoebe in Chapter 22.
Among the words that appear only once are “bassackwards”, “oversexed”, the brand name “Tattersall”, the curiously spelled “wutchamacallit” (the more common “whatchamacallit” does not appear) and several phonetically caricatured French words like “voolay voo” (when Holden imitates the speech of Janine, a singer in the Wicker Bar). There are approximately* 75,000 words, and 3,500 unique words.
After I finished making this word cloud, I found out there is a plan to publish a posthumous sequel to The Catcher in the Rye in 2015. I have absolutely no comment to make about this.
For more information, please see the Wikipedia article about book, the Wikipedia article about the author, or this analysis of the themes in the book, including the hat. You can also see the method I used to determine the words and their frequencies. Last week, I posted a boring, traditional word cloud of this book; I’ve removed it, but you can see it here if you want.
 * See the method file for an explanation why these counts are approximate.

Word cloud created using Tagxedo.