[Word Cloud] Comic book superhero names

Word Cloud

It's been a while since I had this idea, but I struggled to find a good corpus of names to work with. Comicvine has a nice list of characters in comics, but it would have taken a lot of manual processing to make sure the end result was not full of "McDuck".

I stumbled across superheronames.net, a fan-curated list of favorite superhero names, and this seemed a decent compromise. I extracted all the names that fans had given four or five stars, separated them into morphemes (so Batman becomes "bat" and "man"), compiled a frequency list, made a shaped word cloud with a comic-style font at tagxedo, did a little phosohopping, and voila.

No surprises that Man, Captain and Girl are most highly represented (and you can draw your own conclusions about Girl being more common than Woman). A co-worker I showed this to pointed out there are some interestingly serendipitous names that can be made from the way the algorithm put the morphemes together on this graphic: "Super fire she lad", "America devil ice", "Princess cat bird hawk". I would totally buy those comics.

I've posted the names I used in this Google doc. It's rather imperfect; if anyone has any better suggestions for a corpus, I would be very interested to hear them.

Word cloud created using Tagxedo.

Word Cloud of A Clockwork Orange by Anthony Burgess



Practice makes perfect; this is a replacement for the word cloud of this book I made two weeks ago, and I think it's much better looking. It is still based on the original, iconic Penguin cover:

A Clockwork Orange makes for an interesting word cloud, because so many of the words are in Nadsat; six of the top ten non-trivial words are in Anthony Burgess's invented Russian/Cockney youth dialect, including veck, "guy" (from Russian chellovek), viddy (to see) and horrorshow (an Anglicization of the Russian khorosho, "good").

The most common non-trivial word is "brother(s)"; it occurs 259 times, 86 of them as the phrase "O my brothers," which is how the sociopathic narrator, Alex, addresses the reader. Two memorable phrases from the book and the Kubrick movie are "ultra-violence" and "Ludwig van", which occurs relatively few times: 15 and 12, respectively.

The title (which is never uttered in the movie) appears nine times, since it's the title of a book within the book; in chapter two Alex mockingly reads it aloud before he and his gang rape the author's wife to death.

Alex uses the misspelling "heighth" 21 times, but gets it right twice, all in similar contexts.

About half of the words in the book are only used once, an exceptionally high proportion. Many of these are Nadsat, which is an interesting challenge, since the reader has to understand the meaning from context. Many of these are easy derivatives of English words ("chickiwick", "clopclopclop"), others are more challenging ("choodessny", "oobivat").

There are about 59,000 words in A Clockwork Orange, of which about 14,000 are non-trivial ("the", "in", etc.) There are about 5,500 unique words.

For more info:

A Clockwork Orange on Wikipedia

Word Cloud of The Catcher in the Rye





This word cloud is based not on a published book cover, but on an artist’s homage that is, in my opinion, far more striking and evocative than any of the “official” versions. The artist, M. S. Corley, was kind enough to give me permission to adapt his work; please look at the rest of his oeuvre at his website, The Art of M.S. Corley.

Three of the top five words in The Catcher in the Rye – “goddam”, “hell” and “damn” – are curse words, which of course is one of the reasons this is one of the most banned books in school libraries. In fact, continuing down the list with “chrissake”, “bastard”, “crap”, “sonuvabitch”, etc., my rough calculation is that just over 5% of the book consists of words that would not be acceptable at a 1951 supper table.
But they and many of the top words, like “lousy” and “terrific”, are essential to the voice Salinger gives his protagonist, Holden Caulfield. One of the most thematic and memorable words in the book is “phony”, but Salinger uses it relatively sparingly and strategically: 35 times, ranking it #46, behind Holden’s iconic “hat”, featured on this cover.
Holden calls everyone “old”, such as “old Phoebe” and “old Stradlater”, but the concordance software omits the most common 1,000 words in the English language so the word clouds aren’t full of words like “the” and “I”. “Old” appears 397 times (more than “or”!), which would have put it in the #1 spot, beating “goddam” with 245 appearances. Interestingly, when compared to the word frequencies in the Brown corpus of written English, “old” appears no more often in The Catcher in the Rye than it does elsewhere (it is 31st when ranked by frequency and also 31st by “keyness” compared to Brown).
“Phoebe” is the seventh-most-used word in the book; Holden mentions his sister’s name 115 times. “Rye” appears only seven times, most often as “If a body catch a body comin’ through the rye.” The book’s title, and the word “catcher”, appear just once, during Holden's conversation with Phoebe in Chapter 22.
Among the words that appear only once are “bassackwards”, “oversexed”, the brand name “Tattersall”, the curiously spelled “wutchamacallit” (the more common “whatchamacallit” does not appear) and several phonetically caricatured French words like “voolay voo” (when Holden imitates the speech of Janine, a singer in the Wicker Bar). There are approximately* 75,000 words, and 3,500 unique words.
After I finished making this word cloud, I found out there is a plan to publish a posthumous sequel to The Catcher in the Rye in 2015. I have absolutely no comment to make about this.
For more information, please see the Wikipedia article about book, the Wikipedia article about the author, or this analysis of the themes in the book, including the hat. You can also see the method I used to determine the words and their frequencies. Last week, I posted a boring, traditional word cloud of this book; I’ve removed it, but you can see it here if you want.
 * See the method file for an explanation why these counts are approximate.

Word cloud created using Tagxedo.