Text Mining: Word Clouds

I’ve been exploring text mining in greater depth lately. As an experiment, I decided to create a word cloud based on Virgils Aeneid, one of the great works of Roman literature. Mostly because it can’t all be business cases and twitter analyses. The translation I used was by J.W. Mackail and you can download it here.

word cloud

Aeneas (the protagonist) and Turnus (the main antagonist) feature prominently. “Father” also makes a prominent appearance, as part of the epic is about Aeneas’ relationship with his elderly father. However, neither of Aeneas’ wives or his lover, Dido, appear in the word cloud. “Death”, “gods”, “blood”, “sword”, “arms” and “battle” all feature. That sums the epic up: it’s a rollicking adventure about the fall of Troy, the founding of Rome and a trip to the underworld as well.

The choice to downplay the role of romantic love in the story had particular political implications for the epic as a piece of propaganda. You can read more about it here and here. I found it interesting that the word cloud echoed this.

What I learnt from this experiment was that stop words matter. The cloud was put together from an early 20th century translation of a 2000 year old text using 21st century methods and stop words. Due to the archaic English used in the translation, I added a few stop words of my own: things like thee, thou, thine. This resulted in a much more informative cloud.

I did create a word cloud using the Latin text, but without a set of Latin stop words easily available it yields a cloud helpfully describing the text with prominent features like “but”, “so”, “and”, “until”.

The moral of the story is: the stop words we use matter. Choosing the right set describes the text accurately.

If you’re interested in creating your own clouds, I found these resources particularly helpful:

  • Julia Silge’s analysis of Jane Austen inspired me to think about data mining in relation to Roman texts, you can see it here, it’s great!
  • The Gutenbergr package for accessing texts by Ropenscilabs available on GitHub.
  • This tutorial on data mining from RDatamining.
  • Preparing literary data for text mining by Jeff Rydberg-Cox.
  • A great word cloud tutorial you can view here on STHDA.

There were a number of other tutorials and fixes that were helpful, I noted these in the Rscript. The script is up on github: if you want to try it yourself, you can find it here.