Social Networks: The Aeneid Again

Applying social network analysis techniques to the Aeneid provides an opportunity to visualise literary concepts that Virgil envisaged for the text. It occurred to me that this was a great idea when I saw this social network analysis of Game of Thrones. If there is a group of literary figures more blood thirsty, charming and messed around by cruel fate than the denizens of Westeros, it would be those in the golden age of Roman literature.

Aeneid social network

This is a representation of the network of characters in the Aeneid. Aeneas and Turnus, both prominent figures in the wordcloud I created for the Aeneid are also prominent in the network. Connected to Aeneas is his wife Lavinia, his father Anchises, the king of the Latins (Latinus) and Pallas, the young man placed into Aeneas’ care.

Turnus is connected to Aeneas directly along with his sister Juturna, Evander (father of Pallas. Cliff notes version: the babysitting did not go well) and Allecto, a divine figure of rage.

Between Aeneas and Turnus is the “Trojan contingent”. Virgil deliberately created parallels between the stories surrounding the fall of Troy and Aeneas’ story. Achilles, the tragic hero, is connected to Turnus directly, while Aeneas is connected to Priam (king of Troy) and Hector (the great defender of Troy). Andromache is Hector’s widow whom Aeneas meets early in the epic.

Also of note is the divine grouping: major players in directing the action of the epic. Jupiter, king of the gods and Apollo the sun god are directly connected to our hero. Venus, Neptune, Minerva and Cupid are all present. In a slightly different grouping, Juno, Queen of the Gods and Aeneas’ enemy is connected to Dido, Aeneas’ lover. Suffice it to say, the relationship was not a “happily ever after”.

I used this list of the characters in the Aeneid as a starting point and later removed all characters who were peripheral to the social network. If you’re interested in trying this yourself, I posted the program I used here. Once again, the text used is the translation by J.W. Mackail and you can download it from Project Gutenberg here.

There were a number of resources I found useful for this project:

  • This tutorial from R DataMining provided a substantive amount of the code required for the social network analysis
  • While this tutorial from the same place was very helpful for creating a text document matrix. I’ve used it previously a number of times.
  • This article from R Bloggers on using igraph was also very useful
  • There were a number of other useful links and I’ve documented those in the R script.

Whilst text mining is typically applied to modern issues, the opportunity to visualise an ancient text is an interesting one. I was interested in how the technique grouped the characters together. These groupings were by and large consistent not only with the surface interpretation of the text, but also deeper levels of political and moral meaning within the epic.

Modelling Early Grade Education in Papua New Guinea

For several years, I worked for the World Bank analysing the early grade education outcomes in a number of different Pacific countries including Laos, Tonga and Papua New Guinea, amongst others. Recently, our earlier work in Papua New Guinea was published for the first time.

One of the more challenging things I did was model a difficult set of survey outcomes: reading amongst young children. You can see the reports here. Two of the most interesting relationships we observed were the importance of language for young children learning to read (Papua New Guinea has over 850 of them so this matters) and the role that both household and school environments play in literacy development.

At some point I will write a post about the choice between standard ordinary least squares regressions used in the field and the tobit models I (generally) prefer for this data. Understanding the theoretical difference between censored, truncated and continuous data isn’t the most difficult thing in the world, but understanding the practical difference between them can have a big impact on modelling outcomes.

Elasticity and Marginal Effects: Two Key Concepts

One of the critical parts of building a great model is using your understanding of the problem and context. Choosing an appropriate model type and deciding on appropriate features/variables to explore based on this information is critical.

The two key concepts of elasticity and marginal effects are fundamental to an economic understanding of model building. This is something that can be overlooked for practitioners not coming from that background. Neither concept is difficult or particularly obtuse.

This infographic came about because I had a group of talented economics students at the masters’ level who had no econometric background, by and large. In a crowded course, I don’t have much time to expand on my favourite things. This was my take on explaining the concepts quickly and simply.

Elasticity infographic

For those very new to the concept, this explanation here is simple. Alternatively, if you’re interested in non-constant marginal effects and ways they can be used, check out this discussion.


More word clouds: Auspol

Whilst I love text mining a classic work of western literature, this time I decided to stay in the present century with the twitter feeds of the leaders of the three major parties heading into a downunder federal election.

Turnbull word cloud

Malcolm Turnbull is the prime minister and leads the liberal party, he’s in blue. Bill Shorten is the opposition leader and head of the labor party, he’s in red. Richard di Natale is the leader of the Greens party, he’s in green because I had no choice there.

The outcomes are pretty interesting: apparently AMP was A.Big.Deal. lately. “Jobs” and “growth” were the other words playing on repeat for the two major parties.
Shorten Word Cloud

Each one has a distinct pattern, however. Turnbull is talking about AMP, plans, jobs, future, growth. Shorten is talking about labor, AMP, medicare, budget, schools and jobs. Di Natale is talking about the Greens, AMP (again), electricity, and auspol itself. Unlike the others, Di Natale was also particularly interested in farmers, warming and science.Word cloud Di Natale

The programming and associated sources are pretty much the same as for the Aeneid word cloud, except I used the excellent twitteR package you can find out about here. This tutorial on R Data mining was the basis of the project. The size of the corpi (that would be the plural of corpus, if you speak Latin) presented a problem and these resources here and here were particularly helpful.

For reference, I pulled the tweets from the leaders’ timelines on the evening of the 23/05/16. The same code gave me 83 tweets from Bill Shorten, 59 from Malcolm Turnbull and 33 from Richard Di Natale: all leaders are furious tweeters, so if anyone has any thoughts on why twitteR responded like that, I’d be grateful to hear.

The minimum frequencies for entering the word clouds were 3 per word for Shorten with a greater number of tweets picked up, but only 2 for Di Natale and Turnbull, due to the smaller number of available words.

I’ll try this again later in the campaign and see what turns up.

Text Mining: Word Clouds

I’ve been exploring text mining in greater depth lately. As an experiment, I decided to create a word cloud based on Virgils Aeneid, one of the great works of Roman literature. Mostly because it can’t all be business cases and twitter analyses. The translation I used was by J.W. Mackail and you can download it here.

word cloud

Aeneas (the protagonist) and Turnus (the main antagonist) feature prominently. “Father” also makes a prominent appearance, as part of the epic is about Aeneas’ relationship with his elderly father. However, neither of Aeneas’ wives or his lover, Dido, appear in the word cloud. “Death”, “gods”, “blood”, “sword”, “arms” and “battle” all feature. That sums the epic up: it’s a rollicking adventure about the fall of Troy, the founding of Rome and a trip to the underworld as well.

The choice to downplay the role of romantic love in the story had particular political implications for the epic as a piece of propaganda. You can read more about it here and here. I found it interesting that the word cloud echoed this.

What I learnt from this experiment was that stop words matter. The cloud was put together from an early 20th century translation of a 2000 year old text using 21st century methods and stop words. Due to the archaic English used in the translation, I added a few stop words of my own: things like thee, thou, thine. This resulted in a much more informative cloud.

I did create a word cloud using the Latin text, but without a set of Latin stop words easily available it yields a cloud helpfully describing the text with prominent features like “but”, “so”, “and”, “until”.

The moral of the story is: the stop words we use matter. Choosing the right set describes the text accurately.

If you’re interested in creating your own clouds, I found these resources particularly helpful:

  • Julia Silge’s analysis of Jane Austen inspired me to think about data mining in relation to Roman texts, you can see it here, it’s great!
  • The Gutenbergr package for accessing texts by Ropenscilabs available on GitHub.
  • This tutorial on data mining from RDatamining.
  • Preparing literary data for text mining by Jeff Rydberg-Cox.
  • A great word cloud tutorial you can view here on STHDA.

There were a number of other tutorials and fixes that were helpful, I noted these in the Rscript. The script is up on github: if you want to try it yourself, you can find it here.

Data Analysis: Enough with the Questions Already

We’ve talked a lot about data analysis lately. First we asked questions. Then we asked more. Hopefully when you’re doing your own analyses you have your own questions to ask. But sooner or later, you need to stop asking questions and start answering them.

Ideally, you’d really like to write something that doesn’t leave the reader with a keyboard imprint across their forehead due to analysis-induced narcolepsy. That’s not always easy, but here are some thoughts.

Know your story.

Writing up data analysis shouldn’t be about listing means, standard deviations and some dodgy histograms. Yes, sometimes you need that stuff- but mostly what you need is a compelling narrative. What is the data saying to support your claims?

It doesn’t all need to be there. 

You worked out that tricky bit of code and did that really awesome piece of analysis that led you to ask questions and… sorry, no one cares. If it’s not a direct part of your story, it probably needs to be consigned to telling your nerd friends on twitter- at least they’ll understand what you’re talking about. But keep it out of the write up!

How is it relevant?

Data analysis is rarely the end in and of itself. How does your analysis support the rest of your project? Does it offer insight for modelling or forecasting? Does it offer insight for decision making? Make sure your reader knows why it’s worth reading.

Do you have an internal structure?

Data analysis is about translating complex numerical information into text. A clear and concise structure for your analysis makes life much easier for the reader.

If you’re staring at the keyboard wondering if checking every social media account you ever had since high school is a valid procrastination option: try starting with “three important things”. Then maybe add three more. Now you have a few things to say and can build from there.

Who are you writing for?

Academia, business, government, your culture, someone else’s, fellow geeks, students… all of these have different expectations around communication.  All of them are interested in different things. Try not to have a single approach for communicating analysis to different groups. Remember what’s important to you may not be important to your reader.

Those are just a few tips for writing up your analyses. As we’ve said before: it’s not a one-size-fits-all approach. But hopefully you won’t feel compelled to give a list of means, a correlation matrix and four dodgy histograms that fit in the space of a credit card. We can do better than that!

Data Analysis: More Questions

In our last post on data analysis, we asked a lot of questions. Data analysis isn’t a series of generic questions we can apply to every dataset we encounter, but it can be a helpful way to frame the beginning of your analysis. This post is, simply, some more questions to ask yourself if you’re having trouble getting started.

The terminology I use below (tall, dense and wide) is due to Francis Diebold. You can find his original post here and it’s well worth a read.

Remember, these generic questions aren’t a replacement for a thoughtful, strategic analysis. But maybe they will help you generate your own questions to ask your data.

Data analysis infographic

Data Analysis: Questions to Ask the First Time

Data analysis is one of the most under rated, but most important parts of data science/econometrics/statistics/whatever it is you do with data.

It’s not impressive when it’s done right because it’s like being impressed by a door handle: it is something that is both ubiquitous and obvious. But when you’re missing the doorhandles, you can’t open the door.

There are lots of guides to data analysis but fundamentally there is no one-size-fits-most approach that can be guaranteed to work for every data set. Data analysis is a series of open-ended questions to ask yourself.

If you’re new or coming to data science from a background that did not emphasise statistics or econometrics (or story telling with data in general), it can be hard to know which questions to ask.

I put together this guide to offer some insight into the kinds of questions I ask myself when examining my data for the first time. It’s not complete: work through this guide and you won’t have even started the analysis proper. This is just the first time you open your data, after all.

But by uncovering the answers to these questions, you’ll have a more efficient analysis process. You’ll also (hopefully) think of more questions to ask yourself.

Remember, this isn’t all the information you need to uncover: this is just a start! But hopefully it offers you a framework to think about your data the first time you open it. I’ll be back with some ideas for the second time you open your data later.

career timeline-2.


Congratulations to the Melbourne Data Science Group!

Last week, I attended the Melbourne Data Science Initiative and it was definitely the highlight of my data science calendar this year! The event was superbly organised by Phil Brierley and his team. Events included tutorials on Machine Learning, Deep Learning, Business Analytics and talks on feature engineering, big data and the need to invest in analytic talent amongst others.

The speakers were knowledgable and interesting with everything covered from the hilarious building of a rhinoceros catapult (thanks to Eugene from Presciient, it’s possible I’ll never forget that one) to the paramount importance of the “higher  purpose” in business analytics as discussed by Evan Stubbs from SAS Australia and New Zealand.

If you’re in or around Melbourne and into Data Science at all, check out the group who put on this event out here.