Democracy Sausage Redux

One last time. I wanted to see if there was any interesting election day behaviour by following the hashtag for democracy sausage. As it turns out, there was. There was a peak of early-morning democratic enthusiasm with a bunch of sleepless auspol and sausage tragic posting furiously. It tapered off dramatically during the day as we were forced to contend with the reality of democracy.

For a change, I also calculated a basic sentiment score for each tweet and tracked that too. There was a large degree of variability on 30/06, but posting was very low that day. A late afternoon disappointment dip as people realised that we’d all packed up the BBQs and gone home before they got there was also evident. Julia Silge’s post on the subject was extremely helpful.

I’m teaching again this week and to start students off they’re doing basic charts in Excel. So here’s mine!

Line graph showing frequency and sentiment of hashtag


Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

What if policies have networks like people?

It’s been policy-light this election season, but some policy areas are up for debate. Others are being carefully avoided by mutual agreement, much like at Christmas lunch when we all tacitly agree we aren’t bringing up What Aunty Betty Did Last Year After Twelve Sherries. It’s all too painful, we’ll never come to any kind of agreement and we should just pretend like it’s not important.

However, policy doesn’t happen in a vacuum and I wondered if it was possible that using a social network-type analysis might illustrate something about the policy debate that is occurring during this election season.

To test the theory, I used the transcripts of the campaign launch speeches of Malcolm Turnbull and Bill Shorten. These are interesting documents to examine, because they are at one and the same time an affirmation of each parties’ policy aspirations for the campaign as well as a rejection of the other’s. I used a simple social network analysis, similar to that used in the Aeneid study. If you want to try it yourself, you can find the R script here.

Deciding on the topics to examine was some trial and error, but the list was eventually narrowed down to 19 topics that have been the themes of the election year: jobs, growth, housing, childcare, superannuation, health, education, borders, immigration, tax, medicare, climate change,marriage equality, offshore processing, environment, boats, asylum, business and bulk billing. These aren’t the topics that the parties necessarily want to talk about, but they are nonetheless being talked about.

It took some manoeuvring to get a network that was readable, but one layout (Kamada Kawaii for the interested) stood out. I think it describes the policy state quite well, visually speaking.

topic network 160627

We have the inner circle of high disagreement: borders, environment, superannuation, boats and immigration. There is a middle circle doing the job of containment: jobs and growth, housing, childcare, education, medicare, business and tax- all standard election fodder.

Then we have the outer arc of topics neither the labor or liberal parties really wants to engage with: offshore processing, asylum (as opposed to immigration, boats and borders), climate change (much more difficult to manage than mere environment), bulk billing (the crux of medicare) and marriage equality (have a plebiscite, have a free parliamentary vote, have something, except responsibility). I found it interesting that the two leaders’ speeches when visualised contain one part of a policy debate around immigration: boats and borders. But they conspicuously avoided discussing the unpleasant details: offshore processing.

Much like Aunty Betty and her unfortunate incident with the cold ham, both parties are in tacit agreement to ignore the difficult parts of a policy debate.

Australia Votes: Only Six Days to Go

It’s been painful, frankly pretty lame on the policy front and we’re over it. We all go to the national quadrennial BBQ election next week. While we’re standing in line clutching our sausage sandwiches and/or delightful local baked goods, it’d be nice to have an idea of what the people we’re voting for have had to say.

So another word cloud it is, because neither side has dared offer a policy that might stray from the narrative that “we’re all good blokes, really”.

This time, I requested up to 20 tweets from Turnbull and Shorten to see what’s been going on in the last couple of weeks. I got 18 back from both. Shorten (in red, below) has been talking about voting (surprise!), been screaming about medicare and apparently has an intense interest in trades with mentions of “brick” and “nails”. I hope that’s real tradies he’s talking about. Standard pollie speak “government”, “people”, “liberals”, “Turnbull” made it into the word cloud. Marriage equality also figured in the discussion.

Screen Shot 2016-06-25 at 10.18.46 PM

Turnbull (below, blue) was making a point about his relationship with the Australian muslim community, mentioning the Kirribilli house iftar and multifaith Australia. Standard coalition topics such as “investment”, “stable leaders”, “plan”, “economic”, “jobs” were all present. The AMP issue I touched on briefly last time. He appears to be trying to avoid the subject of marriage equality as much as possible.

Screen Shot 2016-06-25 at 10.19.02 PM

So there we have it: jobs and growth, the promise of stability, an Iftar in Kirribilli, marriage equality and a fascination with how we define a real or a fake tradie. If we all keep smiling fixedly, maybe we can forget about Brexit.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

It got wet

NSW got wet this weekend. In our own particular case we lost a large amount of our driveway and several paddocks spontaneously attained lake status. So there was nothing else to do but to poke around and see what I could turn up in the historic record (find yours here).

Some locals recorded up to 250mm in 24 hours this weekend. I thought that was an extraordinary amount until I checked the data (only available up until April this year so far, alas).

It turns out that sometime in the late sixties the local rainfall station recorded an extraordinary 392mm in 24 hours. Now that’s an outlier…!

I’ll invest in a new pair of gumboots just in case.

Smooth scatter plot rainfall

If you’re into this sort of thing, the plot was done using the “smoothScatter” function in R. It’s a change from the usual time series line chart. I think I’m a convert.