Stargate SG1: Sentiment is not Enough for the Techno Bugs

One of the really great things as your kids get older is that you can share with them stuff you thought was cool when you were young, had ideals and spare time. One of those things my siblings and I spent a lot of that spare time doing was watching Stargate SG1. Now I’ve got kids and they love the show too.

When I sat down to watch the series the first time I was a history student in my first year of university, so Daniel’s fascination with languages and cultures was my interest too. Ironically, at the time I was also avoiding taking my first compulsory econometrics course because I was going to hate it SO MUCH.

Approximately one million years, a Ph.D. in econometrics and possibly an alternate reality later, I’m a completely different person. With Julia Silge’s fabulous Austen analyses fresh in my mind (for a start, see here and then keep exploring) I rewatched the series. I wondered: how might sentiment work for transcripts, rather than print-only media like a novel?

In my view, this is something like an instrumental variables problem. A transcript of a TV show is only part of the medium’s signal: imagery and sound round out the full product. So a sentiment analysis on a transcript is only an analysis of part of the presented work. But because dialogue is such an intrinsic and important part of the medium, might it give a good representation?

What is sentiment analysis?

If you’re not a data scientist, or you’re new to natural language processing, you may not know what sentiment analysis is. Basically, sentiment analysis compares a list of words (like you may find in a transcript, a speech or a novel) to a dictionary that measures the emotions the words convey. In its most simple form, we talk about positive and negative sentiment.

Here’s an example of a piece of text with a positive sentiment:

“I would like to take this opportunity to express my admiration for your cause. It is both honourable and brave.” – Teal’c to Garshaw, Season Two: The Tokra Part II.

Now here’s an example of a piece of dialogue with a negative sentiment:

“I mean, one wrong move, one false step, and a whole fragile world gets wiped out?” – Daniel, Season Two: One False Step

This is an example of a fairly neutral piece of text:

“Gentlemen, these planets designated P3A-575 and P3A-577 have been submitted by Captain Carter’s team as possible destinations for your next mission.”- General Hammond, Season Two: The Enemy Within.

It’s important to understand that sentiment analysis in its simplest form doesn’t really worry about how the words are put together. Picking up sarcasm, for example isn’t really possible by just deciding which words are negative and which are positive.

Sentiment analyses like this can’t measure the value of a text: they are abbreviations of a text. In the same way we use statistics like a mean or a standard deviation to describe a dataset, a sentiment analysis can be used to succinctly describe a text.

If you’d like to find out more about how sentiment analysis works, check out Julia Silge’s blog post here which provided a lot of the detailed code structure and inspiration for this analysis.

What does sentiment analysis show for SG1?

I analysed the show on both a by-episode and by-series basis. With over 200 episodes and 10 series, the show covered a lot of ground with its four main characters. I found a couple of things that were interesting.

The sentiment arc for most shows is fairly consistent.

Most shows open with a highly variable sentiment as the dilemma is explained, sarcastic, wry humour is applied and our intrepid heroes set out on whatever journey/quest/mission they’re tasked with. Season One’s Within the Serpent’s Grasp is a pretty good example. Daniel finds himself in an alternate reality where everything is, to put it mildly, stuffed.

Within the Serpent’s Grasp, Season 1. Daniel crosses over to an alternate reality where earth is invaded by evil parasitic aliens with an OTT dress sense.

According to these charts however, about three quarters of the way through it all gets a bit “meh”.

Below is the sentiment chart for Season Two’s In the Line of Duty, where Sam Carter has an alien parasite in control of her body. If that’s not enough for any astrophysicist to deal with, an alien assassin is also trying to kill Sam.

If we take the sentiment chart below literally, nobody really cares very much at all about the impending murder of a major character. Except, that’s clearly not what’s happening in the show: it’s building to the climax.


In the Line of Duty, Season 2. Sam Carter gets a parasite in her head and if that’s not enough, another alien is trying to kill her.

So why doesn’t sentiment analysis pick up on these moments of high drama?

I think the answer here is that this is a scifi/adventure show: tension and action isn’t usually achieved through dialogue. It’s achieved by blowing stuff up in exciting and interesting ways, usually.

The Season Three cliffhanger introduced “the replicators” for precisely this purpose. SG1 always billed itself as a family-friendly show. Except for an egregious full frontal nude scene in the pilot, everyone kept their clothes on. Things got blown up and people got thrown around the place by Really Bad Guys, but limbs and heads stayed on and the violence wasn’t that bad. SG1 was out to make the galaxy a better place with family values, a drive for freedom and a liberal use of sarcasm.

But scifi/adventure shows thrive on two things: blowing stuff up and really big guns. So the writers introduced the replicators, a “race” of self-generating techno lego that scuttled around in bug form for the most part.

In response to this new galactic terror, SG1 pulled out the shot guns, the grenades and had a delightful several seasons blasting them blood-free. The show mostly maintained its PG rating.

The chart below shows the sentiment chart for the replicators’ introductory episode, Nemesis. The bugs are heading to earth to consume all technology in their path. The Asgard, a race of super-advanced Roswell Greys have got nothing and SG1 has to be called in to save the day. With pump action shotguns, obviously.

The replicator bugs don’t speak. The sound of them crawling around inside a space ship and dropping down on people is pretty damn creepy: but not something to be picked up by using a transcript as an instrument for the full product.

Nemesis, Season 3 cliffhanger: the rise of the techno bugs.

Season Four’s opener, Small Victories solved the initial techno bug crisis, but not before a good half hour of two of our characters flailing around inside a Russian submarine with said bugs. Again, the sentiment analysis found it all a little “whatever” towards the end.

Small Victories, Season 4 series opener. Techno bugs are temporarily defeated.

Is sentiment analysis useless for TV transcripts then?

Actually, no. It’s just that in those parts of the show where dialogue is only of secondary importance to the other elements of the work obscures the usefulness of the transcript as an instrument. In order for the transcript to be a useful instrument, we need to do what we’d ideally do in many instrumental variables cases: look at a bigger sample size.

Let’s take a look at the sentiment chart for the entire sixth season. This is the one where Daniel Jackson is dead, but is leading a surprisingly active and fulfilling life for a dead man. We can see the overall structure of the story arc for the season below. The season starts with something of a bang as new nerd Jonas is introduced just in time for old nerd Daniel to receive a life-ending dose of explosive radiation. The tension goes up and down throughout. It’s most negative at about the middle of the season where there’s usually a double-episode cliffhanger and smooths out towards the end of the series until tension increases with the final cliffhanger.


Series Six: Daniel is dead and new guy Jonas has to pick up the vacant nerd space.

Season Eight, in which the anti-establishment Jack O’Neill has become the establishment follows a broadly similar pattern. (Jack copes with the greatness thrust upon him as a newly-starred general by being more himself than ever before.)

Note the end-of-series low levels of sentiment. This is caused by a couple of things: as with the episodes, moments of high emotion get big scores and this obscures the rest of the distribution. I considered normalising it all between 0 and 1. This would be a good move for comparing between episodes and seasons, but didn’t seem necessary in this case.

The other issue going on here is the narrative structure of the overall arc. In these cases, I think the season is slowing down a little in preparation for the big finale.

Both of these issues were also apparent in the by-episode charts as well.

Your turn now

For the fun of it, I built a Shiny app which will allow you to explore the sentiment of each episode and series on your own. I’ve also added simple word clouds for each episode and series. It’s an interesting look at the relative importance each character has and how that changed over the length of the show.

The early series were intensely focussed on Jack, but as the show progressed the other characters got more and more nuanced development.

Richard Dean Anderson made the occasional guest appearance after Season Eight, but was no longer a regular role on the show after Season Nine started. The introduction of new characters Vala Mal Doran and Cameron Mitchell took the show into an entirely new direction. The word clouds show those changes.

You can play around with the app below, or find a full screen version here. Bear in mind it’s a little slow to load at first: the corpus of SG1 transcripts comes with the first load, and that’s a lot of episodes. Give it a few minutes, it’ll happen!


The details.

The IMSDB transcript database provided the transcripts for this analysis, but not for all episodes: the database only had 182 of the more than 200 episodes that were filmed on file. I have no transcripts to any episode in Season 10 and only half of Season 9! If anyone knows where to find more or if the spinoff Stargate Atlantis transcripts are online somewhere, I’d love to know.

A large portion of this analysis and build used Julia Silge and David Robinson’s Tidy Text package in R. They also have a book coming out shortly, which I have on preorder. If you’re considering learning about Natural Language Processing, this is the book to have, in my opinion.

You can find the code I wrote for the project on Github here.


Democracy Sausage Redux

One last time. I wanted to see if there was any interesting election day behaviour by following the hashtag for democracy sausage. As it turns out, there was. There was a peak of early-morning democratic enthusiasm with a bunch of sleepless auspol and sausage tragic posting furiously. It tapered off dramatically during the day as we were forced to contend with the reality of democracy.

For a change, I also calculated a basic sentiment score for each tweet and tracked that too. There was a large degree of variability on 30/06, but posting was very low that day. A late afternoon disappointment dip as people realised that we’d all packed up the BBQs and gone home before they got there was also evident. Julia Silge’s post on the subject was extremely helpful.

I’m teaching again this week and to start students off they’re doing basic charts in Excel. So here’s mine!

Line graph showing frequency and sentiment of hashtag


Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Late Night Democracy Sausage Surge

It’s hard-hitting electoral coverage over here at Rex. Democracy sausage is apparently more of  a late night event leading up to the election. Late night tweeting was driving the hashtag up until the close of 1 July. By the end of the day twitter had changed the #ausvotes emoji to a sausage sandwich. My personal prediction is another overnight lull and then a daytime surge on 02/07 petering off by 4pm on the day.

Time series graph of #democracysausage


And just for fun, who was the top Twitter advocate for the hashtag over the last three days? A user (bot?) called SausageSizzles. Some serious tweeting going on there. A steady focus on message and brand.

Bart chart

Meanwhile, as I write Antony Green on the ABC is teaching the country about sample size and variance of estimators at the early stage of counting.

The same as yesterday, check out this discussion on R Bloggers which provides a good amount of the code for doing this analysis.

Tracking Democracy Sausage

It’s a fine tradition here in Australia where every few years communities manfully attempt to make up funding gaps in the selling and eating of #democracysausage to the captured audience of compulsory voters.

For fun, I decided to see if we could track the interest in the hashtag on twitter over time. I’ve exported the frequencies out to excel for this graph making exercise, because I’ll be teaching a class on stats entirely in excel in a few weeks and this will make for some fun discussion.

Democracy sausage line graph

As we can see, as of last night (2 more sleeps until #democracysausage day), interest on twitter was increasing. I’ll bring you a democracy sausage update tomorrow.

Technical notes: the API I’m using would only pull a maximum of 350 tweets featuring the hashtag on any given day: I suspect we may be missing some interest in sausages. I’ll look into other ways of doing this.

One very useful resource formed the bulk of the programming required: this blog post on R bloggers takes you through the basics required to do the same to any hashtag you may be interested in exploring.

Happy democracy sausage day!

What if policies have networks like people?

It’s been policy-light this election season, but some policy areas are up for debate. Others are being carefully avoided by mutual agreement, much like at Christmas lunch when we all tacitly agree we aren’t bringing up What Aunty Betty Did Last Year After Twelve Sherries. It’s all too painful, we’ll never come to any kind of agreement and we should just pretend like it’s not important.

However, policy doesn’t happen in a vacuum and I wondered if it was possible that using a social network-type analysis might illustrate something about the policy debate that is occurring during this election season.

To test the theory, I used the transcripts of the campaign launch speeches of Malcolm Turnbull and Bill Shorten. These are interesting documents to examine, because they are at one and the same time an affirmation of each parties’ policy aspirations for the campaign as well as a rejection of the other’s. I used a simple social network analysis, similar to that used in the Aeneid study. If you want to try it yourself, you can find the R script here.

Deciding on the topics to examine was some trial and error, but the list was eventually narrowed down to 19 topics that have been the themes of the election year: jobs, growth, housing, childcare, superannuation, health, education, borders, immigration, tax, medicare, climate change,marriage equality, offshore processing, environment, boats, asylum, business and bulk billing. These aren’t the topics that the parties necessarily want to talk about, but they are nonetheless being talked about.

It took some manoeuvring to get a network that was readable, but one layout (Kamada Kawaii for the interested) stood out. I think it describes the policy state quite well, visually speaking.

topic network 160627

We have the inner circle of high disagreement: borders, environment, superannuation, boats and immigration. There is a middle circle doing the job of containment: jobs and growth, housing, childcare, education, medicare, business and tax- all standard election fodder.

Then we have the outer arc of topics neither the labor or liberal parties really wants to engage with: offshore processing, asylum (as opposed to immigration, boats and borders), climate change (much more difficult to manage than mere environment), bulk billing (the crux of medicare) and marriage equality (have a plebiscite, have a free parliamentary vote, have something, except responsibility). I found it interesting that the two leaders’ speeches when visualised contain one part of a policy debate around immigration: boats and borders. But they conspicuously avoided discussing the unpleasant details: offshore processing.

Much like Aunty Betty and her unfortunate incident with the cold ham, both parties are in tacit agreement to ignore the difficult parts of a policy debate.

Australia Votes: Only Six Days to Go

It’s been painful, frankly pretty lame on the policy front and we’re over it. We all go to the national quadrennial BBQ election next week. While we’re standing in line clutching our sausage sandwiches and/or delightful local baked goods, it’d be nice to have an idea of what the people we’re voting for have had to say.

So another word cloud it is, because neither side has dared offer a policy that might stray from the narrative that “we’re all good blokes, really”.

This time, I requested up to 20 tweets from Turnbull and Shorten to see what’s been going on in the last couple of weeks. I got 18 back from both. Shorten (in red, below) has been talking about voting (surprise!), been screaming about medicare and apparently has an intense interest in trades with mentions of “brick” and “nails”. I hope that’s real tradies he’s talking about. Standard pollie speak “government”, “people”, “liberals”, “Turnbull” made it into the word cloud. Marriage equality also figured in the discussion.

Screen Shot 2016-06-25 at 10.18.46 PM

Turnbull (below, blue) was making a point about his relationship with the Australian muslim community, mentioning the Kirribilli house iftar and multifaith Australia. Standard coalition topics such as “investment”, “stable leaders”, “plan”, “economic”, “jobs” were all present. The AMP issue I touched on briefly last time. He appears to be trying to avoid the subject of marriage equality as much as possible.

Screen Shot 2016-06-25 at 10.19.02 PM

So there we have it: jobs and growth, the promise of stability, an Iftar in Kirribilli, marriage equality and a fascination with how we define a real or a fake tradie. If we all keep smiling fixedly, maybe we can forget about Brexit.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Q&A vs the Leaders’ Debate: is everyone singing from the same song sheet?

The election campaign is in full swing here in Australia and earlier this week the leaders of the two main parties, Malcolm Turnbull and Bill Shorten, faced off in a heavily scripted debate in which few questions were answered and the talking points were well practiced. An encounter described as “diabolical” and “boring“, fewer Australians tuned in compared to recent years. Possibly this was because they expected to hear what they had already heard before.

Since the song sheet was well rehearsed, this seemed like the perfect opportunity for another auspol word cloud. The transcript of the debate was made available on Malcolm Turnbull’s website and it was an easy enough matter of poking around and seeing what could be found. Chris Ullmann, who moderator, was added to the stop words list as he was a prominent feature in earlier versions of the cloud.

debate word cloud

The song sheet was mild: the future tense “will” was in the middle with Shorten, labor, plan, people and Turnbull. Also featured were tax, economic, growth, change and other economic nouns like billion, (per)cent, economy, budget, superannuation. There was mention of climate, (people) smugglers, fair and action, but these were relatively isolated as topics.

In summary, this word cloud is not that different to that generated from the carefully strategised twitter feeds of Turnbull and Shorten I looked at last week.

The ABC’s program Q and A could be a better opportunity for politicians to depart from the song sheet and offer less scripted insight: why not see what the word cloud throws up?

This week’s program aired the day after the leader’s debate and featured Steve Ciobo (Liberal: minister for trade), Terri Butler (Labor: shadow parliamentary secretary for child safety and prevention of family violence), Richard di Natale (Greens, leader, his twitter word cloud is here), Nick Xenophon (independent senator) and Jacqui Lambie (independent senator).  Tony Jones hosted the program and suffered the same fate as Chris Uhlmann.

QandA word cloud

The word cloud picked up on the discursive format of the show: names of panellists feature prominently. Interestingly, Richard di Natale appears in the centre. Also prominent are election related words such as Australia, government, country, question, debate.

Looking at other topics thrown up by the word cloud, there is a broad range: penalty rates, coal, senate, economy, businesses, greens, policy, money, Queensland, medicare, politician, commission.

Two different formats, two different panels and two different sets of topics. Personally, I prefer it when the song sheet has a few more pages.

Social Networks: The Aeneid Again

Applying social network analysis techniques to the Aeneid provides an opportunity to visualise literary concepts that Virgil envisaged for the text. It occurred to me that this was a great idea when I saw this social network analysis of Game of Thrones. If there is a group of literary figures more blood thirsty, charming and messed around by cruel fate than the denizens of Westeros, it would be those in the golden age of Roman literature.

Aeneid social network

This is a representation of the network of characters in the Aeneid. Aeneas and Turnus, both prominent figures in the wordcloud I created for the Aeneid are also prominent in the network. Connected to Aeneas is his wife Lavinia, his father Anchises, the king of the Latins (Latinus) and Pallas, the young man placed into Aeneas’ care.

Turnus is connected to Aeneas directly along with his sister Juturna, Evander (father of Pallas. Cliff notes version: the babysitting did not go well) and Allecto, a divine figure of rage.

Between Aeneas and Turnus is the “Trojan contingent”. Virgil deliberately created parallels between the stories surrounding the fall of Troy and Aeneas’ story. Achilles, the tragic hero, is connected to Turnus directly, while Aeneas is connected to Priam (king of Troy) and Hector (the great defender of Troy). Andromache is Hector’s widow whom Aeneas meets early in the epic.

Also of note is the divine grouping: major players in directing the action of the epic. Jupiter, king of the gods and Apollo the sun god are directly connected to our hero. Venus, Neptune, Minerva and Cupid are all present. In a slightly different grouping, Juno, Queen of the Gods and Aeneas’ enemy is connected to Dido, Aeneas’ lover. Suffice it to say, the relationship was not a “happily ever after”.

I used this list of the characters in the Aeneid as a starting point and later removed all characters who were peripheral to the social network. If you’re interested in trying this yourself, I posted the program I used here. Once again, the text used is the translation by J.W. Mackail and you can download it from Project Gutenberg here.

There were a number of resources I found useful for this project:

  • This tutorial from R DataMining provided a substantive amount of the code required for the social network analysis
  • While this tutorial from the same place was very helpful for creating a text document matrix. I’ve used it previously a number of times.
  • This article from R Bloggers on using igraph was also very useful
  • There were a number of other useful links and I’ve documented those in the R script.

Whilst text mining is typically applied to modern issues, the opportunity to visualise an ancient text is an interesting one. I was interested in how the technique grouped the characters together. These groupings were by and large consistent not only with the surface interpretation of the text, but also deeper levels of political and moral meaning within the epic.