Australia’s same sex marriage survey

It was a farcical display of an absence of leadership. And the data it provides is not remotely as good as a properly executed survey.

Nonetheless, it had our national attention for months and it’s over.

Here’s a Shiny app because my Facebook discussions got a little detailed. Now everyone can have a look at the data on a by-electorate basis.

Some hot takes for you:

  • When thinking about outcomes in ‘electorates with a high proportion of migrants’, also think about the massively different treatment effects caused by the fact there was little to no outreach from the yes campaign to non English speaking communities, while some others targeted these communities with misinformation regarding the impact of gay marriage on schools. (That’s not a diss on the yes campaign: limited resources and all of that. They were in it to win a nation, not single electorates.)
  • Remember that socioeconomic advantage is a huge confound in just about everything.
  • The survey asked about changing a status quo. That’s not entirely the same thing as being actively homophobic: but I’ll agree in this case that’s a fine line to draw.
  • Why didn’t areas with high migrant populations in other cities follow the same patterns?
  • Did Sydney diocesan involvement, both in terms of investment and pulpit rhetoric create a different treatment effect compared to different cities?

And one thing I think we should all be constantly aware of, even as we nerds are enjoying our dissection:

  • This data was generated on the backs of the suffering of many GBLTIQ+ Australians and their families.

Bring on equality.

Code here.

Data here.

App in full screen here.

Stargate SG1: Sentiment is not Enough for the Techno Bugs

One of the really great things as your kids get older is that you can share with them stuff you thought was cool when you were young, had ideals and spare time. One of those things my siblings and I spent a lot of that spare time doing was watching Stargate SG1. Now I’ve got kids and they love the show too.

When I sat down to watch the series the first time I was a history student in my first year of university, so Daniel’s fascination with languages and cultures was my interest too. Ironically, at the time I was also avoiding taking my first compulsory econometrics course because I was going to hate it SO MUCH.

Approximately one million years, a Ph.D. in econometrics and possibly an alternate reality later, I’m a completely different person. With Julia Silge’s fabulous Austen analyses fresh in my mind (for a start, see here and then keep exploring) I rewatched the series. I wondered: how might sentiment work for transcripts, rather than print-only media like a novel?

In my view, this is something like an instrumental variables problem. A transcript of a TV show is only part of the medium’s signal: imagery and sound round out the full product. So a sentiment analysis on a transcript is only an analysis of part of the presented work. But because dialogue is such an intrinsic and important part of the medium, might it give a good representation?

What is sentiment analysis?

If you’re not a data scientist, or you’re new to natural language processing, you may not know what sentiment analysis is. Basically, sentiment analysis compares a list of words (like you may find in a transcript, a speech or a novel) to a dictionary that measures the emotions the words convey. In its most simple form, we talk about positive and negative sentiment.

Here’s an example of a piece of text with a positive sentiment:

“I would like to take this opportunity to express my admiration for your cause. It is both honourable and brave.” – Teal’c to Garshaw, Season Two: The Tokra Part II.

Now here’s an example of a piece of dialogue with a negative sentiment:

“I mean, one wrong move, one false step, and a whole fragile world gets wiped out?” – Daniel, Season Two: One False Step

This is an example of a fairly neutral piece of text:

“Gentlemen, these planets designated P3A-575 and P3A-577 have been submitted by Captain Carter’s team as possible destinations for your next mission.”- General Hammond, Season Two: The Enemy Within.

It’s important to understand that sentiment analysis in its simplest form doesn’t really worry about how the words are put together. Picking up sarcasm, for example isn’t really possible by just deciding which words are negative and which are positive.

Sentiment analyses like this can’t measure the value of a text: they are abbreviations of a text. In the same way we use statistics like a mean or a standard deviation to describe a dataset, a sentiment analysis can be used to succinctly describe a text.

If you’d like to find out more about how sentiment analysis works, check out Julia Silge’s blog post here which provided a lot of the detailed code structure and inspiration for this analysis.

What does sentiment analysis show for SG1?

I analysed the show on both a by-episode and by-series basis. With over 200 episodes and 10 series, the show covered a lot of ground with its four main characters. I found a couple of things that were interesting.

The sentiment arc for most shows is fairly consistent.

Most shows open with a highly variable sentiment as the dilemma is explained, sarcastic, wry humour is applied and our intrepid heroes set out on whatever journey/quest/mission they’re tasked with. Season One’s Within the Serpent’s Grasp is a pretty good example. Daniel finds himself in an alternate reality where everything is, to put it mildly, stuffed.

Within the Serpent’s Grasp, Season 1. Daniel crosses over to an alternate reality where earth is invaded by evil parasitic aliens with an OTT dress sense.

According to these charts however, about three quarters of the way through it all gets a bit “meh”.

Below is the sentiment chart for Season Two’s In the Line of Duty, where Sam Carter has an alien parasite in control of her body. If that’s not enough for any astrophysicist to deal with, an alien assassin is also trying to kill Sam.

If we take the sentiment chart below literally, nobody really cares very much at all about the impending murder of a major character. Except, that’s clearly not what’s happening in the show: it’s building to the climax.

 

In the Line of Duty, Season 2. Sam Carter gets a parasite in her head and if that’s not enough, another alien is trying to kill her.

So why doesn’t sentiment analysis pick up on these moments of high drama?

I think the answer here is that this is a scifi/adventure show: tension and action isn’t usually achieved through dialogue. It’s achieved by blowing stuff up in exciting and interesting ways, usually.

The Season Three cliffhanger introduced “the replicators” for precisely this purpose. SG1 always billed itself as a family-friendly show. Except for an egregious full frontal nude scene in the pilot, everyone kept their clothes on. Things got blown up and people got thrown around the place by Really Bad Guys, but limbs and heads stayed on and the violence wasn’t that bad. SG1 was out to make the galaxy a better place with family values, a drive for freedom and a liberal use of sarcasm.

But scifi/adventure shows thrive on two things: blowing stuff up and really big guns. So the writers introduced the replicators, a “race” of self-generating techno lego that scuttled around in bug form for the most part.

In response to this new galactic terror, SG1 pulled out the shot guns, the grenades and had a delightful several seasons blasting them blood-free. The show mostly maintained its PG rating.

The chart below shows the sentiment chart for the replicators’ introductory episode, Nemesis. The bugs are heading to earth to consume all technology in their path. The Asgard, a race of super-advanced Roswell Greys have got nothing and SG1 has to be called in to save the day. With pump action shotguns, obviously.

The replicator bugs don’t speak. The sound of them crawling around inside a space ship and dropping down on people is pretty damn creepy: but not something to be picked up by using a transcript as an instrument for the full product.

Nemesis, Season 3 cliffhanger: the rise of the techno bugs.

Season Four’s opener, Small Victories solved the initial techno bug crisis, but not before a good half hour of two of our characters flailing around inside a Russian submarine with said bugs. Again, the sentiment analysis found it all a little “whatever” towards the end.

Small Victories, Season 4 series opener. Techno bugs are temporarily defeated.

Is sentiment analysis useless for TV transcripts then?

Actually, no. It’s just that in those parts of the show where dialogue is only of secondary importance to the other elements of the work obscures the usefulness of the transcript as an instrument. In order for the transcript to be a useful instrument, we need to do what we’d ideally do in many instrumental variables cases: look at a bigger sample size.

Let’s take a look at the sentiment chart for the entire sixth season. This is the one where Daniel Jackson is dead, but is leading a surprisingly active and fulfilling life for a dead man. We can see the overall structure of the story arc for the season below. The season starts with something of a bang as new nerd Jonas is introduced just in time for old nerd Daniel to receive a life-ending dose of explosive radiation. The tension goes up and down throughout. It’s most negative at about the middle of the season where there’s usually a double-episode cliffhanger and smooths out towards the end of the series until tension increases with the final cliffhanger.

 

Series Six: Daniel is dead and new guy Jonas has to pick up the vacant nerd space.

Season Eight, in which the anti-establishment Jack O’Neill has become the establishment follows a broadly similar pattern. (Jack copes with the greatness thrust upon him as a newly-starred general by being more himself than ever before.)

Note the end-of-series low levels of sentiment. This is caused by a couple of things: as with the episodes, moments of high emotion get big scores and this obscures the rest of the distribution. I considered normalising it all between 0 and 1. This would be a good move for comparing between episodes and seasons, but didn’t seem necessary in this case.

The other issue going on here is the narrative structure of the overall arc. In these cases, I think the season is slowing down a little in preparation for the big finale.

Both of these issues were also apparent in the by-episode charts as well.

Your turn now

For the fun of it, I built a Shiny app which will allow you to explore the sentiment of each episode and series on your own. I’ve also added simple word clouds for each episode and series. It’s an interesting look at the relative importance each character has and how that changed over the length of the show.

The early series were intensely focussed on Jack, but as the show progressed the other characters got more and more nuanced development.

Richard Dean Anderson made the occasional guest appearance after Season Eight, but was no longer a regular role on the show after Season Nine started. The introduction of new characters Vala Mal Doran and Cameron Mitchell took the show into an entirely new direction. The word clouds show those changes.

You can play around with the app below, or find a full screen version here. Bear in mind it’s a little slow to load at first: the corpus of SG1 transcripts comes with the first load, and that’s a lot of episodes. Give it a few minutes, it’ll happen!

 

The details.

The IMSDB transcript database provided the transcripts for this analysis, but not for all episodes: the database only had 182 of the more than 200 episodes that were filmed on file. I have no transcripts to any episode in Season 10 and only half of Season 9! If anyone knows where to find more or if the spinoff Stargate Atlantis transcripts are online somewhere, I’d love to know.

A large portion of this analysis and build used Julia Silge and David Robinson’s Tidy Text package in R. They also have a book coming out shortly, which I have on preorder. If you’re considering learning about Natural Language Processing, this is the book to have, in my opinion.

You can find the code I wrote for the project on Github here.

 

Yield to Maturity: A Basic Interactive

The yield to maturity concept describes the approximate rate of return a bond generates if it’s held until redemption date. It’s dependent on a few things including the coupon rate (nominal interest rate), face value of the bond, price of the bond and the time until maturity.

It can get a little confusing with the mathematics behind it, so I’ve created a simple Shiny App that allows you to manipulate the inputs to observe what happens. Bear in mind this is not a financial calculator, it’s an interactive for educational purposes. It’s also the approximate not exact yield to maturity of a bond which is fine for our purposes.

I’ve mapped the yield up to 30 year redemption and assumed a face value of $100. Coupon rate varies between 0% and 25%. Current price of the bond can vary between $50 and $150. Mostly, the yield curve is very flat in this simplified approximation- but observe what happens when there is only a short time to maturity (0-5 years) and rates or price are extreme. You can find the interactive directly here.

 

 

Remember, this is just an approximation. For a more accurate calculation, see here.

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view