Australia’s same sex marriage survey

It was a farcical display of an absence of leadership. And the data it provides is not remotely as good as a properly executed survey.

Nonetheless, it had our national attention for months and it’s over.

Here’s a Shiny app because my Facebook discussions got a little detailed. Now everyone can have a look at the data on a by-electorate basis.

Some hot takes for you:

  • When thinking about outcomes in ‘electorates with a high proportion of migrants’, also think about the massively different treatment effects caused by the fact there was little to no outreach from the yes campaign to non English speaking communities, while some others targeted these communities with misinformation regarding the impact of gay marriage on schools. (That’s not a diss on the yes campaign: limited resources and all of that. They were in it to win a nation, not single electorates.)
  • Remember that socioeconomic advantage is a huge confound in just about everything.
  • The survey asked about changing a status quo. That’s not entirely the same thing as being actively homophobic: but I’ll agree in this case that’s a fine line to draw.
  • Why didn’t areas with high migrant populations in other cities follow the same patterns?
  • Did Sydney diocesan involvement, both in terms of investment and pulpit rhetoric create a different treatment effect compared to different cities?

And one thing I think we should all be constantly aware of, even as we nerds are enjoying our dissection:

  • This data was generated on the backs of the suffering of many GBLTIQ+ Australians and their families.

Bring on equality.

Code here.

Data here.

App in full screen here.

Some notes

I’m teaching a course in financial statistics this semester – it’s something I’ve been doing on and off for, well, it’s best measured in decades now.

To make my life easier, I started compiling my notes on various subjects: here they are in rough draft form, at least for the moment. I’ll add to them as I go.

Chapter 1 The World of Data Around Us

Chapter 2 Data Visualisation Matters

Data Analysis

An Introduction to Algebra

Introduction to Sigma Notation

Failure Is An Option

Failure is not an option available to most of us, most of the time. With people depending on us, it’s a luxury few can afford. As a consequence, we shoot for a minimum viable product, minimise risks and, often, minimise creativity. This is a crying shame. If we want to be better programmers, better modellers or better analysts, we need to have space to fail occasionally. The opportunity to “try it and see” is an incredible luxury not offered to many people.

This isn’t a defence of mediocrity or incompetence. Let me be very clear: highly capable, brilliant people fail at things they try. They fail because they take risks. They push and push and push. They find out a whole bunch of things that don’t work and- if they’re lucky- find the piece of gold that does.

In our general workplaces, we often can’t afford this, unless we are very privileged. I spent last week at the ROpensci Oz Unconference and failure wasn’t just an option: it was encouraged.

This was an incredibly freeing approach to programming and one that generated a wealth of creativity and collaboration. Participants were users of all skill levels from just-starting-out to decades-long-veterans. They came from diverse fields like ecology, psychology, academia and business. The underlying ethos of the conference was “try it and see”.

Try it we did.

We had two days of learning, trying, succeeding, failing occasionally and solving those problems until we succeeded. Thanks to ROpensci and to our sponsors: having the space and support to fail made exploration, learning and creativity a priority. It’s a rare luxury and one I’d recommend to everyone if you can!

If you’re just starting out: remember that it’s OK to fail occasionally. You’ll be a better programmer, better analyst or better modeller for it.

Size Matters

It would be too easy for someone like me to declare that A/B testing is simple. If you’re doing website testing you have all the power in the world. Sample sizes literally not dreamt of when I was an undergraduate. A large sample is the same thing as power, right? And of course, power is all that matters.

This is completely wrong. While statistical A/B testing is using (parts of) the same toolkit as I was using in RCTs for Papua New Guinea and Tonga: it isn’t looking at the same problem and it isn’t looking at the same effect sizes. A/B testing is used for incremental changes in this context. On the contrary, in my previous professional life we were looking for the biggest possible bang we could generate for the very limited development dollar.

As always, context matters. Power gets more expensive as you approach the asymptote if effect size is also shrinking. How expensive? How big does that sample have to be? This is one of the things I’m looking at in this experiment.

However, size isn’t something I see discussed often. The size of the test is about control: it tells you what your tradeoff has been between Type I and Type II errors. We fix our size a priori and then we move on and forget about it. But is our fixed, set size the same as the actual size of our test?

We know that Fisher’s exact test is often actually undersized in practice, for example.  In practice, this means the test is too conservative. On the flip side, a test that is too profligate (over sized) is rejecting the null hypothesis when it’s true far too often.

In this post, I’m going to look at a very basic test with some very robust assumptions and see what happens to size and power as we vary those assumptions and sample sizes. The purpose here is the old decision vs default adage: know what’s happening and own the decision you make.

tl;dr

While power matters in A/B testing, so does the size of the test. We spend a lot of time worrying about power in this field but not enough (in my view) ensuring our expectations of size are appropriate. Simple departures from the unicorns-and-fairy-dust normal distribution can cause problems for size and power.

Testing tests

The test I’m looking at is the plainest of plain vanilla statistical tests. The null hypothesis is that the mean of a generating distribution is zero against an alternate. The statistic is the classic z-statistic and the assumptions underlying the test can go one of two ways:

  1. In the finite sample, if the underlying distribution is normal, then the z-statistic is normal, no matter the sample size.
  2. Asymptotically, as long as the underlying distribution meets some conditions like finite fourth moments (more on this later- fat tails condition) and independence, as the sample size gets large, the z-statistic will have a normal limiting distribution.

I’m going to test this simple test in one of four scenarios over a variety of sample sizes:

  1. Normally generated errors. Everything is super, nothing to see here. This is the statistician’s promised land flowing with free beer and coffee.
  2. t(4) errors. This is a fat tailed distribution still centred on zero and symmetric. The fourth moment is finite, but no larger moments are. Is fat-tails alone an issue?
  3. Centred and standardised chi squared (2) errors: fat tailed and asymmetric, the errors generated from this distribution have mean zero and a standard deviation of unity. Does symmetry matter that much?
  4. Cauchy errors. This is the armageddon scenario: the central limit theorem doesn’t even apply here. There is no theoretical underpinning for getting this test to work under this scenario- there are no finite moments at or equal to the first or above (there are some fractional ones though). Can a big sample size get you over this in practice?

In the experiments, we’ll look at sample sizes between 10 and 100 000. Note that the charts below are on a log10 scale.

The null hypothesis is mu =0, the rejection rate in this scenario gives us the size of our test. We can measure power under a variety of alternatives and here I’ve looked at mu = 0.01, 0.1, 1 and 10. I also looked at micro effect sizes like 0.0001, but it was a sad story. [1]

The test has been set with significance level 0.05 and each experiment was performed with 5000 replications. Want to try it yourself/pull it apart? Code is here.

Normal errors: everything is super

power and size comparisons

With really small samples, the test is oversized. N=10 is a real Hail Mary at the best of times, it doesn’t matter what your assumptions are. But by about n=30 and above it’s consistently in the range of 0.05.

It comes as a surprise to no one to hear that power is very much dependent on distance from the null hypothesis. For alternates where mu =10 or mu =1, power is near unity for smallish sample sizes. In other words, we reject the null hypothesis close to 100% of the time when it is false. It’s easy for the test to distinguish these alternatives because they’re so different to the null.

But what about smaller effect sizes? If mu =0.1, then you need at least a sample of a thousand to get close to power of unity. If mu =0.001 then a sample of 100 000 might be required.

Smaller still? The chart below shows another run of the experiment with effect sizes of mu =0.001 and 0.0001 – miniscule. The test has almost no power even at sample sizes of 100 000. If you need to detect effects this small, you need samples in the millions (at least).

chart with power and size

Fat shaming distributions

Leptokurtotic distributions (fat tails) catch a lot of schtick for messing up tests and models. And that’s fairly well deserved. However, degree of fatness is an issue. The t(4) distribution still generates a statistic that works asymptotically in this context: but it has only exactly the number of moments we need and no wriggle room at all.

Size is more variable than in the normal scenario: still comfortably around 0.05, it’s more profligate at small sample sizes and at larger ones has a tendency to be more conservative. It’s still reasonably close to 0.05 at larger sample sizes, however.

another size/power comparison

Power is costly for the smaller effect sizes. For the same sample size (say n=1000) with mu = 0.1, there is substantially less power than in the normal case. A similar behaviour is evident for mu=0.01. Tiny effect sizes are similarly punished (see below).

t distribution small effect sizes

Fat, Skewed and Nearly Dead

The chi-squared(2) distribution put this simple test through its paces. The size of the test for anything under n=1000 is not in the same ballpark as its nominal, rendering the test (in my view) a liability. By the time n=100 000, the size of the test is reasonable.

Power, in my view, while showing similar outcomes is not a saving grace here: there’s very little control in this test under which you can interpret your power.

chi squared example

Here be dragons

I included a scenario with the Cauchy distribution, despite the fact it’s grossly unfair to this simple little test. The Cauchy distribution ensures that the Central Limit Theorem does not apply here: the test will not work in theory (or, indeed, in practice).

I thought it was a useful exercise, however, to show what that looks like. Too often, we assume “as n gets big CLT is going to work its magic” and that’s just not true. To whit: one hot mess.

Cauchy

Neither size nor power is improved with sample size increasing: that’s because the CLT isn’t operational in this scenario. The test is under sized, under powered for all but the largest of effect sizes (and really, at that effect size you could tell a difference from a chart anyway).

A/B reality is rarely this simple

A/B testing reality is rarely as simple as the test I’ve illustrated above. More typically, we’re testing groups of means or  proportions and interaction effects, possibly dynamic relationships and a long etc.

My purpose here is to show that even a simple, robust test with minimal assumptions can be thoroughly useless if those assumptions are not met. More complex tests and testing regimes that build on these simple results may be impacted more severely and more completely.

Power is not the only concern: size matters.

End notes

[1] Yes, I need to get a grip on latex in WordPress sooner or later, but it’s less interesting than the actual experimenting.

Violence Against Women: Standing Up with Data

Today, I spent the day at a workshop tasked with exploring the ways we can use data to contribute to ending violence against women. I was invited along by The Minerva Collective, who have been working on the project for some time.

Like all good workshops there were approximately 1001 good ideas. Facilitation was great: the future plan got narrowed down to a manageable handful.

One thing I particularly liked was that while the usual NGO and charitable contributors were present (and essential) the team from Minerva had managed to bring in a number of industry contributors from telecommunications and finance who were able to make substantial contributions. This is quite a different approach to what I’ve seen before and I’m interested to see how we can work together.

I’m looking forward to the next stage, there’s a huge capacity to make a difference. While there are no simple answers or magic bullets, data science could definitely do some good here.

Hannan Quinn Information Criteria

This is a short post for all of you out there who use information criteria to inform model decision making. The usual suspects are the Akaike and the Bayes-Schwartz criteria.

Especially, if you’re working with big data, try adding the Hannan Quinn (1972) into the mix. It’s not often used in practice for some reason. Possibly this is a leftover from our small sample days. It has a slower rate of convergence than the other two – it’s log(n) convergent. As a result, it’s often more conservative in the number of parameters or size of the model it suggests- e.g. it can be a good foil against overfitting.

It’s not the whole answer and for your purposes may offer no different insight. But it adds nothing to your run time, is fabulously practical and one of my all time favourites that no one has ever heard of.

Things I wish I’d noticed in grad school

Back in the day, I tended to get a little hyper-focussed on things. I’m sure someone, sometime, somewhere pointed this stuff out to me. But at the time it went over my head and I learned these things the hard way. Maybe my list of things I wish I’d noticed helps someone else.

  • Your professional contacts matter and it’s OK to ask for help. You’re not researching in a vacuum, the people around you want to help.
  • You need to look outside your department and university. There’s a bigger, wider world out there and while what’s going on inside your little world seems like it’s important: you need to be aware of what’s outside too.
  • Being methodologically/theoretically robust matters, yes. But learning when to let it go is going to be harder than learning the theory/methodology. No easy answers here, all you can do is make your decision and own it.
  • It doesn’t matter how much you read, you’re not going to be an expert across your whole field. Just be aware of the field and be an expert in what you’re doing right now. That’s OK.
  • Get a life. Really.

Stargate SG1: Sentiment is not Enough for the Techno Bugs

One of the really great things as your kids get older is that you can share with them stuff you thought was cool when you were young, had ideals and spare time. One of those things my siblings and I spent a lot of that spare time doing was watching Stargate SG1. Now I’ve got kids and they love the show too.

When I sat down to watch the series the first time I was a history student in my first year of university, so Daniel’s fascination with languages and cultures was my interest too. Ironically, at the time I was also avoiding taking my first compulsory econometrics course because I was going to hate it SO MUCH.

Approximately one million years, a Ph.D. in econometrics and possibly an alternate reality later, I’m a completely different person. With Julia Silge’s fabulous Austen analyses fresh in my mind (for a start, see here and then keep exploring) I rewatched the series. I wondered: how might sentiment work for transcripts, rather than print-only media like a novel?

In my view, this is something like an instrumental variables problem. A transcript of a TV show is only part of the medium’s signal: imagery and sound round out the full product. So a sentiment analysis on a transcript is only an analysis of part of the presented work. But because dialogue is such an intrinsic and important part of the medium, might it give a good representation?

What is sentiment analysis?

If you’re not a data scientist, or you’re new to natural language processing, you may not know what sentiment analysis is. Basically, sentiment analysis compares a list of words (like you may find in a transcript, a speech or a novel) to a dictionary that measures the emotions the words convey. In its most simple form, we talk about positive and negative sentiment.

Here’s an example of a piece of text with a positive sentiment:

“I would like to take this opportunity to express my admiration for your cause. It is both honourable and brave.” – Teal’c to Garshaw, Season Two: The Tokra Part II.

Now here’s an example of a piece of dialogue with a negative sentiment:

“I mean, one wrong move, one false step, and a whole fragile world gets wiped out?” – Daniel, Season Two: One False Step

This is an example of a fairly neutral piece of text:

“Gentlemen, these planets designated P3A-575 and P3A-577 have been submitted by Captain Carter’s team as possible destinations for your next mission.”- General Hammond, Season Two: The Enemy Within.

It’s important to understand that sentiment analysis in its simplest form doesn’t really worry about how the words are put together. Picking up sarcasm, for example isn’t really possible by just deciding which words are negative and which are positive.

Sentiment analyses like this can’t measure the value of a text: they are abbreviations of a text. In the same way we use statistics like a mean or a standard deviation to describe a dataset, a sentiment analysis can be used to succinctly describe a text.

If you’d like to find out more about how sentiment analysis works, check out Julia Silge’s blog post here which provided a lot of the detailed code structure and inspiration for this analysis.

What does sentiment analysis show for SG1?

I analysed the show on both a by-episode and by-series basis. With over 200 episodes and 10 series, the show covered a lot of ground with its four main characters. I found a couple of things that were interesting.

The sentiment arc for most shows is fairly consistent.

Most shows open with a highly variable sentiment as the dilemma is explained, sarcastic, wry humour is applied and our intrepid heroes set out on whatever journey/quest/mission they’re tasked with. Season One’s Within the Serpent’s Grasp is a pretty good example. Daniel finds himself in an alternate reality where everything is, to put it mildly, stuffed.

Within the Serpent’s Grasp, Season 1. Daniel crosses over to an alternate reality where earth is invaded by evil parasitic aliens with an OTT dress sense.

According to these charts however, about three quarters of the way through it all gets a bit “meh”.

Below is the sentiment chart for Season Two’s In the Line of Duty, where Sam Carter has an alien parasite in control of her body. If that’s not enough for any astrophysicist to deal with, an alien assassin is also trying to kill Sam.

If we take the sentiment chart below literally, nobody really cares very much at all about the impending murder of a major character. Except, that’s clearly not what’s happening in the show: it’s building to the climax.

 

In the Line of Duty, Season 2. Sam Carter gets a parasite in her head and if that’s not enough, another alien is trying to kill her.

So why doesn’t sentiment analysis pick up on these moments of high drama?

I think the answer here is that this is a scifi/adventure show: tension and action isn’t usually achieved through dialogue. It’s achieved by blowing stuff up in exciting and interesting ways, usually.

The Season Three cliffhanger introduced “the replicators” for precisely this purpose. SG1 always billed itself as a family-friendly show. Except for an egregious full frontal nude scene in the pilot, everyone kept their clothes on. Things got blown up and people got thrown around the place by Really Bad Guys, but limbs and heads stayed on and the violence wasn’t that bad. SG1 was out to make the galaxy a better place with family values, a drive for freedom and a liberal use of sarcasm.

But scifi/adventure shows thrive on two things: blowing stuff up and really big guns. So the writers introduced the replicators, a “race” of self-generating techno lego that scuttled around in bug form for the most part.

In response to this new galactic terror, SG1 pulled out the shot guns, the grenades and had a delightful several seasons blasting them blood-free. The show mostly maintained its PG rating.

The chart below shows the sentiment chart for the replicators’ introductory episode, Nemesis. The bugs are heading to earth to consume all technology in their path. The Asgard, a race of super-advanced Roswell Greys have got nothing and SG1 has to be called in to save the day. With pump action shotguns, obviously.

The replicator bugs don’t speak. The sound of them crawling around inside a space ship and dropping down on people is pretty damn creepy: but not something to be picked up by using a transcript as an instrument for the full product.

Nemesis, Season 3 cliffhanger: the rise of the techno bugs.

Season Four’s opener, Small Victories solved the initial techno bug crisis, but not before a good half hour of two of our characters flailing around inside a Russian submarine with said bugs. Again, the sentiment analysis found it all a little “whatever” towards the end.

Small Victories, Season 4 series opener. Techno bugs are temporarily defeated.

Is sentiment analysis useless for TV transcripts then?

Actually, no. It’s just that in those parts of the show where dialogue is only of secondary importance to the other elements of the work obscures the usefulness of the transcript as an instrument. In order for the transcript to be a useful instrument, we need to do what we’d ideally do in many instrumental variables cases: look at a bigger sample size.

Let’s take a look at the sentiment chart for the entire sixth season. This is the one where Daniel Jackson is dead, but is leading a surprisingly active and fulfilling life for a dead man. We can see the overall structure of the story arc for the season below. The season starts with something of a bang as new nerd Jonas is introduced just in time for old nerd Daniel to receive a life-ending dose of explosive radiation. The tension goes up and down throughout. It’s most negative at about the middle of the season where there’s usually a double-episode cliffhanger and smooths out towards the end of the series until tension increases with the final cliffhanger.

 

Series Six: Daniel is dead and new guy Jonas has to pick up the vacant nerd space.

Season Eight, in which the anti-establishment Jack O’Neill has become the establishment follows a broadly similar pattern. (Jack copes with the greatness thrust upon him as a newly-starred general by being more himself than ever before.)

Note the end-of-series low levels of sentiment. This is caused by a couple of things: as with the episodes, moments of high emotion get big scores and this obscures the rest of the distribution. I considered normalising it all between 0 and 1. This would be a good move for comparing between episodes and seasons, but didn’t seem necessary in this case.

The other issue going on here is the narrative structure of the overall arc. In these cases, I think the season is slowing down a little in preparation for the big finale.

Both of these issues were also apparent in the by-episode charts as well.

Your turn now

For the fun of it, I built a Shiny app which will allow you to explore the sentiment of each episode and series on your own. I’ve also added simple word clouds for each episode and series. It’s an interesting look at the relative importance each character has and how that changed over the length of the show.

The early series were intensely focussed on Jack, but as the show progressed the other characters got more and more nuanced development.

Richard Dean Anderson made the occasional guest appearance after Season Eight, but was no longer a regular role on the show after Season Nine started. The introduction of new characters Vala Mal Doran and Cameron Mitchell took the show into an entirely new direction. The word clouds show those changes.

You can play around with the app below, or find a full screen version here. Bear in mind it’s a little slow to load at first: the corpus of SG1 transcripts comes with the first load, and that’s a lot of episodes. Give it a few minutes, it’ll happen!

 

The details.

The IMSDB transcript database provided the transcripts for this analysis, but not for all episodes: the database only had 182 of the more than 200 episodes that were filmed on file. I have no transcripts to any episode in Season 10 and only half of Season 9! If anyone knows where to find more or if the spinoff Stargate Atlantis transcripts are online somewhere, I’d love to know.

A large portion of this analysis and build used Julia Silge and David Robinson’s Tidy Text package in R. They also have a book coming out shortly, which I have on preorder. If you’re considering learning about Natural Language Processing, this is the book to have, in my opinion.

You can find the code I wrote for the project on Github here.

 

Expertise vs Awareness for the Data Scientist

We’ve all seen them: articles with headlines like “17 things you MUST know to be a data scientist” and “Great data scientists know these 198 algorithms no one else does.” While the content can be a useful read, the titles are clickbait and imposter syndrome is a common outcome.

You can’t be an expert in every skill on the crazy data science Venn Diagram. It’s not physically possible and if you try you’ll spend all your time attempting to become a “real” data scientist with no time left to be one. In any case, most of those diagrams actually describe an entire industry or a large and diverse team: not the individual.

Data scientists need expertise, but you only need expertise in the areas you’re working with right now. For the rest, you need awareness.

Awareness of the broad church that is data science tells you when you need more knowledge, more skill or more information than you currently have. Awareness of areas outside your expertise means you don’t default to the familiar, you make your decisions based on a broad understanding of what’s possible.

Expertise still matters, but the exact area you’re expert in is less important. Expertise gives you the skills you need to go out and learn new things when and as you need them. Expertise in Python gives you the skills to pick up R or C++ next time you need them. Expertise in econometrics gives you the skills to pick up machine learning. Heck, expertise in languages (human ones, not computer ones) is also a useful skill set for data scientists, in my view.

You need expertise because that gives you the core skills to pick up new things. You need awareness because that will let you know when you need the new things and what they could be. They’re not the same thing: so keep doing what you do well and keep one eye on what other people do well.

Tiny Coders

I’ve mentioned it before, but I run the local code club out here in rural Australia. We are using the Code Club curriculum, designed for kids aged 9-12. Due to our particular circumstances with transport and distance, our code club needs to offer fun and learning for the age range 5-8 as well. Some of our littles are finding the materials too challenging to be fun, so as of this week we are running two streams:

  • The “Senior Dev Team”: in time-honoured managerial tradition, I told them they could be senior devs with a badge, if they helped the littles. That’s right, more responsibility and nothing but a badge to show for it. The senior dev team is going to keep going with the regular code club projects and they are smashing them out. Seriously, all I need to do is get these kids a black t-shirt each and they’re regular programmers already.
  • The “red team”: these are our kids that are struggling with the projects we have been doing and not having fun because of it. We’ll be doing multistage projects with lots of optional end points for kids to stop and go play: these are really young kids sitting down to code after six hours of school, so for some of them 20 minutes is more than enough. For them, it’s enough that they learn that computers and code are fun and interesting. For the older/more capable kids in this group we’ll still be learning about loops and conditional statements and all the good stuff, but our projects will be pared back and more basic so they aren’t overwhelming.

Our first red team project is here: Flying Cat Instructions and on Github here.

Of course, none of this would be possible without an amazing team of dedicated parent and teacher volunteers: many of whom had very little computer skills before we started and NO coding skills. They’re as amazing as the kids.