No man should be an island

Insularity may be the catchword in the aftershock analysis of Brexit, but it’s not confined to elites. We are a society clustered into islands of opinion. Our insular communities are separated from each other by opportunity and circumstance. They develop their own novel views of the world we live in. As the British referendum last week showed, these can be fundamentally different world views not easily bridged by arguments of economic rationality.

There has been much made of the insularity of elites and failure to heed the rage of the disenfranchised that resulted in the Brexit decision. There have been parallels in the U.S. with Trumpism while here in Australia too, there are similarities to be drawn.

Focus has been on the elites and their system, but little attention has yet been paid to the scaffolding by which a system fails to engage with its constituents to its own detriment. The phenomenon of insularity reaches much further and goes much deeper down the chain.

We have developed insular communities within our own cities. Communities where we think similar thoughts, have similar incomes and even speak in similar ways. The radial income distributions of our major capital cities (higher in the centre, decreasing as you move away from the centre) is striking. These patterns of income mean that the people at the school gate are more likely to come from similar incomes to you than not. Given our understanding of the relationships between education and income, there is also a good chance that the person on the treadmill next to you at the gym is also from a similar socioeconomic background.

The problem is not just insular elites ignoring the constituents that scaffold their power. The problem is that housing affordability has reduced our opportunity to interact on a daily basis with friends who are not like us. Conversation with friends is very different to professionally interacting with clients, our barista or our child’s school teacher: all of whom may live at a distance from us.

We form our world views with our friends as a social barometer. Social influence has a noted relationship with behaviour. If our social circle narrows to those only living a similar life to our own, our exposure to differing opinion may do the same.  We may find ourselves living in an echo chamber of our own views, unable to understand the passionate and different opinion of our fellow citizens without recourse to simplistically ascribing their beliefs to a lack of understanding.

The mocking commentary of regretful and fearful Brexiteers illustrates our inability to understand what compelled them to make their decision in the first place. Our eye-rolling supersedes our desire to engage. The memes were eye-wateringly funny, but when our examination of the phenomenon stops with the retweet button, we are further insulating ourselves from the uncomfortable reality of disagreement.

In Australia, our election campaign has been seven long weeks of avoiding uncomfortable disagreement. The government has not been willing to push for a sweeping policy agenda. It’s a move designed to keep the focus off the possibility of disagreement and firmly on the reassuring mantra of “jobs and growth”. Avoiding electoral discomfort only serves to further isolate communities already unheard and unremarked upon by party influencers. We are literally being asked to “stick with the current mob for awhile”.

Why do we congregate into islands of opinion? Neil Gaiman suggests that we are fearful of the consequences of disagreement or, simply, being wrong. This is an age when every opinion is recorded forever and those disagreed with may be mercilessly abused with consequences for harrassers rare enough to be newsworthy. Threats of violence are as common place as they are unacceptable. Opinions are also laudable if static, but not if they change. Elite changes in opinion are categorised as backflips and turnarounds suggesting a gymnastic talent not otherwise known amongst the blue tie wearing classes. Mere ordinary citizens must make do with online mockery in all its forms.

Caution in expressing opinion or inviting disagreement is a reasonable response. We retreat geographically in the face of economic pressures and we retreat socially, online and otherwise, in the face of object lessons meted out daily in our social media channel of choice.

What should we, as a society do? Withdrawing into our islands of opinion, we risk failing to understand each other. The Brexit is an example of the ultimate consequence of this. Trumpism may be another. In Australia, the proliferation of far right microparties, each certain of saving Australia from some peril, is one manifestation.

Neil Gaiman makes the suggestion, that it was in part a deep rage that allowed his friend and co-author Terry Pratchett to create as prolifically and effectively as he did. As individuals we need to risk even the deep discomfort of rage in the pursuit of understanding our neighbouring islands of opinion. Shaping society is a creative pursuit, after all.

What if policies have networks like people?

It’s been policy-light this election season, but some policy areas are up for debate. Others are being carefully avoided by mutual agreement, much like at Christmas lunch when we all tacitly agree we aren’t bringing up What Aunty Betty Did Last Year After Twelve Sherries. It’s all too painful, we’ll never come to any kind of agreement and we should just pretend like it’s not important.

However, policy doesn’t happen in a vacuum and I wondered if it was possible that using a social network-type analysis might illustrate something about the policy debate that is occurring during this election season.

To test the theory, I used the transcripts of the campaign launch speeches of Malcolm Turnbull and Bill Shorten. These are interesting documents to examine, because they are at one and the same time an affirmation of each parties’ policy aspirations for the campaign as well as a rejection of the other’s. I used a simple social network analysis, similar to that used in the Aeneid study. If you want to try it yourself, you can find the R script here.

Deciding on the topics to examine was some trial and error, but the list was eventually narrowed down to 19 topics that have been the themes of the election year: jobs, growth, housing, childcare, superannuation, health, education, borders, immigration, tax, medicare, climate change,marriage equality, offshore processing, environment, boats, asylum, business and bulk billing. These aren’t the topics that the parties necessarily want to talk about, but they are nonetheless being talked about.

It took some manoeuvring to get a network that was readable, but one layout (Kamada Kawaii for the interested) stood out. I think it describes the policy state quite well, visually speaking.

topic network 160627

We have the inner circle of high disagreement: borders, environment, superannuation, boats and immigration. There is a middle circle doing the job of containment: jobs and growth, housing, childcare, education, medicare, business and tax- all standard election fodder.

Then we have the outer arc of topics neither the labor or liberal parties really wants to engage with: offshore processing, asylum (as opposed to immigration, boats and borders), climate change (much more difficult to manage than mere environment), bulk billing (the crux of medicare) and marriage equality (have a plebiscite, have a free parliamentary vote, have something, except responsibility). I found it interesting that the two leaders’ speeches when visualised contain one part of a policy debate around immigration: boats and borders. But they conspicuously avoided discussing the unpleasant details: offshore processing.

Much like Aunty Betty and her unfortunate incident with the cold ham, both parties are in tacit agreement to ignore the difficult parts of a policy debate.

Australia Votes: Only Six Days to Go

It’s been painful, frankly pretty lame on the policy front and we’re over it. We all go to the national quadrennial BBQ election next week. While we’re standing in line clutching our sausage sandwiches and/or delightful local baked goods, it’d be nice to have an idea of what the people we’re voting for have had to say.

So another word cloud it is, because neither side has dared offer a policy that might stray from the narrative that “we’re all good blokes, really”.

This time, I requested up to 20 tweets from Turnbull and Shorten to see what’s been going on in the last couple of weeks. I got 18 back from both. Shorten (in red, below) has been talking about voting (surprise!), been screaming about medicare and apparently has an intense interest in trades with mentions of “brick” and “nails”. I hope that’s real tradies he’s talking about. Standard pollie speak “government”, “people”, “liberals”, “Turnbull” made it into the word cloud. Marriage equality also figured in the discussion.

Screen Shot 2016-06-25 at 10.18.46 PM

Turnbull (below, blue) was making a point about his relationship with the Australian muslim community, mentioning the Kirribilli house iftar and multifaith Australia. Standard coalition topics such as “investment”, “stable leaders”, “plan”, “economic”, “jobs” were all present. The AMP issue I touched on briefly last time. He appears to be trying to avoid the subject of marriage equality as much as possible.

Screen Shot 2016-06-25 at 10.19.02 PM

So there we have it: jobs and growth, the promise of stability, an Iftar in Kirribilli, marriage equality and a fascination with how we define a real or a fake tradie. If we all keep smiling fixedly, maybe we can forget about Brexit.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Cheat Sheets: The New Programmer’s Friend

Cheat sheets are brilliant: whether you’re learning to program for the first time or you’re picking up a new language. Most data scientists are probably programming regularly in multiple languages at any given time: cheat sheets are a handy reference guide that saves you from googling how to “do that thing you know I did it in python yesterday but how does it go in stata?”

This post is an ongoing curation of cheat sheets in the languages I use. In other words, it’s a cheat sheet for cheat sheets. Because a blog post is more efficient than googling “that cheatsheet, with the orange bit and the boxes.” You can find my list of the tutorials and how-to guides I enjoyed here.

R cheat sheets + tutorials

Python cheat sheets

Stata cheat sheets

  • There is a whole list of them here, organised by category.
  • Stata cheat sheet, I could have used this five years ago. Also very useful when it’s been awhile since you last played in the stata sandpit.
  • This isn’t a cheat sheet, but it’s an exhaustive list of commands that makes it easy to find what you want to do- as long as you already have a good idea.

SPSS cheat sheets

  • “For Dummies” has one for SPSS too.
  • This isn’t so much a cheat sheet but a very basic click-by-click guide to trying out SPSS for the first time. If you’re new to this, it’s a good start. Since SPSS is often the gateway program for many people, it’s a useful resource.

General cheat sheets + discusions

  • Comparisons between R, Stata, SPSS, SAS.
  • This post from KD Nuggets has lots of cheat sheets for R, Python, SQL and a bunch of others.

I’ll add to this list as I find things.

Law of Large Numbers vs the Central Limit Theorem: in GIFs

I’ve spoken about these two fundamentals of asymptotics previously here and here. But sometimes, you need a .gif to really drive the point home. I feel this is one of those times.

Firstly, I simulated a population of 100 000 observations from the random uniform distribution. This population looks nothing like a normal distribution and you can see that below.

histogram of uniform distribution

Next, I took 500 samples from the data with varying sample sizes. I used n=5, 10, 20, 50, 100 and 500. I calculated the sample mean (x-bar) and the z score for each and I plotted their kernel densities using ggplot in R.

Here’s a .gif of what happens to the z score as the sample size increases: we can see that the distribution is pretty normal looking, even when the sample size is quite low. Notice that the distribution is centred on zero.

z score gif

Here’s a .gif of what happens to the sample mean as n increases: we can see that the distribution collapses on the population mean (in this case µ=0.5).

sample mean gif

For scale, here is a .gif of both frequencies as n gets large sitting on the same set of axes: the activity is quite different.

Sample mean vs z score

 If you want to try this yourself, the script is here. Feel free to play around with different distributions and sample sizes, see what turns up.

The Law of Large Numbers: It’s Not the Central Limit Theorem

I’ve spoken about asymptotics before. It’s the lego of the modelling world, in my view. Interesting, hard and you can lose years of your life looking for just the right piece that fits into the model you’re trying to build.

The Law of Large Numbers (LLN) is another simple theorem that’s widely misunderstood. Most often it’s conflated with the central limit theorem (CLT), which deals with the studentised sample mean or z-score. The LLN pertains to the sample mean itself.

Like the CLT, the LLN is actually a collection of theorems, strong and weak. I’ll confine myself to the simplest version here, Khinchine’s weak law of large numbers. It states that for a random, independent and identically distributed sample of n observations from any distribution with a finite mean (µ) and variance: then the sample mean has a probability limit equal to the population mean, µ. That is, the sample mean is a consistent estimator of the population mean under these conditions.

Put simply, as n gets very big, the sample mean is equal to the population mean.

Notice there is nothing about normal distributions as n gets large. That’s the key difference between the LLN and the CLT. One deals with the sample mean alone, the other with the studentised version. On its own, the distribution of the sample mean collapses onto a single point as n gets large: µ. This is the implication of the LLN.

Appropriately scaled, centred and at the correct rate, the studentised sample mean has a normal distribution in the limit as N gets large: that’s the CLT.

As usual, here’s an infographic to go: put side by side the two theorems have different results but are dealing with something quite similar.

CLT vs LLN infographic

Three things every new data scientist should know

Anyone who has spent any time in the online data science community knows that this kind of post is a genre all on its own. “N things you should know/do/be/learn/never do” is something that pops up in my twitter feed several times a day. These posts range from useful ways to improve your own practice to clickbait listing reams of accomplishments that make Miss Bingley’s “accomplished young ladies” speech in Pride and Prejudice appear positively unambitious.

Miss Bingley’s pronouncement could be easily be applied to data scientists everywhere:

“Oh! certainly,” cried his faithful assistant, “no [woman] can be really esteemed accomplished who does not greatly surpass what is usually met with. A woman must have a thorough knowledge of music, singing, drawing, dancing, and the modern languages, to deserve the word; and besides all this, she must possess a certain something in her air and manner of walking, the tone of her voice, her address and expressions, or the word will be but half-deserved.”

Swap out the references to women with “data scientist”, throw in a different skill set and there we have it:

“Oh! certainly,” cried his faithful assistant, “no data scientist can be really esteemed accomplished who does not greatly surpass what is usually met with. A data scientist must have a thorough knowledge of programming in every conceivable language that was, is or shall be, linear algebra, business acumen, obscure models only ever applied in obscure places, and whatever is “hot” this year, to deserve the title; and besides all this, she must possess a certain something in her air and manner of tweeting, the tone of her blogging, her linkedin profile and be a snappy dresser, or the title will be but half-deserved.”

Put like that, you’d be forgiven for not allowing the Miss Bingleys of the world to define you.

If I had a list of things to say to new data scientists, they wouldn’t have much to do with data science at all:

  1. You define yourself and your own practice. Not twitter, not an online community, not blogs from people who may or may not know your work. Data science is an incredibly broad array of people, ideas and tools. Maybe you’re in the middle of it, maybe you’re on the edge. That’s OK, it’s all valuable.
  2. You’re more than a bot. This is an industry that is increasing automation every day. You add value to your organisation in ways that a bot never can. What is the value you add? Cultivate and grow it.
  3. The online community is a wonderful place full of people who want to help you grow your practice and potential. Dive in and explore: but remember that the advice and pronouncements are just that. They don’t always apply to you all the time. Take what’s useful today and put the rest aside until it’s useful later.

It’s a short list!

It got wet

NSW got wet this weekend. In our own particular case we lost a large amount of our driveway and several paddocks spontaneously attained lake status. So there was nothing else to do but to poke around and see what I could turn up in the historic record (find yours here).

Some locals recorded up to 250mm in 24 hours this weekend. I thought that was an extraordinary amount until I checked the data (only available up until April this year so far, alas).

It turns out that sometime in the late sixties the local rainfall station recorded an extraordinary 392mm in 24 hours. Now that’s an outlier…!

I’ll invest in a new pair of gumboots just in case.

Smooth scatter plot rainfall

If you’re into this sort of thing, the plot was done using the “smoothScatter” function in R. It’s a change from the usual time series line chart. I think I’m a convert.