Violence Against Women: Standing Up with Data

Today, I spent the day at a workshop tasked with exploring the ways we can use data to contribute to ending violence against women. I was invited along by The Minerva Collective, who have been working on the project for some time.

Like all good workshops there were approximately 1001 good ideas. Facilitation was great: the future plan got narrowed down to a manageable handful.

One thing I particularly liked was that while the usual NGO and charitable contributors were present (and essential) the team from Minerva had managed to bring in a number of industry contributors from telecommunications and finance who were able to make substantial contributions. This is quite a different approach to what I’ve seen before and I’m interested to see how we can work together.

I’m looking forward to the next stage, there’s a huge capacity to make a difference. While there are no simple answers or magic bullets, data science could definitely do some good here.

A Primer on Basic Probability

… and by basic, I mean basic. I sometimes find people come to me with questions and no one has ever taken the time to give them the most basic underpinnings in probability that would make their lives a lot easier. A friend of mine is having this problem and is on a limited time frame for solving it, so this is quick and dirty and contains both wild ad-lib on my part and swearing. When I get some more time, I’ll try and expand and improve, but for now it’s better than nothing.

Youtube explainer: done without microphone, sorry- time limit again.

Slides I used:


I mentioned two links in the screencast. One was Allen Downey’s walkthrough with python, you don’t need to know anything about Python to explore this one: well worth it. The other is Victor Powell’s visualisation of conditional probability. Again, worth a few minutes exploration.

Good luck! Hit me up in the comments section if you’ve got any questions, this was a super quick run through so it’s a summary at best.

Machine Learning: Beware Enthusiasts Bearing Algorithms

Machine learning is not the emperor with no clothes. It’s a serious, important discipline that has a lot to offer many industries. I’m not anti-machine learning. What I think is that machine learning is a discipline with a lot of hype surrounding it at the moment. Eventually this hype will die away and what will be left are the serious practitioners developing useful, robust analyses with real implications. In the meantime, those working with data scientists or with data science would do well to beware enthusiasts bearing gifts.

There are a lot of parallels between the enthusiasm for machine learning right now the enthusiasm for Bayesian methods about ten years ago. Then, as now, there were a large number of enthusiasts, a moderate number of people producing serious, useful analysis and a misguided belief in some quarters that Bayesian methods were the solution to just about everything. Sound familiar?

Then as now, Bayesian methods weren’t the solution to everything, but they offered great solutions to many problems. Machine learning is the same.

If you’re not a data scientist or not familiar with machine learning methods, beware the enthusiast who believes machine learning solves just about everything. It’s one tool in a whole suite of options. A good data scientist understands it, a great data scientist uses the whole toolbox.

If your enthusiast can’t tell you what’s in the black box, or how their algorithm works then be cautious and keep asking questions. Sometimes, the initial confusion is because the data scientist and the businessperson may actually be speaking two different languages. Try not to be put off by that, often your friendly nerd is doing an internal parallel translation between geek speak and regular language. It doesn’t mean they don’t know what they’re doing. When the statistician and the machine learning expert have to check in with each other regularly about terminology, this is definitely a “thing”!

Keep asking questions, keep listening to the answers: you’ll get a pretty good idea if this technique is being used by someone who knows how it works under the hood.

Update: Kids Are Going to Code

I’m a big believer that kids should be given access to learning about code and programming. I don’t believe there is any particular right, wrong or best method or language for kids: I think when they go pro, we’re going to have a whole new suite of languages. The most important thing is to teach them that they can solve interesting, relevant problems and give them this skill set generally.

It turns out there are a lot of kids and parents who feel strongly about learning too, but in rural Australia where we’re situated, resources are thin on the ground. So Rex Analytics is about to enter a whole new world of education and training- one with very small people in it. That’s right, I’m running the local kids’ code club this year!

Our plan is to use the resources provided by Code Club Australia and we’ll start out with Scratch, move onto some basic CSS/HTML and then go onto Python, partly because the kids around here like snakes.

Code Club is designed for kids ages 9-12, but in our tiny town, we need to be able to provide activities for younger kids to ensure that all siblings can attend. So we’ll be flying by the seat of our pants with the younger group ages 5-8 and really just letting them guide us. The plan is lots of iPad-based learning which I’ve talked about before.

The school and the P and C are big supporters and I’ve got another parent lined up to help out (thanks Dave!), hopefully more will be interested as time goes on.

Wish me luck. I’ve always had a healthy respect for my kids’ teachers, but I think that’s about to increase exponentially…

Statistical model selection with “Big Data”: Doornik & Hendry’s New Paper

The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing …. or if we ourselves hope to change it. – Harford, 2014

Ten or fifteen years ago, big data sounded like the best thing ever in econometrics. When you spend your undergraduate career learning that (almost!) everything can be solved in classical statistics with more data, it sounds great. But big data comes with its own issues. No free lunch, too good to be true and your mileage really does vary.

In a big data set, statistical power isn’t the issue. You have power enough for just about everything. But that comes with problems of its own. The probability of a Type II error may be very high. In this context, it’s the possibility of falsely interpreting that a parameter estimate is significant when in fact it is not. The existence of spurious relationships are likely. Working out which ones are truly significant and those that are spurious is difficult. Model selection in the big data context is complex!

David Hendry is one of the powerhouses of modern econometrics and the fact that he is weighing into the big data model selection problem is a really exciting proposition. This week a paper was published with Jurgen Doornik in the Cogent Journal of Economics and Finance. You can find it here.

Doornik and Hendry propose a methodology for big data model selection. Big data comes in many varieties and in this paper, they consider only cross-sectional and time series data of “fat” structure: that is, more variables than observations. Their results generalise to other structures, but not always. Doornik and Hendry describe four key issues for big data model selection in their paper:

  • Spurious relationships
  • Mistaking correlations for causes
  • Ignoring sampling bias
  • Overestimating significance of results.

So, what are Doornik and Hendry’s suggestions for model selection in a big data context? Their approach has several pillars to the overall concept:

  1. They calculate the probabilities of false positives in advance. It’s long been possible in statistics to set the significance level to control multiple simultaneous tests. This is an approach taken in both ANOVA testing for controlling the overall significance level when testing multiple interactions and in some panel data approaches when testing multiple cross-sections individually. The Bonferroni inequality is the simplest of this family of techniques, though Doornik and Hendry are suggesting a far more sophisticated approach.
  2. Test “causation” by evaluating super exogeneity. In many economic problems especially, the possibility of a randomised control trial is unfeasible. Super exogeneity is an added layer of sophistication on the correlation/causation spectrum of which Granger causation was an early addition.
  3. Deal with hidden dependence in cross-section data. Not always an easy prospect to manage, cross-sectional dependence usually has no natural or obvious ordering as in time series dependence: but controlling for this is critical.
  4. Correct for selection biases. Often, big data arrives not out of a careful sampling design, but on a “whoever turned up to the website” basis. Recognising, controlling and correcting for this is critical to good model selection.

Doornik and Hendry advocate the use of autometrics in the presence of big data, without abandoning statistical rigour. Failing to understand the statistical consequences of our modelling techniques makes poor use of data assets that otherwise have immense value. Doornik and Hendry propose a robust and achievable methodology. Go read their paper!

What if policies have networks like people?

It’s been policy-light this election season, but some policy areas are up for debate. Others are being carefully avoided by mutual agreement, much like at Christmas lunch when we all tacitly agree we aren’t bringing up What Aunty Betty Did Last Year After Twelve Sherries. It’s all too painful, we’ll never come to any kind of agreement and we should just pretend like it’s not important.

However, policy doesn’t happen in a vacuum and I wondered if it was possible that using a social network-type analysis might illustrate something about the policy debate that is occurring during this election season.

To test the theory, I used the transcripts of the campaign launch speeches of Malcolm Turnbull and Bill Shorten. These are interesting documents to examine, because they are at one and the same time an affirmation of each parties’ policy aspirations for the campaign as well as a rejection of the other’s. I used a simple social network analysis, similar to that used in the Aeneid study. If you want to try it yourself, you can find the R script here.

Deciding on the topics to examine was some trial and error, but the list was eventually narrowed down to 19 topics that have been the themes of the election year: jobs, growth, housing, childcare, superannuation, health, education, borders, immigration, tax, medicare, climate change,marriage equality, offshore processing, environment, boats, asylum, business and bulk billing. These aren’t the topics that the parties necessarily want to talk about, but they are nonetheless being talked about.

It took some manoeuvring to get a network that was readable, but one layout (Kamada Kawaii for the interested) stood out. I think it describes the policy state quite well, visually speaking.

topic network 160627

We have the inner circle of high disagreement: borders, environment, superannuation, boats and immigration. There is a middle circle doing the job of containment: jobs and growth, housing, childcare, education, medicare, business and tax- all standard election fodder.

Then we have the outer arc of topics neither the labor or liberal parties really wants to engage with: offshore processing, asylum (as opposed to immigration, boats and borders), climate change (much more difficult to manage than mere environment), bulk billing (the crux of medicare) and marriage equality (have a plebiscite, have a free parliamentary vote, have something, except responsibility). I found it interesting that the two leaders’ speeches when visualised contain one part of a policy debate around immigration: boats and borders. But they conspicuously avoided discussing the unpleasant details: offshore processing.

Much like Aunty Betty and her unfortunate incident with the cold ham, both parties are in tacit agreement to ignore the difficult parts of a policy debate.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Cheat Sheets: The New Programmer’s Friend

Cheat sheets are brilliant: whether you’re learning to program for the first time or you’re picking up a new language. Most data scientists are probably programming regularly in multiple languages at any given time: cheat sheets are a handy reference guide that saves you from googling how to “do that thing you know I did it in python yesterday but how does it go in stata?”

This post is an ongoing curation of cheat sheets in the languages I use. In other words, it’s a cheat sheet for cheat sheets. Because a blog post is more efficient than googling “that cheatsheet, with the orange bit and the boxes.” You can find my list of the tutorials and how-to guides I enjoyed here.

R cheat sheets + tutorials

Python cheat sheets

Stata cheat sheets

  • There is a whole list of them here, organised by category.
  • Stata cheat sheet, I could have used this five years ago. Also very useful when it’s been awhile since you last played in the stata sandpit.
  • This isn’t a cheat sheet, but it’s an exhaustive list of commands that makes it easy to find what you want to do- as long as you already have a good idea.

SPSS cheat sheets

  • “For Dummies” has one for SPSS too.
  • This isn’t so much a cheat sheet but a very basic click-by-click guide to trying out SPSS for the first time. If you’re new to this, it’s a good start. Since SPSS is often the gateway program for many people, it’s a useful resource.

General cheat sheets + discusions

  • Comparisons between R, Stata, SPSS, SAS.
  • This post from KD Nuggets has lots of cheat sheets for R, Python, SQL and a bunch of others.

I’ll add to this list as I find things.