Correlation vs Causation

Correlation vs causation. I find this is an issue that is quite simple, from a technical point of view, but widely misunderstood. Statistical significance does not imply causation. Correlation implies there may be a direct or indirect relationship, but does not imply causation. In fact, very few things imply causation. My simple version of the differences is below.

If you want to know why this is far more than a stoush to be had in an academic tea room, check out Tyler Vigen’s collection. If the age of Miss America can be significantly and strongly correlated with murders by steam, hot vapours and objects; then in any practical analysis there are many options for other less obvious spurious correlations. In a big data context, knowing the difference could be millions of dollars.

Occasionally, people opine that causation vs correlation doesn’t matter (especially in a big data and sometimes a machine learning context). I’d argue this is completely the wrong view to take: just because you have all the power that matters doesn’t mean we should ignore these issues because a randomised control trial is impractical in a lot of ways. It just means deciding when, how and why you’re going to do so in the knowledge of what you’re doing. Spurious correlations are common, hard to detect and difficult to deal with. It’s a bear hunt worth setting out on.

 Causation vs correlation


Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM