Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view

Correlation vs Causation

Correlation vs causation. I find this is an issue that is quite simple, from a technical point of view, but widely misunderstood. Statistical significance does not imply causation. Correlation implies there may be a direct or indirect relationship, but does not imply causation. In fact, very few things imply causation. My simple version of the differences is below.

If you want to know why this is far more than a stoush to be had in an academic tea room, check out Tyler Vigen’s collection. If the age of Miss America can be significantly and strongly correlated with murders by steam, hot vapours and objects; then in any practical analysis there are many options for other less obvious spurious correlations. In a big data context, knowing the difference could be millions of dollars.

Occasionally, people opine that causation vs correlation doesn’t matter (especially in a big data and sometimes a machine learning context). I’d argue this is completely the wrong view to take: just because you have all the power that matters doesn’t mean we should ignore these issues because a randomised control trial is impractical in a lot of ways. It just means deciding when, how and why you’re going to do so in the knowledge of what you’re doing. Spurious correlations are common, hard to detect and difficult to deal with. It’s a bear hunt worth setting out on.

 Causation vs correlation


Democracy Sausage Redux

One last time. I wanted to see if there was any interesting election day behaviour by following the hashtag for democracy sausage. As it turns out, there was. There was a peak of early-morning democratic enthusiasm with a bunch of sleepless auspol and sausage tragic posting furiously. It tapered off dramatically during the day as we were forced to contend with the reality of democracy.

For a change, I also calculated a basic sentiment score for each tweet and tracked that too. There was a large degree of variability on 30/06, but posting was very low that day. A late afternoon disappointment dip as people realised that we’d all packed up the BBQs and gone home before they got there was also evident. Julia Silge’s post on the subject was extremely helpful.

I’m teaching again this week and to start students off they’re doing basic charts in Excel. So here’s mine!

Line graph showing frequency and sentiment of hashtag


Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

Statistical model selection with “Big Data”: Doornik & Hendry’s New Paper

The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing …. or if we ourselves hope to change it. – Harford, 2014

Ten or fifteen years ago, big data sounded like the best thing ever in econometrics. When you spend your undergraduate career learning that (almost!) everything can be solved in classical statistics with more data, it sounds great. But big data comes with its own issues. No free lunch, too good to be true and your mileage really does vary.

In a big data set, statistical power isn’t the issue. You have power enough for just about everything. But that comes with problems of its own. The probability of a Type II error may be very high. In this context, it’s the possibility of falsely interpreting that a parameter estimate is significant when in fact it is not. The existence of spurious relationships are likely. Working out which ones are truly significant and those that are spurious is difficult. Model selection in the big data context is complex!

David Hendry is one of the powerhouses of modern econometrics and the fact that he is weighing into the big data model selection problem is a really exciting proposition. This week a paper was published with Jurgen Doornik in the Cogent Journal of Economics and Finance. You can find it here.

Doornik and Hendry propose a methodology for big data model selection. Big data comes in many varieties and in this paper, they consider only cross-sectional and time series data of “fat” structure: that is, more variables than observations. Their results generalise to other structures, but not always. Doornik and Hendry describe four key issues for big data model selection in their paper:

  • Spurious relationships
  • Mistaking correlations for causes
  • Ignoring sampling bias
  • Overestimating significance of results.

So, what are Doornik and Hendry’s suggestions for model selection in a big data context? Their approach has several pillars to the overall concept:

  1. They calculate the probabilities of false positives in advance. It’s long been possible in statistics to set the significance level to control multiple simultaneous tests. This is an approach taken in both ANOVA testing for controlling the overall significance level when testing multiple interactions and in some panel data approaches when testing multiple cross-sections individually. The Bonferroni inequality is the simplest of this family of techniques, though Doornik and Hendry are suggesting a far more sophisticated approach.
  2. Test “causation” by evaluating super exogeneity. In many economic problems especially, the possibility of a randomised control trial is unfeasible. Super exogeneity is an added layer of sophistication on the correlation/causation spectrum of which Granger causation was an early addition.
  3. Deal with hidden dependence in cross-section data. Not always an easy prospect to manage, cross-sectional dependence usually has no natural or obvious ordering as in time series dependence: but controlling for this is critical.
  4. Correct for selection biases. Often, big data arrives not out of a careful sampling design, but on a “whoever turned up to the website” basis. Recognising, controlling and correcting for this is critical to good model selection.

Doornik and Hendry advocate the use of autometrics in the presence of big data, without abandoning statistical rigour. Failing to understand the statistical consequences of our modelling techniques makes poor use of data assets that otherwise have immense value. Doornik and Hendry propose a robust and achievable methodology. Go read their paper!

Teaching kids to code

Kids coding is a topical issue, particularly given the future of employment. The jobs our children will be doing are different to the ones our parents did/are doing and to our own. Programming skills are one of the few things that the experts agree are important.

There are lots of great online resources already in place to help children learn the computer skills they will need in the future. You can start early, you can make it fun and it doesn’t have to cost you a fortune.

Let me be clear: this isn’t a parenting blog. I do have kids. I do program. I do have a kid that wants to learn to program (mostly I think because he thinks I’ll give him a free pass on other human-necessary skills such as creativity, interpersonal relationships and trying on sports day).

My personal parenting philosophy (if anyone cares) is that kids learn very well when you give them interesting tools to explore the world with. That might include programming, but for some kids it won’t. That’s OK. It doesn’t mean they’re never going to get a job: it just means they may prefer to climb trees because they’re kids. There’s a lot of learning to be had up a tree.

But part of providing interesting resources with which to explore the world is knowing where to find them. Here’s a run down of some resources broken down by age group. Yes, kids can start as early as preschool!

Preschool Age (4 +)

The best resources for kids this age are fun interactive apps. If it’s not fun, they won’t engage and frankly nobody wants to stand over a small child making them do something when they could be learning autonomously through undirected play. Here are my favourites:

  • Lightbot. This is a fun interactive app available on Android and Apple that teaches kids the basics of programming using icons rather than language-based code. It comes in both junior coding (4-8 years) and programming puzzles (9+) and my kids have had the apps for six months and enjoyed them.
  • Cargo-bot was recommended to me by a fellow programming-parent and I love the interface and the puzzles. My friends have had the app for a few months and young I. enjoys it a lot.
  • Flow isn’t a coding app. It’s an app that encourages visual motor planning development. Anyone that’s done any coding at all will know that visual motor planning is a critical skill for programming. First this then that. If I put this here then that needs to go there. Flow is a great game that helps kids develop this kind of planning. And that’s helpful not only for programming, but everything else too.

School Age Kids (9 +)

Once kids are comfortable reading and manipulating English as a language, they can move on to a language-based program. There are a few different ones available, some specifically designed for kids like Tynker and Scratch.  For the kid that I have in this age bracket- taking into account his interests and temperament- I’m just going to go straight to Python or R for him. As with everything parenting: your mileage may vary and that’s OK.

Some resources for learning python with kids include:

  • This great post from Geekwire. Really simple ideas to engage with your kid.
  • Python Tutorials for kids 13+ is a companion site to the For Dummies book Python for kids I’ve mentioned previously. We got the book from the library a month or so back and I’m thinking of shelling out the $$ to buy it and keep it here permanently.
  • The Invent with Python blog has some great discussion of the issue generally.

R doesn’t seem to have as many kid-friendly resources, but the turtle graphics package looks like it might be worth a try.

General Resources for Teaching Kids to Code

Advocates for programming have been beating this drum for a long time. I came across a number of useful posts while writing this one, so here they are for your reference:

Good luck and enjoy coding with your kid. And if your kid doesn’t want to learn code, enjoy climbing that tree instead!

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Open datasets for analysis

So you’re a new data scientist and you’re exploring everything the internet has to offer (a lot). But having explored, you’re ready to try something on your own. Here is a (short) list of data sources you can tackle:

I’ll keep adding to the list as I come across interesting things.

Late Night Democracy Sausage Surge

It’s hard-hitting electoral coverage over here at Rex. Democracy sausage is apparently more of  a late night event leading up to the election. Late night tweeting was driving the hashtag up until the close of 1 July. By the end of the day twitter had changed the #ausvotes emoji to a sausage sandwich. My personal prediction is another overnight lull and then a daytime surge on 02/07 petering off by 4pm on the day.

Time series graph of #democracysausage


And just for fun, who was the top Twitter advocate for the hashtag over the last three days? A user (bot?) called SausageSizzles. Some serious tweeting going on there. A steady focus on message and brand.

Bart chart

Meanwhile, as I write Antony Green on the ABC is teaching the country about sample size and variance of estimators at the early stage of counting.

The same as yesterday, check out this discussion on R Bloggers which provides a good amount of the code for doing this analysis.