Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

Statistical model selection with “Big Data”: Doornik & Hendry’s New Paper

The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing …. or if we ourselves hope to change it. – Harford, 2014

Ten or fifteen years ago, big data sounded like the best thing ever in econometrics. When you spend your undergraduate career learning that (almost!) everything can be solved in classical statistics with more data, it sounds great. But big data comes with its own issues. No free lunch, too good to be true and your mileage really does vary.

In a big data set, statistical power isn’t the issue. You have power enough for just about everything. But that comes with problems of its own. The probability of a Type II error may be very high. In this context, it’s the possibility of falsely interpreting that a parameter estimate is significant when in fact it is not. The existence of spurious relationships are likely. Working out which ones are truly significant and those that are spurious is difficult. Model selection in the big data context is complex!

David Hendry is one of the powerhouses of modern econometrics and the fact that he is weighing into the big data model selection problem is a really exciting proposition. This week a paper was published with Jurgen Doornik in the Cogent Journal of Economics and Finance. You can find it here.

Doornik and Hendry propose a methodology for big data model selection. Big data comes in many varieties and in this paper, they consider only cross-sectional and time series data of “fat” structure: that is, more variables than observations. Their results generalise to other structures, but not always. Doornik and Hendry describe four key issues for big data model selection in their paper:

  • Spurious relationships
  • Mistaking correlations for causes
  • Ignoring sampling bias
  • Overestimating significance of results.

So, what are Doornik and Hendry’s suggestions for model selection in a big data context? Their approach has several pillars to the overall concept:

  1. They calculate the probabilities of false positives in advance. It’s long been possible in statistics to set the significance level to control multiple simultaneous tests. This is an approach taken in both ANOVA testing for controlling the overall significance level when testing multiple interactions and in some panel data approaches when testing multiple cross-sections individually. The Bonferroni inequality is the simplest of this family of techniques, though Doornik and Hendry are suggesting a far more sophisticated approach.
  2. Test “causation” by evaluating super exogeneity. In many economic problems especially, the possibility of a randomised control trial is unfeasible. Super exogeneity is an added layer of sophistication on the correlation/causation spectrum of which Granger causation was an early addition.
  3. Deal with hidden dependence in cross-section data. Not always an easy prospect to manage, cross-sectional dependence usually has no natural or obvious ordering as in time series dependence: but controlling for this is critical.
  4. Correct for selection biases. Often, big data arrives not out of a careful sampling design, but on a “whoever turned up to the website” basis. Recognising, controlling and correcting for this is critical to good model selection.

Doornik and Hendry advocate the use of autometrics in the presence of big data, without abandoning statistical rigour. Failing to understand the statistical consequences of our modelling techniques makes poor use of data assets that otherwise have immense value. Doornik and Hendry propose a robust and achievable methodology. Go read their paper!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Cheat Sheets: The New Programmer’s Friend

Cheat sheets are brilliant: whether you’re learning to program for the first time or you’re picking up a new language. Most data scientists are probably programming regularly in multiple languages at any given time: cheat sheets are a handy reference guide that saves you from googling how to “do that thing you know I did it in python yesterday but how does it go in stata?”

This post is an ongoing curation of cheat sheets in the languages I use. In other words, it’s a cheat sheet for cheat sheets. Because a blog post is more efficient than googling “that cheatsheet, with the orange bit and the boxes.” You can find my list of the tutorials and how-to guides I enjoyed here.

R cheat sheets + tutorials

Python cheat sheets

Stata cheat sheets

  • There is a whole list of them here, organised by category.
  • Stata cheat sheet, I could have used this five years ago. Also very useful when it’s been awhile since you last played in the stata sandpit.
  • This isn’t a cheat sheet, but it’s an exhaustive list of commands that makes it easy to find what you want to do- as long as you already have a good idea.

SPSS cheat sheets

  • “For Dummies” has one for SPSS too.
  • This isn’t so much a cheat sheet but a very basic click-by-click guide to trying out SPSS for the first time. If you’re new to this, it’s a good start. Since SPSS is often the gateway program for many people, it’s a useful resource.

General cheat sheets + discusions

  • Comparisons between R, Stata, SPSS, SAS.
  • This post from KD Nuggets has lots of cheat sheets for R, Python, SQL and a bunch of others.

I’ll add to this list as I find things.

Law of Large Numbers vs the Central Limit Theorem: in GIFs

I’ve spoken about these two fundamentals of asymptotics previously here and here. But sometimes, you need a .gif to really drive the point home. I feel this is one of those times.

Firstly, I simulated a population of 100 000 observations from the random uniform distribution. This population looks nothing like a normal distribution and you can see that below.

histogram of uniform distribution

Next, I took 500 samples from the data with varying sample sizes. I used n=5, 10, 20, 50, 100 and 500. I calculated the sample mean (x-bar) and the z score for each and I plotted their kernel densities using ggplot in R.

Here’s a .gif of what happens to the z score as the sample size increases: we can see that the distribution is pretty normal looking, even when the sample size is quite low. Notice that the distribution is centred on zero.

z score gif

Here’s a .gif of what happens to the sample mean as n increases: we can see that the distribution collapses on the population mean (in this case µ=0.5).

sample mean gif

For scale, here is a .gif of both frequencies as n gets large sitting on the same set of axes: the activity is quite different.

Sample mean vs z score

 If you want to try this yourself, the script is here. Feel free to play around with different distributions and sample sizes, see what turns up.

The Law of Large Numbers: It’s Not the Central Limit Theorem

I’ve spoken about asymptotics before. It’s the lego of the modelling world, in my view. Interesting, hard and you can lose years of your life looking for just the right piece that fits into the model you’re trying to build.

The Law of Large Numbers (LLN) is another simple theorem that’s widely misunderstood. Most often it’s conflated with the central limit theorem (CLT), which deals with the studentised sample mean or z-score. The LLN pertains to the sample mean itself.

Like the CLT, the LLN is actually a collection of theorems, strong and weak. I’ll confine myself to the simplest version here, Khinchine’s weak law of large numbers. It states that for a random, independent and identically distributed sample of n observations from any distribution with a finite mean (µ) and variance: then the sample mean has a probability limit equal to the population mean, µ. That is, the sample mean is a consistent estimator of the population mean under these conditions.

Put simply, as n gets very big, the sample mean is equal to the population mean.

Notice there is nothing about normal distributions as n gets large. That’s the key difference between the LLN and the CLT. One deals with the sample mean alone, the other with the studentised version. On its own, the distribution of the sample mean collapses onto a single point as n gets large: µ. This is the implication of the LLN.

Appropriately scaled, centred and at the correct rate, the studentised sample mean has a normal distribution in the limit as N gets large: that’s the CLT.

As usual, here’s an infographic to go: put side by side the two theorems have different results but are dealing with something quite similar.

CLT vs LLN infographic

Continuous, Censored and Truncated Data: what are the differences and do you need to care?

Whenever I work with someone whose statistical or econometric experience has been more practical than theoretical, two things happen. The first is that the poor person inexplicably develops a twitch whenever I launch into an enthusiastic tangent that requires a sheet of graph paper and extensive hand waving.

The other thing that inevitably happens is that the digression comes to an end and the question is asked “but does that matter in practice?”

When it comes to model section, the difference between data types really does matter. You may make choices one way or another, but understanding the differences (both obvious and subtle) lets you make those choices understanding that you do have them.

This post is a cliff-notes version of the issue. Maybe you’ve heard of these differences in data types and just need a memory jog. Maybe you’ve not heard of them at all and want somewhere simple to start.

Continuous data is pretty simple: it’s data that can lie anywhere on the real line with a positive probability. That is, it can be anywhere from very large negative numbers to very large positive numbers. The normal distribution is an example of continuous data.

Truncated data, on the other hand, is data which is continuous but has the added complication of only being observed above or below a certain point. The classic example suggested by Greene is income [1]. One example would be if we only surveyed the income of those earning above the tax-free threshold: then we would have truncated data.

Censored data is similar. It’s an issue not of observation but in the way the data is sampled. Some parts of the distribution are obscured, but not ignored. The survey may, for example, interview all income levels, but only record those above the tax free threshold and describe the rest as “under the tax threshold” rather than giving the income in dollar terms. In this case all parts of the distribution are reported on, but the level of information differs above or below a threshold.

Most people are aware of issues modelling categorical data using techniques designed for continuous data. However, censored and truncated data also need special treatment. A lot of the data we deal with has a natural truncation point: distance isn’t negative, prices are not (well, hardly ever) negative. Recognising that you may be dealing with truncated or censored data is an important part of initial data analysis. For a thorough discussion, see W.H. Green’s chapter on the subject here.

In practice, continuous data methodologies may work quite well for these types of data as long as there isn’t a large amount of data sitting at or near the truncation or censoring point (which is often zero).

Test scores are something I’ve worked a lot with. In my experience, once the proportion of test scores began to approach around 20% zeros, I needed to switch over to models designed for the issue. In the 10%-20% range I will often try a few of different models to see which is most appropriate. That’s just a general rule of thumb- your mileage may vary.

Hand waving and furious graph-paper drawing aside: yes in this case knowing the differences does matter in practice.

Notes:

[1] W. H. Green, Econometric Analysis, is a classic text and here I’m looking at p. 756 in the fifth edition. There are three copies of this book living in my house. Definitely worth the investment if you are looking for either a classic text covering everything econometrics or a useful TV stand. What can I say? We were young and poor and a matched set of texts made up for deficits in our furniture budget. I’ve owned this book for nearly twenty years and I still use it- even long after we can afford furniture.

The Central Limit Theorem: Misunderstood

Asymptotics are the building blocks of many models. They’re basically lego: sturdy, functional and capable of allowing the user to exercise great creativity. They also hurt like hell when you don’t know where they are and you step on them accidentally. I’m pushing it on the last, I’ll admit. But I have gotten very sweary over recalcitrant limiting distributions in the past (though I may be in a small group there).

One of the fundamentals of the asymptotic toolkit is the Central Limit Theorem, or CLT for short. If you didn’t study eight semesters of econometrics or statistics, then it’s something you (might have) sat through a single lecture on and walked away with the hot take “more data is better”.

The CLT is actually a collection of theorems, but the basic entry-level version is the Lindberg-Levy CLT. It states that for any sample of n random, independent observations drawn from any distribution with finite mean (μ) and standard deviation (σ), if we calculate the sample mean x-bar then,

central limit theorem

In my time both in industry and in teaching, I’ve come across a number of interpretations of this result: many of them very wrong from very smart people. I’ve found it useful to clarify what this result does and does not mean, as well as when it matters.

Not all distributions become normal as n gets large. In fact, most things don’t “tend to normality” as N gets large. Often, they just get really big or really small. Some distributions are asymptotically equivalent to normality, but most “things”- estimators and distributions alike- are not.

The sample mean by itself does not become normal as n gets large. What would happen if you added up a huge series of numbers? You’d get a big number. What would happen if you divided your big number by your huge number? Go on, whack some experimental numbers into your calculator!

Whatever you put into your calculator, it’s not a “normal distribution” you get when you’re done. The sample mean alone does not tend to a normal distribution as N gets large.

The studentised sample mean has a distribution which is normal in the limit. There are some adjustments we need to make before the sample mean has a stable limiting distribution – this is the quantity often known as the z-score. It’s this quantity that tends to normality as n gets large.

How large does n need to be? This theorem works for any distribution with a finite mean and standard deviation, e.g. as long as x comes from a distribution with these features. Generally, statistics texts quote the figure of n=30 as a “rule of thumb”. This works reasonably well for simple estimators and models like the sample mean in a lot of situations.

This isn’t to say, however, that if you have “big data” your problems are gone. You just got a whole different set, I’m sorry. That’s a different post, though.

So that’s a brief run down on the simplest of central limit theorems: it’s not a complex or difficult concept, but it is a subtle one. It’s the building block upon which models such as regression, logistic regression and their known properties have been based.

The infographic below is the same information, but for some reason my students find information in that format easier to digest. When it comes to asymptotic theory, I am disinclined to argue with them: I just try to communicate in whatever way works. On that note, if this post was too complex or boring, here is the CLT presented with bunnies and dragons.** What’s not to love?CLT infographic

** I can’t help myself: The reason why the average bunny weights distribution gets narrower as the sample size gets larger is because this is the sample mean tending towards the true population mean. For a discussion of this behaviour vs the CLT see here.

It’s my only criticism of what was an otherwise a delightful video. Said video being in every way superior to my own version done late one night for a class with my dog assisting and my kid’s drawing book. No bunnies or dragons, but it’s here.

Modelling Early Grade Education in Papua New Guinea

For several years, I worked for the World Bank analysing the early grade education outcomes in a number of different Pacific countries including Laos, Tonga and Papua New Guinea, amongst others. Recently, our earlier work in Papua New Guinea was published for the first time.

One of the more challenging things I did was model a difficult set of survey outcomes: reading amongst young children. You can see the reports here. Two of the most interesting relationships we observed were the importance of language for young children learning to read (Papua New Guinea has over 850 of them so this matters) and the role that both household and school environments play in literacy development.

At some point I will write a post about the choice between standard ordinary least squares regressions used in the field and the tobit models I (generally) prefer for this data. Understanding the theoretical difference between censored, truncated and continuous data isn’t the most difficult thing in the world, but understanding the practical difference between them can have a big impact on modelling outcomes.

Elasticity and Marginal Effects: Two Key Concepts

One of the critical parts of building a great model is using your understanding of the problem and context. Choosing an appropriate model type and deciding on appropriate features/variables to explore based on this information is critical.

The two key concepts of elasticity and marginal effects are fundamental to an economic understanding of model building. This is something that can be overlooked for practitioners not coming from that background. Neither concept is difficult or particularly obtuse.

This infographic came about because I had a group of talented economics students at the masters’ level who had no econometric background, by and large. In a crowded course, I don’t have much time to expand on my favourite things. This was my take on explaining the concepts quickly and simply.

Elasticity infographic

For those very new to the concept, this explanation here is simple. Alternatively, if you’re interested in non-constant marginal effects and ways they can be used, check out this discussion.