Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)

The Seven Stages of Being a Data Scientist

Becoming a data scientist is a fraught process as you frantically try to mark off all the bits on the ridiculous Venn diagram that will allow you to enter the high priesthood of data and be a “real” data scientist. In that vein, I offer you the seven stages on the road to becoming a “real” data scientist.

Like the Venn diagrams (the best and most accurate is here), you should take these stages just as seriously.

(1) You find out that this data and code at the same time thing makes myyourbrain hurt.

(2) OK, you’re getting it now! [insert popular methodology du jour] is the most amazing thing ever! It’s so cool! You want to learn all about it!

(3) Why the hell won’t your matrix invert? You need to know how to code how many damn languages?

(4) While spending three increasingly frustrated hours looking for a comma, bracket or other infinitesimal piece of code in the wrong place, realise most of your wardrobe is now some variation on jeans and a local t-shirt, or whatever your local equivalent is. Realise you’ve crossed some sort of psychological divide. Wonder what the meaning of life is and remember it’s 42. Try to remember the last time you ate something that wasn’t instant coffee straight off the spoon. Ponder the pretty blinking cursor for a bit. Find your damn comma and return from the hell of debugging. Repeat stage (4) many times. (Pro tip: print statements are your friend.)

(5) Revise position on (2) to “it does a good enough job in the right place.”

(6) Revise position on (5) to “… that’s what the client wants and I need to be a better negotiator to talk them out of it because it’s wrong for this project.” All of a sudden your communication skills matter more than your code or your stats geek stuff.

(7) By this stage, you don’t really care what language or which method someone uses as long as they can get the job done right and explain it to the client so they understand it. The data and code at the same time thing still makes your brain hurt, though.

Things I’m glad got beaten into me in grad school

There are a few things that were – painstakingly and with great patience- inserted into my skull during grad school by my Ph.D. supervisor. A great supervisor is the best thing that can happen to you during a Ph.D. So, in no particular order, here are the things I’m glad he taught me (as of tonight, the list changes regularly):

  • You might think it’s all about the numbers, but you need to know how to write if you want anyone to care about the numbers.
  • Do it PROPERLY. No hacks, no bodge fixes. It will save you time and the occasional preventable heart attack in the long run.
  • It doesn’t really matter what programming language you use, but learn to code and learn to document that code thoroughly.
  • Even if you’re going into applied work, learn the theory: the hard stuff especially. Once you know the theory, you know you have options. You don’t have to default to what you’re familiar with, you have the skills to go and explore the unfamiliar.
  • Likewise, even if you’re going into theoretic work, learn how good applied work happens. Don’t be cavalier about applied work: in many cases the applied is the purpose for the theoretic. It doesn’t exist in a vacuum.
  • Reverse parking. Yes, he taught me to reverse park too.

Thanks for everything Andy, it was the best x

Decision vs Default: Abdicating to the Algorithm

Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.

One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.

When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.

Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.

Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.

Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.

Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view

Correlation vs Causation

Correlation vs causation. I find this is an issue that is quite simple, from a technical point of view, but widely misunderstood. Statistical significance does not imply causation. Correlation implies there may be a direct or indirect relationship, but does not imply causation. In fact, very few things imply causation. My simple version of the differences is below.

If you want to know why this is far more than a stoush to be had in an academic tea room, check out Tyler Vigen’s collection. If the age of Miss America can be significantly and strongly correlated with murders by steam, hot vapours and objects; then in any practical analysis there are many options for other less obvious spurious correlations. In a big data context, knowing the difference could be millions of dollars.

Occasionally, people opine that causation vs correlation doesn’t matter (especially in a big data and sometimes a machine learning context). I’d argue this is completely the wrong view to take: just because you have all the power that matters doesn’t mean we should ignore these issues because a randomised control trial is impractical in a lot of ways. It just means deciding when, how and why you’re going to do so in the knowledge of what you’re doing. Spurious correlations are common, hard to detect and difficult to deal with. It’s a bear hunt worth setting out on.

 Causation vs correlation

 

Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

Tracking Democracy Sausage

It’s a fine tradition here in Australia where every few years communities manfully attempt to make up funding gaps in the selling and eating of #democracysausage to the captured audience of compulsory voters.

For fun, I decided to see if we could track the interest in the hashtag on twitter over time. I’ve exported the frequencies out to excel for this graph making exercise, because I’ll be teaching a class on stats entirely in excel in a few weeks and this will make for some fun discussion.

Democracy sausage line graph

As we can see, as of last night (2 more sleeps until #democracysausage day), interest on twitter was increasing. I’ll bring you a democracy sausage update tomorrow.

Technical notes: the API I’m using would only pull a maximum of 350 tweets featuring the hashtag on any given day: I suspect we may be missing some interest in sausages. I’ll look into other ways of doing this.

One very useful resource formed the bulk of the programming required: this blog post on R bloggers takes you through the basics required to do the same to any hashtag you may be interested in exploring.

Happy democracy sausage day!