The Seven Stages of Being a Data Scientist

Becoming a data scientist is a fraught process as you frantically try to mark off all the bits on the ridiculous Venn diagram that will allow you to enter the high priesthood of data and be a “real” data scientist. In that vein, I offer you the seven stages on the road to becoming a “real” data scientist.

Like the Venn diagrams (the best and most accurate is here), you should take these stages just as seriously.

(1) You find out that this data and code at the same time thing makes myyourbrain hurt.

(2) OK, you’re getting it now! [insert popular methodology du jour] is the most amazing thing ever! It’s so cool! You want to learn all about it!

(3) Why the hell won’t your matrix invert? You need to know how to code how many damn languages?

(4) While spending three increasingly frustrated hours looking for a comma, bracket or other infinitesimal piece of code in the wrong place, realise most of your wardrobe is now some variation on jeans and a local t-shirt, or whatever your local equivalent is. Realise you’ve crossed some sort of psychological divide. Wonder what the meaning of life is and remember it’s 42. Try to remember the last time you ate something that wasn’t instant coffee straight off the spoon. Ponder the pretty blinking cursor for a bit. Find your damn comma and return from the hell of debugging. Repeat stage (4) many times. (Pro tip: print statements are your friend.)

(5) Revise position on (2) to “it does a good enough job in the right place.”

(6) Revise position on (5) to “… that’s what the client wants and I need to be a better negotiator to talk them out of it because it’s wrong for this project.” All of a sudden your communication skills matter more than your code or your stats geek stuff.

(7) By this stage, you don’t really care what language or which method someone uses as long as they can get the job done right and explain it to the client so they understand it. The data and code at the same time thing still makes your brain hurt, though.

Things I’m glad got beaten into me in grad school

There are a few things that were – painstakingly and with great patience- inserted into my skull during grad school by my Ph.D. supervisor. A great supervisor is the best thing that can happen to you during a Ph.D. So, in no particular order, here are the things I’m glad he taught me (as of tonight, the list changes regularly):

  • You might think it’s all about the numbers, but you need to know how to write if you want anyone to care about the numbers.
  • Do it PROPERLY. No hacks, no bodge fixes. It will save you time and the occasional preventable heart attack in the long run.
  • It doesn’t really matter what programming language you use, but learn to code and learn to document that code thoroughly.
  • Even if you’re going into applied work, learn the theory: the hard stuff especially. Once you know the theory, you know you have options. You don’t have to default to what you’re familiar with, you have the skills to go and explore the unfamiliar.
  • Likewise, even if you’re going into theoretic work, learn how good applied work happens. Don’t be cavalier about applied work: in many cases the applied is the purpose for the theoretic. It doesn’t exist in a vacuum.
  • Reverse parking. Yes, he taught me to reverse park too.

Thanks for everything Andy, it was the best x

Decision vs Default: Abdicating to the Algorithm

Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.

One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.

When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.

Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.

Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.

Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.

Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view

Correlation vs Causation

Correlation vs causation. I find this is an issue that is quite simple, from a technical point of view, but widely misunderstood. Statistical significance does not imply causation. Correlation implies there may be a direct or indirect relationship, but does not imply causation. In fact, very few things imply causation. My simple version of the differences is below.

If you want to know why this is far more than a stoush to be had in an academic tea room, check out Tyler Vigen’s collection. If the age of Miss America can be significantly and strongly correlated with murders by steam, hot vapours and objects; then in any practical analysis there are many options for other less obvious spurious correlations. In a big data context, knowing the difference could be millions of dollars.

Occasionally, people opine that causation vs correlation doesn’t matter (especially in a big data and sometimes a machine learning context). I’d argue this is completely the wrong view to take: just because you have all the power that matters doesn’t mean we should ignore these issues because a randomised control trial is impractical in a lot of ways. It just means deciding when, how and why you’re going to do so in the knowledge of what you’re doing. Spurious correlations are common, hard to detect and difficult to deal with. It’s a bear hunt worth setting out on.

 Causation vs correlation

 

Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

Tracking Democracy Sausage

It’s a fine tradition here in Australia where every few years communities manfully attempt to make up funding gaps in the selling and eating of #democracysausage to the captured audience of compulsory voters.

For fun, I decided to see if we could track the interest in the hashtag on twitter over time. I’ve exported the frequencies out to excel for this graph making exercise, because I’ll be teaching a class on stats entirely in excel in a few weeks and this will make for some fun discussion.

Democracy sausage line graph

As we can see, as of last night (2 more sleeps until #democracysausage day), interest on twitter was increasing. I’ll bring you a democracy sausage update tomorrow.

Technical notes: the API I’m using would only pull a maximum of 350 tweets featuring the hashtag on any given day: I suspect we may be missing some interest in sausages. I’ll look into other ways of doing this.

One very useful resource formed the bulk of the programming required: this blog post on R bloggers takes you through the basics required to do the same to any hashtag you may be interested in exploring.

Happy democracy sausage day!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM