Expertise vs Awareness for the Data Scientist

We’ve all seen them: articles with headlines like “17 things you MUST know to be a data scientist” and “Great data scientists know these 198 algorithms no one else does.” While the content can be a useful read, the titles are clickbait and imposter syndrome is a common outcome.

You can’t be an expert in every skill on the crazy data science Venn Diagram. It’s not physically possible and if you try you’ll spend all your time attempting to become a “real” data scientist with no time left to be one. In any case, most of those diagrams actually describe an entire industry or a large and diverse team: not the individual.

Data scientists need expertise, but you only need expertise in the areas you’re working with right now. For the rest, you need awareness.

Awareness of the broad church that is data science tells you when you need more knowledge, more skill or more information than you currently have. Awareness of areas outside your expertise means you don’t default to the familiar, you make your decisions based on a broad understanding of what’s possible.

Expertise still matters, but the exact area you’re expert in is less important. Expertise gives you the skills you need to go out and learn new things when and as you need them. Expertise in Python gives you the skills to pick up R or C++ next time you need them. Expertise in econometrics gives you the skills to pick up machine learning. Heck, expertise in languages (human ones, not computer ones) is also a useful skill set for data scientists, in my view.

You need expertise because that gives you the core skills to pick up new things. You need awareness because that will let you know when you need the new things and what they could be. They’re not the same thing: so keep doing what you do well and keep one eye on what other people do well.

Machine Learning vs Econometric Modelling: Which One?

Renee from Becoming a Data Scientist asked Twitter which basic issues were hard to understand in data science. It generated a great thread with lots of interesting perspectives you can find here.

My opinion is that the most difficult to understand concept has nothing to do with the technical aspects of data science. Twitter post

The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.

Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.

Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.


So how do I make decisions about algorithms, estimators and models?

Like data analysis (see here, here and here), I think of modelling as an interrogation over my data, my problem and my brief. If you’re new to modelling, here’s some thoughts to get you started.

Why am I here? Who is it for?

It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.

If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.

What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?

Here’s the one question I ask myself for every model: what do I think the causal relationships are here?

What do I need it to do?

The key outcome you need from your model will probably have the most weight on your decisions.

For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.

On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?

Next question.

How long do I have?

In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?


If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.

Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.

If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.

This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.

If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.


Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.

If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.

It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.

This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.

Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.

What are the resources I have?

How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.

I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.

It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).

On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.

How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.

Now go forth and model stuff

This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.

You’re not doing machine learning OR econometrics: you’re modelling.

That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.

What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.

Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)

The Seven Stages of Being a Data Scientist

Becoming a data scientist is a fraught process as you frantically try to mark off all the bits on the ridiculous Venn diagram that will allow you to enter the high priesthood of data and be a “real” data scientist. In that vein, I offer you the seven stages on the road to becoming a “real” data scientist.

Like the Venn diagrams (the best and most accurate is here), you should take these stages just as seriously.

(1) You find out that this data and code at the same time thing makes myyourbrain hurt.

(2) OK, you’re getting it now! [insert popular methodology du jour] is the most amazing thing ever! It’s so cool! You want to learn all about it!

(3) Why the hell won’t your matrix invert? You need to know how to code how many damn languages?

(4) While spending three increasingly frustrated hours looking for a comma, bracket or other infinitesimal piece of code in the wrong place, realise most of your wardrobe is now some variation on jeans and a local t-shirt, or whatever your local equivalent is. Realise you’ve crossed some sort of psychological divide. Wonder what the meaning of life is and remember it’s 42. Try to remember the last time you ate something that wasn’t instant coffee straight off the spoon. Ponder the pretty blinking cursor for a bit. Find your damn comma and return from the hell of debugging. Repeat stage (4) many times. (Pro tip: print statements are your friend.)

(5) Revise position on (2) to “it does a good enough job in the right place.”

(6) Revise position on (5) to “… that’s what the client wants and I need to be a better negotiator to talk them out of it because it’s wrong for this project.” All of a sudden your communication skills matter more than your code or your stats geek stuff.

(7) By this stage, you don’t really care what language or which method someone uses as long as they can get the job done right and explain it to the client so they understand it. The data and code at the same time thing still makes your brain hurt, though.

Decision vs Default: Abdicating to the Algorithm

Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.

One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.

When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.

Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.

Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.

Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.

Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Q&A vs the Leaders’ Debate: is everyone singing from the same song sheet?

The election campaign is in full swing here in Australia and earlier this week the leaders of the two main parties, Malcolm Turnbull and Bill Shorten, faced off in a heavily scripted debate in which few questions were answered and the talking points were well practiced. An encounter described as “diabolical” and “boring“, fewer Australians tuned in compared to recent years. Possibly this was because they expected to hear what they had already heard before.

Since the song sheet was well rehearsed, this seemed like the perfect opportunity for another auspol word cloud. The transcript of the debate was made available on Malcolm Turnbull’s website and it was an easy enough matter of poking around and seeing what could be found. Chris Ullmann, who moderator, was added to the stop words list as he was a prominent feature in earlier versions of the cloud.

debate word cloud

The song sheet was mild: the future tense “will” was in the middle with Shorten, labor, plan, people and Turnbull. Also featured were tax, economic, growth, change and other economic nouns like billion, (per)cent, economy, budget, superannuation. There was mention of climate, (people) smugglers, fair and action, but these were relatively isolated as topics.

In summary, this word cloud is not that different to that generated from the carefully strategised twitter feeds of Turnbull and Shorten I looked at last week.

The ABC’s program Q and A could be a better opportunity for politicians to depart from the song sheet and offer less scripted insight: why not see what the word cloud throws up?

This week’s program aired the day after the leader’s debate and featured Steve Ciobo (Liberal: minister for trade), Terri Butler (Labor: shadow parliamentary secretary for child safety and prevention of family violence), Richard di Natale (Greens, leader, his twitter word cloud is here), Nick Xenophon (independent senator) and Jacqui Lambie (independent senator).  Tony Jones hosted the program and suffered the same fate as Chris Uhlmann.

QandA word cloud

The word cloud picked up on the discursive format of the show: names of panellists feature prominently. Interestingly, Richard di Natale appears in the centre. Also prominent are election related words such as Australia, government, country, question, debate.

Looking at other topics thrown up by the word cloud, there is a broad range: penalty rates, coal, senate, economy, businesses, greens, policy, money, Queensland, medicare, politician, commission.

Two different formats, two different panels and two different sets of topics. Personally, I prefer it when the song sheet has a few more pages.

Congratulations to the Melbourne Data Science Group!

Last week, I attended the Melbourne Data Science Initiative and it was definitely the highlight of my data science calendar this year! The event was superbly organised by Phil Brierley and his team. Events included tutorials on Machine Learning, Deep Learning, Business Analytics and talks on feature engineering, big data and the need to invest in analytic talent amongst others.

The speakers were knowledgable and interesting with everything covered from the hilarious building of a rhinoceros catapult (thanks to Eugene from Presciient, it’s possible I’ll never forget that one) to the paramount importance of the “higher  purpose” in business analytics as discussed by Evan Stubbs from SAS Australia and New Zealand.

If you’re in or around Melbourne and into Data Science at all, check out the group who put on this event out here.