Renee from Becoming a Data Scientist asked Twitter which basic issues were hard to understand in data science. It generated a great thread with lots of interesting perspectives you can find here.
My opinion is that the most difficult to understand concept has nothing to do with the technical aspects of data science.
The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.
Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.
Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.
So how do I make decisions about algorithms, estimators and models?
Like data analysis (see here, here and here), I think of modelling as an interrogation over my data, my problem and my brief. If you’re new to modelling, here’s some thoughts to get you started.
Why am I here? Who is it for?
It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.
If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.
What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?
Here’s the one question I ask myself for every model: what do I think the causal relationships are here?
What do I need it to do?
The key outcome you need from your model will probably have the most weight on your decisions.
For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.
On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?
How long do I have?
In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?
If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.
Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.
If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.
This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.
If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.
Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.
If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.
It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.
This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.
Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.
What are the resources I have?
How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.
I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.
It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).
On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.
How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.
Now go forth and model stuff
This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.
You’re not doing machine learning OR econometrics: you’re modelling.
That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.
What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.