Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.
One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.
When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.
Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.
Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.
Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.