Machine Learning vs Econometric Modelling: Which One?

Renee from Becoming a Data Scientist asked Twitter which basic issues were hard to understand in data science. It generated a great thread with lots of interesting perspectives you can find here.

My opinion is that the most difficult to understand concept has nothing to do with the technical aspects of data science. Twitter post

The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.

Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.

Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.

 

So how do I make decisions about algorithms, estimators and models?

Like data analysis (see here, here and here), I think of modelling as an interrogation over my data, my problem and my brief. If you’re new to modelling, here’s some thoughts to get you started.

Why am I here? Who is it for?

It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.

If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.

What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?

Here’s the one question I ask myself for every model: what do I think the causal relationships are here?

What do I need it to do?

The key outcome you need from your model will probably have the most weight on your decisions.

For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.

On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?

Next question.

How long do I have?

In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?

Development

If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.

Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.

If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.

This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.

If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.

Deployment

Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.

If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.

It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.

This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.

Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.

What are the resources I have?

How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.

I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.

It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).

On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.

How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.

Now go forth and model stuff

This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.

You’re not doing machine learning OR econometrics: you’re modelling.

That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.

What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.

Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)

Decision vs Default: Abdicating to the Algorithm

Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.

One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.

When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.

Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.

Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.

Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.

Interpreting Models: Coefficients, Marginal Effects or Elasticities?

I’ve spoken about interpreting models before. I think that this is the most important part of our work, communicating results. However, it’s one that’s often overlooked when discussing the how-to of data science. That’s why marginal effects and elasticities are better for this purpose than coefficients alone.

Model build, selection and testing is complex and nuanced. Communicating the model is sometimes harder, because a lot of the time your audience has no technical background whatsoever. Your stakeholders can’t go up the chain with, “We’ve got a model. And it must be a good model because we don’t understand any of it.”

Our stakeholders also have a limited attention span so the explanation process is two fold: explain the model and do it fast.

For these reasons, I usually interpret models for my stakeholders with marginal effects and elasticities, not coefficients or log-odds. Coefficient interpretation is very different for regressions depending on functional form and if you have interactions or polynomials built into your model, then the coefficient is only part of the story. If you have a more complex model like a tobit, conditional logit or other option, then interpretation of coefficients is different for each one.

I don’t know about your stakeholders and reporting chains: mine can’t handle that level of complexity.

Marginal effects and elasticities are also different for each of these models but they are by and large interpreted in the same way. I can explain the concept of a marginal effect once and move on. I don’t even call it a “marginal effect”: I say “if we increase this input by a single unit, I expect [insert thing here]” and move on.

Marginal effects and elasticities are often variable over the range of your sample: they may be different at the mean than at the minimum or maximum, for example. If you have interactions and polynomials, they will also depend on covarying inputs. Some people see this as added layers of complexity.

In the age of data visualisation, I see it as an opportunity to chart these relationships and visualise how your model works for your stakeholders.

We all know they like charts!

Statistical model selection with “Big Data”: Doornik & Hendry’s New Paper

The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing …. or if we ourselves hope to change it. – Harford, 2014

Ten or fifteen years ago, big data sounded like the best thing ever in econometrics. When you spend your undergraduate career learning that (almost!) everything can be solved in classical statistics with more data, it sounds great. But big data comes with its own issues. No free lunch, too good to be true and your mileage really does vary.

In a big data set, statistical power isn’t the issue. You have power enough for just about everything. But that comes with problems of its own. The probability of a Type II error may be very high. In this context, it’s the possibility of falsely interpreting that a parameter estimate is significant when in fact it is not. The existence of spurious relationships are likely. Working out which ones are truly significant and those that are spurious is difficult. Model selection in the big data context is complex!

David Hendry is one of the powerhouses of modern econometrics and the fact that he is weighing into the big data model selection problem is a really exciting proposition. This week a paper was published with Jurgen Doornik in the Cogent Journal of Economics and Finance. You can find it here.

Doornik and Hendry propose a methodology for big data model selection. Big data comes in many varieties and in this paper, they consider only cross-sectional and time series data of “fat” structure: that is, more variables than observations. Their results generalise to other structures, but not always. Doornik and Hendry describe four key issues for big data model selection in their paper:

  • Spurious relationships
  • Mistaking correlations for causes
  • Ignoring sampling bias
  • Overestimating significance of results.

So, what are Doornik and Hendry’s suggestions for model selection in a big data context? Their approach has several pillars to the overall concept:

  1. They calculate the probabilities of false positives in advance. It’s long been possible in statistics to set the significance level to control multiple simultaneous tests. This is an approach taken in both ANOVA testing for controlling the overall significance level when testing multiple interactions and in some panel data approaches when testing multiple cross-sections individually. The Bonferroni inequality is the simplest of this family of techniques, though Doornik and Hendry are suggesting a far more sophisticated approach.
  2. Test “causation” by evaluating super exogeneity. In many economic problems especially, the possibility of a randomised control trial is unfeasible. Super exogeneity is an added layer of sophistication on the correlation/causation spectrum of which Granger causation was an early addition.
  3. Deal with hidden dependence in cross-section data. Not always an easy prospect to manage, cross-sectional dependence usually has no natural or obvious ordering as in time series dependence: but controlling for this is critical.
  4. Correct for selection biases. Often, big data arrives not out of a careful sampling design, but on a “whoever turned up to the website” basis. Recognising, controlling and correcting for this is critical to good model selection.

Doornik and Hendry advocate the use of autometrics in the presence of big data, without abandoning statistical rigour. Failing to understand the statistical consequences of our modelling techniques makes poor use of data assets that otherwise have immense value. Doornik and Hendry propose a robust and achievable methodology. Go read their paper!

What if policies have networks like people?

It’s been policy-light this election season, but some policy areas are up for debate. Others are being carefully avoided by mutual agreement, much like at Christmas lunch when we all tacitly agree we aren’t bringing up What Aunty Betty Did Last Year After Twelve Sherries. It’s all too painful, we’ll never come to any kind of agreement and we should just pretend like it’s not important.

However, policy doesn’t happen in a vacuum and I wondered if it was possible that using a social network-type analysis might illustrate something about the policy debate that is occurring during this election season.

To test the theory, I used the transcripts of the campaign launch speeches of Malcolm Turnbull and Bill Shorten. These are interesting documents to examine, because they are at one and the same time an affirmation of each parties’ policy aspirations for the campaign as well as a rejection of the other’s. I used a simple social network analysis, similar to that used in the Aeneid study. If you want to try it yourself, you can find the R script here.

Deciding on the topics to examine was some trial and error, but the list was eventually narrowed down to 19 topics that have been the themes of the election year: jobs, growth, housing, childcare, superannuation, health, education, borders, immigration, tax, medicare, climate change,marriage equality, offshore processing, environment, boats, asylum, business and bulk billing. These aren’t the topics that the parties necessarily want to talk about, but they are nonetheless being talked about.

It took some manoeuvring to get a network that was readable, but one layout (Kamada Kawaii for the interested) stood out. I think it describes the policy state quite well, visually speaking.

topic network 160627

We have the inner circle of high disagreement: borders, environment, superannuation, boats and immigration. There is a middle circle doing the job of containment: jobs and growth, housing, childcare, education, medicare, business and tax- all standard election fodder.

Then we have the outer arc of topics neither the labor or liberal parties really wants to engage with: offshore processing, asylum (as opposed to immigration, boats and borders), climate change (much more difficult to manage than mere environment), bulk billing (the crux of medicare) and marriage equality (have a plebiscite, have a free parliamentary vote, have something, except responsibility). I found it interesting that the two leaders’ speeches when visualised contain one part of a policy debate around immigration: boats and borders. But they conspicuously avoided discussing the unpleasant details: offshore processing.

Much like Aunty Betty and her unfortunate incident with the cold ham, both parties are in tacit agreement to ignore the difficult parts of a policy debate.

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Continuous, Censored and Truncated Data: what are the differences and do you need to care?

Whenever I work with someone whose statistical or econometric experience has been more practical than theoretical, two things happen. The first is that the poor person inexplicably develops a twitch whenever I launch into an enthusiastic tangent that requires a sheet of graph paper and extensive hand waving.

The other thing that inevitably happens is that the digression comes to an end and the question is asked “but does that matter in practice?”

When it comes to model section, the difference between data types really does matter. You may make choices one way or another, but understanding the differences (both obvious and subtle) lets you make those choices understanding that you do have them.

This post is a cliff-notes version of the issue. Maybe you’ve heard of these differences in data types and just need a memory jog. Maybe you’ve not heard of them at all and want somewhere simple to start.

Continuous data is pretty simple: it’s data that can lie anywhere on the real line with a positive probability. That is, it can be anywhere from very large negative numbers to very large positive numbers. The normal distribution is an example of continuous data.

Truncated data, on the other hand, is data which is continuous but has the added complication of only being observed above or below a certain point. The classic example suggested by Greene is income [1]. One example would be if we only surveyed the income of those earning above the tax-free threshold: then we would have truncated data.

Censored data is similar. It’s an issue not of observation but in the way the data is sampled. Some parts of the distribution are obscured, but not ignored. The survey may, for example, interview all income levels, but only record those above the tax free threshold and describe the rest as “under the tax threshold” rather than giving the income in dollar terms. In this case all parts of the distribution are reported on, but the level of information differs above or below a threshold.

Most people are aware of issues modelling categorical data using techniques designed for continuous data. However, censored and truncated data also need special treatment. A lot of the data we deal with has a natural truncation point: distance isn’t negative, prices are not (well, hardly ever) negative. Recognising that you may be dealing with truncated or censored data is an important part of initial data analysis. For a thorough discussion, see W.H. Green’s chapter on the subject here.

In practice, continuous data methodologies may work quite well for these types of data as long as there isn’t a large amount of data sitting at or near the truncation or censoring point (which is often zero).

Test scores are something I’ve worked a lot with. In my experience, once the proportion of test scores began to approach around 20% zeros, I needed to switch over to models designed for the issue. In the 10%-20% range I will often try a few of different models to see which is most appropriate. That’s just a general rule of thumb- your mileage may vary.

Hand waving and furious graph-paper drawing aside: yes in this case knowing the differences does matter in practice.

Notes:

[1] W. H. Green, Econometric Analysis, is a classic text and here I’m looking at p. 756 in the fifth edition. There are three copies of this book living in my house. Definitely worth the investment if you are looking for either a classic text covering everything econometrics or a useful TV stand. What can I say? We were young and poor and a matched set of texts made up for deficits in our furniture budget. I’ve owned this book for nearly twenty years and I still use it- even long after we can afford furniture.