I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar.
An algorithm is a set of predefined steps. Making a cup of coffee can be defined as an algorithm, for example. Algorithms can be nested within each other to create complex and useful pieces of analysis. Gradient descent is an algorithm for finding the minima of a function computationally. Newton-Raphson does the same thing but slower, stochastic gradient descent does it faster.
An estimation method is the manner in which your model is estimated (often with an algorithm). To take a simple linear regression model, there are a number of ways you can estimate it:
- You can estimate using the ordinary least squares closed form solution (it’s just an algebraic identity). After that’s done, there’s a whole suite of econometric techniques to evaluate and improve your model.
- You can estimate it using maximum likelihood: you calculate the negative likelihood and then you use a computational algorithm like gradient descent to find the minima. The econometric techniques are pretty similar to the closed form solution, though there are some differences.
- You can estimate a regression model using machine learning techniques: divide your sample into training, test and validation sets; estimate by whichever algorithm you like best. Note that in this case, this is essentially a utilisation of maximum likelihood. However, machine learning has a slightly different value system to econometrics with a different set of cultural beliefs on what makes “a good model.” That means the evaluation techniques used are often different (but with plenty of crossover).
The model is the thing you’re estimating using your algorithms and your estimation methods. It’s the decisions you make when you decide if Y has a linear relationship with X, or which variables (features) to include and what functional form your model has.
Anyone who does any machine learning at all has heard of the gradient descent and stochastic gradient descent algorithms. What interests me about machine learning is understanding how some of these algorithms (and as a consequence the parameters they are estimating) behave in different scenarios.
The following are single observations from a random data generating process:
where e is a random variable distributed in one of a variety of ways:
- e is distributed N(0,1). This is a fairly standard experimental set up and should be considered “best case scenario” for most estimators.
- e is distributed t(4). This is a data generation process that’s very fat tailed (leptokurtotic). It’s a fairly common feature in fields like finance. All the required moments for the ordinary least squares estimator to have all its usual properties exist, but only just. (If you don’t have a stats background, think of this as the edge of reasonable cases to test this estimator against.)
- e is distributed as centred and standardised chi squared (2). In this case, the necessary moments exist and we have ensured e has a zero mean. But the noise isn’t just fat-tailed, it’s also skewed. Again, not a great scenario for most estimators, but a useful one to test against.
The independent variable x is intended to be deterministic, but in this case is generated as N(1,1). To get all the details of what I did, the code is here. Put simply, I estimated the ordinary least squares estimators for the intercept and coefficient of the process (e.g. the true intercept here is zero and the true coefficient is 1). I did this using stochastic gradient descent and gradient descent to find the minima of the likelihood and generate the estimates.
Limits: the following are just single examples of what the algorithms do in each case. It’s not a full test of algorithm performance: we would repeat each of these thousands of times and then take overall performance measures to really know how these algorithms perform under these conditions. I just thought it would be interesting to see what happens. If I get some more time I’d develop this into a full simulation experiment if I find something worth exploring in detail.
Here’s what I found:
N is pretty large (10 000). Everything is super (for SGD)!
Realistically, unless dimensionality is huge, closed form ordinary least squares (OLS) estimation methods would do just fine here, I suspect. However, the real point here is to test differences between SGD and GD.
I used a random initialisation for the algorithm and you can see how quickly it got to the true parameter. In situations like this where the algorithm uses only a few iterations, if you are averaging SGD estimators then allowing for a burn in and letting it run beyond the given tolerance level may be very wise. In this case, averaging the parameters over the entire set of iterations would be a worse result than just taking the last estimator.
Under each scenario, the stochastic gradient descent algorithm performed brilliantly and much faster than the gradient descent equivalent which struggled. The different error distributions mattered very little here.
For reference, it took around 37 000 iterations for the gradient descent algorithm to reach its end point:
For the other experiments, I’ll just stick to a maximum of 10 000 iterations.
N is largish (N = 1000). Stochastic gradient descent doesn’t do too badly.
Now, to be fair, unless I was trying to estimate something with huge dimensionality (and this hasn’t been tested here anyway), I’d just use standard OLS estimating procedures for estimating this model in real life.
SGD takes marginally longer to get where it’s going, but the generating process made no material difference, although our example for chi squared was out slightly- this is one example.
N is smallish (N = 100): SGD gets brittle, but it gets somewhere close. GD has no clue.
Again, in real life, this is a moment for simpler methods but the interesting point here is that SGD takes a lot longer and fails to reach the exact true value, especially for the constant term (again, one example in each case, not a thorough investigation).
Here, GD has no clue and it’s taking SGD thousands more iterations to reach a less palatable conclusion. In practical machine learning contexts, it’s highly unlikely this is a sensible estimation method at these sample sizes: just use the standard OLS techniques.
I’m not destructive in general, but sometimes I like to break things.
I failed to achieve SGD armageddon by either weird error distributions (that were within the realms of reasonable for the estimator) or smallish sample sizes. So I decided to give it the doomsday scenario: smallish N and an error distribution that does not fulfil the requirements that the simple linear regression model needs to work. I tried a t(1) distribution for the error- this thing doesn’t even have a finite variance.
SGD and GD are utterly messed up (and I suspect the closed form solution may not be a whole lot better here either). Algorithmic armageddon, achieved. In practical terms, you can probably safely file this under “never going to happen”.
The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.
Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.
Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.
So how do I make decisions about algorithms, estimators and models?
Why am I here? Who is it for?
It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.
If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.
What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?
Here’s the one question I ask myself for every model: what do I think the causal relationships are here?
What do I need it to do?
The key outcome you need from your model will probably have the most weight on your decisions.
For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.
On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?
How long do I have?
In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?
If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.
Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.
If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.
This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.
If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.
Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.
If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.
It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.
This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.
Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.
What are the resources I have?
How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.
I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.
It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).
On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.
How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.
Now go forth and model stuff
This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.
You’re not doing machine learning OR econometrics: you’re modelling.
That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.
What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.
I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.
Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).
As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.
But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.
In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.
So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.
(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)
Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.
One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.
When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.
Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.
Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.
Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.
I’ve spoken about interpreting models before. I think that this is the most important part of our work, communicating results. However, it’s one that’s often overlooked when discussing the how-to of data science. That’s why marginal effects and elasticities are better for this purpose than coefficients alone.
Model build, selection and testing is complex and nuanced. Communicating the model is sometimes harder, because a lot of the time your audience has no technical background whatsoever. Your stakeholders can’t go up the chain with, “We’ve got a model. And it must be a good model because we don’t understand any of it.”
Our stakeholders also have a limited attention span so the explanation process is two fold: explain the model and do it fast.
For these reasons, I usually interpret models for my stakeholders with marginal effects and elasticities, not coefficients or log-odds. Coefficient interpretation is very different for regressions depending on functional form and if you have interactions or polynomials built into your model, then the coefficient is only part of the story. If you have a more complex model like a tobit, conditional logit or other option, then interpretation of coefficients is different for each one.
I don’t know about your stakeholders and reporting chains: mine can’t handle that level of complexity.
Marginal effects and elasticities are also different for each of these models but they are by and large interpreted in the same way. I can explain the concept of a marginal effect once and move on. I don’t even call it a “marginal effect”: I say “if we increase this input by a single unit, I expect [insert thing here]” and move on.
Marginal effects and elasticities are often variable over the range of your sample: they may be different at the mean than at the minimum or maximum, for example. If you have interactions and polynomials, they will also depend on covarying inputs. Some people see this as added layers of complexity.
In the age of data visualisation, I see it as an opportunity to chart these relationships and visualise how your model works for your stakeholders.
We all know they like charts!
I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.
This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.
If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.
This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.
- Not so standard deviations podcast: very funny, very accessible.
- Huge wealth of information on data science at the becoming a data scientist website here.
- Where to start with data mining and data science.
- What I’d say to a new data scientist: read Jane Austen.
- Yes you can: learn data science
Tutorials and videos: General
- Hadley Wickham making everyone a better programmer with cupcakes. (This is a man who knows his audience.)
- Sharing data amongst nerds: how to guide by Jeff Leek (not a puppet).
Puppets teach data science too
- Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
- DIY your data science. Another offering from the puppet circle on the data science venn diagram.
- Censored, truncated, categorical data: the differences.
- Why you should care about autocorrelation, a.k.a. the horrible things it can do to your modelling.
- Not sure what a confound is? Not your problem? You’ll get it after reading this.
- Elasticity and marginal effects: super simple explanation.
- What is a probability distribution?
- Probability cheat sheet: really useful if you’re new to the concepts of thinking probabilistically.
- Conditional probability: visual guide.
- Ten simple rules for effective statistical practice from the ASA: very useful reminder.
- Explaining histograms. Fantastic .gif here, even better find the code here.
- Visualise correlations: weak to strong.
- Statistical significance as explained by the Economist.
- Starting a data analysis: questions to ask the first time.
- More questions to ask: data analysis.
- Data analysis: enough with the questions already.
- P-values: very thorough guide. Don’t throw that baby out with the bathwater. I’m a committed worshipper at the altar of Neumann-Pearson: but know what you’re doing with them.
- And its opposite: the p-hacker app. Have a good laugh.
- Correlation vs causation. It matters.
- New to reading data analysis? Check out this chart.
- Guide to modern statistical workflow. Really great organisation of background material.
- Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
- Extracting data from the web. You found the data, now what to do? Look here.
- What on God’s green earth are eigen values and vectors? Not instruments of torture. Look here.
- Shiny app for estimating principal components. The author is on my list of “favourite nerds” just for this.
- Law of large numbers vs central limit theorem: in gif form.
- Law of large numbers.
- Visualise the central limit theorem.
- Understanding asymptotics in practice: at what sample size do correlations stabilise?
- The central limit theorem explained.
- Bayesian statistics for beginners. Includes a great discussion of frequentism vs bayesianism and no one got hurt.
- Visualising Bayesian updating– this is a fantastic way of making sense of the concept.
- Cross validation methodology review.
- Variable selection in big data.
- An introduction in 15 hours of lectures.
- Random Forests in R.
- Five tips for coding for data visualisation: particularly useful I thought.
- Geographical data visualisation: comparisons.
- Walk through for tree mapping.
- Revisualisation: a really interesting and thoughtful discussion.
- Voronoi layers for maps.
- Helpful checklist for visualisation– remember it’s a start, not a recipe book.
Natural Language Processing
- Extending the word cloud.
- Social network analysis how to guide.
- Social networks in the golden age of Latin literature.
- Creating a text document matrix, retrieving text from twitter, great tutorial here.
- This post has a number of resources I found useful when learning about simple NLP and word clouds.
- Term frequency and TF-IDF in R.
- Dimensionality reduction in text data. A very clear step-by-step.
I’ll continue to update this list as I find things I think are useful or interesting.
Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.
One of the critical parts of building a great model is using your understanding of the problem and context. Choosing an appropriate model type and deciding on appropriate features/variables to explore based on this information is critical.
The two key concepts of elasticity and marginal effects are fundamental to an economic understanding of model building. This is something that can be overlooked for practitioners not coming from that background. Neither concept is difficult or particularly obtuse.
This infographic came about because I had a group of talented economics students at the masters’ level who had no econometric background, by and large. In a crowded course, I don’t have much time to expand on my favourite things. This was my take on explaining the concepts quickly and simply.