Mapping analytics objects

A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.

As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.

Here’s a general concept map: individual projects vary alot. But it’s a start point.

A concept map with analytics objects.

Client request objects

My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.

At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.

A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project  is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.

I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.

a picture of a spreadsheet

Data objects

The next thing that happens is people give me data, I go and find data or some third party sends data my way.

There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.

It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).

If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.

It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.

For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.

Next comes the analysis and communications objects which you’re probably familiar with.

Analysis and communications objects

(Warning: shameless R plug here)

The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.

Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.

That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.

Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:

  • Export the Rmarkdown document as a word document
  • Have non-technical collaborators make changes and updates via tracked changes
  • Depending on the stage of the project input all those back into R by hand or go forwards with the word document.

It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)

Objects inform other objects

In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.

The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.

A consultant’s workflow

I’ve been thinking a lot of about workflow lately and how that differs from project to project. There are a few common states I move through with each project, however. I wanted to talk a more about how failure fits in that workflow. As I’ve mentioned before, I quite like failure. It’s a useful tool for a data scientist. And failure has an important place in a data scientist’s workflow in my view.

Here’s my basic workflow. Note the strong similarity in parts to Hadley Wickham’s data science workflow, which I think is an excellent discussion of the process. In this case, I wanted to talk more about an interactive workflow with a client, however, and how failure fits into that.

a flow chart describing text below

An interactive workflow

As a consultant, a lot of what I do is interactive with the client. This creates opportunities for better analysis. It also creates opportunities for failure. Let me be clear: some failures are not acceptable and are not in any way beneficial. Those are the failures that happen after ‘analysis complete’. All the failures that happen before that are an opportunity to improve, grow and cement a working relationship with a client. (Some, however, are hideously embarrassing, you want to avoid those in general.)

This workflow is specific to me: your mileage will almost certainly vary. The amount of variation from mine I have no opinion on I’m afraid. As usual, you should take what’s useful to you and jettison the rest.

My workflow starts with a client request. This request is often nebulous, vague and unformed. If that’s the case, then there’s a lot of work around getting the client’s needs and wants into a shape that will be (a) beneficial to the client and (b) achievable. That’s a whole other workflow to discuss on another day.

This is the stage I recommend documenting client requests in the form of some kind of work order so everyone’s on the same page. Some clients have better clarity around what they require than others. Having a document you can refer back to at handover saves a lot of time and difficulty when the client is working outside their domain knowledge. It also helps a lot at final stage validation with the client: it’s easy to then check off that you did what you set out to do.

Once the client request is in workable shape, it’s time to identify and select data sources. This may be client-provided, or it may be external or both. Pro tip: document everything. Every Excel worksheet, .csv, database – everything. Where did it come from, who gave it to you, how and when? I talk about how I do that in part here.

Next I validate the data: does it make sense, is it what I expected, what’s in there? Once that’s all done I need to validate my findings with the client’s expectations. Here’s where a failure is good. You want to pick up if the data is a load of crap EARLY. You can then manage client’s expectations around outcomes – and lay the groundwork for future projects.

If it’s a failure – back to data sourcing and validation. If it’s a pass, on to cleaning and transform, another sub-workflow by itself.

Analyse, Model, Visualise

This part of my workflow is very close to the iterative model proposed by Hadley Wickham that I linked to above. It’s fundamentally repetitive: try, catch problems and insights, repeat. I also like to note the difference between visualisation for finding insight and visualisation for communicating insight. These can be the same, but they’re often different.

Sometimes I find an insight in the statistics and use the visualisation to communicate it. Sometimes I find the insight in the visualisation, then validate with statistics and communicate the insight to the client with a chart. Pro tip: the more complex an idea, the easier it is to present it initially with a chart. Don’t diss the humble bar chart: it’s sometimes 90% of the value I add as a consultant, fancy multi-equation models not withstanding.

This process is full of failure. Code that breaks, models that don’t work, statistics that are not useful. Insights that seem amazing at first, but aren’t valid. Often the client likes regular updates at this point and that’s a reasonable accommodation. However! Be wary about communicating your latest excitement without internal validation. It can set up expectations for your client that aren’t in line with your final findings.

You know you’re ready to move out of this cycle when you run out of failures or time or both.

Communicate and validate

This penultimate stage often takes the longest, communication is hard. Writing some code and throwing a bunch of data in a model is relatively easy. It’s also the most important stage – it’s vital that we communicate in such a way that our client or domain experts can validate what we’re finding. Avoid at all costs the temptation to tech-speak. The client must be able to engage with what you’re saying.

If that all checks out, then great – analysis complete. If it doesn’t, we’re bounced all the way back to data validation. It’s a big failure – but that’s OK. It’s far more palatable than the failures that come after ‘analysis complete’.

On moving the box, not the whiskers

A lot of institutions and data scientists are keen to take the best of the best and make them better. They are working with elites in terms of intellectual ability, the socio economic lottery that takes intelligence and operationalises it and all the other bits and pieces that go into that.

These institutions and data scientists are working with the whiskers of the population box plot. They’re taking the people on the upper edge of the distribution and they’re keen to push them further out: to achieve more, do more, create more. Bravo! This is important and should continue.

Box plot

However, there’s another group of people that I think need to learn data science skills in general and to learn how to code in particular. Australia has no national workforce plan: but we know and acknowledge that data is at the heart of our economy going forward.

In order to make the most of our future, we need a large number of people in the box to learn the skills that will give them access to a digital, data driven economy. These people are not elites. They often do not believe they have a strong mathematics skills set and they don’t have PhDs. But we need them.

Data science in general and coding in particular is a useful, important skill set. There will always be space and need for elite data scientists, trained by elite institutions and mentors. But we also need to start thinking about how we’re going to move the box, not just the whiskers.

If you think about it, the productivity gains from moving a small proportion of the box upwards are enormous compared to moving just the whiskers.

R for Excel users

Moving over to R (or any other programming language) from Excel can feel very daunting. One of the big stumbling blocks, in my view, is having a mental understanding of how we store data in structures in R. You can view your data structures in R, but unlike Excel where it’s in front of your face, it’s not always intuitive to the user just starting out.

There’s lots of great information on the hows, whys and wherefores: here’s a basic rundown of some of the common ways we structure our data in R and how that compares to what you’re already familiar with: Excel.

Homogeneous data structures


basic data structures infographic

 

Homogeneous in this case just means all the ‘bits’ inside these structures need to be of the same type. There are many types of data in R, but the basic ones you need to know when just starting out for the first time are:

  • Numbers. These come in two varieties:
    • Doubles – where you are wanting and using decimal points, for example 1.23 and 4.56.
    • Integers- where you don’t, for example 1, 2, 3.
  • Strings. This is basically just text data – made up of characters. For example, dog, cat, bird.
  • Booleans. These take two forms TRUE and FALSE.

Homogeneous data structures are vectors, matrices and arrays. All the contents of these structures have to have the same type. They need to be numbers OR text OR booleans or other types – but no mixing.

Let’s go through them one-by-one:

  • Vectors. You can think of a vector like a column in a spreadsheet – there’s an arbitrary number of slots and data in each one. There’s a catch – the data types all have to be the same: all numbers, all strings, all booleans or other types. Base R has a good selection of options for working with this structure.
  • Matrices. Think of this one as the whole spreadsheet – a series of columns in a two dimensional arrangement. But! This arrangement is homogeneous – all types the same. Base R has you covered here!
  • Arrays. This is the n-dimensional equivalent of the matrix- a bundle of worksheets in the workbook if you will. Again, it’s homogenous. The abind package is really useful for manipulating arrays. If you’re just starting out, you probably don’t need this yet!

The advantage of homogeneous structures is that they can be faster to process – but you have to be using serious amounts of processing power for this to matter a lot. So don’t worry too much about that for now. The disadvantage is that they can be restrictive compared to some other structures we’ll talk about next.

 

Heterogeneous structures

Basic data structures heterogeneous

 

Heterogeneous data structures just mean that the content can be of a variety of types. This is a really useful property and makes these structures very powerful. There are two main forms, lists and data frames.

  • Lists. Like a vector, we can think about a list like a column from a spreadsheet. But unlike a vector, the content of the list can be any type.
  • Data frames. A data frame is really a list of lists. Generally the content of each sub-list (column of the data frame) is the same (like you’d expect in a spreadsheet) but that’s not necessarily the case. Data frames can have named columns (so can other structures) and you can access data using those names.

Data frames can be extended to quite complex structures. Data frames don’t have to be ‘flat’. Because you can make lists of lists, you can have data frames where one or more of the columns have lists in each slot, they’re called nested data frames.

This and other properties makes the data type extremely powerful for manipulating data. There’s a whole series of operations and functions in R dedicated to manipulating data frames. Matrices and vectors can be converted into data frames, one way is the function as.data.frame(my_matrix).

The disadvantage of this structure is it can be slower to process – but if you’re at the stage of coding where you’re not sure if this matters to you, it probably doesn’t just now! R is set up to do a bunch of really useful things using data frames. This is the data structure probably most similar to an Excel sheet.

How do you know what structure you’re working with? If you have an object in R and you’re not sure if it’s a matrix, or a vector, a list or a data frame call str(object). It will tell you what you’re working with.

So that’s a really simple take on some simple data structures in R: quite manageable, because you already understand lots of these concepts from your work in Excel. It’s just a matter of translating them into a different environment.

 

Acknowledgement: Did you like the whole homogeneous/heterogeneous structure idea? That isn’t my idea – Hadley Wickham in Advanced R talks about it in much more detail.

Algorithmic trading: evolutionary tourney?

The use of artificial intelligence (AI) in the form of machine learning techniques and algorithmic trading in the finance industry has in recent years become widespread. The preference for quantitative techniques is not a new development in the finance industry: a strong respect for quantitative methods has long been in place. However, the recent massive uptake in the use of AI through these algorithmic decision making tools for high frequency trading is a relatively new development.

Locally, these developments have contributed to improved industry productivity at a rate considerably greater than other industries in Australia. They have also resulted in increased profits to investment vehicles. However, they are not without pitfalls and cautions. The inability of machine learning models to predict “fat tail events” or regime change in the data generating process is well known. The dependence of these models on a consistent solution space over time is another weakness that should be acknowledged.

Failure to understand critical features of these models and algorithms may lead to substantive losses through naïve application. A subculture within the industry that breeds similarity between models may possibly make this a systemic failure at some point in the future.

While these models offer excellent improvements in wealth generation and productivity in this industry, (Qiao & Beling, 2016) they are not a substitute for human decision making, only an assistance to it. Failure to acknowledge this distinction may lead to adverse outcomes on a very wide scale.

Background

Algorithmic trading and use of decision analytics in financial market transactions is now very widespread, with many investment houses reducing the number of “stock pickers” they employ in favour of purely quantitative algorithmic approaches to trading (The Economist Intelligence Unit, 2017). The use of these algorithmic decision making tools is common in markets such as foreign exchange (Chaboud, et al., 2014) and commodity markets, as well as listed stock exchanges.

As part of an overall move towards digitization, these AI tools have resulted in increasing profits for investment houses in recent years (The Economist Intelligence Unit, 2017). Indeed, in part contributed to by these AI methodologies, the productivity growth of the finance industry in Australia alone is outpacing the aggregate productivity levels of the economy by a very considerable amount over the last fifteen years, shown in Figure 1 (Australian Bureau of Statistics, 2016).

 

Productivity chart

Figure 1: Productivity, Finance Industry and 12 Industry Aggregate Productivity. Data: ABS (2016), chart: Author

 

Algorithmic trading as an evolutionary tourney?

 

The methodology used to employ algorithmic trading is a combination of predictive modelling techniques and prescriptive decision frameworks (Qiao & Beling, 2016). The decision tools implemented by the automated algorithms consist of optimisation and simulation tools (Qiao & Beling, 2016).

Predictive methods vary, but include forecasting using machine learning methods such as neural nets, evolutionary algorithms and more traditional econometric modelling such as the AutoRegressive Conditional Heteroskedastic (ARCH) modelling frameworks. One example is Laird-Smith et al.’s use of regression techniques to estimate a systemic error capital asset price model (Laird-Smith, et al., 2016). Other techniques include using sentiment analysis on unstructured corpi generated by journalists to estimate the effects of feedback into the market and the interaction between market and sentiment (Yin, et al., 2016). Yin et al. (2016) note the strong correlation between news sentiment and market returns which can be exploited to improve investment outcomes.

The move towards algorithmic trading has been highly successful. Improved outcomes are observable both systemically, in the form of increased liquidity and price efficiency (Chaboud, et al., 2014) and at the firm level with improved profit margins (The Economist Intelligence Unit, 2017).

However, the complexity of machine learning models underpinning these systems make them difficult to compare and critically analyse, requiring novel techniques to do so such as Parnes (2015). Many of these algorithms are proprietary and closely guarded: it is not possible to outsiders to analyse them, except by observing trading behaviours ex post.

There are also some negative outcomes which bear consideration. Chaboud (2014) note that behaviour of algorithmic traders is highly correlated. While it is noted that this correlation does not appear to cause a degradation in market quality on average, these AI algorithms have in the past resulted in unexpected and costly mistakes, such as the automated selling program that initiated the ‘flash crash’ of 2010 (Kirilenko, et al., 2017).

In broader terms, it is now well known that machine learning algorithms are biased in unexpected ways when the training data from which they are generated is biased. For instance, an algorithm that decides on offers of bail to people awaiting trial in the U.S.A. was shown to disadvantage bail seekers of colour compared to those who were white despite the fact that race was not a feature selected for use in the algorithm (Angwin, et al., 2016). Algorithmically generated decision making can also lead to unforeseen and unwanted outcomes. Facebook was recently revealed to be selling advertisement space targeted at users categorised with anti-Semite topics (Angwin, et al., 2017). These algorithmically-generated targeting categories were unknown to the company until they were discovered by journalists.

The economic implications of employing algorithmic decision making methods are enormous, not only in private financial circles, but in government as well (Dilmegani, et al., 2014). It is clear that employing this form of AI will continue well beyond the finance industry, whether or not bias or negative unexpected outcomes is eradicable. However, being aware of and actively testing for the existence of these possibilities is critical going forward.

In the finance industry, each algorithmic trader has a similar fitness function by which it is judged, fine-tuned and updated: the ability to profitably trade in the market. In this sense, in a market dominated by algorithmic trading AI is effectively a tourney in which the fittest agents survive and the weakest are abandoned and replaced with better models. However, the resultant similarity after many trading generations, already noted in the popular press (Maley, 2017) may expose the system to crisis during an unpredictable event if a majority of algorithmic traders react similarly in unforeseen ways that have negative consequences.

The high speed at which these high frequency trading algorithms execute trades could lead to far greater damage in the future than the ‘flash crash’ documented by Kirilenko et al. (2017). While optimal performance of algorithmic trading agents will likely include a ‘Deadman’s switch’- a command which will halt further trading in the event of a predetermined event (such as a crash)- the efficacy of these measures have not been tested systemically during a financial crisis.

 

Conclusion

Algorithmic trading is a branch of artificial intelligence that has contributed to the generation of wealth and increased productivity in the finance industry, algorithmic decision making should be seen as an aid to human decision making and not a replacement for it.

While gains to be made from the technology are substantial and ongoing in fields beyond finance, the possibility of a lack of variability among algorithmic traders and the proprietary and hidden nature of these models which are difficult to explain and interpret may lead to adverse consequences if applied naively.

 

References

Angwin, J., Larson, J., Mattu, S. & Kirchener, L., 2016. Machine Bias. [Online]
Available at: www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[Accessed 6 September 2017].

Angwin, J., Varner, M. & Tobin, A., 2017. Facebook’s Anti-Semitic Ad Categories Persisted after Promised Hate Speech Crackdown. [Online]
Available at: https://www.scientificamerican.com/article/facebook-rsquo-s-anti-semitic-ad-categories-persisted-after-promised-hate-speech-crackdown/
[Accessed 23 September 2017].

Australian Bureau of Statistics, 2016. 5260.0.55.002 Estimates of Industry Multifactor Productivity, Australia, Canberra: Australian Bureau of Statistics.

Chaboud, A. P., Chiquoine, B., Hjalmarsson, E. & Vega, C., 2014. Rise of the Machines: Algorithmic Trading in the Foreign Exchange Market. The Journal of Finance, Volume 69, pp. 2045-2084.

Dilmegani, C., Korkmaz, B. & Lunqvist, M., 2014. Public Sector Digitization: The Trillion-Dollar Challenge. [Online]
Available at: www.mckinsey.com/business-functions/digital-mckinsey/our-insights/public-sector-digitization-the-trillion-dollar-challenge
[Accessed 6 September 2017].

Kirilenko, A., Kyle, A., Samadi, M. & Tuzun, T., 2017. The Flash Crash: High-Frequency Trading in an Electronic Market. The Journal of Finance, Volume 72, pp. 967-988.

Laird-Smith, J., Meyer, K. & Rajaratnam, K., 2016. A study of total beta specification through symmetric regression: the case of the Johannesburg Stock Exchange. Environment Systems and Decisions, 36(2), pp. 114-125.

Maley, K., 2017. Are algorithms under estimating the risk of a Pyongyang panic?. [Online]
Available at: http://bit.ly/2vLlY2r
[Accessed 6 September 2017].

Parnes, D., 2015. Performance Measurements for Machine-Learning Trading Systems. Journal of Trading, 10(4), pp. 5-16.

Qiao, Q. & Beling, P. A., 2016. Decision analytics and machine learning in economic and financial systems. Environment Systems & Decisions, 36(2), pp. 109-113.

The Economist Intelligence Unit, N. I., 2017. Unshackled algorithms; Machine-learning in finance 2017. The Economist, 27 May.Volume 72.

Yin, S., Mo, K., Liu, A. & Yang, S. Y., 2016. News sentiment to market impact and its feedback effect. Environment Systems and Decisions, 36(2), pp. 158-166.

 

Expertise vs Awareness for the Data Scientist

We’ve all seen them: articles with headlines like “17 things you MUST know to be a data scientist” and “Great data scientists know these 198 algorithms no one else does.” While the content can be a useful read, the titles are clickbait and imposter syndrome is a common outcome.

You can’t be an expert in every skill on the crazy data science Venn Diagram. It’s not physically possible and if you try you’ll spend all your time attempting to become a “real” data scientist with no time left to be one. In any case, most of those diagrams actually describe an entire industry or a large and diverse team: not the individual.

Data scientists need expertise, but you only need expertise in the areas you’re working with right now. For the rest, you need awareness.

Awareness of the broad church that is data science tells you when you need more knowledge, more skill or more information than you currently have. Awareness of areas outside your expertise means you don’t default to the familiar, you make your decisions based on a broad understanding of what’s possible.

Expertise still matters, but the exact area you’re expert in is less important. Expertise gives you the skills you need to go out and learn new things when and as you need them. Expertise in Python gives you the skills to pick up R or C++ next time you need them. Expertise in econometrics gives you the skills to pick up machine learning. Heck, expertise in languages (human ones, not computer ones) is also a useful skill set for data scientists, in my view.

You need expertise because that gives you the core skills to pick up new things. You need awareness because that will let you know when you need the new things and what they could be. They’re not the same thing: so keep doing what you do well and keep one eye on what other people do well.

Models, Estimators and Algorithms

I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar.

An algorithm is a set of predefined steps. Making a cup of coffee can be defined as an algorithm, for example. Algorithms can be nested within each other to create complex and useful pieces of analysis. Gradient descent is an algorithm for finding the minima of a function computationally. Newton-Raphson does the same thing but slower, stochastic gradient descent does it faster.

An estimation method is the manner in which your model is estimated (often with an algorithm). To take a simple linear regression model, there are a number of ways you can estimate it:

  • You can estimate using the ordinary least squares closed form solution (it’s just an algebraic identity). After that’s done, there’s a whole suite of econometric techniques to evaluate and improve your model.
  • You can estimate it using maximum likelihood: you calculate the negative likelihood and then you use a computational algorithm like gradient descent to find the minima. The econometric techniques are pretty similar to the closed form solution, though there are some differences.
  • You can estimate a regression model using machine learning techniques: divide your sample into training, test and validation sets; estimate by whichever algorithm you like best. Note that in this case, this is essentially a utilisation of maximum likelihood. However, machine learning has a slightly different value system to econometrics with a different set of cultural beliefs on what makes “a good model.” That means the evaluation techniques used are often different (but with plenty of crossover).

The model is the thing you’re estimating using your algorithms and your estimation methods. It’s the decisions you make when you decide if Y has a linear relationship with X, or which variables (features) to include and what functional form your model has.

Machine Learning vs Econometric Modelling: Which One?

Renee from Becoming a Data Scientist asked Twitter which basic issues were hard to understand in data science. It generated a great thread with lots of interesting perspectives you can find here.

My opinion is that the most difficult to understand concept has nothing to do with the technical aspects of data science. Twitter post

The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.

Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.

Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.

 

So how do I make decisions about algorithms, estimators and models?

Like data analysis (see here, here and here), I think of modelling as an interrogation over my data, my problem and my brief. If you’re new to modelling, here’s some thoughts to get you started.

Why am I here? Who is it for?

It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.

If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.

What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?

Here’s the one question I ask myself for every model: what do I think the causal relationships are here?

What do I need it to do?

The key outcome you need from your model will probably have the most weight on your decisions.

For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.

On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?

Next question.

How long do I have?

In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?

Development

If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.

Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.

If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.

This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.

If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.

Deployment

Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.

If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.

It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.

This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.

Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.

What are the resources I have?

How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.

I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.

It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).

On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.

How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.

Now go forth and model stuff

This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.

You’re not doing machine learning OR econometrics: you’re modelling.

That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.

What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.

Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)