Algorithmic trading: evolutionary tourney?

The use of artificial intelligence (AI) in the form of machine learning techniques and algorithmic trading in the finance industry has in recent years become widespread. The preference for quantitative techniques is not a new development in the finance industry: a strong respect for quantitative methods has long been in place. However, the recent massive uptake in the use of AI through these algorithmic decision making tools for high frequency trading is a relatively new development.

Locally, these developments have contributed to improved industry productivity at a rate considerably greater than other industries in Australia. They have also resulted in increased profits to investment vehicles. However, they are not without pitfalls and cautions. The inability of machine learning models to predict “fat tail events” or regime change in the data generating process is well known. The dependence of these models on a consistent solution space over time is another weakness that should be acknowledged.

Failure to understand critical features of these models and algorithms may lead to substantive losses through naïve application. A subculture within the industry that breeds similarity between models may possibly make this a systemic failure at some point in the future.

While these models offer excellent improvements in wealth generation and productivity in this industry, (Qiao & Beling, 2016) they are not a substitute for human decision making, only an assistance to it. Failure to acknowledge this distinction may lead to adverse outcomes on a very wide scale.

Background

Algorithmic trading and use of decision analytics in financial market transactions is now very widespread, with many investment houses reducing the number of “stock pickers” they employ in favour of purely quantitative algorithmic approaches to trading (The Economist Intelligence Unit, 2017). The use of these algorithmic decision making tools is common in markets such as foreign exchange (Chaboud, et al., 2014) and commodity markets, as well as listed stock exchanges.

As part of an overall move towards digitization, these AI tools have resulted in increasing profits for investment houses in recent years (The Economist Intelligence Unit, 2017). Indeed, in part contributed to by these AI methodologies, the productivity growth of the finance industry in Australia alone is outpacing the aggregate productivity levels of the economy by a very considerable amount over the last fifteen years, shown in Figure 1 (Australian Bureau of Statistics, 2016).

 

Productivity chart

Figure 1: Productivity, Finance Industry and 12 Industry Aggregate Productivity. Data: ABS (2016), chart: Author

 

Algorithmic trading as an evolutionary tourney?

 

The methodology used to employ algorithmic trading is a combination of predictive modelling techniques and prescriptive decision frameworks (Qiao & Beling, 2016). The decision tools implemented by the automated algorithms consist of optimisation and simulation tools (Qiao & Beling, 2016).

Predictive methods vary, but include forecasting using machine learning methods such as neural nets, evolutionary algorithms and more traditional econometric modelling such as the AutoRegressive Conditional Heteroskedastic (ARCH) modelling frameworks. One example is Laird-Smith et al.’s use of regression techniques to estimate a systemic error capital asset price model (Laird-Smith, et al., 2016). Other techniques include using sentiment analysis on unstructured corpi generated by journalists to estimate the effects of feedback into the market and the interaction between market and sentiment (Yin, et al., 2016). Yin et al. (2016) note the strong correlation between news sentiment and market returns which can be exploited to improve investment outcomes.

The move towards algorithmic trading has been highly successful. Improved outcomes are observable both systemically, in the form of increased liquidity and price efficiency (Chaboud, et al., 2014) and at the firm level with improved profit margins (The Economist Intelligence Unit, 2017).

However, the complexity of machine learning models underpinning these systems make them difficult to compare and critically analyse, requiring novel techniques to do so such as Parnes (2015). Many of these algorithms are proprietary and closely guarded: it is not possible to outsiders to analyse them, except by observing trading behaviours ex post.

There are also some negative outcomes which bear consideration. Chaboud (2014) note that behaviour of algorithmic traders is highly correlated. While it is noted that this correlation does not appear to cause a degradation in market quality on average, these AI algorithms have in the past resulted in unexpected and costly mistakes, such as the automated selling program that initiated the ‘flash crash’ of 2010 (Kirilenko, et al., 2017).

In broader terms, it is now well known that machine learning algorithms are biased in unexpected ways when the training data from which they are generated is biased. For instance, an algorithm that decides on offers of bail to people awaiting trial in the U.S.A. was shown to disadvantage bail seekers of colour compared to those who were white despite the fact that race was not a feature selected for use in the algorithm (Angwin, et al., 2016). Algorithmically generated decision making can also lead to unforeseen and unwanted outcomes. Facebook was recently revealed to be selling advertisement space targeted at users categorised with anti-Semite topics (Angwin, et al., 2017). These algorithmically-generated targeting categories were unknown to the company until they were discovered by journalists.

The economic implications of employing algorithmic decision making methods are enormous, not only in private financial circles, but in government as well (Dilmegani, et al., 2014). It is clear that employing this form of AI will continue well beyond the finance industry, whether or not bias or negative unexpected outcomes is eradicable. However, being aware of and actively testing for the existence of these possibilities is critical going forward.

In the finance industry, each algorithmic trader has a similar fitness function by which it is judged, fine-tuned and updated: the ability to profitably trade in the market. In this sense, in a market dominated by algorithmic trading AI is effectively a tourney in which the fittest agents survive and the weakest are abandoned and replaced with better models. However, the resultant similarity after many trading generations, already noted in the popular press (Maley, 2017) may expose the system to crisis during an unpredictable event if a majority of algorithmic traders react similarly in unforeseen ways that have negative consequences.

The high speed at which these high frequency trading algorithms execute trades could lead to far greater damage in the future than the ‘flash crash’ documented by Kirilenko et al. (2017). While optimal performance of algorithmic trading agents will likely include a ‘Deadman’s switch’- a command which will halt further trading in the event of a predetermined event (such as a crash)- the efficacy of these measures have not been tested systemically during a financial crisis.

 

Conclusion

Algorithmic trading is a branch of artificial intelligence that has contributed to the generation of wealth and increased productivity in the finance industry, algorithmic decision making should be seen as an aid to human decision making and not a replacement for it.

While gains to be made from the technology are substantial and ongoing in fields beyond finance, the possibility of a lack of variability among algorithmic traders and the proprietary and hidden nature of these models which are difficult to explain and interpret may lead to adverse consequences if applied naively.

 

References

Angwin, J., Larson, J., Mattu, S. & Kirchener, L., 2016. Machine Bias. [Online]
Available at: www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[Accessed 6 September 2017].

Angwin, J., Varner, M. & Tobin, A., 2017. Facebook’s Anti-Semitic Ad Categories Persisted after Promised Hate Speech Crackdown. [Online]
Available at: https://www.scientificamerican.com/article/facebook-rsquo-s-anti-semitic-ad-categories-persisted-after-promised-hate-speech-crackdown/
[Accessed 23 September 2017].

Australian Bureau of Statistics, 2016. 5260.0.55.002 Estimates of Industry Multifactor Productivity, Australia, Canberra: Australian Bureau of Statistics.

Chaboud, A. P., Chiquoine, B., Hjalmarsson, E. & Vega, C., 2014. Rise of the Machines: Algorithmic Trading in the Foreign Exchange Market. The Journal of Finance, Volume 69, pp. 2045-2084.

Dilmegani, C., Korkmaz, B. & Lunqvist, M., 2014. Public Sector Digitization: The Trillion-Dollar Challenge. [Online]
Available at: www.mckinsey.com/business-functions/digital-mckinsey/our-insights/public-sector-digitization-the-trillion-dollar-challenge
[Accessed 6 September 2017].

Kirilenko, A., Kyle, A., Samadi, M. & Tuzun, T., 2017. The Flash Crash: High-Frequency Trading in an Electronic Market. The Journal of Finance, Volume 72, pp. 967-988.

Laird-Smith, J., Meyer, K. & Rajaratnam, K., 2016. A study of total beta specification through symmetric regression: the case of the Johannesburg Stock Exchange. Environment Systems and Decisions, 36(2), pp. 114-125.

Maley, K., 2017. Are algorithms under estimating the risk of a Pyongyang panic?. [Online]
Available at: http://bit.ly/2vLlY2r
[Accessed 6 September 2017].

Parnes, D., 2015. Performance Measurements for Machine-Learning Trading Systems. Journal of Trading, 10(4), pp. 5-16.

Qiao, Q. & Beling, P. A., 2016. Decision analytics and machine learning in economic and financial systems. Environment Systems & Decisions, 36(2), pp. 109-113.

The Economist Intelligence Unit, N. I., 2017. Unshackled algorithms; Machine-learning in finance 2017. The Economist, 27 May.Volume 72.

Yin, S., Mo, K., Liu, A. & Yang, S. Y., 2016. News sentiment to market impact and its feedback effect. Environment Systems and Decisions, 36(2), pp. 158-166.

 

Models, Estimators and Algorithms

I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar.

An algorithm is a set of predefined steps. Making a cup of coffee can be defined as an algorithm, for example. Algorithms can be nested within each other to create complex and useful pieces of analysis. Gradient descent is an algorithm for finding the minima of a function computationally. Newton-Raphson does the same thing but slower, stochastic gradient descent does it faster.

An estimation method is the manner in which your model is estimated (often with an algorithm). To take a simple linear regression model, there are a number of ways you can estimate it:

  • You can estimate using the ordinary least squares closed form solution (it’s just an algebraic identity). After that’s done, there’s a whole suite of econometric techniques to evaluate and improve your model.
  • You can estimate it using maximum likelihood: you calculate the negative likelihood and then you use a computational algorithm like gradient descent to find the minima. The econometric techniques are pretty similar to the closed form solution, though there are some differences.
  • You can estimate a regression model using machine learning techniques: divide your sample into training, test and validation sets; estimate by whichever algorithm you like best. Note that in this case, this is essentially a utilisation of maximum likelihood. However, machine learning has a slightly different value system to econometrics with a different set of cultural beliefs on what makes “a good model.” That means the evaluation techniques used are often different (but with plenty of crossover).

The model is the thing you’re estimating using your algorithms and your estimation methods. It’s the decisions you make when you decide if Y has a linear relationship with X, or which variables (features) to include and what functional form your model has.

Gradient Descent vs Stochastic Gradient Descent: Some Observations of Behaviour

Anyone who does any machine learning at all has heard of the gradient descent and stochastic gradient descent algorithms. What interests me about machine learning is understanding how some of these algorithms (and as a consequence the parameters they are estimating) behave in different scenarios.

The following are single observations from a random data generating process:

y(i) = x(i) +e(i)

where e is a random variable distributed in one of a variety of ways:

  • e is distributed N(0,1). This is a fairly standard experimental set up and should be considered “best case scenario” for most estimators.
  • e is distributed t(4). This is a data generation process that’s very fat tailed (leptokurtotic). It’s a fairly common feature in fields like finance. All the required moments for the ordinary least squares estimator to have all its usual properties exist, but only just. (If you don’t have a stats background, think of this as the edge of reasonable cases to test this estimator against.)
  • e is distributed as centred and standardised chi squared (2). In this case, the necessary moments exist and we have ensured e has a zero mean. But the noise isn’t just fat-tailed, it’s also skewed. Again, not a great scenario for most estimators, but a useful one to test against.

The independent variable x is intended to be deterministic, but in this case is generated as N(1,1). To get all the details of what I did, the code is here. Put simply, I estimated the ordinary least squares estimators for the intercept and coefficient of the process (e.g. the true intercept here is zero and the true coefficient is 1). I did this using stochastic gradient descent and gradient descent to find the minima of the likelihood and generate the estimates.

Limits: the following are just single  examples of what the algorithms do in each case. It’s not a full test of algorithm performance: we would repeat each of these thousands of times and then take overall performance measures to really know how these algorithms perform under these conditions. I just thought it would be interesting to see what happens. If I get some more time I’d develop this into a full simulation experiment if I find something worth exploring in detail.

Here’s what I found:

N is pretty large (10 000). Everything is super (for SGD)!

Realistically, unless dimensionality is huge, closed form ordinary least squares (OLS) estimation methods would do just fine here, I suspect. However, the real point here is to test differences between SGD and GD.

I used a random initialisation for the algorithm and you can see how quickly it got to the true parameter. In situations like this where the algorithm uses only a few iterations, if you are averaging SGD estimators then allowing for a burn in and letting it run beyond the given tolerance level may be very wise. In this case, averaging the parameters over the entire set of iterations would be a worse result than just taking the last estimator.

Under each scenario, the stochastic gradient descent algorithm performed brilliantly and much faster than the gradient descent equivalent which struggled. The different error distributions mattered very little here.

For reference, it took around 37 000 iterations for the gradient descent algorithm to reach its end point:

For the other experiments, I’ll just stick to a maximum of 10 000 iterations.

N is largish (N = 1000). Stochastic gradient descent doesn’t do too badly.

Now, to be fair, unless I was trying to estimate something with huge dimensionality (and this hasn’t been tested here anyway), I’d just use standard OLS estimating procedures for estimating this model in real life.

SGD takes marginally longer to get where it’s going, but the generating process made no material difference, although our example for chi squared was out slightly- this is one example.

N is smallish (N = 100): SGD gets brittle, but it gets somewhere close. GD has no clue.

Again, in real life, this is a moment for simpler methods but the interesting point here is that SGD takes a lot longer and fails to reach the exact true value, especially for the constant term (again, one example in each case, not a thorough investigation).

Here, GD has no clue and it’s taking SGD thousands more iterations to reach a less palatable conclusion. In practical machine learning contexts, it’s highly unlikely this is a sensible estimation method at these sample sizes: just use the standard OLS techniques.

I’m not destructive in general, but sometimes I like to break things.

I failed to achieve SGD armageddon by either weird error distributions (that were within the realms of reasonable for the estimator) or smallish sample sizes. So I decided to give it the doomsday scenario: smallish N and an error distribution that does not fulfil the requirements that the simple linear regression model needs to work. I tried a t(1) distribution for the error- this thing doesn’t even have a finite variance.

SGD and GD are utterly messed up (and I suspect the closed form solution may not be a whole lot better here either). Algorithmic armageddon, achieved. In practical terms, you can probably safely file this under “never going to happen”.

Machine Learning vs Econometric Modelling: Which One?

Renee from Becoming a Data Scientist asked Twitter which basic issues were hard to understand in data science. It generated a great thread with lots of interesting perspectives you can find here.

My opinion is that the most difficult to understand concept has nothing to do with the technical aspects of data science. Twitter post

The choice of when to use machine learning, when to use econometric methods and when it matters is rarely discussed. The reason for that is that the answers are neither simple nor finite.

Firstly, the difference between econometrics/statistics and machine learning is mostly cultural. Many econometric models not commonly seen in machine learning (tobit, conditional logit are two that come to mind) could easily be estimated using those techniques. Likewise, machine learning mainstays like clustering or decision trees could benefit from an econometric/statistical approach to model building and evaluation. The main differences between the two groups are different values about what makes a model “good” and slightly different (and very complimentary) skill sets.

Secondly, I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar. See here for details.

 

So how do I make decisions about algorithms, estimators and models?

Like data analysis (see here, here and here), I think of modelling as an interrogation over my data, my problem and my brief. If you’re new to modelling, here’s some thoughts to get you started.

Why am I here? Who is it for?

It’s a strange one to start off with, but what’s your purpose for sitting down with this data? Where will your work end? Who is it for? All these questions matter.

If you are developing a model that customises a website for a user, then prediction may matter more than explanation. If you need to take your model up to the C-suite then explanation may be paramount.

What’s the life expectancy of your model? This is another question about purpose: are you trying to uncover complex and nuanced relationships that will be consistent in a dynamic and changing space? Or are you just trying to get the right document in front of the right user in a corpus that is static and finite?

Here’s the one question I ask myself for every model: what do I think the causal relationships are here?

What do I need it to do?

The key outcome you need from your model will probably have the most weight on your decisions.

For example, if you need to decide which content to place in front of a user with a certain search query, that may not be a problem you can efficiently solve with classic econometric techniques: the machine learning toolkit may be the best and only viable choice.

On the other hand, if you are trying to decide what the determinants of reading skill among young children in Papua New Guinea are, there may be a number of options on the table. Possibilities might include classic econometric techniques like the tobit model, estimated by maximum likelihood. But what about clustering techniques or decision trees? How do you decide between them?

Next question.

How long do I have?

In this case there are two ways of thinking about this problem: how long does my model have to estimate? How long do I have to develop it?

Development

If you have a reasonable length of time, then considering the full suite of statistical solutions and an open-ended analysis will mean a tight, efficient and nuanced model in deployment. If you have until the end of the day, then simpler options may be the only sensible choice. That applies whether you consider yourself to be doing machine learning OR statistics.

Econometrics and machine learning have two different value sets about what makes a good model, but it’s important to remember that this isn’t a case where you have to pick a team and stick with it. Each of those value sets developed out of a response to solving different problems with a different skill set. There’s plenty of crossover and plenty to learn on each side.

If you have the time, then a thorough interrogation your data is never a bad idea. Statistics has a lot to offer there. Even if your final product is classic machine learning, statistics/econometrics will help you develop a better model.

This is also a situation where the decision to use techniques like lasso and ridge regression may come into play. If your development time is lacking, then lasso and/or ridge regularisation may be a reasonable response to very wide data (e.g. data with a lot of variables). However, don’t make the mistake of believing that defaulting to these techniques is always the best or most reasonable option. Utilising a general-to-specific methodology is something to consider if you have the time available. The two techniques were developed for two different situations, however: one size does not fit all.

If you are on a tight deadline (and that does happen, regularly) then be strategic: don’t default to the familiar, make your decision about what is going to offer most value for your project.

Deployment

Back to our website example, if your model has 15 microseconds to evaluate every time a new user hits the website, then the critical component of run time becomes paramount. In a big data context, machine learning models with highly efficient algorithms may be the best option.

If you have a few minutes (or more) then your options are much wider: you can consider whether classic models like multinomial or conditional logit may offer a better outcome for your particular needs than, say, machine learning models like decision trees. Marginal effects and elasticities can be used in both machine learning and econometric contexts. They may offer you two things: a strong way to explain what’s going on to end-users and a structured way to approach your problem.

It’s not the case that machine learning = fast, econometrics = slow. It’s very dependent on the models, the resultant likelihoods/optimisation requirements and so on. If you’ve read this far, you’ve also probably seen that the space between the two fields is narrow and blurring rapidly.

This is where domain knowledge, solid analysis and testing in the development stage will inform your choices regarding your model for deployment. Detailed econometric models may be too slow for deployment in some contexts, but experimenting with them at the development stage can inform a final, streamlined deployment model.

Is your model static- do you present one set of results, once? Or is it dynamic: does this model generate multiple times over its lifecycle? These are also decisions you need to consider.

What are the resources I have?

How much data do you have? Do you need all of it? Do you want all of it? How much computing power have you got under your belt? These questions will help you decide what methodologies you need to estimate your model.

I once did a contract where the highly classified data could not be removed from the company’s computers. That’s reasonable! What wasn’t reasonable was the fact that the computer they gave me couldn’t run email and R at the same time. It made the choices we had available rather limited, let me tell you. But that was the constraint we were working under: confidentiality mattered most.

It may not be possible to use the simple closed form ordinary least squares solution for regression if your data is big, wide and streaming continuously. You may not have the computing power you need to estimate highly structured and nuanced econometric models in the time available. In those cases, the models developed for these situations in machine learning are clearly a very superior choice (because they come up with answers).

On the other hand, assuming that machine learning is the solution to everything is limiting and naive: you may be missing an opportunity to generate robust insights if you don’t look beyond what’s common in machine learning.

How big is too big for classic econometrics? Like all these questions, it’s answered with it depends. My best advice here is: during your analysis stage, try it and see.

Now go forth and model stuff

This is just a brief, very general rundown of how I think about modelling and how I make my decisions between machine learning and econometrics. One thing I want to make abundantly clear, however, is that this is not a binary choice.

You’re not doing machine learning OR econometrics: you’re modelling.

That means being aware of your options and that the differences between them can be extremely subtle (or even non existent at times). There are times when those differences won’t matter for your purpose, others where they will.

What are you modelling and how are you doing it? It’d be great to get one non-spam comment this week.

Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)

Decision vs Default: Abdicating to the Algorithm

Decision vs default is something I’ve been thinking a lot about lately in data science practise. It’s easy to stick with what we know and do well, rather than push our own boundaries. That’s pretty typical for life in general, but in data science it comes at the expense of defaulting to our norms instead of making a decision. That leads to less-than-optimal outcomes.

One way this can manifest is to default to the models and methods we know best. If you’re a machine learning aficionado, then you tend to use machine learning tools to solve your problems. Likewise, if you’re an econometrician by training, you may default to explicit model build and testing regimes. Playing to our strengths isn’t a bad thing.

When it comes to model construction, both methods have their good points. But the best outcome is when you make the decision to use one or the other in the knowledge that you have options, not because you defaulted to the familiar.

Explicit model build and testing is a useful methodology if explanation of your model matters. If your stakeholder needs to know why, not just what. These models are built with a priori assumptions about causation, relationships and functional forms. They require a reasonable level of domain knowledge and the model may be built out of many iterations of testing and experimenting. Typically, I use the Campos (2006) general to specific method: but not after extensive data analysis that informs my views on interactions, polynomial and other transformative inputs and so on. In this regime, predictive power comes from a combination of domain knowledge, statistical methodologies and a canny understanding of your data.

Machine learning methodologies on the other hand are useful if you need lots of models, in real time and you want them more for prediction, not explanation. Machine learning methodologies that use techniques like Lasso or Ridge regression let the algorithm guide feature selection to a greater degree than in the explicit model build methods. Domain knowledge still matters: interactions, the decisions regarding polynomial inputs and so on still have to be explicitly constructed in many cases. Causation may not be obvious.

Neither machine learning or statistical modelling is better in all scenarios. Either may perform substantially better in some, depending on what your performance metric is. But make your decision knowing you have options. Don’t default and abdicate to an algorithm.

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM