Algorithmic trading: evolutionary tourney?

The use of artificial intelligence (AI) in the form of machine learning techniques and algorithmic trading in the finance industry has in recent years become widespread. The preference for quantitative techniques is not a new development in the finance industry: a strong respect for quantitative methods has long been in place. However, the recent massive uptake in the use of AI through these algorithmic decision making tools for high frequency trading is a relatively new development.

Locally, these developments have contributed to improved industry productivity at a rate considerably greater than other industries in Australia. They have also resulted in increased profits to investment vehicles. However, they are not without pitfalls and cautions. The inability of machine learning models to predict “fat tail events” or regime change in the data generating process is well known. The dependence of these models on a consistent solution space over time is another weakness that should be acknowledged.

Failure to understand critical features of these models and algorithms may lead to substantive losses through naïve application. A subculture within the industry that breeds similarity between models may possibly make this a systemic failure at some point in the future.

While these models offer excellent improvements in wealth generation and productivity in this industry, (Qiao & Beling, 2016) they are not a substitute for human decision making, only an assistance to it. Failure to acknowledge this distinction may lead to adverse outcomes on a very wide scale.

Background

Algorithmic trading and use of decision analytics in financial market transactions is now very widespread, with many investment houses reducing the number of “stock pickers” they employ in favour of purely quantitative algorithmic approaches to trading (The Economist Intelligence Unit, 2017). The use of these algorithmic decision making tools is common in markets such as foreign exchange (Chaboud, et al., 2014) and commodity markets, as well as listed stock exchanges.

As part of an overall move towards digitization, these AI tools have resulted in increasing profits for investment houses in recent years (The Economist Intelligence Unit, 2017). Indeed, in part contributed to by these AI methodologies, the productivity growth of the finance industry in Australia alone is outpacing the aggregate productivity levels of the economy by a very considerable amount over the last fifteen years, shown in Figure 1 (Australian Bureau of Statistics, 2016).

 

Productivity chart

Figure 1: Productivity, Finance Industry and 12 Industry Aggregate Productivity. Data: ABS (2016), chart: Author

 

Algorithmic trading as an evolutionary tourney?

 

The methodology used to employ algorithmic trading is a combination of predictive modelling techniques and prescriptive decision frameworks (Qiao & Beling, 2016). The decision tools implemented by the automated algorithms consist of optimisation and simulation tools (Qiao & Beling, 2016).

Predictive methods vary, but include forecasting using machine learning methods such as neural nets, evolutionary algorithms and more traditional econometric modelling such as the AutoRegressive Conditional Heteroskedastic (ARCH) modelling frameworks. One example is Laird-Smith et al.’s use of regression techniques to estimate a systemic error capital asset price model (Laird-Smith, et al., 2016). Other techniques include using sentiment analysis on unstructured corpi generated by journalists to estimate the effects of feedback into the market and the interaction between market and sentiment (Yin, et al., 2016). Yin et al. (2016) note the strong correlation between news sentiment and market returns which can be exploited to improve investment outcomes.

The move towards algorithmic trading has been highly successful. Improved outcomes are observable both systemically, in the form of increased liquidity and price efficiency (Chaboud, et al., 2014) and at the firm level with improved profit margins (The Economist Intelligence Unit, 2017).

However, the complexity of machine learning models underpinning these systems make them difficult to compare and critically analyse, requiring novel techniques to do so such as Parnes (2015). Many of these algorithms are proprietary and closely guarded: it is not possible to outsiders to analyse them, except by observing trading behaviours ex post.

There are also some negative outcomes which bear consideration. Chaboud (2014) note that behaviour of algorithmic traders is highly correlated. While it is noted that this correlation does not appear to cause a degradation in market quality on average, these AI algorithms have in the past resulted in unexpected and costly mistakes, such as the automated selling program that initiated the ‘flash crash’ of 2010 (Kirilenko, et al., 2017).

In broader terms, it is now well known that machine learning algorithms are biased in unexpected ways when the training data from which they are generated is biased. For instance, an algorithm that decides on offers of bail to people awaiting trial in the U.S.A. was shown to disadvantage bail seekers of colour compared to those who were white despite the fact that race was not a feature selected for use in the algorithm (Angwin, et al., 2016). Algorithmically generated decision making can also lead to unforeseen and unwanted outcomes. Facebook was recently revealed to be selling advertisement space targeted at users categorised with anti-Semite topics (Angwin, et al., 2017). These algorithmically-generated targeting categories were unknown to the company until they were discovered by journalists.

The economic implications of employing algorithmic decision making methods are enormous, not only in private financial circles, but in government as well (Dilmegani, et al., 2014). It is clear that employing this form of AI will continue well beyond the finance industry, whether or not bias or negative unexpected outcomes is eradicable. However, being aware of and actively testing for the existence of these possibilities is critical going forward.

In the finance industry, each algorithmic trader has a similar fitness function by which it is judged, fine-tuned and updated: the ability to profitably trade in the market. In this sense, in a market dominated by algorithmic trading AI is effectively a tourney in which the fittest agents survive and the weakest are abandoned and replaced with better models. However, the resultant similarity after many trading generations, already noted in the popular press (Maley, 2017) may expose the system to crisis during an unpredictable event if a majority of algorithmic traders react similarly in unforeseen ways that have negative consequences.

The high speed at which these high frequency trading algorithms execute trades could lead to far greater damage in the future than the ‘flash crash’ documented by Kirilenko et al. (2017). While optimal performance of algorithmic trading agents will likely include a ‘Deadman’s switch’- a command which will halt further trading in the event of a predetermined event (such as a crash)- the efficacy of these measures have not been tested systemically during a financial crisis.

 

Conclusion

Algorithmic trading is a branch of artificial intelligence that has contributed to the generation of wealth and increased productivity in the finance industry, algorithmic decision making should be seen as an aid to human decision making and not a replacement for it.

While gains to be made from the technology are substantial and ongoing in fields beyond finance, the possibility of a lack of variability among algorithmic traders and the proprietary and hidden nature of these models which are difficult to explain and interpret may lead to adverse consequences if applied naively.

 

References

Angwin, J., Larson, J., Mattu, S. & Kirchener, L., 2016. Machine Bias. [Online]
Available at: www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[Accessed 6 September 2017].

Angwin, J., Varner, M. & Tobin, A., 2017. Facebook’s Anti-Semitic Ad Categories Persisted after Promised Hate Speech Crackdown. [Online]
Available at: https://www.scientificamerican.com/article/facebook-rsquo-s-anti-semitic-ad-categories-persisted-after-promised-hate-speech-crackdown/
[Accessed 23 September 2017].

Australian Bureau of Statistics, 2016. 5260.0.55.002 Estimates of Industry Multifactor Productivity, Australia, Canberra: Australian Bureau of Statistics.

Chaboud, A. P., Chiquoine, B., Hjalmarsson, E. & Vega, C., 2014. Rise of the Machines: Algorithmic Trading in the Foreign Exchange Market. The Journal of Finance, Volume 69, pp. 2045-2084.

Dilmegani, C., Korkmaz, B. & Lunqvist, M., 2014. Public Sector Digitization: The Trillion-Dollar Challenge. [Online]
Available at: www.mckinsey.com/business-functions/digital-mckinsey/our-insights/public-sector-digitization-the-trillion-dollar-challenge
[Accessed 6 September 2017].

Kirilenko, A., Kyle, A., Samadi, M. & Tuzun, T., 2017. The Flash Crash: High-Frequency Trading in an Electronic Market. The Journal of Finance, Volume 72, pp. 967-988.

Laird-Smith, J., Meyer, K. & Rajaratnam, K., 2016. A study of total beta specification through symmetric regression: the case of the Johannesburg Stock Exchange. Environment Systems and Decisions, 36(2), pp. 114-125.

Maley, K., 2017. Are algorithms under estimating the risk of a Pyongyang panic?. [Online]
Available at: http://bit.ly/2vLlY2r
[Accessed 6 September 2017].

Parnes, D., 2015. Performance Measurements for Machine-Learning Trading Systems. Journal of Trading, 10(4), pp. 5-16.

Qiao, Q. & Beling, P. A., 2016. Decision analytics and machine learning in economic and financial systems. Environment Systems & Decisions, 36(2), pp. 109-113.

The Economist Intelligence Unit, N. I., 2017. Unshackled algorithms; Machine-learning in finance 2017. The Economist, 27 May.Volume 72.

Yin, S., Mo, K., Liu, A. & Yang, S. Y., 2016. News sentiment to market impact and its feedback effect. Environment Systems and Decisions, 36(2), pp. 158-166.

 

Closures in R

Put briefly, closures are functions that make other functions. Are you repeating a lot of code, but there’s no simple way to use the apply family or Purrr to streamline the process? Maybe you could write your own closure. Closures enclose access to the environment in which they were created – so you can nest functions within other functions.

What does that mean exactly? Put simply, it can see all the variables enclosed in the function that created it. That’s a useful property!

What’s the use case for a closure?

Are you repeating code, but instead of variables or data that’s changing – it’s the function instead? That’s a good case for a closure. Especially as your code gets more complex, closures are a good way of modularising and containing concepts.

What are the steps for creating a closure?

Building a closure happens in several steps:

  1. Create the output function you’re aiming for. This function’s input is the same as what you’ll give it when you call it. It will return the final output. Let’s call this the enclosed function.
  2. Enclose that function within a constructor function. This constructor’s input will be the parameters by which you’re varying the enclosed function in Step 1. It will output the enclosed function. Let’s call this the enclosing function.
  3. Realise you’ve got it all wrong, go back to step 1. Repeat this multiple times. (Ask me how I know..)
  4. Next you need to create the enclosed function(s) (the ones from Step 1) by calling the enclosing function (the one from Step 2).
  5. Lastly, call and use your enclosed functions.

An example

Say I want to calculate the mean, SD and median of a data set. I could write:
x <- c(1, 2, 3)
mean(x)
median(x)
sd(x)

That would definitely be the most efficient way of going about it. But imagine that your real use case is hundreds of statistics or calculations on many, many variables. This will get old, fast.

I’m calling those three functions each in the same way, but the functions are changing rather than the data I’m using. I could write a closure instead:

stat <- function(stat_name){

function(x){
stat_name(x)
}

}

This is made up of two parts: function(x){} which is the enclosed function and stat() which is the enclosing function.

Then I can call my closure to build my enclosed functions:

mean_of_x <- stat(mean)
sd_of_x <- stat(sd)
median_of_x <- stat(median)

Lastly I can call the created functions (probably many times in practice):

mean_of_x(x)
sd_of_x(x)
median_of_x(x)

I can repeat this for all the statistics/outcomes I care about. This example is too trivial to be realistic – it takes about double the lines of code and is grossly ineffficient! But it’s a simple example of how closures work.

More on closures

If you’re producing more complex structures, closures are very useful. See Jason’s post from Left Censored for a realistic bootstrap example – closures can streamline complex pieces of code reducing mistakes and improving the process you’re trying to build. It takes modularity to the next step.

For more information in R see Hadley Wickham’s section on closures in Advanced R.

 

Australia’s same sex marriage survey

It was a farcical display of an absence of leadership. And the data it provides is not remotely as good as a properly executed survey.

Nonetheless, it had our national attention for months and it’s over.

Here’s a Shiny app because my Facebook discussions got a little detailed. Now everyone can have a look at the data on a by-electorate basis.

Some hot takes for you:

  • When thinking about outcomes in ‘electorates with a high proportion of migrants’, also think about the massively different treatment effects caused by the fact there was little to no outreach from the yes campaign to non English speaking communities, while some others targeted these communities with misinformation regarding the impact of gay marriage on schools. (That’s not a diss on the yes campaign: limited resources and all of that. They were in it to win a nation, not single electorates.)
  • Remember that socioeconomic advantage is a huge confound in just about everything.
  • The survey asked about changing a status quo. That’s not entirely the same thing as being actively homophobic: but I’ll agree in this case that’s a fine line to draw.
  • Why didn’t areas with high migrant populations in other cities follow the same patterns?
  • Did Sydney diocesan involvement, both in terms of investment and pulpit rhetoric create a different treatment effect compared to different cities?

And one thing I think we should all be constantly aware of, even as we nerds are enjoying our dissection:

  • This data was generated on the backs of the suffering of many GBLTIQ+ Australians and their families.

Bring on equality.

Code here.

Data here.

App in full screen here.

Some notes

I’m teaching a course in financial statistics this semester – it’s something I’ve been doing on and off for, well, it’s best measured in decades now.

To make my life easier, I started compiling my notes on various subjects: here they are in rough draft form, at least for the moment. I’ll add to them as I go.

Chapter 1 The World of Data Around Us

Chapter 2 Data Visualisation Matters

Data Analysis

An Introduction to Algebra

Introduction to Sigma Notation

Failure Is An Option

Failure is not an option available to most of us, most of the time. With people depending on us, it’s a luxury few can afford. As a consequence, we shoot for a minimum viable product, minimise risks and, often, minimise creativity. This is a crying shame. If we want to be better programmers, better modellers or better analysts, we need to have space to fail occasionally. The opportunity to “try it and see” is an incredible luxury not offered to many people.

This isn’t a defence of mediocrity or incompetence. Let me be very clear: highly capable, brilliant people fail at things they try. They fail because they take risks. They push and push and push. They find out a whole bunch of things that don’t work and- if they’re lucky- find the piece of gold that does.

In our general workplaces, we often can’t afford this, unless we are very privileged. I spent last week at the ROpensci Oz Unconference and failure wasn’t just an option: it was encouraged.

This was an incredibly freeing approach to programming and one that generated a wealth of creativity and collaboration. Participants were users of all skill levels from just-starting-out to decades-long-veterans. They came from diverse fields like ecology, psychology, academia and business. The underlying ethos of the conference was “try it and see”.

Try it we did.

We had two days of learning, trying, succeeding, failing occasionally and solving those problems until we succeeded. Thanks to ROpensci and to our sponsors: having the space and support to fail made exploration, learning and creativity a priority. It’s a rare luxury and one I’d recommend to everyone if you can!

If you’re just starting out: remember that it’s OK to fail occasionally. You’ll be a better programmer, better analyst or better modeller for it.

Size Matters

It would be too easy for someone like me to declare that A/B testing is simple. If you’re doing website testing you have all the power in the world. Sample sizes literally not dreamt of when I was an undergraduate. A large sample is the same thing as power, right? And of course, power is all that matters.

This is completely wrong. While statistical A/B testing is using (parts of) the same toolkit as I was using in RCTs for Papua New Guinea and Tonga: it isn’t looking at the same problem and it isn’t looking at the same effect sizes. A/B testing is used for incremental changes in this context. On the contrary, in my previous professional life we were looking for the biggest possible bang we could generate for the very limited development dollar.

As always, context matters. Power gets more expensive as you approach the asymptote if effect size is also shrinking. How expensive? How big does that sample have to be? This is one of the things I’m looking at in this experiment.

However, size isn’t something I see discussed often. The size of the test is about control: it tells you what your tradeoff has been between Type I and Type II errors. We fix our size a priori and then we move on and forget about it. But is our fixed, set size the same as the actual size of our test?

We know that Fisher’s exact test is often actually undersized in practice, for example.  In practice, this means the test is too conservative. On the flip side, a test that is too profligate (over sized) is rejecting the null hypothesis when it’s true far too often.

In this post, I’m going to look at a very basic test with some very robust assumptions and see what happens to size and power as we vary those assumptions and sample sizes. The purpose here is the old decision vs default adage: know what’s happening and own the decision you make.

tl;dr

While power matters in A/B testing, so does the size of the test. We spend a lot of time worrying about power in this field but not enough (in my view) ensuring our expectations of size are appropriate. Simple departures from the unicorns-and-fairy-dust normal distribution can cause problems for size and power.

Testing tests

The test I’m looking at is the plainest of plain vanilla statistical tests. The null hypothesis is that the mean of a generating distribution is zero against an alternate. The statistic is the classic z-statistic and the assumptions underlying the test can go one of two ways:

  1. In the finite sample, if the underlying distribution is normal, then the z-statistic is normal, no matter the sample size.
  2. Asymptotically, as long as the underlying distribution meets some conditions like finite fourth moments (more on this later- fat tails condition) and independence, as the sample size gets large, the z-statistic will have a normal limiting distribution.

I’m going to test this simple test in one of four scenarios over a variety of sample sizes:

  1. Normally generated errors. Everything is super, nothing to see here. This is the statistician’s promised land flowing with free beer and coffee.
  2. t(4) errors. This is a fat tailed distribution still centred on zero and symmetric. The fourth moment is finite, but no larger moments are. Is fat-tails alone an issue?
  3. Centred and standardised chi squared (2) errors: fat tailed and asymmetric, the errors generated from this distribution have mean zero and a standard deviation of unity. Does symmetry matter that much?
  4. Cauchy errors. This is the armageddon scenario: the central limit theorem doesn’t even apply here. There is no theoretical underpinning for getting this test to work under this scenario- there are no finite moments at or equal to the first or above (there are some fractional ones though). Can a big sample size get you over this in practice?

In the experiments, we’ll look at sample sizes between 10 and 100 000. Note that the charts below are on a log10 scale.

The null hypothesis is mu =0, the rejection rate in this scenario gives us the size of our test. We can measure power under a variety of alternatives and here I’ve looked at mu = 0.01, 0.1, 1 and 10. I also looked at micro effect sizes like 0.0001, but it was a sad story. [1]

The test has been set with significance level 0.05 and each experiment was performed with 5000 replications. Want to try it yourself/pull it apart? Code is here.

Normal errors: everything is super

power and size comparisons

With really small samples, the test is oversized. N=10 is a real Hail Mary at the best of times, it doesn’t matter what your assumptions are. But by about n=30 and above it’s consistently in the range of 0.05.

It comes as a surprise to no one to hear that power is very much dependent on distance from the null hypothesis. For alternates where mu =10 or mu =1, power is near unity for smallish sample sizes. In other words, we reject the null hypothesis close to 100% of the time when it is false. It’s easy for the test to distinguish these alternatives because they’re so different to the null.

But what about smaller effect sizes? If mu =0.1, then you need at least a sample of a thousand to get close to power of unity. If mu =0.001 then a sample of 100 000 might be required.

Smaller still? The chart below shows another run of the experiment with effect sizes of mu =0.001 and 0.0001 – miniscule. The test has almost no power even at sample sizes of 100 000. If you need to detect effects this small, you need samples in the millions (at least).

chart with power and size

Fat shaming distributions

Leptokurtotic distributions (fat tails) catch a lot of schtick for messing up tests and models. And that’s fairly well deserved. However, degree of fatness is an issue. The t(4) distribution still generates a statistic that works asymptotically in this context: but it has only exactly the number of moments we need and no wriggle room at all.

Size is more variable than in the normal scenario: still comfortably around 0.05, it’s more profligate at small sample sizes and at larger ones has a tendency to be more conservative. It’s still reasonably close to 0.05 at larger sample sizes, however.

another size/power comparison

Power is costly for the smaller effect sizes. For the same sample size (say n=1000) with mu = 0.1, there is substantially less power than in the normal case. A similar behaviour is evident for mu=0.01. Tiny effect sizes are similarly punished (see below).

t distribution small effect sizes

Fat, Skewed and Nearly Dead

The chi-squared(2) distribution put this simple test through its paces. The size of the test for anything under n=1000 is not in the same ballpark as its nominal, rendering the test (in my view) a liability. By the time n=100 000, the size of the test is reasonable.

Power, in my view, while showing similar outcomes is not a saving grace here: there’s very little control in this test under which you can interpret your power.

chi squared example

Here be dragons

I included a scenario with the Cauchy distribution, despite the fact it’s grossly unfair to this simple little test. The Cauchy distribution ensures that the Central Limit Theorem does not apply here: the test will not work in theory (or, indeed, in practice).

I thought it was a useful exercise, however, to show what that looks like. Too often, we assume “as n gets big CLT is going to work its magic” and that’s just not true. To whit: one hot mess.

Cauchy

Neither size nor power is improved with sample size increasing: that’s because the CLT isn’t operational in this scenario. The test is under sized, under powered for all but the largest of effect sizes (and really, at that effect size you could tell a difference from a chart anyway).

A/B reality is rarely this simple

A/B testing reality is rarely as simple as the test I’ve illustrated above. More typically, we’re testing groups of means or  proportions and interaction effects, possibly dynamic relationships and a long etc.

My purpose here is to show that even a simple, robust test with minimal assumptions can be thoroughly useless if those assumptions are not met. More complex tests and testing regimes that build on these simple results may be impacted more severely and more completely.

Power is not the only concern: size matters.

End notes

[1] Yes, I need to get a grip on latex in WordPress sooner or later, but it’s less interesting than the actual experimenting.

Violence Against Women: Standing Up with Data

Today, I spent the day at a workshop tasked with exploring the ways we can use data to contribute to ending violence against women. I was invited along by The Minerva Collective, who have been working on the project for some time.

Like all good workshops there were approximately 1001 good ideas. Facilitation was great: the future plan got narrowed down to a manageable handful.

One thing I particularly liked was that while the usual NGO and charitable contributors were present (and essential) the team from Minerva had managed to bring in a number of industry contributors from telecommunications and finance who were able to make substantial contributions. This is quite a different approach to what I’ve seen before and I’m interested to see how we can work together.

I’m looking forward to the next stage, there’s a huge capacity to make a difference. While there are no simple answers or magic bullets, data science could definitely do some good here.

Hannan Quinn Information Criteria

This is a short post for all of you out there who use information criteria to inform model decision making. The usual suspects are the Akaike and the Bayes-Schwartz criteria.

Especially, if you’re working with big data, try adding the Hannan Quinn (1972) into the mix. It’s not often used in practice for some reason. Possibly this is a leftover from our small sample days. It has a slower rate of convergence than the other two – it’s log(n) convergent. As a result, it’s often more conservative in the number of parameters or size of the model it suggests- e.g. it can be a good foil against overfitting.

It’s not the whole answer and for your purposes may offer no different insight. But it adds nothing to your run time, is fabulously practical and one of my all time favourites that no one has ever heard of.