Help Please – a package for new R users

Starting out in R can be quite overwhelming. There are lots of resources and people around who want to help, but navigating those can be hard and some of the R error messages and explainers within R itself can seem like something of a foreign language.

helpPlease is there to bridge the gap. The gap closes by itself over time, but let’s build a bridge and make life easier.

The package is in proof of concept stage and you can see it on Github. If you have an error message that could benefit by being explained in plain language, a term used in R that could use a plain language explanation, encouragement for new users or a troubleshooting tip, we’d love to hear from you.

A consultant’s workflow

I’ve been thinking a lot of about workflow lately and how that differs from project to project. There are a few common states I move through with each project, however. I wanted to talk a more about how failure fits in that workflow. As I’ve mentioned before, I quite like failure. It’s a useful tool for a data scientist. And failure has an important place in a data scientist’s workflow in my view.

Here’s my basic workflow. Note the strong similarity in parts to Hadley Wickham’s data science workflow, which I think is an excellent discussion of the process. In this case, I wanted to talk more about an interactive workflow with a client, however, and how failure fits into that.

a flow chart describing text below

An interactive workflow

As a consultant, a lot of what I do is interactive with the client. This creates opportunities for better analysis. It also creates opportunities for failure. Let me be clear: some failures are not acceptable and are not in any way beneficial. Those are the failures that happen after ‘analysis complete’. All the failures that happen before that are an opportunity to improve, grow and cement a working relationship with a client. (Some, however, are hideously embarrassing, you want to avoid those in general.)

This workflow is specific to me: your mileage will almost certainly vary. The amount of variation from mine I have no opinion on I’m afraid. As usual, you should take what’s useful to you and jettison the rest.

My workflow starts with a client request. This request is often nebulous, vague and unformed. If that’s the case, then there’s a lot of work around getting the client’s needs and wants into a shape that will be (a) beneficial to the client and (b) achievable. That’s a whole other workflow to discuss on another day.

This is the stage I recommend documenting client requests in the form of some kind of work order so everyone’s on the same page. Some clients have better clarity around what they require than others. Having a document you can refer back to at handover saves a lot of time and difficulty when the client is working outside their domain knowledge. It also helps a lot at final stage validation with the client: it’s easy to then check off that you did what you set out to do.

Once the client request is in workable shape, it’s time to identify and select data sources. This may be client-provided, or it may be external or both. Pro tip: document everything. Every Excel worksheet, .csv, database – everything. Where did it come from, who gave it to you, how and when? I talk about how I do that in part here.

Next I validate the data: does it make sense, is it what I expected, what’s in there? Once that’s all done I need to validate my findings with the client’s expectations. Here’s where a failure is good. You want to pick up if the data is a load of crap EARLY. You can then manage client’s expectations around outcomes – and lay the groundwork for future projects.

If it’s a failure – back to data sourcing and validation. If it’s a pass, on to cleaning and transform, another sub-workflow by itself.

Analyse, Model, Visualise

This part of my workflow is very close to the iterative model proposed by Hadley Wickham that I linked to above. It’s fundamentally repetitive: try, catch problems and insights, repeat. I also like to note the difference between visualisation for finding insight and visualisation for communicating insight. These can be the same, but they’re often different.

Sometimes I find an insight in the statistics and use the visualisation to communicate it. Sometimes I find the insight in the visualisation, then validate with statistics and communicate the insight to the client with a chart. Pro tip: the more complex an idea, the easier it is to present it initially with a chart. Don’t diss the humble bar chart: it’s sometimes 90% of the value I add as a consultant, fancy multi-equation models not withstanding.

This process is full of failure. Code that breaks, models that don’t work, statistics that are not useful. Insights that seem amazing at first, but aren’t valid. Often the client likes regular updates at this point and that’s a reasonable accommodation. However! Be wary about communicating your latest excitement without internal validation. It can set up expectations for your client that aren’t in line with your final findings.

You know you’re ready to move out of this cycle when you run out of failures or time or both.

Communicate and validate

This penultimate stage often takes the longest, communication is hard. Writing some code and throwing a bunch of data in a model is relatively easy. It’s also the most important stage – it’s vital that we communicate in such a way that our client or domain experts can validate what we’re finding. Avoid at all costs the temptation to tech-speak. The client must be able to engage with what you’re saying.

If that all checks out, then great – analysis complete. If it doesn’t, we’re bounced all the way back to data validation. It’s a big failure – but that’s OK. It’s far more palatable than the failures that come after ‘analysis complete’.

I turned off commenting on new posts…

… not because I don’t like hearing from you all. It’s more that being offered amazing generic pharmaceutical deals 1500 times gets old.

If you’d like to throw some criticism or comment my way about anything on here, feel free to ping me on twitter @StephdeSilva or use the contact form to hit my inbox.

And if you want a great deal on generic pharmaceuticals, I can probably help you out there too.

On moving the box, not the whiskers

A lot of institutions and data scientists are keen to take the best of the best and make them better. They are working with elites in terms of intellectual ability, the socio economic lottery that takes intelligence and operationalises it and all the other bits and pieces that go into that.

These institutions and data scientists are working with the whiskers of the population box plot. They’re taking the people on the upper edge of the distribution and they’re keen to push them further out: to achieve more, do more, create more. Bravo! This is important and should continue.

Box plot

However, there’s another group of people that I think need to learn data science skills in general and to learn how to code in particular. Australia has no national workforce plan: but we know and acknowledge that data is at the heart of our economy going forward.

In order to make the most of our future, we need a large number of people in the box to learn the skills that will give them access to a digital, data driven economy. These people are not elites. They often do not believe they have a strong mathematics skills set and they don’t have PhDs. But we need them.

Data science in general and coding in particular is a useful, important skill set. There will always be space and need for elite data scientists, trained by elite institutions and mentors. But we also need to start thinking about how we’re going to move the box, not just the whiskers.

If you think about it, the productivity gains from moving a small proportion of the box upwards are enormous compared to moving just the whiskers.

Where do things live in R? R for Excel Users

One of the more difficult things about learning a new tool is the investment you make while you’re learning things you already know in your current tool. That can feel like time wasted – it’s not, but it’s a very frustrating experience. One of the ways to speed up this part is to ‘translate’ concepts you know in your current tool into concepts for your new one.

In that spirit, here’s a brief introduction to where things live in R compared to Excel. Excel is a very visual medium – you can see and manipulate your objects all the time. You can do the same in R, it’s just that they are arranged in slightly different ways.

Where does it live infographic

Data

Data is the most important part. In Excel, it lives in the spreadsheet. In R it lives in a data structure – commonly a data frame. In Excel you can always see your data.

Excel spreadsheet

In R you can too – go to the environment window and click on the spreadsheet-looking icon, it will give you your data in the viewer window if it’s an object that can be reproduced like that (if you don’t have this option, your object may be a list not a data frame). You can’t manipulate the data like this, however – you need code for that. You can also use commands like head(myData) to see the first few lines, tail(myData) to see the last few and print(myData) to see the whole object.

R environment view

view of data in R

Code

Excel uses code to make calculations and create statistics – but it often ‘lives’ behind the object it produces. Sometimes it can make your calculation look like the original data and create confusion for your stakeholders (and for you!).

Excel formula

In R code is used in a similar way to Excel, but it lives in a script, a .R file. This makes it easier to reuse, understand and more powerful to manipulate. Using code in a script saves a lot of time and effort.

R script

Results and calculations

In Excel, results and calculations live in a worksheet in a workbook. It can be easy to confuse with the original data, it’s hard to check if things are correct and re-running analyses (you often re-run them!) is time consuming.

In R, if you give your result or analysis a name, it will be in the Environment, waiting for you – you can print it, copy it, change it, chart it, write it out to Excel for a coworker and recreate it any time you need with your script.

A result in R

That’s just a simple run down – there’s a lot more to R! But it helps a lot to know where everything ‘lives’ as you’re getting started. Good luck!

R for Excel users

Moving over to R (or any other programming language) from Excel can feel very daunting. One of the big stumbling blocks, in my view, is having a mental understanding of how we store data in structures in R. You can view your data structures in R, but unlike Excel where it’s in front of your face, it’s not always intuitive to the user just starting out.

There’s lots of great information on the hows, whys and wherefores: here’s a basic rundown of some of the common ways we structure our data in R and how that compares to what you’re already familiar with: Excel.

Homogeneous data structures


basic data structures infographic

 

Homogeneous in this case just means all the ‘bits’ inside these structures need to be of the same type. There are many types of data in R, but the basic ones you need to know when just starting out for the first time are:

  • Numbers. These come in two varieties:
    • Doubles – where you are wanting and using decimal points, for example 1.23 and 4.56.
    • Integers- where you don’t, for example 1, 2, 3.
  • Strings. This is basically just text data – made up of characters. For example, dog, cat, bird.
  • Booleans. These take two forms TRUE and FALSE.

Homogeneous data structures are vectors, matrices and arrays. All the contents of these structures have to have the same type. They need to be numbers OR text OR booleans or other types – but no mixing.

Let’s go through them one-by-one:

  • Vectors. You can think of a vector like a column in a spreadsheet – there’s an arbitrary number of slots and data in each one. There’s a catch – the data types all have to be the same: all numbers, all strings, all booleans or other types. Base R has a good selection of options for working with this structure.
  • Matrices. Think of this one as the whole spreadsheet – a series of columns in a two dimensional arrangement. But! This arrangement is homogeneous – all types the same. Base R has you covered here!
  • Arrays. This is the n-dimensional equivalent of the matrix- a bundle of worksheets in the workbook if you will. Again, it’s homogenous. The abind package is really useful for manipulating arrays. If you’re just starting out, you probably don’t need this yet!

The advantage of homogeneous structures is that they can be faster to process – but you have to be using serious amounts of processing power for this to matter a lot. So don’t worry too much about that for now. The disadvantage is that they can be restrictive compared to some other structures we’ll talk about next.

 

Heterogeneous structures

Basic data structures heterogeneous

 

Heterogeneous data structures just mean that the content can be of a variety of types. This is a really useful property and makes these structures very powerful. There are two main forms, lists and data frames.

  • Lists. Like a vector, we can think about a list like a column from a spreadsheet. But unlike a vector, the content of the list can be any type.
  • Data frames. A data frame is really a list of lists. Generally the content of each sub-list (column of the data frame) is the same (like you’d expect in a spreadsheet) but that’s not necessarily the case. Data frames can have named columns (so can other structures) and you can access data using those names.

Data frames can be extended to quite complex structures. Data frames don’t have to be ‘flat’. Because you can make lists of lists, you can have data frames where one or more of the columns have lists in each slot, they’re called nested data frames.

This and other properties makes the data type extremely powerful for manipulating data. There’s a whole series of operations and functions in R dedicated to manipulating data frames. Matrices and vectors can be converted into data frames, one way is the function as.data.frame(my_matrix).

The disadvantage of this structure is it can be slower to process – but if you’re at the stage of coding where you’re not sure if this matters to you, it probably doesn’t just now! R is set up to do a bunch of really useful things using data frames. This is the data structure probably most similar to an Excel sheet.

How do you know what structure you’re working with? If you have an object in R and you’re not sure if it’s a matrix, or a vector, a list or a data frame call str(object). It will tell you what you’re working with.

So that’s a really simple take on some simple data structures in R: quite manageable, because you already understand lots of these concepts from your work in Excel. It’s just a matter of translating them into a different environment.

 

Acknowledgement: Did you like the whole homogeneous/heterogeneous structure idea? That isn’t my idea – Hadley Wickham in Advanced R talks about it in much more detail.

Algorithmic trading: evolutionary tourney?

The use of artificial intelligence (AI) in the form of machine learning techniques and algorithmic trading in the finance industry has in recent years become widespread. The preference for quantitative techniques is not a new development in the finance industry: a strong respect for quantitative methods has long been in place. However, the recent massive uptake in the use of AI through these algorithmic decision making tools for high frequency trading is a relatively new development.

Locally, these developments have contributed to improved industry productivity at a rate considerably greater than other industries in Australia. They have also resulted in increased profits to investment vehicles. However, they are not without pitfalls and cautions. The inability of machine learning models to predict “fat tail events” or regime change in the data generating process is well known. The dependence of these models on a consistent solution space over time is another weakness that should be acknowledged.

Failure to understand critical features of these models and algorithms may lead to substantive losses through naïve application. A subculture within the industry that breeds similarity between models may possibly make this a systemic failure at some point in the future.

While these models offer excellent improvements in wealth generation and productivity in this industry, (Qiao & Beling, 2016) they are not a substitute for human decision making, only an assistance to it. Failure to acknowledge this distinction may lead to adverse outcomes on a very wide scale.

Background

Algorithmic trading and use of decision analytics in financial market transactions is now very widespread, with many investment houses reducing the number of “stock pickers” they employ in favour of purely quantitative algorithmic approaches to trading (The Economist Intelligence Unit, 2017). The use of these algorithmic decision making tools is common in markets such as foreign exchange (Chaboud, et al., 2014) and commodity markets, as well as listed stock exchanges.

As part of an overall move towards digitization, these AI tools have resulted in increasing profits for investment houses in recent years (The Economist Intelligence Unit, 2017). Indeed, in part contributed to by these AI methodologies, the productivity growth of the finance industry in Australia alone is outpacing the aggregate productivity levels of the economy by a very considerable amount over the last fifteen years, shown in Figure 1 (Australian Bureau of Statistics, 2016).

 

Productivity chart

Figure 1: Productivity, Finance Industry and 12 Industry Aggregate Productivity. Data: ABS (2016), chart: Author

 

Algorithmic trading as an evolutionary tourney?

 

The methodology used to employ algorithmic trading is a combination of predictive modelling techniques and prescriptive decision frameworks (Qiao & Beling, 2016). The decision tools implemented by the automated algorithms consist of optimisation and simulation tools (Qiao & Beling, 2016).

Predictive methods vary, but include forecasting using machine learning methods such as neural nets, evolutionary algorithms and more traditional econometric modelling such as the AutoRegressive Conditional Heteroskedastic (ARCH) modelling frameworks. One example is Laird-Smith et al.’s use of regression techniques to estimate a systemic error capital asset price model (Laird-Smith, et al., 2016). Other techniques include using sentiment analysis on unstructured corpi generated by journalists to estimate the effects of feedback into the market and the interaction between market and sentiment (Yin, et al., 2016). Yin et al. (2016) note the strong correlation between news sentiment and market returns which can be exploited to improve investment outcomes.

The move towards algorithmic trading has been highly successful. Improved outcomes are observable both systemically, in the form of increased liquidity and price efficiency (Chaboud, et al., 2014) and at the firm level with improved profit margins (The Economist Intelligence Unit, 2017).

However, the complexity of machine learning models underpinning these systems make them difficult to compare and critically analyse, requiring novel techniques to do so such as Parnes (2015). Many of these algorithms are proprietary and closely guarded: it is not possible to outsiders to analyse them, except by observing trading behaviours ex post.

There are also some negative outcomes which bear consideration. Chaboud (2014) note that behaviour of algorithmic traders is highly correlated. While it is noted that this correlation does not appear to cause a degradation in market quality on average, these AI algorithms have in the past resulted in unexpected and costly mistakes, such as the automated selling program that initiated the ‘flash crash’ of 2010 (Kirilenko, et al., 2017).

In broader terms, it is now well known that machine learning algorithms are biased in unexpected ways when the training data from which they are generated is biased. For instance, an algorithm that decides on offers of bail to people awaiting trial in the U.S.A. was shown to disadvantage bail seekers of colour compared to those who were white despite the fact that race was not a feature selected for use in the algorithm (Angwin, et al., 2016). Algorithmically generated decision making can also lead to unforeseen and unwanted outcomes. Facebook was recently revealed to be selling advertisement space targeted at users categorised with anti-Semite topics (Angwin, et al., 2017). These algorithmically-generated targeting categories were unknown to the company until they were discovered by journalists.

The economic implications of employing algorithmic decision making methods are enormous, not only in private financial circles, but in government as well (Dilmegani, et al., 2014). It is clear that employing this form of AI will continue well beyond the finance industry, whether or not bias or negative unexpected outcomes is eradicable. However, being aware of and actively testing for the existence of these possibilities is critical going forward.

In the finance industry, each algorithmic trader has a similar fitness function by which it is judged, fine-tuned and updated: the ability to profitably trade in the market. In this sense, in a market dominated by algorithmic trading AI is effectively a tourney in which the fittest agents survive and the weakest are abandoned and replaced with better models. However, the resultant similarity after many trading generations, already noted in the popular press (Maley, 2017) may expose the system to crisis during an unpredictable event if a majority of algorithmic traders react similarly in unforeseen ways that have negative consequences.

The high speed at which these high frequency trading algorithms execute trades could lead to far greater damage in the future than the ‘flash crash’ documented by Kirilenko et al. (2017). While optimal performance of algorithmic trading agents will likely include a ‘Deadman’s switch’- a command which will halt further trading in the event of a predetermined event (such as a crash)- the efficacy of these measures have not been tested systemically during a financial crisis.

 

Conclusion

Algorithmic trading is a branch of artificial intelligence that has contributed to the generation of wealth and increased productivity in the finance industry, algorithmic decision making should be seen as an aid to human decision making and not a replacement for it.

While gains to be made from the technology are substantial and ongoing in fields beyond finance, the possibility of a lack of variability among algorithmic traders and the proprietary and hidden nature of these models which are difficult to explain and interpret may lead to adverse consequences if applied naively.

 

References

Angwin, J., Larson, J., Mattu, S. & Kirchener, L., 2016. Machine Bias. [Online]
Available at: www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[Accessed 6 September 2017].

Angwin, J., Varner, M. & Tobin, A., 2017. Facebook’s Anti-Semitic Ad Categories Persisted after Promised Hate Speech Crackdown. [Online]
Available at: https://www.scientificamerican.com/article/facebook-rsquo-s-anti-semitic-ad-categories-persisted-after-promised-hate-speech-crackdown/
[Accessed 23 September 2017].

Australian Bureau of Statistics, 2016. 5260.0.55.002 Estimates of Industry Multifactor Productivity, Australia, Canberra: Australian Bureau of Statistics.

Chaboud, A. P., Chiquoine, B., Hjalmarsson, E. & Vega, C., 2014. Rise of the Machines: Algorithmic Trading in the Foreign Exchange Market. The Journal of Finance, Volume 69, pp. 2045-2084.

Dilmegani, C., Korkmaz, B. & Lunqvist, M., 2014. Public Sector Digitization: The Trillion-Dollar Challenge. [Online]
Available at: www.mckinsey.com/business-functions/digital-mckinsey/our-insights/public-sector-digitization-the-trillion-dollar-challenge
[Accessed 6 September 2017].

Kirilenko, A., Kyle, A., Samadi, M. & Tuzun, T., 2017. The Flash Crash: High-Frequency Trading in an Electronic Market. The Journal of Finance, Volume 72, pp. 967-988.

Laird-Smith, J., Meyer, K. & Rajaratnam, K., 2016. A study of total beta specification through symmetric regression: the case of the Johannesburg Stock Exchange. Environment Systems and Decisions, 36(2), pp. 114-125.

Maley, K., 2017. Are algorithms under estimating the risk of a Pyongyang panic?. [Online]
Available at: http://bit.ly/2vLlY2r
[Accessed 6 September 2017].

Parnes, D., 2015. Performance Measurements for Machine-Learning Trading Systems. Journal of Trading, 10(4), pp. 5-16.

Qiao, Q. & Beling, P. A., 2016. Decision analytics and machine learning in economic and financial systems. Environment Systems & Decisions, 36(2), pp. 109-113.

The Economist Intelligence Unit, N. I., 2017. Unshackled algorithms; Machine-learning in finance 2017. The Economist, 27 May.Volume 72.

Yin, S., Mo, K., Liu, A. & Yang, S. Y., 2016. News sentiment to market impact and its feedback effect. Environment Systems and Decisions, 36(2), pp. 158-166.

 

Closures in R

Put briefly, closures are functions that make other functions. Are you repeating a lot of code, but there’s no simple way to use the apply family or Purrr to streamline the process? Maybe you could write your own closure. Closures enclose access to the environment in which they were created – so you can nest functions within other functions.

What does that mean exactly? Put simply, it can see all the variables enclosed in the function that created it. That’s a useful property!

What’s the use case for a closure?

Are you repeating code, but instead of variables or data that’s changing – it’s the function instead? That’s a good case for a closure. Especially as your code gets more complex, closures are a good way of modularising and containing concepts.

What are the steps for creating a closure?

Building a closure happens in several steps:

  1. Create the output function you’re aiming for. This function’s input is the same as what you’ll give it when you call it. It will return the final output. Let’s call this the enclosed function.
  2. Enclose that function within a constructor function. This constructor’s input will be the parameters by which you’re varying the enclosed function in Step 1. It will output the enclosed function. Let’s call this the enclosing function.
  3. Realise you’ve got it all wrong, go back to step 1. Repeat this multiple times. (Ask me how I know..)
  4. Next you need to create the enclosed function(s) (the ones from Step 1) by calling the enclosing function (the one from Step 2).
  5. Lastly, call and use your enclosed functions.

An example

Say I want to calculate the mean, SD and median of a data set. I could write:
x <- c(1, 2, 3)
mean(x)
median(x)
sd(x)

That would definitely be the most efficient way of going about it. But imagine that your real use case is hundreds of statistics or calculations on many, many variables. This will get old, fast.

I’m calling those three functions each in the same way, but the functions are changing rather than the data I’m using. I could write a closure instead:

stat <- function(stat_name){

function(x){
stat_name(x)
}

}

This is made up of two parts: function(x){} which is the enclosed function and stat() which is the enclosing function.

Then I can call my closure to build my enclosed functions:

mean_of_x <- stat(mean)
sd_of_x <- stat(sd)
median_of_x <- stat(median)

Lastly I can call the created functions (probably many times in practice):

mean_of_x(x)
sd_of_x(x)
median_of_x(x)

I can repeat this for all the statistics/outcomes I care about. This example is too trivial to be realistic – it takes about double the lines of code and is grossly ineffficient! But it’s a simple example of how closures work.

More on closures

If you’re producing more complex structures, closures are very useful. See Jason’s post from Left Censored for a realistic bootstrap example – closures can streamline complex pieces of code reducing mistakes and improving the process you’re trying to build. It takes modularity to the next step.

For more information in R see Hadley Wickham’s section on closures in Advanced R.