The annual release of the Australian national productivity statistics is only days away - this is an area I’ve worked in for quite some time. I’m interested to see what’s been happening over the last year. As it happens, I’ve also been reading through Hyndman and Athanasopolous brushing up on a few things I haven’t looked at in a decade or more (!) so I thought this was the perfect time for a a basic forecast to tide me over until release. The code comes straight from their book.

Here, I’ll be making extensive use of the forecast package by Hyndman and coauthors. It’s an incredibly useful piece of work that provides a helpful structure if time series nuances are not your thing, as well as a lot of flexibility if they are.

The data I’ll be using is the 2015-16 of Estimates of Industry Multifactor Productivity, Australia (series 5260.0.55.002) which was released on December 5, 2016. It’s December 4, 2017 at the time of writing: let’s see if we can come up with a forecast for Multifactor productivity for this past year.

Step 1: Load the data, set up a useful working environment for a basic forecast.

The ABS has a tendency to provide its data in all sorts of weird formats - or at least it seems weird to people who don’t use Excel alot.

It’s not messy, but it’s not tidy

It’s not messy, but it’s not “tidy”

In order to get started I needed to do a fair bit of cleaning up. In brief, there were a few steps:

* Load the data, turn it sideways so the variables are in columns
* Replace the column names with something more workable
* Get rid of empty/not useful variables
* Replace the ABS na error code with NA for R compatibility
* Reindex to 2013-14 financial year, more on that in a second
* Convert to a ts object

The last two steps are worth talking about: productivity is measured as an index, so there has to be a base year somewhere in the series that all other years are compared to. I tend to use a base year of 2013-14 as that’s what my coauthors use.

I also converted the dataframe to a ts object. This is a really useful datatype that comes with all sorts of built in tools via the forecast package.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

productivity<- read_xls("data/52600550022016.xls", sheet="Table 4", 
                     col_names = FALSE, skip=10, n_max = 17)

# the data is sideways and has difficult variable names - let's fix that!
productivity<-t(productivity) %>%
                      "Mutlifactor.Productivity.Null" ,

# Getting rid of empty space and unnecessary bits
productivity <- productivity[-1,]
productivity <- productivity[,c(1:3,5:6)]

# A few things going on here:
# 1. Replacing the ABS "na" with NA for R compatibility
# 2. Read the data in as numeric
# 3. Convert to data frame
# 4. Reindex to 2013-14 year
# 5. Convert to a time series object
productivity <- productivity %>% 
                    map(function(x) replace(x, x == "na", NA)) %>%
                    map(function(x) x = as.numeric(as.character(x))) %>%
                           simplify=TRUE) %>%
                    ts(start = 1973)
# Reindex to 2003-2004 which is row 31 in data frame

That was a fair bit of data wrangling, let’s take a look at what we’re given with the autoplot() function from forecast. This function is a quick way to visualise your data when you’re working through it interactively. It works with all the usual ggplot2 ‘bits’ and I’ve added theme_light() here because I’m not into grey rectangles.

Here I’m just looking at three productivity series: labour productivity (hours worked), capital productivity and multifactor productivity (hours worked), which is a combination of the two. The series dates back to 1973-4 and ends at 2015-16.