Hannan Quinn Information Criteria

This is a short post for all of you out there who use information criteria to inform model decision making. The usual suspects are the Akaike and the Bayes-Schwartz criteria.

Especially, if you’re working with big data, try adding the Hannan Quinn (1972) into the mix. It’s not often used in practice for some reason. Possibly this is a leftover from our small sample days. It has a slower rate of convergence than the other two – it’s log(n) convergent. As a result, it’s often more conservative in the number of parameters or size of the model it suggests- e.g. it can be a good foil against overfitting.

It’s not the whole answer and for your purposes may offer no different insight. But it adds nothing to your run time, is fabulously practical and one of my all time favourites that no one has ever heard of.

Interpreting Models: Coefficients, Marginal Effects or Elasticities?

I’ve spoken about interpreting models before. I think that this is the most important part of our work, communicating results. However, it’s one that’s often overlooked when discussing the how-to of data science. That’s why marginal effects and elasticities are better for this purpose than coefficients alone.

Model build, selection and testing is complex and nuanced. Communicating the model is sometimes harder, because a lot of the time your audience has no technical background whatsoever. Your stakeholders can’t go up the chain with, “We’ve got a model. And it must be a good model because we don’t understand any of it.”

Our stakeholders also have a limited attention span so the explanation process is two fold: explain the model and do it fast.

For these reasons, I usually interpret models for my stakeholders with marginal effects and elasticities, not coefficients or log-odds. Coefficient interpretation is very different for regressions depending on functional form and if you have interactions or polynomials built into your model, then the coefficient is only part of the story. If you have a more complex model like a tobit, conditional logit or other option, then interpretation of coefficients is different for each one.

I don’t know about your stakeholders and reporting chains: mine can’t handle that level of complexity.

Marginal effects and elasticities are also different for each of these models but they are by and large interpreted in the same way. I can explain the concept of a marginal effect once and move on. I don’t even call it a “marginal effect”: I say “if we increase this input by a single unit, I expect [insert thing here]” and move on.

Marginal effects and elasticities are often variable over the range of your sample: they may be different at the mean than at the minimum or maximum, for example. If you have interactions and polynomials, they will also depend on covarying inputs. Some people see this as added layers of complexity.

In the age of data visualisation, I see it as an opportunity to chart these relationships and visualise how your model works for your stakeholders.

We all know they like charts!

Bonds: Prices, Yields and Confusion- a Visual Guide

Bonds have been the talk of the financial world lately. One minute it’s a thirty-year bull market, the next it’s a bondcano. Prices are up, yields are down and that’s bad. But then in the last couple of months, prices are down and yields are up and that’s bad too, apparently. I’m going to take some of the confusion out of these relationships and give you a visual guide to what’s been going on in the bond world.

The mathematical relationship between bond prices and yields can be a little complicated and I know very few people who think their lives would be improved by more algebra in it. So for our purposes, the fundamental relationship is that bond prices and yields move in opposite directions. If one is going up, the other is going down. But it’s not a simple 1:1 relationship and there are a few other factors at play.

There are several different types of bond yields that can be calculated:

  • Yield to maturity: the yield you would get if you hold the bond until it matures.
  • Yield to call: the yield you would get if you hold the bond until its call date.
  • Yield to worst: the worst outcome on a bond, whether it is called or held to maturity.
  • Running yield: this is roughly the yield you would get from holding the bond for a year.

We are going to focus on yield to maturity here, but a good overview of yields generally can be found at FIIG. Another good overview is here.

 

To explain all this (without algebra), I’ve created two simulations. These show the approximate yield to maturity against the time to maturity, coupon rate and the price paid for the bond. For the purposes of this exercise, I’m assuming that our example bonds have a face value of $100 and a single annual payment.

The first visual shows what happens as we change the price we pay for the bond. When we buy a bond below face value (at, say $50 when its face value is $100), yield is higher. But if we buy that bond at $150, then yield is much lower. As price increases, yield decreases.

The time the bond has until maturity matters a lot here, though. If there is only a short time to maturity then the differences between below/above face value can be very large. If there are decades to maturity, then these differences tend to be much smaller. The shading of the blue dots represent the coupon rate that might be attached to a bond like this- the darkest colours will have the highest coupon rate and the lighter colour will have the lowest coupon rates. Again, the differences matter more when there is less time for a bond to mature.

Prices gif

The second animation is a representation of what happens as we change the coupon rate (e.g. the interest rate the debtor is paying to the bond holder). The lines of dots represent differences in the price paid for the bond. The lighter colours represent a cheaper purchase below face value (better yields- great!). The darker colours represent an expensive purchase above face value (lower yields-not so great).

If we buy a bond cheaply, then the yield may be higher than the coupon rate. If we buy it over the face value, then the yield may be lower than the coupon rate. The difference between them is less the longer the bond has to mature. When the bond is very close to maturity those differences can be quite large.

Coupon Gif

When discussing bonds, we often mention something called the yield curve and this describes the yield a bond (or group of bonds) will generate over their life time.

If you’d like to have a go at manipulating the coupon rate and the price to manipulate an approximate yield curve, you can check out this interactive I built here.

Remember that all of these interactives and animations are approximate, if you want to calculate yield to maturity exactly, you can use an online calculator like the one here.

So how does this match the real data that gets reported on daily? Our last chart shows the data from the US Treasury 10-year bills that were sold on the 25th of November 2016. The black observations are bonds maturing within a year, the blue are those that have longer to run.  Here I’ve charted the “Asked Yield”, which is the yield a buyer would receive if the seller sold their bond at the price they were asking. Sometimes, however, the bond is bought at a lower bid, so the actual yield would be a little higher. I’ve plotted this against the time until the bond matures. We can see that the actual yield curve produced is pretty similar to our example charts.

This was the yield curve from one day. The shape of the yield curve will change on a day-to-day basis depending on the prevailing market conditions (e.g. prices). It will also change more slowly over time as the Federal Reserve issues bonds with higher or lower coupon rates, depending on economic conditions.

yield curve

Data: Wall Street Journal.

Bond yields and pricing can be confusing, but hopefully as you’re reading the financial pages they’re a lot less so now.

A huge thanks to my colleague, Dr Henry Leung at the University of Sydney for making some fantastic suggestions on this piece.

 

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

It got wet

NSW got wet this weekend. In our own particular case we lost a large amount of our driveway and several paddocks spontaneously attained lake status. So there was nothing else to do but to poke around and see what I could turn up in the historic record (find yours here).

Some locals recorded up to 250mm in 24 hours this weekend. I thought that was an extraordinary amount until I checked the data (only available up until April this year so far, alas).

It turns out that sometime in the late sixties the local rainfall station recorded an extraordinary 392mm in 24 hours. Now that’s an outlier…!

I’ll invest in a new pair of gumboots just in case.

Smooth scatter plot rainfall

If you’re into this sort of thing, the plot was done using the “smoothScatter” function in R. It’s a change from the usual time series line chart. I think I’m a convert.

Continuous, Censored and Truncated Data: what are the differences and do you need to care?

Whenever I work with someone whose statistical or econometric experience has been more practical than theoretical, two things happen. The first is that the poor person inexplicably develops a twitch whenever I launch into an enthusiastic tangent that requires a sheet of graph paper and extensive hand waving.

The other thing that inevitably happens is that the digression comes to an end and the question is asked “but does that matter in practice?”

When it comes to model section, the difference between data types really does matter. You may make choices one way or another, but understanding the differences (both obvious and subtle) lets you make those choices understanding that you do have them.

This post is a cliff-notes version of the issue. Maybe you’ve heard of these differences in data types and just need a memory jog. Maybe you’ve not heard of them at all and want somewhere simple to start.

Continuous data is pretty simple: it’s data that can lie anywhere on the real line with a positive probability. That is, it can be anywhere from very large negative numbers to very large positive numbers. The normal distribution is an example of continuous data.

Truncated data, on the other hand, is data which is continuous but has the added complication of only being observed above or below a certain point. The classic example suggested by Greene is income [1]. One example would be if we only surveyed the income of those earning above the tax-free threshold: then we would have truncated data.

Censored data is similar. It’s an issue not of observation but in the way the data is sampled. Some parts of the distribution are obscured, but not ignored. The survey may, for example, interview all income levels, but only record those above the tax free threshold and describe the rest as “under the tax threshold” rather than giving the income in dollar terms. In this case all parts of the distribution are reported on, but the level of information differs above or below a threshold.

Most people are aware of issues modelling categorical data using techniques designed for continuous data. However, censored and truncated data also need special treatment. A lot of the data we deal with has a natural truncation point: distance isn’t negative, prices are not (well, hardly ever) negative. Recognising that you may be dealing with truncated or censored data is an important part of initial data analysis. For a thorough discussion, see W.H. Green’s chapter on the subject here.

In practice, continuous data methodologies may work quite well for these types of data as long as there isn’t a large amount of data sitting at or near the truncation or censoring point (which is often zero).

Test scores are something I’ve worked a lot with. In my experience, once the proportion of test scores began to approach around 20% zeros, I needed to switch over to models designed for the issue. In the 10%-20% range I will often try a few of different models to see which is most appropriate. That’s just a general rule of thumb- your mileage may vary.

Hand waving and furious graph-paper drawing aside: yes in this case knowing the differences does matter in practice.

Notes:

[1] W. H. Green, Econometric Analysis, is a classic text and here I’m looking at p. 756 in the fifth edition. There are three copies of this book living in my house. Definitely worth the investment if you are looking for either a classic text covering everything econometrics or a useful TV stand. What can I say? We were young and poor and a matched set of texts made up for deficits in our furniture budget. I’ve owned this book for nearly twenty years and I still use it- even long after we can afford furniture.