Australia’s same sex marriage survey

It was a farcical display of an absence of leadership. And the data it provides is not remotely as good as a properly executed survey.

Nonetheless, it had our national attention for months and it’s over.

Here’s a Shiny app because my Facebook discussions got a little detailed. Now everyone can have a look at the data on a by-electorate basis.

Some hot takes for you:

  • When thinking about outcomes in ‘electorates with a high proportion of migrants’, also think about the massively different treatment effects caused by the fact there was little to no outreach from the yes campaign to non English speaking communities, while some others targeted these communities with misinformation regarding the impact of gay marriage on schools. (That’s not a diss on the yes campaign: limited resources and all of that. They were in it to win a nation, not single electorates.)
  • Remember that socioeconomic advantage is a huge confound in just about everything.
  • The survey asked about changing a status quo. That’s not entirely the same thing as being actively homophobic: but I’ll agree in this case that’s a fine line to draw.
  • Why didn’t areas with high migrant populations in other cities follow the same patterns?
  • Did Sydney diocesan involvement, both in terms of investment and pulpit rhetoric create a different treatment effect compared to different cities?

And one thing I think we should all be constantly aware of, even as we nerds are enjoying our dissection:

  • This data was generated on the backs of the suffering of many GBLTIQ+ Australians and their families.

Bring on equality.

Code here.

Data here.

App in full screen here.

Data Visualisation: Hex Codes, Pantone Colours and Accessibility

One of the things I find hardest about data visualisation is colouring. I’m not a natural artist, much preferring everything in gentle shades of monochrome. Possibly beige. Obviously for any kind of data visualisation, this limited .Quite frankly this is the kind of comfort zone that needs setting on fire.

I’ve found this site really helpful: it’s a listing of the Pantone colours with both Hex and RGB codes for inserting straight into your visualisations. It’s a really useful correspondence if I’m working with someone (they can give me the Pantone colour numbers of their website or report palette- I just search the page).

One thing I’ve found, however, is that a surprising (to me) number of people have some kind of colour-based visual impairment. A palette that looks great to me may be largely meaningless to someone I’m working with. I found this out in one of those forehead slapping moments when I couldn’t understand why a team member wasn’t seeing the implications of my charts. That’s because, to him, those charts were worse than useless. They were a complete waste of his time.

Some resources I’ve found helpful in making my visualisations more accessible are the colourblind-friendly palettes discussed here and this discussion on R-Bloggers. The latter made me realise that up until now I’ve been building visualisations that were obscuring vital information for many users.

The things I think are important for building an accessible visualisation are:

  • Yes, compared to more subtle palettes, colour-blind friendly palettes look like particularly lurid unicorn vomit. They don’t have to look bad if you’re careful about combinations, but I’m of the opinion that prioritising accessibility for my users is more important than “pretty”.
  • Redundant encoding (discussed in the R-bloggers link above) is a great way ensuring users can make out the information you’re trying to get across. To make sure this is apparent in your scale, use a combination of scale_colour_manual() and scale_linetype_manual(). The latter works the same as scale_colour_manual() but is not as well covered in the literature.
  • Consider reducing the information you’re putting into each chart, or using a combination of facets and multiple panels. The less there is to differentiate, the easier it can be on your users. This is a good general point and not limited to those with colourblindness.

Interpreting Models: Coefficients, Marginal Effects or Elasticities?

I’ve spoken about interpreting models before. I think that this is the most important part of our work, communicating results. However, it’s one that’s often overlooked when discussing the how-to of data science. That’s why marginal effects and elasticities are better for this purpose than coefficients alone.

Model build, selection and testing is complex and nuanced. Communicating the model is sometimes harder, because a lot of the time your audience has no technical background whatsoever. Your stakeholders can’t go up the chain with, “We’ve got a model. And it must be a good model because we don’t understand any of it.”

Our stakeholders also have a limited attention span so the explanation process is two fold: explain the model and do it fast.

For these reasons, I usually interpret models for my stakeholders with marginal effects and elasticities, not coefficients or log-odds. Coefficient interpretation is very different for regressions depending on functional form and if you have interactions or polynomials built into your model, then the coefficient is only part of the story. If you have a more complex model like a tobit, conditional logit or other option, then interpretation of coefficients is different for each one.

I don’t know about your stakeholders and reporting chains: mine can’t handle that level of complexity.

Marginal effects and elasticities are also different for each of these models but they are by and large interpreted in the same way. I can explain the concept of a marginal effect once and move on. I don’t even call it a “marginal effect”: I say “if we increase this input by a single unit, I expect [insert thing here]” and move on.

Marginal effects and elasticities are often variable over the range of your sample: they may be different at the mean than at the minimum or maximum, for example. If you have interactions and polynomials, they will also depend on covarying inputs. Some people see this as added layers of complexity.

In the age of data visualisation, I see it as an opportunity to chart these relationships and visualise how your model works for your stakeholders.

We all know they like charts!

Bonds: Prices, Yields and Confusion- a Visual Guide

Bonds have been the talk of the financial world lately. One minute it’s a thirty-year bull market, the next it’s a bondcano. Prices are up, yields are down and that’s bad. But then in the last couple of months, prices are down and yields are up and that’s bad too, apparently. I’m going to take some of the confusion out of these relationships and give you a visual guide to what’s been going on in the bond world.

The mathematical relationship between bond prices and yields can be a little complicated and I know very few people who think their lives would be improved by more algebra in it. So for our purposes, the fundamental relationship is that bond prices and yields move in opposite directions. If one is going up, the other is going down. But it’s not a simple 1:1 relationship and there are a few other factors at play.

There are several different types of bond yields that can be calculated:

  • Yield to maturity: the yield you would get if you hold the bond until it matures.
  • Yield to call: the yield you would get if you hold the bond until its call date.
  • Yield to worst: the worst outcome on a bond, whether it is called or held to maturity.
  • Running yield: this is roughly the yield you would get from holding the bond for a year.

We are going to focus on yield to maturity here, but a good overview of yields generally can be found at FIIG. Another good overview is here.


To explain all this (without algebra), I’ve created two simulations. These show the approximate yield to maturity against the time to maturity, coupon rate and the price paid for the bond. For the purposes of this exercise, I’m assuming that our example bonds have a face value of $100 and a single annual payment.

The first visual shows what happens as we change the price we pay for the bond. When we buy a bond below face value (at, say $50 when its face value is $100), yield is higher. But if we buy that bond at $150, then yield is much lower. As price increases, yield decreases.

The time the bond has until maturity matters a lot here, though. If there is only a short time to maturity then the differences between below/above face value can be very large. If there are decades to maturity, then these differences tend to be much smaller. The shading of the blue dots represent the coupon rate that might be attached to a bond like this- the darkest colours will have the highest coupon rate and the lighter colour will have the lowest coupon rates. Again, the differences matter more when there is less time for a bond to mature.

Prices gif

The second animation is a representation of what happens as we change the coupon rate (e.g. the interest rate the debtor is paying to the bond holder). The lines of dots represent differences in the price paid for the bond. The lighter colours represent a cheaper purchase below face value (better yields- great!). The darker colours represent an expensive purchase above face value (lower yields-not so great).

If we buy a bond cheaply, then the yield may be higher than the coupon rate. If we buy it over the face value, then the yield may be lower than the coupon rate. The difference between them is less the longer the bond has to mature. When the bond is very close to maturity those differences can be quite large.

Coupon Gif

When discussing bonds, we often mention something called the yield curve and this describes the yield a bond (or group of bonds) will generate over their life time.

If you’d like to have a go at manipulating the coupon rate and the price to manipulate an approximate yield curve, you can check out this interactive I built here.

Remember that all of these interactives and animations are approximate, if you want to calculate yield to maturity exactly, you can use an online calculator like the one here.

So how does this match the real data that gets reported on daily? Our last chart shows the data from the US Treasury 10-year bills that were sold on the 25th of November 2016. The black observations are bonds maturing within a year, the blue are those that have longer to run.  Here I’ve charted the “Asked Yield”, which is the yield a buyer would receive if the seller sold their bond at the price they were asking. Sometimes, however, the bond is bought at a lower bid, so the actual yield would be a little higher. I’ve plotted this against the time until the bond matures. We can see that the actual yield curve produced is pretty similar to our example charts.

This was the yield curve from one day. The shape of the yield curve will change on a day-to-day basis depending on the prevailing market conditions (e.g. prices). It will also change more slowly over time as the Federal Reserve issues bonds with higher or lower coupon rates, depending on economic conditions.

yield curve

Data: Wall Street Journal.

Bond yields and pricing can be confusing, but hopefully as you’re reading the financial pages they’re a lot less so now.

A huge thanks to my colleague, Dr Henry Leung at the University of Sydney for making some fantastic suggestions on this piece.


Political Donations 2015/16

Yesterday, the ABC released a dataset detailing donations made to political parties in Australia during the 2015-16 period. You can find their analysis and the data here. The data itself isn’t a particularly good representation of what was happening during the period: there isn’t a single donation to the One Nation Party among the lot of them, for example. This data isn’t a complete picture of what’s going on.

While the ABC made a pretty valiant effort to categorise where the donations were coming from, “uncategorised” was the last resort for many of the donors.

Who gets the money?

In total, there were 49 unique groups who received the money. Many of these were state branches of national parties, for example the Liberal Party of Australia – ACT Division, Liberal Party of Australia (S.A. Division) and so on. I’ve grouped these and others like it together under their national party. Other groups included small narrowly-focussed parties like the Shooters, Fishers and Farmers Party and the Australian Sex Party. Small micro parties like the Jacqui Lambie Network, Katter’s Australian Party and so on were grouped together. Parties with a conservative focus (Australian Christians, Family First, Democratic Labor Party) were grouped and those with a progressive focus (Australian Equality Party, Socialist Alliance) were also grouped together. Parties focused on immigration were combined.

The following chart shows the value of the donation declared and the recipient group that received it.

Scatter plot

Only one individual donation exceeded $500 000 and that was to the Liberal Party. It’s obscuring the rest of the distribution, so I’ve removed it in the next chart. Both the major parties receive more donations than the other parties, which comes as no surprise to anyone. However, the Greens have a proportion of very generous givers ($100 000+) which is quite substantial. The interesting question is not so much as who received it, but who gave the money.

Scatter plot with outlier removed


Who gave the money?

This is probably the more interesting point. The following charts use the ABC’s categories to see if we can break down where the (declared) money trail lies (for donations $500 000 and under). Again, the data confirmed what everyone already knew: unions give to the Labor party. Finance and insurance gave heavily to the Liberal Party (among others). Several clusters stand out, though: uncategorised donors give substantially to minor parties and the Greens have two major clusters of donors: individuals and a smaller one in the agriculture category.

Donor categories and value scatter plot

Breaking this down further, if we just look at where the money came from and who it went to, we can see that the immigration-focused parties are powered almost entirely by individual donations with some from uncategorised donors. Minor parties are powered by family trusts, unions and uncategorised donors. Greens by individuals, uncategorised and agriculture with some input from unions. What’s particularly interesting is the differences in Labor and Liberal donors. Compared to Liberal, Labor does not have donors in the tobacco industry, but also has less input by number of donations in agriculture, alcohol, advocacy/lobby groups, sports and water management. They also have fewer donations from uncategorised donors and more from unions.

Donors and Recipients Scatterplot

What did we learn?

Some of what we learned here was common knowledge: Labor doesn’t take donations from tobacco, but it does from unions. The unions don’t donate to Liberal, but advocacy and lobby groups do. The more interesting observations are focussed on the smaller parties: the cluster of agricultural donations for the Greens Party – normally LNP heartland; and the individual donations powering the parties focussed on immigration. The latter may have something to say for the money powering the far right.


Productivity: In the Long Run, It’s Nearly Everything.

“Productivity … isn’t everything, but in the long run it’s nearly everything.” Paul Krugman, The Age of Diminished Expectations (1994)

So in the very long run, what’s the Australian experience? I recently did some work with the Department of Communications and the Arts on digital techniques and developments. Specifically, we were looking at the impacts advances in fields like machine learning, artificial intelligence and blockchain may have on productivity in Australia. I worked with a great team at the department led by the Chief Economist Paul Paterson and we’re looking forward to our report being published.

In the meantime, here’s the very long run on productivity downunder.

Australian Productivity Chart

Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2