Note: there is a point at which it will be more efficient to migrate the blog over to blogdown rather than continue with workarounds like this. Yes… I can see it’s very close too.
It was a farcical display of an absence of leadership. And the data it provides is not remotely as good as a properly executed survey.
Nonetheless, it had our national attention for months and it’s over.
Here’s a Shiny app because my Facebook discussions got a little detailed. Now everyone can have a look at the data on a by-electorate basis.
Some hot takes for you:
- When thinking about outcomes in ‘electorates with a high proportion of migrants’, also think about the massively different treatment effects caused by the fact there was little to no outreach from the yes campaign to non English speaking communities, while some others targeted these communities with misinformation regarding the impact of gay marriage on schools. (That’s not a diss on the yes campaign: limited resources and all of that. They were in it to win a nation, not single electorates.)
- Remember that socioeconomic advantage is a huge confound in just about everything.
- The survey asked about changing a status quo. That’s not entirely the same thing as being actively homophobic: but I’ll agree in this case that’s a fine line to draw.
- Why didn’t areas with high migrant populations in other cities follow the same patterns?
- Did Sydney diocesan involvement, both in terms of investment and pulpit rhetoric create a different treatment effect compared to different cities?
And one thing I think we should all be constantly aware of, even as we nerds are enjoying our dissection:
- This data was generated on the backs of the suffering of many GBLTIQ+ Australians and their families.
Bring on equality.
I think the differences between a model, an estimation method and an algorithm are not always well understood. Identifying differences helps you understand what your choices are in any given situation. Once you know your choices you can make a decision rather than defaulting to the familiar.
An algorithm is a set of predefined steps. Making a cup of coffee can be defined as an algorithm, for example. Algorithms can be nested within each other to create complex and useful pieces of analysis. Gradient descent is an algorithm for finding the minima of a function computationally. Newton-Raphson does the same thing but slower, stochastic gradient descent does it faster.
An estimation method is the manner in which your model is estimated (often with an algorithm). To take a simple linear regression model, there are a number of ways you can estimate it:
- You can estimate using the ordinary least squares closed form solution (it’s just an algebraic identity). After that’s done, there’s a whole suite of econometric techniques to evaluate and improve your model.
- You can estimate it using maximum likelihood: you calculate the negative likelihood and then you use a computational algorithm like gradient descent to find the minima. The econometric techniques are pretty similar to the closed form solution, though there are some differences.
- You can estimate a regression model using machine learning techniques: divide your sample into training, test and validation sets; estimate by whichever algorithm you like best. Note that in this case, this is essentially a utilisation of maximum likelihood. However, machine learning has a slightly different value system to econometrics with a different set of cultural beliefs on what makes “a good model.” That means the evaluation techniques used are often different (but with plenty of crossover).
The model is the thing you’re estimating using your algorithms and your estimation methods. It’s the decisions you make when you decide if Y has a linear relationship with X, or which variables (features) to include and what functional form your model has.
One of the things I find hardest about data visualisation is colouring. I’m not a natural artist, much preferring everything in gentle shades of monochrome. Possibly beige. Obviously for any kind of data visualisation, this limited .Quite frankly this is the kind of comfort zone that needs setting on fire.
I’ve found this site really helpful: it’s a listing of the Pantone colours with both Hex and RGB codes for inserting straight into your visualisations. It’s a really useful correspondence if I’m working with someone (they can give me the Pantone colour numbers of their website or report palette- I just search the page).
One thing I’ve found, however, is that a surprising (to me) number of people have some kind of colour-based visual impairment. A palette that looks great to me may be largely meaningless to someone I’m working with. I found this out in one of those forehead slapping moments when I couldn’t understand why a team member wasn’t seeing the implications of my charts. That’s because, to him, those charts were worse than useless. They were a complete waste of his time.
Some resources I’ve found helpful in making my visualisations more accessible are the colourblind-friendly palettes discussed here and this discussion on R-Bloggers. The latter made me realise that up until now I’ve been building visualisations that were obscuring vital information for many users.
The things I think are important for building an accessible visualisation are:
- Yes, compared to more subtle palettes, colour-blind friendly palettes look like particularly lurid unicorn vomit. They don’t have to look bad if you’re careful about combinations, but I’m of the opinion that prioritising accessibility for my users is more important than “pretty”.
- Redundant encoding (discussed in the R-bloggers link above) is a great way ensuring users can make out the information you’re trying to get across. To make sure this is apparent in your scale, use a combination of scale_colour_manual() and scale_linetype_manual(). The latter works the same as scale_colour_manual() but is not as well covered in the literature.
- Consider reducing the information you’re putting into each chart, or using a combination of facets and multiple panels. The less there is to differentiate, the easier it can be on your users. This is a good general point and not limited to those with colourblindness.
Yesterday, the ABC released a dataset detailing donations made to political parties in Australia during the 2015-16 period. You can find their analysis and the data here. The data itself isn’t a particularly good representation of what was happening during the period: there isn’t a single donation to the One Nation Party among the lot of them, for example. This data isn’t a complete picture of what’s going on.
While the ABC made a pretty valiant effort to categorise where the donations were coming from, “uncategorised” was the last resort for many of the donors.
Who gets the money?
In total, there were 49 unique groups who received the money. Many of these were state branches of national parties, for example the Liberal Party of Australia – ACT Division, Liberal Party of Australia (S.A. Division) and so on. I’ve grouped these and others like it together under their national party. Other groups included small narrowly-focussed parties like the Shooters, Fishers and Farmers Party and the Australian Sex Party. Small micro parties like the Jacqui Lambie Network, Katter’s Australian Party and so on were grouped together. Parties with a conservative focus (Australian Christians, Family First, Democratic Labor Party) were grouped and those with a progressive focus (Australian Equality Party, Socialist Alliance) were also grouped together. Parties focused on immigration were combined.
The following chart shows the value of the donation declared and the recipient group that received it.
Only one individual donation exceeded $500 000 and that was to the Liberal Party. It’s obscuring the rest of the distribution, so I’ve removed it in the next chart. Both the major parties receive more donations than the other parties, which comes as no surprise to anyone. However, the Greens have a proportion of very generous givers ($100 000+) which is quite substantial. The interesting question is not so much as who received it, but who gave the money.
Who gave the money?
This is probably the more interesting point. The following charts use the ABC’s categories to see if we can break down where the (declared) money trail lies (for donations $500 000 and under). Again, the data confirmed what everyone already knew: unions give to the Labor party. Finance and insurance gave heavily to the Liberal Party (among others). Several clusters stand out, though: uncategorised donors give substantially to minor parties and the Greens have two major clusters of donors: individuals and a smaller one in the agriculture category.
Breaking this down further, if we just look at where the money came from and who it went to, we can see that the immigration-focused parties are powered almost entirely by individual donations with some from uncategorised donors. Minor parties are powered by family trusts, unions and uncategorised donors. Greens by individuals, uncategorised and agriculture with some input from unions. What’s particularly interesting is the differences in Labor and Liberal donors. Compared to Liberal, Labor does not have donors in the tobacco industry, but also has less input by number of donations in agriculture, alcohol, advocacy/lobby groups, sports and water management. They also have fewer donations from uncategorised donors and more from unions.
What did we learn?
Some of what we learned here was common knowledge: Labor doesn’t take donations from tobacco, but it does from unions. The unions don’t donate to Liberal, but advocacy and lobby groups do. The more interesting observations are focussed on the smaller parties: the cluster of agricultural donations for the Greens Party – normally LNP heartland; and the individual donations powering the parties focussed on immigration. The latter may have something to say for the money powering the far right.
“Productivity … isn’t everything, but in the long run it’s nearly everything.” Paul Krugman, The Age of Diminished Expectations (1994).
So in the very long run, what’s the Australian experience? I recently did some work with the Department of Communications and the Arts on digital techniques and developments. Specifically, we were looking at the impacts advances in fields like machine learning, artificial intelligence and blockchain may have on productivity in Australia. I worked with a great team at the department led by the Chief Economist Paul Paterson and we’re looking forward to our report being published.
In the meantime, here’s the very long run on productivity downunder.
The yield to maturity concept describes the approximate rate of return a bond generates if it’s held until redemption date. It’s dependent on a few things including the coupon rate (nominal interest rate), face value of the bond, price of the bond and the time until maturity.
It can get a little confusing with the mathematics behind it, so I’ve created a simple Shiny App that allows you to manipulate the inputs to observe what happens. Bear in mind this is not a financial calculator, it’s an interactive for educational purposes. It’s also the approximate not exact yield to maturity of a bond which is fine for our purposes.
I’ve mapped the yield up to 30 year redemption and assumed a face value of $100. Coupon rate varies between 0% and 25%. Current price of the bond can vary between $50 and $150. Mostly, the yield curve is very flat in this simplified approximation- but observe what happens when there is only a short time to maturity (0-5 years) and rates or price are extreme. You can find the interactive directly here.
Remember, this is just an approximation. For a more accurate calculation, see here.
One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.
Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.
The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.
I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.
The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.