Mapping analytics objects

A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.

As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.

Here’s a general concept map: individual projects vary alot. But it’s a start point.

A concept map with analytics objects.

Client request objects

My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.

At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.

A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project  is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.

I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.

a picture of a spreadsheet

Data objects

The next thing that happens is people give me data, I go and find data or some third party sends data my way.

There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.

It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).

If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.

It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.

For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.

Next comes the analysis and communications objects which you’re probably familiar with.

Analysis and communications objects

(Warning: shameless R plug here)

The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.

Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.

That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.

Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:

  • Export the Rmarkdown document as a word document
  • Have non-technical collaborators make changes and updates via tracked changes
  • Depending on the stage of the project input all those back into R by hand or go forwards with the word document.

It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)

Objects inform other objects

In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.

The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.

Hannan Quinn Information Criteria

This is a short post for all of you out there who use information criteria to inform model decision making. The usual suspects are the Akaike and the Bayes-Schwartz criteria.

Especially, if you’re working with big data, try adding the Hannan Quinn (1972) into the mix. It’s not often used in practice for some reason. Possibly this is a leftover from our small sample days. It has a slower rate of convergence than the other two – it’s log(n) convergent. As a result, it’s often more conservative in the number of parameters or size of the model it suggests- e.g. it can be a good foil against overfitting.

It’s not the whole answer and for your purposes may offer no different insight. But it adds nothing to your run time, is fabulously practical and one of my all time favourites that no one has ever heard of.

Political Donations 2015/16

Yesterday, the ABC released a dataset detailing donations made to political parties in Australia during the 2015-16 period. You can find their analysis and the data here. The data itself isn’t a particularly good representation of what was happening during the period: there isn’t a single donation to the One Nation Party among the lot of them, for example. This data isn’t a complete picture of what’s going on.

While the ABC made a pretty valiant effort to categorise where the donations were coming from, “uncategorised” was the last resort for many of the donors.

Who gets the money?

In total, there were 49 unique groups who received the money. Many of these were state branches of national parties, for example the Liberal Party of Australia – ACT Division, Liberal Party of Australia (S.A. Division) and so on. I’ve grouped these and others like it together under their national party. Other groups included small narrowly-focussed parties like the Shooters, Fishers and Farmers Party and the Australian Sex Party. Small micro parties like the Jacqui Lambie Network, Katter’s Australian Party and so on were grouped together. Parties with a conservative focus (Australian Christians, Family First, Democratic Labor Party) were grouped and those with a progressive focus (Australian Equality Party, Socialist Alliance) were also grouped together. Parties focused on immigration were combined.

The following chart shows the value of the donation declared and the recipient group that received it.

Scatter plot

Only one individual donation exceeded $500 000 and that was to the Liberal Party. It’s obscuring the rest of the distribution, so I’ve removed it in the next chart. Both the major parties receive more donations than the other parties, which comes as no surprise to anyone. However, the Greens have a proportion of very generous givers ($100 000+) which is quite substantial. The interesting question is not so much as who received it, but who gave the money.

Scatter plot with outlier removed

 

Who gave the money?

This is probably the more interesting point. The following charts use the ABC’s categories to see if we can break down where the (declared) money trail lies (for donations $500 000 and under). Again, the data confirmed what everyone already knew: unions give to the Labor party. Finance and insurance gave heavily to the Liberal Party (among others). Several clusters stand out, though: uncategorised donors give substantially to minor parties and the Greens have two major clusters of donors: individuals and a smaller one in the agriculture category.

Donor categories and value scatter plot

Breaking this down further, if we just look at where the money came from and who it went to, we can see that the immigration-focused parties are powered almost entirely by individual donations with some from uncategorised donors. Minor parties are powered by family trusts, unions and uncategorised donors. Greens by individuals, uncategorised and agriculture with some input from unions. What’s particularly interesting is the differences in Labor and Liberal donors. Compared to Liberal, Labor does not have donors in the tobacco industry, but also has less input by number of donations in agriculture, alcohol, advocacy/lobby groups, sports and water management. They also have fewer donations from uncategorised donors and more from unions.

Donors and Recipients Scatterplot

What did we learn?

Some of what we learned here was common knowledge: Labor doesn’t take donations from tobacco, but it does from unions. The unions don’t donate to Liberal, but advocacy and lobby groups do. The more interesting observations are focussed on the smaller parties: the cluster of agricultural donations for the Greens Party – normally LNP heartland; and the individual donations powering the parties focussed on immigration. The latter may have something to say for the money powering the far right.

 

Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

application view1

application view 2

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Data Analysis: Enough with the Questions Already

We’ve talked a lot about data analysis lately. First we asked questions. Then we asked more. Hopefully when you’re doing your own analyses you have your own questions to ask. But sooner or later, you need to stop asking questions and start answering them.

Ideally, you’d really like to write something that doesn’t leave the reader with a keyboard imprint across their forehead due to analysis-induced narcolepsy. That’s not always easy, but here are some thoughts.

Know your story.

Writing up data analysis shouldn’t be about listing means, standard deviations and some dodgy histograms. Yes, sometimes you need that stuff- but mostly what you need is a compelling narrative. What is the data saying to support your claims?

It doesn’t all need to be there. 

You worked out that tricky bit of code and did that really awesome piece of analysis that led you to ask questions and… sorry, no one cares. If it’s not a direct part of your story, it probably needs to be consigned to telling your nerd friends on twitter- at least they’ll understand what you’re talking about. But keep it out of the write up!

How is it relevant?

Data analysis is rarely the end in and of itself. How does your analysis support the rest of your project? Does it offer insight for modelling or forecasting? Does it offer insight for decision making? Make sure your reader knows why it’s worth reading.

Do you have an internal structure?

Data analysis is about translating complex numerical information into text. A clear and concise structure for your analysis makes life much easier for the reader.

If you’re staring at the keyboard wondering if checking every social media account you ever had since high school is a valid procrastination option: try starting with “three important things”. Then maybe add three more. Now you have a few things to say and can build from there.

Who are you writing for?

Academia, business, government, your culture, someone else’s, fellow geeks, students… all of these have different expectations around communication.  All of them are interested in different things. Try not to have a single approach for communicating analysis to different groups. Remember what’s important to you may not be important to your reader.

Those are just a few tips for writing up your analyses. As we’ve said before: it’s not a one-size-fits-all approach. But hopefully you won’t feel compelled to give a list of means, a correlation matrix and four dodgy histograms that fit in the space of a credit card. We can do better than that!

Data Analysis: More Questions

In our last post on data analysis, we asked a lot of questions. Data analysis isn’t a series of generic questions we can apply to every dataset we encounter, but it can be a helpful way to frame the beginning of your analysis. This post is, simply, some more questions to ask yourself if you’re having trouble getting started.

The terminology I use below (tall, dense and wide) is due to Francis Diebold. You can find his original post here and it’s well worth a read.

Remember, these generic questions aren’t a replacement for a thoughtful, strategic analysis. But maybe they will help you generate your own questions to ask your data.

Data analysis infographic