Three things every new data scientist should know

Anyone who has spent any time in the online data science community knows that this kind of post is a genre all on its own. “N things you should know/do/be/learn/never do” is something that pops up in my twitter feed several times a day. These posts range from useful ways to improve your own practice to clickbait listing reams of accomplishments that make Miss Bingley’s “accomplished young ladies” speech in Pride and Prejudice appear positively unambitious.

Miss Bingley’s pronouncement could be easily be applied to data scientists everywhere:

“Oh! certainly,” cried his faithful assistant, “no [woman] can be really esteemed accomplished who does not greatly surpass what is usually met with. A woman must have a thorough knowledge of music, singing, drawing, dancing, and the modern languages, to deserve the word; and besides all this, she must possess a certain something in her air and manner of walking, the tone of her voice, her address and expressions, or the word will be but half-deserved.”

Swap out the references to women with “data scientist”, throw in a different skill set and there we have it:

“Oh! certainly,” cried his faithful assistant, “no data scientist can be really esteemed accomplished who does not greatly surpass what is usually met with. A data scientist must have a thorough knowledge of programming in every conceivable language that was, is or shall be, linear algebra, business acumen, obscure models only ever applied in obscure places, and whatever is “hot” this year, to deserve the title; and besides all this, she must possess a certain something in her air and manner of tweeting, the tone of her blogging, her linkedin profile and be a snappy dresser, or the title will be but half-deserved.”

Put like that, you’d be forgiven for not allowing the Miss Bingleys of the world to define you.

If I had a list of things to say to new data scientists, they wouldn’t have much to do with data science at all:

  1. You define yourself and your own practice. Not twitter, not an online community, not blogs from people who may or may not know your work. Data science is an incredibly broad array of people, ideas and tools. Maybe you’re in the middle of it, maybe you’re on the edge. That’s OK, it’s all valuable.
  2. You’re more than a bot. This is an industry that is increasing automation every day. You add value to your organisation in ways that a bot never can. What is the value you add? Cultivate and grow it.
  3. The online community is a wonderful place full of people who want to help you grow your practice and potential. Dive in and explore: but remember that the advice and pronouncements are just that. They don’t always apply to you all the time. Take what’s useful today and put the rest aside until it’s useful later.

It’s a short list!

It got wet

NSW got wet this weekend. In our own particular case we lost a large amount of our driveway and several paddocks spontaneously attained lake status. So there was nothing else to do but to poke around and see what I could turn up in the historic record (find yours here).

Some locals recorded up to 250mm in 24 hours this weekend. I thought that was an extraordinary amount until I checked the data (only available up until April this year so far, alas).

It turns out that sometime in the late sixties the local rainfall station recorded an extraordinary 392mm in 24 hours. Now that’s an outlier…!

I’ll invest in a new pair of gumboots just in case.

Smooth scatter plot rainfall

If you’re into this sort of thing, the plot was done using the “smoothScatter” function in R. It’s a change from the usual time series line chart. I think I’m a convert.

Continuous, Censored and Truncated Data: what are the differences and do you need to care?

Whenever I work with someone whose statistical or econometric experience has been more practical than theoretical, two things happen. The first is that the poor person inexplicably develops a twitch whenever I launch into an enthusiastic tangent that requires a sheet of graph paper and extensive hand waving.

The other thing that inevitably happens is that the digression comes to an end and the question is asked “but does that matter in practice?”

When it comes to model section, the difference between data types really does matter. You may make choices one way or another, but understanding the differences (both obvious and subtle) lets you make those choices understanding that you do have them.

This post is a cliff-notes version of the issue. Maybe you’ve heard of these differences in data types and just need a memory jog. Maybe you’ve not heard of them at all and want somewhere simple to start.

Continuous data is pretty simple: it’s data that can lie anywhere on the real line with a positive probability. That is, it can be anywhere from very large negative numbers to very large positive numbers. The normal distribution is an example of continuous data.

Truncated data, on the other hand, is data which is continuous but has the added complication of only being observed above or below a certain point. The classic example suggested by Greene is income [1]. One example would be if we only surveyed the income of those earning above the tax-free threshold: then we would have truncated data.

Censored data is similar. It’s an issue not of observation but in the way the data is sampled. Some parts of the distribution are obscured, but not ignored. The survey may, for example, interview all income levels, but only record those above the tax free threshold and describe the rest as “under the tax threshold” rather than giving the income in dollar terms. In this case all parts of the distribution are reported on, but the level of information differs above or below a threshold.

Most people are aware of issues modelling categorical data using techniques designed for continuous data. However, censored and truncated data also need special treatment. A lot of the data we deal with has a natural truncation point: distance isn’t negative, prices are not (well, hardly ever) negative. Recognising that you may be dealing with truncated or censored data is an important part of initial data analysis. For a thorough discussion, see W.H. Green’s chapter on the subject here.

In practice, continuous data methodologies may work quite well for these types of data as long as there isn’t a large amount of data sitting at or near the truncation or censoring point (which is often zero).

Test scores are something I’ve worked a lot with. In my experience, once the proportion of test scores began to approach around 20% zeros, I needed to switch over to models designed for the issue. In the 10%-20% range I will often try a few of different models to see which is most appropriate. That’s just a general rule of thumb- your mileage may vary.

Hand waving and furious graph-paper drawing aside: yes in this case knowing the differences does matter in practice.

Notes:

[1] W. H. Green, Econometric Analysis, is a classic text and here I’m looking at p. 756 in the fifth edition. There are three copies of this book living in my house. Definitely worth the investment if you are looking for either a classic text covering everything econometrics or a useful TV stand. What can I say? We were young and poor and a matched set of texts made up for deficits in our furniture budget. I’ve owned this book for nearly twenty years and I still use it- even long after we can afford furniture.

The Central Limit Theorem: Misunderstood

Asymptotics are the building blocks of many models. They’re basically lego: sturdy, functional and capable of allowing the user to exercise great creativity. They also hurt like hell when you don’t know where they are and you step on them accidentally. I’m pushing it on the last, I’ll admit. But I have gotten very sweary over recalcitrant limiting distributions in the past (though I may be in a small group there).

One of the fundamentals of the asymptotic toolkit is the Central Limit Theorem, or CLT for short. If you didn’t study eight semesters of econometrics or statistics, then it’s something you (might have) sat through a single lecture on and walked away with the hot take “more data is better”.

The CLT is actually a collection of theorems, but the basic entry-level version is the Lindberg-Levy CLT. It states that for any sample of n random, independent observations drawn from any distribution with finite mean (μ) and standard deviation (σ), if we calculate the sample mean x-bar then,

central limit theorem

In my time both in industry and in teaching, I’ve come across a number of interpretations of this result: many of them very wrong from very smart people. I’ve found it useful to clarify what this result does and does not mean, as well as when it matters.

Not all distributions become normal as n gets large. In fact, most things don’t “tend to normality” as N gets large. Often, they just get really big or really small. Some distributions are asymptotically equivalent to normality, but most “things”- estimators and distributions alike- are not.

The sample mean by itself does not become normal as n gets large. What would happen if you added up a huge series of numbers? You’d get a big number. What would happen if you divided your big number by your huge number? Go on, whack some experimental numbers into your calculator!

Whatever you put into your calculator, it’s not a “normal distribution” you get when you’re done. The sample mean alone does not tend to a normal distribution as N gets large.

The studentised sample mean has a distribution which is normal in the limit. There are some adjustments we need to make before the sample mean has a stable limiting distribution – this is the quantity often known as the z-score. It’s this quantity that tends to normality as n gets large.

How large does n need to be? This theorem works for any distribution with a finite mean and standard deviation, e.g. as long as x comes from a distribution with these features. Generally, statistics texts quote the figure of n=30 as a “rule of thumb”. This works reasonably well for simple estimators and models like the sample mean in a lot of situations.

This isn’t to say, however, that if you have “big data” your problems are gone. You just got a whole different set, I’m sorry. That’s a different post, though.

So that’s a brief run down on the simplest of central limit theorems: it’s not a complex or difficult concept, but it is a subtle one. It’s the building block upon which models such as regression, logistic regression and their known properties have been based.

The infographic below is the same information, but for some reason my students find information in that format easier to digest. When it comes to asymptotic theory, I am disinclined to argue with them: I just try to communicate in whatever way works. On that note, if this post was too complex or boring, here is the CLT presented with bunnies and dragons.** What’s not to love?CLT infographic

** I can’t help myself: The reason why the average bunny weights distribution gets narrower as the sample size gets larger is because this is the sample mean tending towards the true population mean. For a discussion of this behaviour vs the CLT see here.

It’s my only criticism of what was an otherwise a delightful video. Said video being in every way superior to my own version done late one night for a class with my dog assisting and my kid’s drawing book. No bunnies or dragons, but it’s here.

Q&A vs the Leaders’ Debate: is everyone singing from the same song sheet?

The election campaign is in full swing here in Australia and earlier this week the leaders of the two main parties, Malcolm Turnbull and Bill Shorten, faced off in a heavily scripted debate in which few questions were answered and the talking points were well practiced. An encounter described as “diabolical” and “boring“, fewer Australians tuned in compared to recent years. Possibly this was because they expected to hear what they had already heard before.

Since the song sheet was well rehearsed, this seemed like the perfect opportunity for another auspol word cloud. The transcript of the debate was made available on Malcolm Turnbull’s website and it was an easy enough matter of poking around and seeing what could be found. Chris Ullmann, who moderator, was added to the stop words list as he was a prominent feature in earlier versions of the cloud.

debate word cloud

The song sheet was mild: the future tense “will” was in the middle with Shorten, labor, plan, people and Turnbull. Also featured were tax, economic, growth, change and other economic nouns like billion, (per)cent, economy, budget, superannuation. There was mention of climate, (people) smugglers, fair and action, but these were relatively isolated as topics.

In summary, this word cloud is not that different to that generated from the carefully strategised twitter feeds of Turnbull and Shorten I looked at last week.

The ABC’s program Q and A could be a better opportunity for politicians to depart from the song sheet and offer less scripted insight: why not see what the word cloud throws up?

This week’s program aired the day after the leader’s debate and featured Steve Ciobo (Liberal: minister for trade), Terri Butler (Labor: shadow parliamentary secretary for child safety and prevention of family violence), Richard di Natale (Greens, leader, his twitter word cloud is here), Nick Xenophon (independent senator) and Jacqui Lambie (independent senator).  Tony Jones hosted the program and suffered the same fate as Chris Uhlmann.

QandA word cloud

The word cloud picked up on the discursive format of the show: names of panellists feature prominently. Interestingly, Richard di Natale appears in the centre. Also prominent are election related words such as Australia, government, country, question, debate.

Looking at other topics thrown up by the word cloud, there is a broad range: penalty rates, coal, senate, economy, businesses, greens, policy, money, Queensland, medicare, politician, commission.

Two different formats, two different panels and two different sets of topics. Personally, I prefer it when the song sheet has a few more pages.

Social Networks: The Aeneid Again

Applying social network analysis techniques to the Aeneid provides an opportunity to visualise literary concepts that Virgil envisaged for the text. It occurred to me that this was a great idea when I saw this social network analysis of Game of Thrones. If there is a group of literary figures more blood thirsty, charming and messed around by cruel fate than the denizens of Westeros, it would be those in the golden age of Roman literature.

Aeneid social network

This is a representation of the network of characters in the Aeneid. Aeneas and Turnus, both prominent figures in the wordcloud I created for the Aeneid are also prominent in the network. Connected to Aeneas is his wife Lavinia, his father Anchises, the king of the Latins (Latinus) and Pallas, the young man placed into Aeneas’ care.

Turnus is connected to Aeneas directly along with his sister Juturna, Evander (father of Pallas. Cliff notes version: the babysitting did not go well) and Allecto, a divine figure of rage.

Between Aeneas and Turnus is the “Trojan contingent”. Virgil deliberately created parallels between the stories surrounding the fall of Troy and Aeneas’ story. Achilles, the tragic hero, is connected to Turnus directly, while Aeneas is connected to Priam (king of Troy) and Hector (the great defender of Troy). Andromache is Hector’s widow whom Aeneas meets early in the epic.

Also of note is the divine grouping: major players in directing the action of the epic. Jupiter, king of the gods and Apollo the sun god are directly connected to our hero. Venus, Neptune, Minerva and Cupid are all present. In a slightly different grouping, Juno, Queen of the Gods and Aeneas’ enemy is connected to Dido, Aeneas’ lover. Suffice it to say, the relationship was not a “happily ever after”.

I used this list of the characters in the Aeneid as a starting point and later removed all characters who were peripheral to the social network. If you’re interested in trying this yourself, I posted the program I used here. Once again, the text used is the translation by J.W. Mackail and you can download it from Project Gutenberg here.

There were a number of resources I found useful for this project:

  • This tutorial from R DataMining provided a substantive amount of the code required for the social network analysis
  • While this tutorial from the same place was very helpful for creating a text document matrix. I’ve used it previously a number of times.
  • This article from R Bloggers on using igraph was also very useful
  • There were a number of other useful links and I’ve documented those in the R script.

Whilst text mining is typically applied to modern issues, the opportunity to visualise an ancient text is an interesting one. I was interested in how the technique grouped the characters together. These groupings were by and large consistent not only with the surface interpretation of the text, but also deeper levels of political and moral meaning within the epic.

Modelling Early Grade Education in Papua New Guinea

For several years, I worked for the World Bank analysing the early grade education outcomes in a number of different Pacific countries including Laos, Tonga and Papua New Guinea, amongst others. Recently, our earlier work in Papua New Guinea was published for the first time.

One of the more challenging things I did was model a difficult set of survey outcomes: reading amongst young children. You can see the reports here. Two of the most interesting relationships we observed were the importance of language for young children learning to read (Papua New Guinea has over 850 of them so this matters) and the role that both household and school environments play in literacy development.

At some point I will write a post about the choice between standard ordinary least squares regressions used in the field and the tobit models I (generally) prefer for this data. Understanding the theoretical difference between censored, truncated and continuous data isn’t the most difficult thing in the world, but understanding the practical difference between them can have a big impact on modelling outcomes.

Elasticity and Marginal Effects: Two Key Concepts

One of the critical parts of building a great model is using your understanding of the problem and context. Choosing an appropriate model type and deciding on appropriate features/variables to explore based on this information is critical.

The two key concepts of elasticity and marginal effects are fundamental to an economic understanding of model building. This is something that can be overlooked for practitioners not coming from that background. Neither concept is difficult or particularly obtuse.

This infographic came about because I had a group of talented economics students at the masters’ level who had no econometric background, by and large. In a crowded course, I don’t have much time to expand on my favourite things. This was my take on explaining the concepts quickly and simply.

Elasticity infographic

For those very new to the concept, this explanation here is simple. Alternatively, if you’re interested in non-constant marginal effects and ways they can be used, check out this discussion.

 

More word clouds: Auspol

Whilst I love text mining a classic work of western literature, this time I decided to stay in the present century with the twitter feeds of the leaders of the three major parties heading into a downunder federal election.

Turnbull word cloud

Malcolm Turnbull is the prime minister and leads the liberal party, he’s in blue. Bill Shorten is the opposition leader and head of the labor party, he’s in red. Richard di Natale is the leader of the Greens party, he’s in green because I had no choice there.

The outcomes are pretty interesting: apparently AMP was A.Big.Deal. lately. “Jobs” and “growth” were the other words playing on repeat for the two major parties.
Shorten Word Cloud

Each one has a distinct pattern, however. Turnbull is talking about AMP, plans, jobs, future, growth. Shorten is talking about labor, AMP, medicare, budget, schools and jobs. Di Natale is talking about the Greens, AMP (again), electricity, and auspol itself. Unlike the others, Di Natale was also particularly interested in farmers, warming and science.Word cloud Di Natale

The programming and associated sources are pretty much the same as for the Aeneid word cloud, except I used the excellent twitteR package you can find out about here. This tutorial on R Data mining was the basis of the project. The size of the corpi (that would be the plural of corpus, if you speak Latin) presented a problem and these resources here and here were particularly helpful.

For reference, I pulled the tweets from the leaders’ timelines on the evening of the 23/05/16. The same code gave me 83 tweets from Bill Shorten, 59 from Malcolm Turnbull and 33 from Richard Di Natale: all leaders are furious tweeters, so if anyone has any thoughts on why twitteR responded like that, I’d be grateful to hear.

The minimum frequencies for entering the word clouds were 3 per word for Shorten with a greater number of tweets picked up, but only 2 for Di Natale and Turnbull, due to the smaller number of available words.

I’ll try this again later in the campaign and see what turns up.

Text Mining: Word Clouds

I’ve been exploring text mining in greater depth lately. As an experiment, I decided to create a word cloud based on Virgils Aeneid, one of the great works of Roman literature. Mostly because it can’t all be business cases and twitter analyses. The translation I used was by J.W. Mackail and you can download it here.

word cloud

Aeneas (the protagonist) and Turnus (the main antagonist) feature prominently. “Father” also makes a prominent appearance, as part of the epic is about Aeneas’ relationship with his elderly father. However, neither of Aeneas’ wives or his lover, Dido, appear in the word cloud. “Death”, “gods”, “blood”, “sword”, “arms” and “battle” all feature. That sums the epic up: it’s a rollicking adventure about the fall of Troy, the founding of Rome and a trip to the underworld as well.

The choice to downplay the role of romantic love in the story had particular political implications for the epic as a piece of propaganda. You can read more about it here and here. I found it interesting that the word cloud echoed this.

What I learnt from this experiment was that stop words matter. The cloud was put together from an early 20th century translation of a 2000 year old text using 21st century methods and stop words. Due to the archaic English used in the translation, I added a few stop words of my own: things like thee, thou, thine. This resulted in a much more informative cloud.

I did create a word cloud using the Latin text, but without a set of Latin stop words easily available it yields a cloud helpfully describing the text with prominent features like “but”, “so”, “and”, “until”.

The moral of the story is: the stop words we use matter. Choosing the right set describes the text accurately.

If you’re interested in creating your own clouds, I found these resources particularly helpful:

  • Julia Silge’s analysis of Jane Austen inspired me to think about data mining in relation to Roman texts, you can see it here, it’s great!
  • The Gutenbergr package for accessing texts by Ropenscilabs available on GitHub.
  • This tutorial on data mining from RDatamining.
  • Preparing literary data for text mining by Jeff Rydberg-Cox.
  • A great word cloud tutorial you can view here on STHDA.

There were a number of other tutorials and fixes that were helpful, I noted these in the Rscript. The script is up on github: if you want to try it yourself, you can find it here.