Continuous, Censored and Truncated Data: what are the differences and do you need to care?

Whenever I work with someone whose statistical or econometric experience has been more practical than theoretical, two things happen. The first is that the poor person inexplicably develops a twitch whenever I launch into an enthusiastic tangent that requires a sheet of graph paper and extensive hand waving.

The other thing that inevitably happens is that the digression comes to an end and the question is asked “but does that matter in practice?”

When it comes to model section, the difference between data types really does matter. You may make choices one way or another, but understanding the differences (both obvious and subtle) lets you make those choices understanding that you do have them.

This post is a cliff-notes version of the issue. Maybe you’ve heard of these differences in data types and just need a memory jog. Maybe you’ve not heard of them at all and want somewhere simple to start.

Continuous data is pretty simple: it’s data that can lie anywhere on the real line with a positive probability. That is, it can be anywhere from very large negative numbers to very large positive numbers. The normal distribution is an example of continuous data.

Truncated data, on the other hand, is data which is continuous but has the added complication of only being observed above or below a certain point. The classic example suggested by Greene is income [1]. One example would be if we only surveyed the income of those earning above the tax-free threshold: then we would have truncated data.

Censored data is similar. It’s an issue not of observation but in the way the data is sampled. Some parts of the distribution are obscured, but not ignored. The survey may, for example, interview all income levels, but only record those above the tax free threshold and describe the rest as “under the tax threshold” rather than giving the income in dollar terms. In this case all parts of the distribution are reported on, but the level of information differs above or below a threshold.

Most people are aware of issues modelling categorical data using techniques designed for continuous data. However, censored and truncated data also need special treatment. A lot of the data we deal with has a natural truncation point: distance isn’t negative, prices are not (well, hardly ever) negative. Recognising that you may be dealing with truncated or censored data is an important part of initial data analysis. For a thorough discussion, see W.H. Green’s chapter on the subject here.

In practice, continuous data methodologies may work quite well for these types of data as long as there isn’t a large amount of data sitting at or near the truncation or censoring point (which is often zero).

Test scores are something I’ve worked a lot with. In my experience, once the proportion of test scores began to approach around 20% zeros, I needed to switch over to models designed for the issue. In the 10%-20% range I will often try a few of different models to see which is most appropriate. That’s just a general rule of thumb- your mileage may vary.

Hand waving and furious graph-paper drawing aside: yes in this case knowing the differences does matter in practice.

Notes:

[1] W. H. Green, Econometric Analysis, is a classic text and here I’m looking at p. 756 in the fifth edition. There are three copies of this book living in my house. Definitely worth the investment if you are looking for either a classic text covering everything econometrics or a useful TV stand. What can I say? We were young and poor and a matched set of texts made up for deficits in our furniture budget. I’ve owned this book for nearly twenty years and I still use it- even long after we can afford furniture.

The Central Limit Theorem: Misunderstood

Asymptotics are the building blocks of many models. They’re basically lego: sturdy, functional and capable of allowing the user to exercise great creativity. They also hurt like hell when you don’t know where they are and you step on them accidentally. I’m pushing it on the last, I’ll admit. But I have gotten very sweary over recalcitrant limiting distributions in the past (though I may be in a small group there).

One of the fundamentals of the asymptotic toolkit is the Central Limit Theorem, or CLT for short. If you didn’t study eight semesters of econometrics or statistics, then it’s something you (might have) sat through a single lecture on and walked away with the hot take “more data is better”.

The CLT is actually a collection of theorems, but the basic entry-level version is the Lindberg-Levy CLT. It states that for any sample of n random, independent observations drawn from any distribution with finite mean (μ) and standard deviation (σ), if we calculate the sample mean x-bar then,

central limit theorem

In my time both in industry and in teaching, I’ve come across a number of interpretations of this result: many of them very wrong from very smart people. I’ve found it useful to clarify what this result does and does not mean, as well as when it matters.

Not all distributions become normal as n gets large. In fact, most things don’t “tend to normality” as N gets large. Often, they just get really big or really small. Some distributions are asymptotically equivalent to normality, but most “things”- estimators and distributions alike- are not.

The sample mean by itself does not become normal as n gets large. What would happen if you added up a huge series of numbers? You’d get a big number. What would happen if you divided your big number by your huge number? Go on, whack some experimental numbers into your calculator!

Whatever you put into your calculator, it’s not a “normal distribution” you get when you’re done. The sample mean alone does not tend to a normal distribution as N gets large.

The studentised sample mean has a distribution which is normal in the limit. There are some adjustments we need to make before the sample mean has a stable limiting distribution – this is the quantity often known as the z-score. It’s this quantity that tends to normality as n gets large.

How large does n need to be? This theorem works for any distribution with a finite mean and standard deviation, e.g. as long as x comes from a distribution with these features. Generally, statistics texts quote the figure of n=30 as a “rule of thumb”. This works reasonably well for simple estimators and models like the sample mean in a lot of situations.

This isn’t to say, however, that if you have “big data” your problems are gone. You just got a whole different set, I’m sorry. That’s a different post, though.

So that’s a brief run down on the simplest of central limit theorems: it’s not a complex or difficult concept, but it is a subtle one. It’s the building block upon which models such as regression, logistic regression and their known properties have been based.

The infographic below is the same information, but for some reason my students find information in that format easier to digest. When it comes to asymptotic theory, I am disinclined to argue with them: I just try to communicate in whatever way works. On that note, if this post was too complex or boring, here is the CLT presented with bunnies and dragons.** What’s not to love?CLT infographic

** I can’t help myself: The reason why the average bunny weights distribution gets narrower as the sample size gets larger is because this is the sample mean tending towards the true population mean. For a discussion of this behaviour vs the CLT see here.

It’s my only criticism of what was an otherwise a delightful video. Said video being in every way superior to my own version done late one night for a class with my dog assisting and my kid’s drawing book. No bunnies or dragons, but it’s here.

Q&A vs the Leaders’ Debate: is everyone singing from the same song sheet?

The election campaign is in full swing here in Australia and earlier this week the leaders of the two main parties, Malcolm Turnbull and Bill Shorten, faced off in a heavily scripted debate in which few questions were answered and the talking points were well practiced. An encounter described as “diabolical” and “boring“, fewer Australians tuned in compared to recent years. Possibly this was because they expected to hear what they had already heard before.

Since the song sheet was well rehearsed, this seemed like the perfect opportunity for another auspol word cloud. The transcript of the debate was made available on Malcolm Turnbull’s website and it was an easy enough matter of poking around and seeing what could be found. Chris Ullmann, who moderator, was added to the stop words list as he was a prominent feature in earlier versions of the cloud.

debate word cloud

The song sheet was mild: the future tense “will” was in the middle with Shorten, labor, plan, people and Turnbull. Also featured were tax, economic, growth, change and other economic nouns like billion, (per)cent, economy, budget, superannuation. There was mention of climate, (people) smugglers, fair and action, but these were relatively isolated as topics.

In summary, this word cloud is not that different to that generated from the carefully strategised twitter feeds of Turnbull and Shorten I looked at last week.

The ABC’s program Q and A could be a better opportunity for politicians to depart from the song sheet and offer less scripted insight: why not see what the word cloud throws up?

This week’s program aired the day after the leader’s debate and featured Steve Ciobo (Liberal: minister for trade), Terri Butler (Labor: shadow parliamentary secretary for child safety and prevention of family violence), Richard di Natale (Greens, leader, his twitter word cloud is here), Nick Xenophon (independent senator) and Jacqui Lambie (independent senator).  Tony Jones hosted the program and suffered the same fate as Chris Uhlmann.

QandA word cloud

The word cloud picked up on the discursive format of the show: names of panellists feature prominently. Interestingly, Richard di Natale appears in the centre. Also prominent are election related words such as Australia, government, country, question, debate.

Looking at other topics thrown up by the word cloud, there is a broad range: penalty rates, coal, senate, economy, businesses, greens, policy, money, Queensland, medicare, politician, commission.

Two different formats, two different panels and two different sets of topics. Personally, I prefer it when the song sheet has a few more pages.