## Size Matters

It would be too easy for someone like me to declare that A/B testing is simple. If you’re doing website testing you have all the power in the world. Sample sizes literally not dreamt of when I was an undergraduate. A large sample is the same thing as power, right? And of course, power is all that matters.

This is completely wrong. While statistical A/B testing is using (parts of) the same toolkit as I was using in RCTs for Papua New Guinea and Tonga: it isn’t looking at the same problem and it isn’t looking at the same effect sizes. A/B testing is used for incremental changes in this context. On the contrary, in my previous professional life we were looking for the biggest possible bang we could generate for the very limited development dollar.

As always, context matters. Power gets more expensive as you approach the asymptote if effect size is also shrinking. How expensive? How big does that sample have to be? This is one of the things I’m looking at in this experiment.

However, size isn’t something I see discussed often. The size of the test is about control: it tells you what your tradeoff has been between Type I and Type II errors. We fix our size a priori and then we move on and forget about it. But is our fixed, set size the same as the actual size of our test?

We know that Fisher’s exact test is often actually undersized in practice, for example.  In practice, this means the test is too conservative. On the flip side, a test that is too profligate (over sized) is rejecting the null hypothesis when it’s true far too often.

In this post, I’m going to look at a very basic test with some very robust assumptions and see what happens to size and power as we vary those assumptions and sample sizes. The purpose here is the old decision vs default adage: know what’s happening and own the decision you make.

# tl;dr

While power matters in A/B testing, so does the size of the test. We spend a lot of time worrying about power in this field but not enough (in my view) ensuring our expectations of size are appropriate. Simple departures from the unicorns-and-fairy-dust normal distribution can cause problems for size and power.

# Testing tests

The test I’m looking at is the plainest of plain vanilla statistical tests. The null hypothesis is that the mean of a generating distribution is zero against an alternate. The statistic is the classic z-statistic and the assumptions underlying the test can go one of two ways:

1. In the finite sample, if the underlying distribution is normal, then the z-statistic is normal, no matter the sample size.
2. Asymptotically, as long as the underlying distribution meets some conditions like finite fourth moments (more on this later- fat tails condition) and independence, as the sample size gets large, the z-statistic will have a normal limiting distribution.

I’m going to test this simple test in one of four scenarios over a variety of sample sizes:

1. Normally generated errors. Everything is super, nothing to see here. This is the statistician’s promised land flowing with free beer and coffee.
2. t(4) errors. This is a fat tailed distribution still centred on zero and symmetric. The fourth moment is finite, but no larger moments are. Is fat-tails alone an issue?
3. Centred and standardised chi squared (2) errors: fat tailed and asymmetric, the errors generated from this distribution have mean zero and a standard deviation of unity. Does symmetry matter that much?
4. Cauchy errors. This is the armageddon scenario: the central limit theorem doesn’t even apply here. There is no theoretical underpinning for getting this test to work under this scenario- there are no finite moments at or equal to the first or above (there are some fractional ones though). Can a big sample size get you over this in practice?

In the experiments, we’ll look at sample sizes between 10 and 100 000. Note that the charts below are on a log10 scale.

The null hypothesis is mu =0, the rejection rate in this scenario gives us the size of our test. We can measure power under a variety of alternatives and here I’ve looked at mu = 0.01, 0.1, 1 and 10. I also looked at micro effect sizes like 0.0001, but it was a sad story. [1]

The test has been set with significance level 0.05 and each experiment was performed with 5000 replications. Want to try it yourself/pull it apart? Code is here.

# Normal errors: everything is super

With really small samples, the test is oversized. N=10 is a real Hail Mary at the best of times, it doesn’t matter what your assumptions are. But by about n=30 and above it’s consistently in the range of 0.05.

It comes as a surprise to no one to hear that power is very much dependent on distance from the null hypothesis. For alternates where mu =10 or mu =1, power is near unity for smallish sample sizes. In other words, we reject the null hypothesis close to 100% of the time when it is false. It’s easy for the test to distinguish these alternatives because they’re so different to the null.

But what about smaller effect sizes? If mu =0.1, then you need at least a sample of a thousand to get close to power of unity. If mu =0.001 then a sample of 100 000 might be required.

Smaller still? The chart below shows another run of the experiment with effect sizes of mu =0.001 and 0.0001 – miniscule. The test has almost no power even at sample sizes of 100 000. If you need to detect effects this small, you need samples in the millions (at least).

# Fat shaming distributions

Leptokurtotic distributions (fat tails) catch a lot of schtick for messing up tests and models. And that’s fairly well deserved. However, degree of fatness is an issue. The t(4) distribution still generates a statistic that works asymptotically in this context: but it has only exactly the number of moments we need and no wriggle room at all.

Size is more variable than in the normal scenario: still comfortably around 0.05, it’s more profligate at small sample sizes and at larger ones has a tendency to be more conservative. It’s still reasonably close to 0.05 at larger sample sizes, however.

Power is costly for the smaller effect sizes. For the same sample size (say n=1000) with mu = 0.1, there is substantially less power than in the normal case. A similar behaviour is evident for mu=0.01. Tiny effect sizes are similarly punished (see below).

# Fat, Skewed and Nearly Dead

The chi-squared(2) distribution put this simple test through its paces. The size of the test for anything under n=1000 is not in the same ballpark as its nominal, rendering the test (in my view) a liability. By the time n=100 000, the size of the test is reasonable.

Power, in my view, while showing similar outcomes is not a saving grace here: there’s very little control in this test under which you can interpret your power.

# Here be dragons

I included a scenario with the Cauchy distribution, despite the fact it’s grossly unfair to this simple little test. The Cauchy distribution ensures that the Central Limit Theorem does not apply here: the test will not work in theory (or, indeed, in practice).

I thought it was a useful exercise, however, to show what that looks like. Too often, we assume “as n gets big CLT is going to work its magic” and that’s just not true. To whit: one hot mess.

Neither size nor power is improved with sample size increasing: that’s because the CLT isn’t operational in this scenario. The test is under sized, under powered for all but the largest of effect sizes (and really, at that effect size you could tell a difference from a chart anyway).

# A/B reality is rarely this simple

A/B testing reality is rarely as simple as the test I’ve illustrated above. More typically, we’re testing groups of means or  proportions and interaction effects, possibly dynamic relationships and a long etc.

My purpose here is to show that even a simple, robust test with minimal assumptions can be thoroughly useless if those assumptions are not met. More complex tests and testing regimes that build on these simple results may be impacted more severely and more completely.

Power is not the only concern: size matters.

# End notes

[1] Yes, I need to get a grip on latex in WordPress sooner or later, but it’s less interesting than the actual experimenting.

## Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

## Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

• Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
• DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

• Guide to modern statistical workflow. Really great organisation of background material.
• Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
• Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

## Law of Large Numbers vs the Central Limit Theorem: in GIFs

I’ve spoken about these two fundamentals of asymptotics previously here and here. But sometimes, you need a .gif to really drive the point home. I feel this is one of those times.

Firstly, I simulated a population of 100 000 observations from the random uniform distribution. This population looks nothing like a normal distribution and you can see that below.

Next, I took 500 samples from the data with varying sample sizes. I used n=5, 10, 20, 50, 100 and 500. I calculated the sample mean (x-bar) and the z score for each and I plotted their kernel densities using ggplot in R.

Here’s a .gif of what happens to the z score as the sample size increases: we can see that the distribution is pretty normal looking, even when the sample size is quite low. Notice that the distribution is centred on zero.

Here’s a .gif of what happens to the sample mean as n increases: we can see that the distribution collapses on the population mean (in this case µ=0.5).

For scale, here is a .gif of both frequencies as n gets large sitting on the same set of axes: the activity is quite different.

If you want to try this yourself, the script is here. Feel free to play around with different distributions and sample sizes, see what turns up.

## The Law of Large Numbers: It’s Not the Central Limit Theorem

I’ve spoken about asymptotics before. It’s the lego of the modelling world, in my view. Interesting, hard and you can lose years of your life looking for just the right piece that fits into the model you’re trying to build.

The Law of Large Numbers (LLN) is another simple theorem that’s widely misunderstood. Most often it’s conflated with the central limit theorem (CLT), which deals with the studentised sample mean or z-score. The LLN pertains to the sample mean itself.

Like the CLT, the LLN is actually a collection of theorems, strong and weak. I’ll confine myself to the simplest version here, Khinchine’s weak law of large numbers. It states that for a random, independent and identically distributed sample of n observations from any distribution with a finite mean (µ) and variance: then the sample mean has a probability limit equal to the population mean, µ. That is, the sample mean is a consistent estimator of the population mean under these conditions.

Put simply, as n gets very big, the sample mean is equal to the population mean.

Notice there is nothing about normal distributions as n gets large. That’s the key difference between the LLN and the CLT. One deals with the sample mean alone, the other with the studentised version. On its own, the distribution of the sample mean collapses onto a single point as n gets large: µ. This is the implication of the LLN.

Appropriately scaled, centred and at the correct rate, the studentised sample mean has a normal distribution in the limit as N gets large: that’s the CLT.

As usual, here’s an infographic to go: put side by side the two theorems have different results but are dealing with something quite similar.

## The Central Limit Theorem: Misunderstood

Asymptotics are the building blocks of many models. They’re basically lego: sturdy, functional and capable of allowing the user to exercise great creativity. They also hurt like hell when you don’t know where they are and you step on them accidentally. I’m pushing it on the last, I’ll admit. But I have gotten very sweary over recalcitrant limiting distributions in the past (though I may be in a small group there).

One of the fundamentals of the asymptotic toolkit is the Central Limit Theorem, or CLT for short. If you didn’t study eight semesters of econometrics or statistics, then it’s something you (might have) sat through a single lecture on and walked away with the hot take “more data is better”.

The CLT is actually a collection of theorems, but the basic entry-level version is the Lindberg-Levy CLT. It states that for any sample of n random, independent observations drawn from any distribution with finite mean (μ) and standard deviation (σ), if we calculate the sample mean x-bar then,

In my time both in industry and in teaching, I’ve come across a number of interpretations of this result: many of them very wrong from very smart people. I’ve found it useful to clarify what this result does and does not mean, as well as when it matters.

Not all distributions become normal as n gets large. In fact, most things don’t “tend to normality” as N gets large. Often, they just get really big or really small. Some distributions are asymptotically equivalent to normality, but most “things”- estimators and distributions alike- are not.

The sample mean by itself does not become normal as n gets large. What would happen if you added up a huge series of numbers? You’d get a big number. What would happen if you divided your big number by your huge number? Go on, whack some experimental numbers into your calculator!

Whatever you put into your calculator, it’s not a “normal distribution” you get when you’re done. The sample mean alone does not tend to a normal distribution as N gets large.

The studentised sample mean has a distribution which is normal in the limit. There are some adjustments we need to make before the sample mean has a stable limiting distribution – this is the quantity often known as the z-score. It’s this quantity that tends to normality as n gets large.

How large does n need to be? This theorem works for any distribution with a finite mean and standard deviation, e.g. as long as x comes from a distribution with these features. Generally, statistics texts quote the figure of n=30 as a “rule of thumb”. This works reasonably well for simple estimators and models like the sample mean in a lot of situations.

This isn’t to say, however, that if you have “big data” your problems are gone. You just got a whole different set, I’m sorry. That’s a different post, though.

So that’s a brief run down on the simplest of central limit theorems: it’s not a complex or difficult concept, but it is a subtle one. It’s the building block upon which models such as regression, logistic regression and their known properties have been based.

The infographic below is the same information, but for some reason my students find information in that format easier to digest. When it comes to asymptotic theory, I am disinclined to argue with them: I just try to communicate in whatever way works. On that note, if this post was too complex or boring, here is the CLT presented with bunnies and dragons.** What’s not to love?

** I can’t help myself: The reason why the average bunny weights distribution gets narrower as the sample size gets larger is because this is the sample mean tending towards the true population mean. For a discussion of this behaviour vs the CLT see here.

It’s my only criticism of what was an otherwise a delightful video. Said video being in every way superior to my own version done late one night for a class with my dog assisting and my kid’s drawing book. No bunnies or dragons, but it’s here.