## Size Matters

It would be too easy for someone like me to declare that A/B testing is simple. If you’re doing website testing you have all the power in the world. Sample sizes literally not dreamt of when I was an undergraduate. A large sample is the same thing as power, right? And of course, power is all that matters.

This is completely wrong. While statistical A/B testing is using (parts of) the same toolkit as I was using in RCTs for Papua New Guinea and Tonga: it isn’t looking at the same problem and it isn’t looking at the same effect sizes. A/B testing is used for incremental changes in this context. On the contrary, in my previous professional life we were looking for the biggest possible bang we could generate for the very limited development dollar.

As always, context matters. Power gets more expensive as you approach the asymptote if effect size is also shrinking. How expensive? How big does that sample have to be? This is one of the things I’m looking at in this experiment.

However, size isn’t something I see discussed often. The size of the test is about control: it tells you what your tradeoff has been between Type I and Type II errors. We fix our size a priori and then we move on and forget about it. But is our fixed, set size the same as the actual size of our test?

We know that Fisher’s exact test is often actually undersized in practice, for example.  In practice, this means the test is too conservative. On the flip side, a test that is too profligate (over sized) is rejecting the null hypothesis when it’s true far too often.

In this post, I’m going to look at a very basic test with some very robust assumptions and see what happens to size and power as we vary those assumptions and sample sizes. The purpose here is the old decision vs default adage: know what’s happening and own the decision you make.

# tl;dr

While power matters in A/B testing, so does the size of the test. We spend a lot of time worrying about power in this field but not enough (in my view) ensuring our expectations of size are appropriate. Simple departures from the unicorns-and-fairy-dust normal distribution can cause problems for size and power.

# Testing tests

The test I’m looking at is the plainest of plain vanilla statistical tests. The null hypothesis is that the mean of a generating distribution is zero against an alternate. The statistic is the classic z-statistic and the assumptions underlying the test can go one of two ways:

1. In the finite sample, if the underlying distribution is normal, then the z-statistic is normal, no matter the sample size.
2. Asymptotically, as long as the underlying distribution meets some conditions like finite fourth moments (more on this later- fat tails condition) and independence, as the sample size gets large, the z-statistic will have a normal limiting distribution.

I’m going to test this simple test in one of four scenarios over a variety of sample sizes:

1. Normally generated errors. Everything is super, nothing to see here. This is the statistician’s promised land flowing with free beer and coffee.
2. t(4) errors. This is a fat tailed distribution still centred on zero and symmetric. The fourth moment is finite, but no larger moments are. Is fat-tails alone an issue?
3. Centred and standardised chi squared (2) errors: fat tailed and asymmetric, the errors generated from this distribution have mean zero and a standard deviation of unity. Does symmetry matter that much?
4. Cauchy errors. This is the armageddon scenario: the central limit theorem doesn’t even apply here. There is no theoretical underpinning for getting this test to work under this scenario- there are no finite moments at or equal to the first or above (there are some fractional ones though). Can a big sample size get you over this in practice?

In the experiments, we’ll look at sample sizes between 10 and 100 000. Note that the charts below are on a log10 scale.

The null hypothesis is mu =0, the rejection rate in this scenario gives us the size of our test. We can measure power under a variety of alternatives and here I’ve looked at mu = 0.01, 0.1, 1 and 10. I also looked at micro effect sizes like 0.0001, but it was a sad story. [1]

The test has been set with significance level 0.05 and each experiment was performed with 5000 replications. Want to try it yourself/pull it apart? Code is here.

# Normal errors: everything is super

With really small samples, the test is oversized. N=10 is a real Hail Mary at the best of times, it doesn’t matter what your assumptions are. But by about n=30 and above it’s consistently in the range of 0.05.

It comes as a surprise to no one to hear that power is very much dependent on distance from the null hypothesis. For alternates where mu =10 or mu =1, power is near unity for smallish sample sizes. In other words, we reject the null hypothesis close to 100% of the time when it is false. It’s easy for the test to distinguish these alternatives because they’re so different to the null.

But what about smaller effect sizes? If mu =0.1, then you need at least a sample of a thousand to get close to power of unity. If mu =0.001 then a sample of 100 000 might be required.

Smaller still? The chart below shows another run of the experiment with effect sizes of mu =0.001 and 0.0001 – miniscule. The test has almost no power even at sample sizes of 100 000. If you need to detect effects this small, you need samples in the millions (at least).

# Fat shaming distributions

Leptokurtotic distributions (fat tails) catch a lot of schtick for messing up tests and models. And that’s fairly well deserved. However, degree of fatness is an issue. The t(4) distribution still generates a statistic that works asymptotically in this context: but it has only exactly the number of moments we need and no wriggle room at all.

Size is more variable than in the normal scenario: still comfortably around 0.05, it’s more profligate at small sample sizes and at larger ones has a tendency to be more conservative. It’s still reasonably close to 0.05 at larger sample sizes, however.

Power is costly for the smaller effect sizes. For the same sample size (say n=1000) with mu = 0.1, there is substantially less power than in the normal case. A similar behaviour is evident for mu=0.01. Tiny effect sizes are similarly punished (see below).

# Fat, Skewed and Nearly Dead

The chi-squared(2) distribution put this simple test through its paces. The size of the test for anything under n=1000 is not in the same ballpark as its nominal, rendering the test (in my view) a liability. By the time n=100 000, the size of the test is reasonable.

Power, in my view, while showing similar outcomes is not a saving grace here: there’s very little control in this test under which you can interpret your power.

# Here be dragons

I included a scenario with the Cauchy distribution, despite the fact it’s grossly unfair to this simple little test. The Cauchy distribution ensures that the Central Limit Theorem does not apply here: the test will not work in theory (or, indeed, in practice).

I thought it was a useful exercise, however, to show what that looks like. Too often, we assume “as n gets big CLT is going to work its magic” and that’s just not true. To whit: one hot mess.

Neither size nor power is improved with sample size increasing: that’s because the CLT isn’t operational in this scenario. The test is under sized, under powered for all but the largest of effect sizes (and really, at that effect size you could tell a difference from a chart anyway).

# A/B reality is rarely this simple

A/B testing reality is rarely as simple as the test I’ve illustrated above. More typically, we’re testing groups of means or  proportions and interaction effects, possibly dynamic relationships and a long etc.

My purpose here is to show that even a simple, robust test with minimal assumptions can be thoroughly useless if those assumptions are not met. More complex tests and testing regimes that build on these simple results may be impacted more severely and more completely.

Power is not the only concern: size matters.

# End notes

[1] Yes, I need to get a grip on latex in WordPress sooner or later, but it’s less interesting than the actual experimenting.

## A Primer on Basic Probability

… and by basic, I mean basic. I sometimes find people come to me with questions and no one has ever taken the time to give them the most basic underpinnings in probability that would make their lives a lot easier. A friend of mine is having this problem and is on a limited time frame for solving it, so this is quick and dirty and contains both wild ad-lib on my part and swearing. When I get some more time, I’ll try and expand and improve, but for now it’s better than nothing.

Youtube explainer: done without microphone, sorry- time limit again.

Slides I used:

Probability

I mentioned two links in the screencast. One was Allen Downey’s walkthrough with python, you don’t need to know anything about Python to explore this one: well worth it. The other is Victor Powell’s visualisation of conditional probability. Again, worth a few minutes exploration.

Good luck! Hit me up in the comments section if you’ve got any questions, this was a super quick run through so it’s a summary at best.

## Machine Learning is Basically the Reversing Camera on Your Car

I’ve been spending a bit of time on machine learning lately. But when it comes to classification or regression: it’s basically the reversing camera on your car.

Let me elaborate: machine learning, like a reversing camera, is awesome. Both things let you do stuff you already could do, but faster and more often. Both give you insights into the world around you that you may not have had without them. However, both can give a more narrow view of the world than some other techniques (in this case, expanded statistical/econometric methodologies and/or your mirrors and checking your blindspots).

As long as everything around you remains perfectly still and doesn’t change, the reversing camera will let you get into a tight parking spot backwards and give you some insights into where the gutter and other objects are that you didn’t have before. Machine learning does great prediction when the inputs are not changing.

But if you have to go a long way in reverse (like reversing down your driveway- mine is 400m long), or things are moving around you (other cars, pet geese, STUPID big black dogs that think running under your wheels is a great idea. He’s bloody fine, stupid mutt): then the reversing camera alone is not all the information you need.

In the same way, if you need to explain relationships- because your space is changing and prediction is not enough- then it’s a very useful thing to expand your machine learning toolbox with statistical/econometric techniques like hypothesis testing, information criteria and solid model building methodologies (as opposed to relying solely on lasso or ridge methods). Likewise, causality and endogeneity matters a lot.

So, in summary machine learning and reversing cameras are awesome, but aren’t the whole picture in many cases. Make your decision about what works best in your situation: don’t just default to what you’re used to.

(Also, I’m not convinced this metaphor extends in the forwards direction. Data analysis? You only reverse, maybe 5% of the time you’re driving. But you’re driving forward the rest of the time: data analysis is 95% of my workflow. Yours?)

## Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

## Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

## Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

## Correlation vs Causation

Correlation vs causation. I find this is an issue that is quite simple, from a technical point of view, but widely misunderstood. Statistical significance does not imply causation. Correlation implies there may be a direct or indirect relationship, but does not imply causation. In fact, very few things imply causation. My simple version of the differences is below.

If you want to know why this is far more than a stoush to be had in an academic tea room, check out Tyler Vigen’s collection. If the age of Miss America can be significantly and strongly correlated with murders by steam, hot vapours and objects; then in any practical analysis there are many options for other less obvious spurious correlations. In a big data context, knowing the difference could be millions of dollars.

Occasionally, people opine that causation vs correlation doesn’t matter (especially in a big data and sometimes a machine learning context). I’d argue this is completely the wrong view to take: just because you have all the power that matters doesn’t mean we should ignore these issues because a randomised control trial is impractical in a lot of ways. It just means deciding when, how and why you’re going to do so in the knowledge of what you’re doing. Spurious correlations are common, hard to detect and difficult to deal with. It’s a bear hunt worth setting out on.

## Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

## Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

• Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
• DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

• Guide to modern statistical workflow. Really great organisation of background material.
• Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
• Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

## Visualising Correlation

It’s a very simple concept, but I often find that people don’t actually know what “strong” or “weak” correlation looks like.

I made this .gif to illustrate the basic idea – it’s located here.

If you want to try this yourself, you can find the r script here.