## Size Matters

It would be too easy for someone like me to declare that A/B testing is simple. If you’re doing website testing you have all the power in the world. Sample sizes literally not dreamt of when I was an undergraduate. A large sample is the same thing as power, right? And of course, power is all that matters.

This is completely wrong. While statistical A/B testing is using (parts of) the same toolkit as I was using in RCTs for Papua New Guinea and Tonga: it isn’t looking at the same problem and it isn’t looking at the same effect sizes. A/B testing is used for incremental changes in this context. On the contrary, in my previous professional life we were looking for the biggest possible bang we could generate for the very limited development dollar.

As always, context matters. Power gets more expensive as you approach the asymptote if effect size is also shrinking. How expensive? How big does that sample have to be? This is one of the things I’m looking at in this experiment.

However, size isn’t something I see discussed often. The size of the test is about control: it tells you what your tradeoff has been between Type I and Type II errors. We fix our size a priori and then we move on and forget about it. But is our fixed, set size the same as the actual size of our test?

We know that Fisher’s exact test is often actually undersized in practice, for example.  In practice, this means the test is too conservative. On the flip side, a test that is too profligate (over sized) is rejecting the null hypothesis when it’s true far too often.

In this post, I’m going to look at a very basic test with some very robust assumptions and see what happens to size and power as we vary those assumptions and sample sizes. The purpose here is the old decision vs default adage: know what’s happening and own the decision you make.

# tl;dr

While power matters in A/B testing, so does the size of the test. We spend a lot of time worrying about power in this field but not enough (in my view) ensuring our expectations of size are appropriate. Simple departures from the unicorns-and-fairy-dust normal distribution can cause problems for size and power.

# Testing tests

The test I’m looking at is the plainest of plain vanilla statistical tests. The null hypothesis is that the mean of a generating distribution is zero against an alternate. The statistic is the classic z-statistic and the assumptions underlying the test can go one of two ways:

1. In the finite sample, if the underlying distribution is normal, then the z-statistic is normal, no matter the sample size.
2. Asymptotically, as long as the underlying distribution meets some conditions like finite fourth moments (more on this later- fat tails condition) and independence, as the sample size gets large, the z-statistic will have a normal limiting distribution.

I’m going to test this simple test in one of four scenarios over a variety of sample sizes:

1. Normally generated errors. Everything is super, nothing to see here. This is the statistician’s promised land flowing with free beer and coffee.
2. t(4) errors. This is a fat tailed distribution still centred on zero and symmetric. The fourth moment is finite, but no larger moments are. Is fat-tails alone an issue?
3. Centred and standardised chi squared (2) errors: fat tailed and asymmetric, the errors generated from this distribution have mean zero and a standard deviation of unity. Does symmetry matter that much?
4. Cauchy errors. This is the armageddon scenario: the central limit theorem doesn’t even apply here. There is no theoretical underpinning for getting this test to work under this scenario- there are no finite moments at or equal to the first or above (there are some fractional ones though). Can a big sample size get you over this in practice?

In the experiments, we’ll look at sample sizes between 10 and 100 000. Note that the charts below are on a log10 scale.

The null hypothesis is mu =0, the rejection rate in this scenario gives us the size of our test. We can measure power under a variety of alternatives and here I’ve looked at mu = 0.01, 0.1, 1 and 10. I also looked at micro effect sizes like 0.0001, but it was a sad story. [1]

The test has been set with significance level 0.05 and each experiment was performed with 5000 replications. Want to try it yourself/pull it apart? Code is here.

# Normal errors: everything is super

With really small samples, the test is oversized. N=10 is a real Hail Mary at the best of times, it doesn’t matter what your assumptions are. But by about n=30 and above it’s consistently in the range of 0.05.

It comes as a surprise to no one to hear that power is very much dependent on distance from the null hypothesis. For alternates where mu =10 or mu =1, power is near unity for smallish sample sizes. In other words, we reject the null hypothesis close to 100% of the time when it is false. It’s easy for the test to distinguish these alternatives because they’re so different to the null.

But what about smaller effect sizes? If mu =0.1, then you need at least a sample of a thousand to get close to power of unity. If mu =0.001 then a sample of 100 000 might be required.

Smaller still? The chart below shows another run of the experiment with effect sizes of mu =0.001 and 0.0001 – miniscule. The test has almost no power even at sample sizes of 100 000. If you need to detect effects this small, you need samples in the millions (at least).

# Fat shaming distributions

Leptokurtotic distributions (fat tails) catch a lot of schtick for messing up tests and models. And that’s fairly well deserved. However, degree of fatness is an issue. The t(4) distribution still generates a statistic that works asymptotically in this context: but it has only exactly the number of moments we need and no wriggle room at all.

Size is more variable than in the normal scenario: still comfortably around 0.05, it’s more profligate at small sample sizes and at larger ones has a tendency to be more conservative. It’s still reasonably close to 0.05 at larger sample sizes, however.

Power is costly for the smaller effect sizes. For the same sample size (say n=1000) with mu = 0.1, there is substantially less power than in the normal case. A similar behaviour is evident for mu=0.01. Tiny effect sizes are similarly punished (see below).

# Fat, Skewed and Nearly Dead

The chi-squared(2) distribution put this simple test through its paces. The size of the test for anything under n=1000 is not in the same ballpark as its nominal, rendering the test (in my view) a liability. By the time n=100 000, the size of the test is reasonable.

Power, in my view, while showing similar outcomes is not a saving grace here: there’s very little control in this test under which you can interpret your power.

# Here be dragons

I included a scenario with the Cauchy distribution, despite the fact it’s grossly unfair to this simple little test. The Cauchy distribution ensures that the Central Limit Theorem does not apply here: the test will not work in theory (or, indeed, in practice).

I thought it was a useful exercise, however, to show what that looks like. Too often, we assume “as n gets big CLT is going to work its magic” and that’s just not true. To whit: one hot mess.

Neither size nor power is improved with sample size increasing: that’s because the CLT isn’t operational in this scenario. The test is under sized, under powered for all but the largest of effect sizes (and really, at that effect size you could tell a difference from a chart anyway).

# A/B reality is rarely this simple

A/B testing reality is rarely as simple as the test I’ve illustrated above. More typically, we’re testing groups of means or  proportions and interaction effects, possibly dynamic relationships and a long etc.

My purpose here is to show that even a simple, robust test with minimal assumptions can be thoroughly useless if those assumptions are not met. More complex tests and testing regimes that build on these simple results may be impacted more severely and more completely.

Power is not the only concern: size matters.

# End notes

[1] Yes, I need to get a grip on latex in WordPress sooner or later, but it’s less interesting than the actual experimenting.

## A Primer on Basic Probability

… and by basic, I mean basic. I sometimes find people come to me with questions and no one has ever taken the time to give them the most basic underpinnings in probability that would make their lives a lot easier. A friend of mine is having this problem and is on a limited time frame for solving it, so this is quick and dirty and contains both wild ad-lib on my part and swearing. When I get some more time, I’ll try and expand and improve, but for now it’s better than nothing.

Youtube explainer: done without microphone, sorry- time limit again.

Slides I used:

Probability

I mentioned two links in the screencast. One was Allen Downey’s walkthrough with python, you don’t need to know anything about Python to explore this one: well worth it. The other is Victor Powell’s visualisation of conditional probability. Again, worth a few minutes exploration.

Good luck! Hit me up in the comments section if you’ve got any questions, this was a super quick run through so it’s a summary at best.

## Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

## Describing simple statistics

I’m a huge believer in the usefulness of learning by doing. That makes me a huge believer in Shiny, which allows me to create and deploy simple apps that allow students to do just that.

This latest app is a simple one that allows you to manipulate either the mean or the variance of a normal distribution and see how that changes the shape of the distribution.

If you want to try out making Shiny apps, but need a place to start, check out Oliver Keyes’ excellent start up guide.

## Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

## Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

• Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
• DIY your data science. Another offering from the puppet circle on the data science venn diagram.

Econometrics

Statistics

Work Flow

• Guide to modern statistical workflow. Really great organisation of background material.
• Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
• Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra

Asymptotics

Bayes

Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

## Visualising Correlation

It’s a very simple concept, but I often find that people don’t actually know what “strong” or “weak” correlation looks like.

I made this .gif to illustrate the basic idea – it’s located here.

If you want to try this yourself, you can find the r script here.

## Data Analysis: Enough with the Questions Already

Ideally, you’d really like to write something that doesn’t leave the reader with a keyboard imprint across their forehead due to analysis-induced narcolepsy. That’s not always easy, but here are some thoughts.

Writing up data analysis shouldn’t be about listing means, standard deviations and some dodgy histograms. Yes, sometimes you need that stuff- but mostly what you need is a compelling narrative. What is the data saying to support your claims?

It doesn’t all need to be there.

You worked out that tricky bit of code and did that really awesome piece of analysis that led you to ask questions and… sorry, no one cares. If it’s not a direct part of your story, it probably needs to be consigned to telling your nerd friends on twitter- at least they’ll understand what you’re talking about. But keep it out of the write up!

How is it relevant?

Data analysis is rarely the end in and of itself. How does your analysis support the rest of your project? Does it offer insight for modelling or forecasting? Does it offer insight for decision making? Make sure your reader knows why it’s worth reading.

Do you have an internal structure?

Data analysis is about translating complex numerical information into text. A clear and concise structure for your analysis makes life much easier for the reader.

If you’re staring at the keyboard wondering if checking every social media account you ever had since high school is a valid procrastination option: try starting with “three important things”. Then maybe add three more. Now you have a few things to say and can build from there.

Who are you writing for?

Academia, business, government, your culture, someone else’s, fellow geeks, students… all of these have different expectations around communication.  All of them are interested in different things. Try not to have a single approach for communicating analysis to different groups. Remember what’s important to you may not be important to your reader.

Those are just a few tips for writing up your analyses. As we’ve said before: it’s not a one-size-fits-all approach. But hopefully you won’t feel compelled to give a list of means, a correlation matrix and four dodgy histograms that fit in the space of a credit card. We can do better than that!

## Data Analysis: More Questions

In our last post on data analysis, we asked a lot of questions. Data analysis isn’t a series of generic questions we can apply to every dataset we encounter, but it can be a helpful way to frame the beginning of your analysis. This post is, simply, some more questions to ask yourself if you’re having trouble getting started.

The terminology I use below (tall, dense and wide) is due to Francis Diebold. You can find his original post here and it’s well worth a read.

## Data Analysis: Questions to Ask the First Time

Data analysis is one of the most under rated, but most important parts of data science/econometrics/statistics/whatever it is you do with data.

It’s not impressive when it’s done right because it’s like being impressed by a door handle: it is something that is both ubiquitous and obvious. But when you’re missing the doorhandles, you can’t open the door.

There are lots of guides to data analysis but fundamentally there is no one-size-fits-most approach that can be guaranteed to work for every data set. Data analysis is a series of open-ended questions to ask yourself.

If you’re new or coming to data science from a background that did not emphasise statistics or econometrics (or story telling with data in general), it can be hard to know which questions to ask.

I put together this guide to offer some insight into the kinds of questions I ask myself when examining my data for the first time. It’s not complete: work through this guide and you won’t have even started the analysis proper. This is just the first time you open your data, after all.

But by uncovering the answers to these questions, you’ll have a more efficient analysis process. You’ll also (hopefully) think of more questions to ask yourself.

Remember, this isn’t all the information you need to uncover: this is just a start! But hopefully it offers you a framework to think about your data the first time you open it. I’ll be back with some ideas for the second time you open your data later.

.