arrow-right cart chevron-down chevron-left chevron-right chevron-up close menu minus play plus search share user email pinterest facebook instagram snapchat tumblr twitter vimeo youtube subscribe dogecoin dwolla forbrugsforeningen litecoin amazon_payments american_express bitcoin cirrus discover fancy interac jcb master paypal stripe visa diners_club dankort maestro trash

Shopping Cart


Experimentation

The math behind A/B tests for people who hate math


(you can still read it if you like math)
The math behind A/B tests for people who hate math

by Mojan Benham

2 months ago


Two-line summary

23 minute read

This article breaks down the fundamental components of an experiment (p-values, confidence intervals, sample size) into basic explanations for any level of math background - no degree required. It is the perfect reference piece if you are in a role that requires you to interpret A/B tests and would benefit from a deeper understanding of the underlying math (i.e. you know that a p-value of less than 0.05 is good but not why, or where that number comes from).

Table of contents

Introduction

Reader prerequisites

The only true prerequisite is that you have a vague familiarly with the premise of an A/B test (to the extent of the primer section below). Aside from that, the entire purpose of this post is to convey the subject matter in a way that is comprehensible to readers with no prior knowledge. Specifically, what was taught in your high school math class should suffice.

If you feel confident in the following three concepts, you're well-equipped to continue on: 

  • Taking the average of a set of numbers (ex. the average of 2,5 & 8 is (2+5+8)/3 = 5)
  • Reading a bar chart
  • Working with percentages (ex. 10% of 50 is 5)

very brief primer on A/B tests

An A/B test (sometimes known as a controlled experiment or hypothesis test) is a statistical tool that is used to measure the impact of making a change. The goal is to understand whether the change made a difference, and by how much.

Let's break this down!

The first step is to determine what we'd like to test; this is called the hypothesis and is basically a statement of cause and effect. E.g. I will change some thing and I hypothesize that it will impact some other thing. For example, I will start offering my customers a coupon after their first purchase (the cause, or treatment) and it will increase the number of people who place a subsequent order (the effect).

The term hypothesis test is apt; we will test this hypothesis by observing whether the data collected during the experiment provides evidence in favour or against its claim.

To carry out an A/B test - as the name suggests - we create two groups: group A where nothing is changed (called the control group) and group B where the treatment is applied (called the treatment group). We then decide on a measure of success, known as the test statistic, which allows us to quantitatively compare the two groups. In the example above, only those in the treatment group would be offered a coupon and we observe the proportion of returning customers between groups.

If at the end of the experiment there is sufficient evidence to validate our hypothesis, it is declared statistically significant. Herein lies the crux of the matter; in fact, all of the math is tied up in that one statement! The remainder of this article focuses on explaining what data is collected in the process of an experiment, and how to conduct the calculations to arrive at a conclusion regarding the hypothesis.

One last thing to briefly touch on before we move forth: you may wonder why hypothesis testing necessitates grouping; meaning, why split into two groups, why not apply a change to everyone and monitor the metric before and after the change point? The long version begs a blog post of its own, but in short it can be difficult to isolate the impact of a change among existing trends, random variation and seasonality. Following the above example, customers will return whether or not coupons are offered; what we're interested in understanding is whether the coupon has incremental impact on the recurring purchase metric.

What's in scope

It's pertinent to mention that there are two schools of thought in the world of experimentation: Bayesian and frequentist. This blog post only covers the frequentist approach. Without getting into the weeds, frequentist statistics believes that there is a true, fixed answer in the universe while Bayesian assigns probability to the hypothesis, evolving its opinion in the face of new evidence. Both are valid and ultimately come down to a matter of personal philosophy; however, frequentist is chosen here since the math is simpler to convey to a non-technical audience. I draw your attention to this matter so that you may confirm the technique employed by your workplace's experimentation tool before proceeding.

As a second note, I'd emphasize that what is covered below is considered the very bare bones of A/B testing knowledge. It will equip you with the ability to interpret results, but know that entire textbooks are dedicated to the design and validity of controlled experiments. Give strong consideration to the reading list in the Extensions section for a more thorough landscape of this topic.

Core material

Demystifying the basics: statistical significance & p-values

Consider this: if you flip two coins fifty times, there is no guarantee that they will return the same number of heads despite being identical. Random chance creates variation in the results. The same is true in experimentation; the treatment and control group will differ even if the treatment had zero impact. For this reason, it is not enough to simply observe a difference - there needs to be a big enough difference that we can, beyond a reasonable doubt, claim that the change cannot be explained by expected variation.

But if coin one returns 25 heads, what magnitude of difference would convince you that the second coin is better? 30 heads? 40 heads? Enter: p-values. A p-value is a score that is calculated using statistics to quantify the likelihood that the results of an experiment could have occurred randomly (the exact calculation is detailed in the next section). If the p-value is lower than a specific threshold - usually 0.05 aka 5% - we describe the change as being too significant to be attributed to random chance, it is statistically significant.

Here's the framework in more detail:

Recall that the first step in designing an experiment is to state the hypothesis. We construct this by forming two, mutually exclusive claims:

  • the null hypothesis: denoted by H0, typically states that there is no effect and is assumed to be true unless there is sufficient evidence to prove otherwise.
  • the alternate hypothesis: denoted by H1, contradicts the null hypothesis and states that the treatment did in fact have some effect.

We begin with a theory that the alternate hypothesis is possible, and in running an experiment we are seeking evidence that will help us reject the null hypothesis.

To illustrate, let's return to the two coins example. I've painted the tail side of coin one with a heavy coating (the treatment) and hypothesize that this will cause it land more frequently on heads. I flip each coin ten times and record the number of heads returned respectively - this is called a trial. I do five trials and end up with the following data:


Then the average of coin 1 is (4+6+5+5+4)/5 = 4.8 (let's call this X̄1) and the average of coin 2 is (5+3+5+4+4)/5 = 4.2 (X̄2). My experiment setup would be:

  • H0: μ1 = μ2 (the coins are the same, they are equally likely to return heads)
  • H1: μ1 > μ2 (coin one is more likely to return heads)

A comment on notation: X̄is used to denote the sample mean (since experiment data is just a sample of the entire population), while μ1 is used to denote the population mean. We use  when speaking about averages derived from our experiment data, and μ when referring to the hypotheses since the hypotheses are a statement about the population.

Now obviously, 4.8 is a bigger number than 4.2, but there's a possibility that random chance caused the numbers to differ. The reason we don't make a literal interpretation of the equal sign is because those two numbers are calculated from the sample whereas the equal sign in the null hypotheses is in reference to the population. To be sure of the alternate hypothesis, we need to use the p-value in order to generalize from sample to population. For now, ignore how the p-value is calculated, we'll get to that in the next section.

The p-value is the probability that results as or more extreme than the ones achieved in the experiment could have occurred if the null hypothesis is true. Meaning, what are the chances that we'd see a difference of 0.6 (4.8 - 4.2) if the coins are the same? If the value is less than 5%, it means that there is a less than 5% chance that identical coins could have produced these results, so we can assume that they must be different. We reject the null hypothesis in favour of the alternate hypothesis because the difference is statistically significant.

Conversely if the p-value is high, we can't rule out random variation as there's too high of a likelihood that we would have seen results like this even if the coins were identical.

P.S. You can of course raise the bar and demand a less than 1% threshold if you need stronger evidence of the alternate hypothesis; 5% is chosen as a somewhat arbitrary industry standard.

To close out this section, I present two common misconceptions to ruminate on. Both of these statements were quite puzzling to me when I was first introduced to this subject so take the time to read them thoroughly as they are quite important:

  • The p-value is not the probability that the alternate hypothesis is true. If your p-value is calculated to be 4%, it does not mean there is a 4% chance the null hypothesis is true and a 96% chance the alternate hypothesis is true. It means that assuming the null hypothesis is 100% true, results like the ones achieved by your experiment occur 4% of the time. This will be illustrated when we introduce probability plots further in the post.
  • We do not accept the null hypothesis, we fail to reject it. While this may sound like nitpicky semantics, it is a vital distinction. The null hypothesis is presumed to be true unless there is sufficient evidence to prove otherwise, and it is in the absence of evidence that we fail to reject it. However, the absence of evidence does not prove a negative. If you are familiar with the American justice system, it's analogous to the terms guilty and innocent. When a defendant is not convicted of a crime, they are not found innocent, they are pronounced not guilty.

Hopefully by this point the idea of p-values and statistical significance has been made clear. Understanding that variation is a natural phenomenon, these are the statistical tools that we use to quantify the role that random chance plays in every experiment. Equipped with the conceptual understanding, we now proceed to the underlying math - how are p-values calculated?

Probability plots

In this section, the goal is to evolve our understanding from what a p-value is to how it is derived. We do this by studying probability plots, and it may surprise you to learn that it's as simple as creating a histogram (i.e. the numerical version of a bar chart).

On the left of the illustration below, we have math test results for a small classroom of eight high school students and to the right, the corresponding histogram. Both show the same data, one in tabular form and one visual.

The histogram is just a tally of the number of students that fall within each grade range on the x-axis. For example, students 6 & 8 fall within the 41-60% range, therefore that bar is drawn up to have two students on the y-axis. Pretty easy, right?
In order to create a probability plot, simply convert the y-axis to show percentages instead of the aggregate value. The y-axis is often labeled as density since it represents the concentration of the population that falls within each range on the x-axis. And there you have it, you've just created your first probability plot! 
In practical contexts such as a statistics textbook or your workplace's experimentation platform, you may notice that the data is represented by a single line (like the green curve on the graph below) rather than bars.
Let's dissect the reasoning with another example. The following histogram shows the grades for the final exam of a 600-person stats101 university course. Instead of having ranges in the x-axis, you decide to plot one bar for every possible percentage grade. The process to create the probability plot is the same as the high school example, but with many more bars.
Then, the professor tells you it's actually possible to get half-points such that grades have precision to one decimal point, which gives you even more bars to plot. As data becomes more and more granular, it may be approximated by continuous data, hence the use of the curve.
The probability plot of a continuous dataset is called the distribution or probability density function.
All this information culminates in an explanation of how the p-value is derived.
When you run an experiment with a control and treatment group, you end up with the average value for the metric in each group (remember, we denoted this by 1 and X̄2). The averages are plugged into the following formula for a value called the t-statistic:
\[ t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{SE} \]
SE stands for standard error and is also a simple calculation. The specifics are not necessary at this juncure but it essentially represents the noisiness of the data. This leaves us with μ1 and μ2, which are the most interesting terms in the equation. Recall that we said the p-value is the probability of achieving results as big or greater than what is observed in the experiment, assuming the null hypothesis is true. We stated above that H0: μ1 = μ2 therefore, μ1 - μ= 0!
The beauty of statistics is that you don't need to reinvent the wheel. Years of hard work from statisticians that precede you means that once the t-statistic is calculated, you can consult the nearest math textbook for the probability plot corresponding to your experiment, called the T-distribution.
The T-distribution is a probability plot, so it is showing you the probability (y-axis) of each possible t-statistic (x-axis). It will also give you a number called the critical value. The area under the curve to the right of the critical value is equal to the probability of achieving t-statistics in that range - the p-value! Meaning, if your t-statistic exceeds the critical value then it is statistically significant because there is only a 5% probability of achieving results beyond that threshold. In this case, the larger your t-statistic, the lower your p-value.
And that, is how a p-value is calculated!

The central limit theorem

Notice that the T-distribution is in the shape of a bell (hence its name, the bell curve, also known as the normal distribution). It would be natural for your next thought to be, "but what if my experiment data doesn't form a bell curve? What if it comes out as some other shape?"

To that point there is a theorem - called the central limit theorem - which states that as long as you have a large enough sample size (number of people in your experiment), you can assume that the sample mean is normally distributed even if the population follows some other distribution.

How large is 'large enough' depends on a number of factors discussed in the Sample Size section below.

If you work with data scientists, this should help you understand why they are so fixated on ensuring the experiment has sufficient sample size; it's because it is absolutely necessary to fulfill the normality assumption in order to conduct the hypothesis test.

For visual learners, an easy way to prove this theorem is to search for Galton Board on YouTube. Another way is to consider throwing six-sided dice:

If I throw one single die a thousand times, I should expect to get about the same frequency of each number. Thus, my probability plot would have six bars, all of which are approximately the same height (this is called a uniform distribution and is obviously not a bell curve).

Then, I will throw two dice together a thousand times and instead, I will record the average of the dice each time. So if I throw a 4 & and 6 the first time, I will record that as a 5 = (4+6)/2. Now, my distribution will look a little different that the uniform distribution achieved with the single die. This is because a 3 & a 3 will average out to 3, but a 1 & a 5 will also average out to 3. I will still have values across all six numbers, but the distribution of my data will start to cluster around the mean, 3.

I can continue this with three dice, then four dice, then five. As the number of dice I use increases, the data clusters more tightly around the mean thereby forming a normal distribution. This demonstrates how the population distribution was not normal (it was uniform) but the distribution of the sample mean is. I hope you find this as satisfying and mind-blowing as I did.

Confidence intervals

The discussion thus far has been focused on whether the control group metric is significantly different from the treatment group, but we have not discussed precision. An average is a great measure of location but it does not tell us anything about the spread of the data. Furthermore, two datasets can have the same average but be vastly different in the range of values they contain.

If I take three courses and score 70% in each one, my overall average is 70%. I could also fail one class with 10% and still achieve a 70% average if I get perfect in the other two courses. Precision matters! The same lesson can be applied to experimentation.

In addition to a p-value, A/B testing platforms typically provide a confidence interval for each group - a range of values that the population mean is likely to fall into with a certain level of confidence. An experiment with a significance level of 5% has a confidence of 1 - the significance level (so 95% in this case). Here are some sample results:

The control group has a sample average of 10, and with 95% confidence that the population average will fall between the range of 7 (lower bound) and 14 (upper bound). This does not mean that 95% of the population will fall between 7 & 14. It means that if you were to take 100 samples, 95 of them would produce an average that falls in that range.
Intervals are provided as a way of quantifying the error that is introduced by random variation when you take a sample that is intended to make some inference about the larger population. It is important to consider these ranges to help navigate your tolerance for variability.
Important note: it is possible to achieve statistical significance with confidence intervals that overlap; use the intervals in conjunction with the p-value to make conclusions about the data.

Designing a great experiment

Sample size

Earlier, we discussed the central limit theorem and the requirement for a sufficiently large sample size, which is a term that refers to the number of units (often, people) in each variation group.

The required sample size is not a single number prescribed across all experiments, i.e. it's not 1,000 or 10,000. It is a function of three components: the expected lift or difference between the metric in the control and treatment groups, the significance level, and the power level. A power analysis is an exercise conducted during the planning phase of an experiment to derive an estimate for the required sample size. The math requires lengthy, complex explanations but there are many papers that discuss the approach and online calculators that can do it for you.

Relevant to this discussion is the topic of exposure. Consider that the purpose of a hypothesis test is to learn something and apply that learning to your product going forward. If you treat a large portion of your clients to achieve a sufficiently powered experiment and do not reach statistical significance, you do not want to be in a position where you have no one left to test iterations on or worse, exposed too many people to a harmful treatment.

False positives and false negatives

Why is the significance level presumed to be 5%? When conducting a hypothesis test, there are two types of error that could occur: false positives (aka type I error) and false negatives (type II error). The threshold that is chosen depends on your level of willingness to accept risk in order to achieve the desired experiment results.

A false positive is the probability of concluding significant results when the null hypothesis is true. It is the risk of saying something happened when it didn't.

A false negative is failing to detect significant results when the alternative hypothesis is true. It is the risk of not saying something happened when it did.

Type I and II errors are typically set to 5% and 20% respectively. There tends to be a higher tolerance for false negatives since in many instances it is less risky to miss a signal than it is to falsely claim one exists.

The reason that data scientists don't automatically opt for error lower than the arbitrary 5% is due to the direct tradeoff with sample size. As you lower your tolerance for error, the sample size requirement increases. This should be intuitive since greater evidence is required when you raise the evidentiary standard.

Practical significance

You may notice after conducting a few power analyses that by using an arbitrarily large sample size, any difference between the control and treatment group - however small - can achieve statistical significance.

It is of utmost importance to recognize that business decisions must not rely on statistics alone; we introduce practical significance as a layer of common sense to A/B testing. The idea is that prior to running the experiment, stakeholders decide on the minimum magnitude of lift that would need to be observed to be considered meaningful. In doing so, we eliminate the possibility of an overpowered experiment where the results have no discernible impact despite the positive outcome of the test.

Extensions

When not to run an A/B test

Hypothesis testing is not always a practical solution. In industry settings, new products and features are often too heavily marketed to effectively create a control group that is not aware of the treatment. It could also be that your product has a network effect (like a social media app) where people in the control group will interact with the treatment group and pollute the results. Some mediums make creating a control group nearly impossible in and of itself (how would you test two different subway ads if you don't know who is going to visit which station?).

Ethics also plays a considerable role, especially in medical trials. Scientists may be interested in the impact of cigarettes on fetal development, but you cannot legally force a group of women to smoke during their pregnancy to prove a point.

Quasi-experiments

When an A/B test is not appropriate, practical or possible, there is a branch of statistics called quasi-experimentation that may serve as a satisfactory alternative for causal inference. Meaning, techniques exist that are able to make conclusions about the population from a sample just as a hypothesis test does, but often without the need for a control group and through the use of more complex statistical methods.

Examples include quantile regression, regression discontinuity, difference-in-differences and propensity score matching. I would caution to approach these methods with a strong background in statistics and a healthy dose of mathematical critical thinking.

More to discover

As I said previously, this article just barely covers the basics of experimentation. There are dozens of concepts beyond its scope required for running trustworthy tests that we cannot begin to cover in a single blog post.

For example, we could explore sample ratio mismatch which is a test that indicates whether some hidden bias exists in the way the treatment group is assigned. Many popular commercial experimentation tools have migrated to using Bayesian statistics instead of the frequentist methods described here. There is also a discussion to be had about the multiple comparisons problem, which is the increased likelihood of false positives when testing more than one metric per experiment and requires a computation to achieve p-value correctionThat covers just a handful of what's left to discover in the broad spectrum of experimentation.

If you are considering designing your own A/B tests, the topics discussed here are not sufficient. I would recommend consulting a data scientist or at minimum committing to the recommended reading below. 

Here are a few resources to begin your journey in the pursuit of experimentation knowledge. For more on frequentist hypothesis tests:

  • Probability and Statistics for Engineering and the Sciences (by) Jay L. Devore
  • Hypothesis Testing - An Intuitive Guide for Making Data Driven Decisions (by) Jim Frost
  • Trustworthy Online Controlled Experiments - A Practical Guide to A/B Testing (by) Ron Kohavi, Diane Tang, Ya Xu
  • Design and Analysis of Experiments (by) Douglas C. Montgomery
  • Statistics for Experimenters - Design, Innovation and Discovery (by) George E. P. Box, J. Stuart Hunter, William G. Hunter

For Bayesian or quasi-experimentation techniques, please consult my ultimate data science reading list.

As always, would love to hear your thoughts and feedback in the comments below.

0 comments


Leave a comment