Experimentation

The importance of A/B tests (and why you may fail without them)

A thorough, non-technical guide for identifying use cases where running an A/B test is the best measurement strategy.

by Mojan Benham

4 years ago

Two-line summary

17 minute read

If you've ever iterated on a product feature or launched a new ad campaign and wondered "how did it do?" then this article is for you. The following is a thorough guide for identifying use cases where running an A/B test is the best measurement strategy - no technical or mathematical background required.

A very brief introduction to A/B testing

This section is by no means a comprehensive definition; rather, it gives you the foundation required to understand the rest of the article. If you'd like to go deeper on this subject afterwards, try Trustworthy Online Controlled Experiments or Causal Inference in Statistics.

An A/B test, sometimes called a controlled experiment, is the process of randomly assigning users to different experiences in order to measure the benefit of one variant over the other. The existing experience is the control group and acts as the baseline against which new experiences or treatments are compared. For example, I may run an A/B test to see if increasing the length of a free trial on my website will boost signup rate. The control group is the current trial length and the treatment group is the extended trial.

You may have heard terms surrounding this topic like statistical significance or sample size, but to understand the premise of how it works we can actually just boil it down to simple subtraction!

Think of it this way: any metric that you observe over time is subject to factors like seasonality or natural fluctuation. These factors will be present in both the control and treatment groups, the only difference between them being the change or intervention that we're trying to measure. So, when you compare your treatment group against the baseline, you cancel out the effect of any external factors and are left with the isolated effect of your change.

Figure 1: As with a simple subtraction of terms, the shared effects between experiment groups cancel out, isolating the experiment intervention thereby allowing for it to be measured.

Reason 1: Establishing causality and the pitfalls of pre-post analysis

We'll use the following use case as an illustrative example: suppose you update the look and feel of your website's homepage in hopes of improving session to customer rate.

A common strategy would be to make the change and then observe the response metric before and after the change. Now, assuming the metric is relatively stable over time, is not in a seasonally-affected period, and doesn't coincide with the efforts of other employees, then this approach is reasonable. The issue is that very rarely do these assumptions hold true; in fact, the impact of the change is typically compounded or confounded by several other factors.

Figure 2: [top] a decent use case for pre-post analysis: metric is flat in the pre-period and sees significant increase at point of intervention (assuming it doesn't overlap with other changes in the business) [bottom] a flawed use case for pre-post analysis: natural growth in metric before and after the change cannot be separated from the effect of the intervention.

Let's first discuss compounding factors, which are factors that affect the response metric simultaneous to the experiment. These typically have an inflating or diluting contribution to the observed effect of the treatment.

The most frequently occurring pitfall when conducting pre-post analysis is observing business metrics that grow over time, which is likely since it is the company's mission to grow them. If session to customer rate follows a positive trend at the time that the homepage is revamped, measuring the lift in the post period would at best be inflated because the effect of the change is compounded with the pre-existing growth. In fact, it's possible that the new homepage had no effect at all and that the analysis is attributing credit for an outcome that would have occurred with or without the change. Ask yourself: if customers were already growing before the change, how do I know how many of them joined as a direct result of my intervention?

Seasonality is also a leading cause of misattribution caused by pre-post analysis. I have seen many instances in practice where a team will report on the incredible impact of a project they shipped just in time for the holidays or Black Friday. The reality is that the business would have seen a seasonal uplift regardless of whether the project existed or not and if it did have a positive impact, it cannot be isolated from the seasonal effect with this type of analysis.

There are also confounding factors. This occurs when your change is correlated with the success metric but is actually being caused by an external factor.

For example, consider a scenario where the homepage rebrand goes live around the same time that the paid advertising team launches a new ad campaign. A positive uptick is observed in volume of customers but it's unclear whether it is due to (a) the new homepage alone (b) the ad campaign, or (c) a combination of both efforts that cannot be separated from one another.

To understand this further, we can reference a dilemma given in Judea Pearl's The Book of Why. In this book he details a study that reveals that when there are extreme heat warnings, cities see a rise in violent crime. Also, ice cream sales go up. It should be apparent that we cannot reasonably deduce that selling ice cream causes violent crime because we know that correlation does not imply causation. However, this discernment becomes much less reliable when the levers that affect a metric are unknown. We hypothesize that changing the homepage will increase customers but we don't know for sure, or by how much. Thus, a rise in customers could be correlated to the homepage (just as ice cream and crime are correlated) but actually caused by the ad campaign (just as ice cream sales are caused by heat).

In fact, there are entire websites dedicated to demonstrating the ridiculousness of spurious correlations. Don't be fooled by the comedic effect in these examples, confounders often cause misplaced credit in simple business metrics as well.

Figure 3: [Source: tylervigen.com] an example of two variables that are correlated without one causing the other. Intended to demonstrate that tweaking a lever and observing the impact on another variable leads to drawing false causal relationships.

Keep in mind that although we have demonstrated how metrics can be inflated thus far, it can also work against you. Perhaps you see a decrease in your primary metric but it would have decreased a lot more if not for your efforts. Just as the treatment cannot be isolated from pre-existing trends, it cannot be isolated from a number of external levers, including:

company growth
natural fluctuations and random variance
seasonality
market-driven changes
the efforts of other team members (including other experiments)

All of these examples serve to illustrate the downfall of pre-post analysis, which is that it cannot be used to extract the isolated or incremental effect of a change in the highly likely case that it is influenced by other factors.

Ideally, to understand the impact of a change we would be able to witness the world both with and without it (kind of like the movie Back to the Future). "What would have happened" is called the counterfactual, and the closest we can come to simulating a counterfactual is by withholding a control group. Just as we saw in Figure 1, the comparison between the control group (the counterfactual) and the treatment (the observed) leaves us with the incremental effect of the treatment. This serves as a superior measurement alternative to the pre-post analysis, free of compounding and confounding agents.

Reason 2: Comparing averages causes false positives

Let's take it a step further: rather than rolling out the change completely, you could show the new homepage to one half of the users and withhold it from the other half, comparing the session to customer rate between the two groups. This bears semblance to an A/B test, but is missing a key component: statistics.

We can build an understanding for this using a simple example:

You and I each flip a coin 100 times. Hypothetically it's known that we should both get heads 50% of the time but in reality, luck plays a role in the outcome. It would be reasonable that you and I would get, say, 48 and 54 heads respectively. At face value, 54 > 48 but we know that the difference is due to random chance and not because your coin is better.

However, if you had gotten 85 heads and I had gotten 12, we may start to suspect that the coins are not 50/50 and your coin is in fact better at getting heads than mine.

This example highlights a particularly important intuition when comparing two numbers: it is not enough to simply declare the bigger number as the winner. The understanding of a metric's variance (or spread of values it can fall into by chance) needs to be considered. A number is only truly better if it exceeds our expectation for what can happen by chance beyond a reasonable doubt.

As obvious as it sounds, our intuition of 50/50 coins doesn't always extend to real-world use cases. Session to customer rate, for example, doesn't have a fixed variance and can differ based on the day of week, time of year, tenure of the business, market demand as well as a number of other elements. If we know that session to customer rate varies by +/- 5% on any given day, then the experiment results would need to be well above that range to consider the new homepage a win. Otherwise, we could be crediting ourselves for a rise caused by natural fluctuation.

You may be wondering, "well, what qualifies as beyond a reasonable doubt?" This is where the concept of statistical significance or confidence is introduced. One popular method is to obtain a p-value less than or equal to 0.05, which describes a (5%) probability of observing equal or greater results under the presumption that the variants are the same. The exact math is beyond the scope of this post, but essentially there are industry thresholds for whether an observed change is significant enough to presume that one variant is truly greater than the other.

When working with A/B testing software or a framework laid out by data scientists/statisticians, variance, confidence and a number of other statistics contribute to determining the final outcome. The power behind these results is that they are reported with a quantified level of confidence (95%) thereby mitigating the risk of drawing conclusions based on a false positive.

Ronald Coase put it best: "if you torture data long enough, it will confess." People are biased toward their own hypotheses and have a tendency to look for confirming evidence. We rely on statistical reasoning to act as the gatekeeper of integrity particularly when it comes to metrics that are nuanced with influences beyond human intuition.

Reason 3: Intuition is not a substitute for measurement

Confirmation bias is a perfect segue into the premise of this section: intuition will fail you. In fact, a large majority of experiments run at tech companies do not validate their hypotheses. Take Slack's then Director of Product, Fareed Mosavat, who tweeted that 70% of their experiment hypotheses are proven false, or Booking.com reporting 9 out of 10 experiments being disproven.

First, it's pertinent to note that a disproven experiment is not a failed one. The learning that an idea is not good can be just as valuable as a good idea. More importantly, it should teach us to exercise a healthy level of skepticism with new ideas, even if they seem painfully obvious.

In my years of experience with A/B testing at large tech companies, I've noticed a widespread presumption that people are good at assessing the value of their own ideas when in fact they are often having the opposite effect. For example, a test of mobile push notifications revealed that instead of increasing engagement, the intrusive nature of the notifications caused users to uninstall the app altogether.

If we can learn one thing from companies like Facebook and Airbnb that have achieved experimentation at scale, it's that trusting A/B tests to determine the validity of ideas allows us to align the product with the user's reality rather than our own biases.

Reason 4: Quantifying impact

This one is short and sweet. If you've ever reported a success to leadership you were likely asked "but how much better is it?" or "what was the impact on the bottom line?" There are times where the benefit to the user is confirmed directionally but the exact lift is unknown.

A/B tests are not only able to quantify the impact, but also provide a likelihood for the results. Ex, "we've concluded that there is 95% chance that the new homepage increased session to customer rate by 3%."

Reason 5: Assessing risk and sensitivity analysis

Thus far we've discussed the power of A/B tests in either confirming or disproving a hypothesis, but they are also capable of optimizing a solution beyond a simple yes or no.

Suppose you are offering a monetary incentive to users who refer a friend and you'd like to know the lowest amount of money required to get three referrals per person. Instead of having a control group and a single treatment, we can construct multiple treatment groups with varying referral bonuses. This type of analysis where a lever (referral amount) is measured at increasing intervals to observe the change in outcome is called sensitivity analysis and is a powerful optimization tool offered by A/B tests.

If we observe similar results across many tiers, we can choose the lower amount knowing confidently that we are not overpaying. Alternatively, if the groups perform differently, we gain a quantified understanding for the tradeoff between monetary loss and user acquisition.

Figure 4: Results from a sensitivity analysis conducted by measuring the average number of referrals made by a user conditional on the referral bonus offered. In this case, each tier on the x-axis would be its own treatment group and we would conclude that the maximum benefit is seen at $20.

Assessing tradeoffs also extends its ability to quantifying risk. There are decisions that are either expected to hurt the user in favour of regulatory compliance or at least present some compromise that requires measurement. Examples include:

A new data privacy policy requires websites to ask users to accept terms and services before they may proceed to the landing page. We'd like to learn whether this added friction is hurting acquisition.
A necessary feature is ready to ship but it is expected to slow page load times by a few milliseconds. Historically, even subtle changes to site speed have hurt conversion rates so we need to test the impact on conversion.
The marketing team would like to add an email subscribe button to the checkout page of an online store but some team members believe that this will distract users from completing their order. An A/B test is constructed to weigh the value of gaining subscribers against potential orders lost.

Even in experiments that are predicted to have no risk, guard rail metrics are monitored to ensure that implicit tripwires aren't crossed. New features can cause latency, increased marketing can lead to unsubscribes, and more CTAs can distract users from core product functionality. Especially as the product grows, ideas are not 100% good or 100% bad; rather, they lie somewhere on the spectrum and require an A/B test to assess vital tradeoffs.

Reason 6: Detecting subtle or complicated effects

This category largely pertains to more mature businesses that are either well-optimized or have introduced complex mechanisms such as recommender systems or machine learning models.

Particularly with machine learning, it's difficult to deconstruct and communicate the inner workings of an algorithm to leadership. If your stakeholders understand A/B tests, they can trust the results of a treatment designed around the new model without needing to understand the model itself. This creates a common ground for proof of concepts among technical and non-technical team members.

On a separate note, more nuanced treatments such as copy revisions, updates to CTA messaging and colour changes generate subtle effects that cannot be confirmed without the use of an A/B test. It should be noted that these kinds of tests are typically reserved for mature, highly optimized products that have the sample size required to detect subtleties. At companies like Amazon, small changes that create even a 1% top of funnel increase can result in significant boosts to the bottom line but are much less impactful in smaller organizations.

Cases against A/B tests

All things considered, there are several valid cases where running a controlled test is either not possible or not desired, including:

No experiment platform: your organization may not have the infrastructure or framework that supports the randomization for A/B tests, or hasn't instrumented the tracking required to measure the response metrics.
Small sample size: a common limiting factor for new products that don't have enough users to reach significant results. Minimum size required depends on various levers including the baseline rate of the chosen metric, but typically user volume in the hundreds or thousands is required.
Randomization is not possible: there are some domains where users cannot be randomized. This includes SEO efforts where the intervention lives on the search engine results page. In these cases you cannot randomize because the treatment occurs on an externally owned property.
Immediate decision required: the change needs to be shipped immediately and cannot wait for experiment results which typically take one-two weeks minimum.
No decision point to be informed by results: the purpose of an A/B test is to either prove/disprove a hypothesis, or to quantify the impact so that a data-informed decision can be made. If you find yourself in a position where you'd make the same decision regardless of the experiment's outcome, re-evaluate what purpose the experiment serves.
Inadequate experiment length: if your hypothesis relies on a lagging indicator, you run the risk of reporting a false negative. Meaning, it's possible that the metric is affected, but is being reported as unaffected because its feedback loop is longer than the length of the experiment. For example, it may take months for users to churn as a result of increased subscription prices, but you claim the increase had no negative impact based on just a few days of data. In these cases, you either have to choose a leading indicator, increase the experiment length, or not run the experiment at all.
Ethical boundaries: there are many instances where it is morally incorrect to withhold a control group. For example, if I have identified fraudulent users on a buy-and-sell marketplace that I own, it would be unfair to construct an experiment where I only remove suspicious activity from one group of people in order to prove the value of anti-fraud efforts.

In such cases, there are measurement alternatives to A/B tests, though it should be noted that these methods only work if a set of statistically-nuanced assumptions are fulfilled. This requires a data scientist to conduct a manual analysis which is relatively high-effort.

One option is to proceed with pre-post analysis while attempting to control for compounding and confounding agents. If not detrimental for the user, it is recommended that the intervention is implemented and withdrawn many times to establish multiple proof points (where the metric increases when the change is introduced and decreases when it is rolled back).

Quasi-experimentation is also an acceptable albeit complex option. Methods such as difference-in-differences, geo experiments, propensity score matching and Google's CausalImpact library use various approaches to either estimate the counterfactual or validate a non-random control group. These methods should be leveraged with the strict guidance of a data scientist as they require a deep understanding of the statistical assumptions that need to be fulfilled to ensure a valid analysis.

All that being said, A/B tests are the gold standard for establishing causality and for measuring the incremental effect of treatments. In my experience, experimentation has been the cornerstone of identifying causal relationships between the product and user behaviour, and an invaluable communication tool for relaying impact to leadership. I often revisit the reasons outlined above when introducing the topic to new stakeholders so I hope that this serves as a useful reference for those of you who plan to have these discussions in the future.

As always, would love to hear your thoughts and feedback in the comments below.

Shopping Cart