A/B Testing, from scratch

Share On Facebook
Share On Twitter
Share On Google Plus
Share On Linkedin
Share On Reddit
Contact us

If you work in a diligent web development business you probably know what an A/B test is. However, its fascinating statistical theory is usually left behind. Understanding the basics can help you avoid common pitfalls, better design your experiments, and ultimately do a better job in improving the effectiveness of your website.

Please hold tight, and enjoy a pleasant statistical journey with the help of R and some maths. You will not be disappointed.

This blog post is also published on my company blog which inspired those studies.

Introduction

An A/B test is a randomized, controlled experiment in which the performance of two product variants are compared. Those variants are usually called A and B 1. From a business perspective we want to know if the performance of a certain variant outperforms the other.

As an example let's take a typical checkout page where we want to assess if a green checkout button is more effective than an orange one.

Experiment example
Figure 1: Experiment example

After 1 week we may collect the following numbers:

Converted Total
Variant A 100 10000
Variant B 120 10000

If at this point you are willing to conclude that the B variant outperforms A be aware you are taking a very naïve approach. Results can vary on a weekly basis because of the intrinsic randomness of the experiment. Put simply, you may be plain wrong. A more thorough approach would be to estimate the likelihood of B being better than A given the number we measured, and statistics is the best tool around for this kind of job.

Statistical modeling

Statisticians love jars and, guess what, our problem can be modeled as an extraction from two different jars. Both jars have a certain ratio of red and green balls. Red balls are customers who end up paying while green balls are customers who leave.

Does the jar belonging to variant B have a greater proportion of red balls compared to the other one? We can estimate this ratio by doing an extraction with replacement 2 from the jar. We extract a finite number of balls from each jar and we measure the proportions. In this type of experiment each time we pick a ball we also put it back to the jar to keep the proportions intact.

Figure 2.a: Variant A population
Figure 2.b: Variant B population

Now the good news is that the Binomial distribution 3 models exactly this type of experiment. It will tell you the expected number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

In other words, each time we extract n balls from our jar we are conducting an experiment, and a yes corresponds to a red ball while a no corresponds to a green ball. The binomial distribution 4 computes the probability of picking exactly k red balls after n extractions as follows:

(a) Pr(X = k) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}

which can be checked using the dbinom() function in R. Assuming we have a jar with 30% of red balls, what is the probability that we extract exactly 10 red balls out of a 100? This is equivalent to Pr(X = 10) which is:

Very small. Fair enough, the jar has definitely a bigger proportions of red and I was the unlucky guy here. Now, let's plot these values for x (number of successes) ranging between 0 and 100:

Makes sense, doesn't it? The chances of having exactly k successes cumulates around the value of 30, which is the true proportion of red/green balls in our jar.

Naïve experiment assessment

Now that we know how statisticians model our problem let's go back to our conversions table.

One way of assessing if B is better than A is to plot their expected distributions. Assuming that A follows a Binomial distribution with p=0.01 (we had 100 conversions over 10000 trials) and that B follows a Binomial distribution with p=0.012 (we had 120 conversions over 10000 trials) this is how they relate to each other:

So if the true mean of the two distribution is p_a = 0.01 and p_b = 0.012 respectively we can conclude that B is better than A. If we repeat the experiment several times (always with 10000 participants) for A we will get most of the time values between 70 and 129 while for B we will get values between 87 and 152. You can check these boundaries on the plot, or you can compute them manually with the 3 times standard deviation rule of thumb 5.

But hold on one minute. How do we know that p_a = 0.01 and p_b = 0.012 are indeed the true means of our distributions? In the end we simply did one extraction from our jars. If these numbers are wrong our distributions will look different and the previous analysis will be flawed. Can we do better?

More rigorous experiment assessment

In order to estimate what is the true mean of our variants statisticians rely on the Central Limit Theorem (CLT) 6 which states that the distribution of the mean of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.

In our case we are trying to estimate the distribution of the mean of the proportions p_a and p_b for our variants. Suppose you run your A/B test experiment N=100 times and that each time you collect n=10000 samples you will end up having for variant A:

p_a^1, p_a^2 ... p_a^N

and the CLT tells us that these are normally distributed with parameters:

\mu_p = p, \sigma_p = \frac{\sigma}{sqrt(n)} * \frac{N-n}{N-1} \approx \sqrt{\frac{p * (1-p)}{n}}

where \sigma = \sqrt{p * (1-p)} is the standard deviation of the binomial distribution. I know, this is hard to believe and proving these numbers goes definitely beyond the scope of this blog post so you will find some maths-heavy material in the footnotes 7 8.

Back to our question, what is the true mean of p_a and p_b? Well, we don't know really, but they are distributed normally like this:

As you can see we are dealing with a risky business here. There is a good chance that the estimation of the true values of p_a and p_b are not correct since they can span anywhere between all the values plotted above. For a good number of them p_a may actually outperform p_b violating the conclusion we did in the previous section. There is no magic bullet that will solve this problem, this is the intrinsic nature of the probabilistic world in which we live. However, we can do our best to quantify the risk and take a conscious decision.

Quantitative evaluation

In the previous section we have seen that it is likely that Variant B is better than Variant A, but how can we quantify this statement? There are different ways in which we can look at this problem, but the ones that statisticians use is Hypothesis testing.

In a series of papers in the early 20th century, J. Neyman and E. S. Pearson developed a decision-theoretic approach to hypothesis-testing 9. The theory was later extended and generalised by Wald 10. For a full account of the theory, see the book of Lehmann and Romano 11.

In this framework we state a hypothesis (also called null-hypothesis) and by looking at the number we will try to reject it. In our example we hypothesize that the true conversion of our visitors is p_a and that the proportion p_b we collected in the B variant is simply due to chance. In other words we assume that our real world visitors behave like in variant A and we quantify the probability of seeing variant B's proportions under this hypothesis.

So, what is the probability of having 120 or more conversions if the true mean of the binomial distribution is p_a = 0.01? We simply have to sum the probability of all possible events:

P(X_a >= 120) = P(X_a=120) +P(X_a=121) + ... + P(X_a = 10000)

To actually compute the number you can use the probability mass function in Formula (a) or you can use R:

which gives us the following:

I deliberately specified alternative = "greater" in the function call to compute the chance of getting more than 120. But there are other ways 12 to approach the problem. The value we are looking for is the p-value 13 0.0276 which is exactly the probability of getting more than 120 successes, i.e. P(X_a >= 120). Visually it corresponds to the area under the right end tail of the distribution of A:

We are now ready to quantify to what degree our null-hypothesis is true or false according to the numbers we collected in our experiment.

Type I and Type II errors

It is well known that statisticians do not have the same talent in the art of giving names as computer scientists do.  This is well proved by the definition of Type I and Type II errors which are equivalent to the Machine learning definitions of False positive and False negative.

Type I error is the probability of rejecting the null-hypothesis when the null-hypothesis is true. In our example this happens when we conclude there is an effect in our AB test variant, while in reality there is none.

Type II error is the probability of failing to reject the null-hypothesis when the null-hypothesis is false. In our example this happens when we conclude the AB test variant has no effect, while in reality it actually did.

In the previous section we quantified the probability of Type error, which is equivalent to the p-value returned by the binom.test function. To quantify the Type II error we need to know for what pvalue = \alpha we are willing to reject the null-hypothesis. It is common practice to use \alpha = 0.05 so I'll just go with that.

Now, what is the critical number of conversions such that we reject the null-hypothesis at \alpha = 0.05? This is called the percentile 14 and it is simply the x_{\alpha} such that:

P(X <= x_{\alpha}) = 1 - \alpha

which can also be computed easily in R with the following:

Now we know that starting from 117 + 1 observations we will reject the null-hypothesis.

To compute the Type II error we need to assume that the null-hypothesis is false (p_b = 0.012) and quantify how likely it is to measure a value of 117 or less since this is exactly what will lead us to a mistake!

P(X_b <= 117) = P(X_b = 0) + P(X_b = 1) ... + P(X_b = 117)

which is equivalent to:

which means that we have a ~= 40% chance to conclude our experiment did not have any effect, while in reality there was some.

That's seems harsh to me. What can we do? Apply the first law of the modern age of statistics which is, get more data. Assuming we can double our data and get 20.000 visitors instead of 10000 and that we measure the same proportions, our Type II error will go down drammatically:

now we have ~= 14% chance of concluding our experiment did not have any effect. We can even go one step further and see for every possible observation count what's the expected Type II error by simply brute forcing the computation as follows:


download (1)

which seems a fair result, starting from 30.000 visitors we can safely assume that our probability or Type II error will be low.

The analysis as I just presented it is flawed by a fundamental assumption. To estimate the Type I error we assumed that p_b is exactly 0.012 while to estimate the Type II error we assumed that p_a is exactly 0.01. We know from the CLT that this is not exactly true and the distributions of those values span across a certain range (see here). So let's see what happens if I take several points across these distributions, for example the 1%, 25%, 50%, 75%, 99% percentiles and check what happens to our hypothesis testing errors.

For Type II errors I first collect the possible values of p_a for my parametric analysis:

and then I estimate the error exactly like I did previously:

which produces the following:

where the green line is the same as the one plotted here. It is quite interesting to observe the pink line at p_a = 0.0123. This is the worst case scenario for us since we picked a value that is greater than p_b = 0.012 and because of that our Type II error actually increases with the number of observations! However it is also worth mentioning that this value is very unlikely because the more data I collect the more the uncertainty around the values of p_a decrease. You may also notice that the lines in the graph are quite thick, and this is because discrete tests like the binomial one 15 have quite large oscillations 16.

The same type of parametric analysis can be performed on the Type I error. Previously we saw how to compute it with the binom.test() function, but we can do the computation manually as follows when p_a=0.01 and p_b=0.012:

where 119 is the value starting from which we fail to reject the null-hypothesis. The parametric analysis is a generalisation of this code like we did for the Type II error:

which plots the following:

as in the Type II error we notice that for p_b = 0.009 the Type I error actually increase with the amount of data but this value become more and more unlikely as the data grows.

It also very interesting to notice how the value of the two type of errors goes down at a completely different rate. Overall, with this design, we are more likely to stick with the "button colour makes no difference" conclusion. When the reality is that button colour makes no difference, the tests will stick to reality most of the times (Type I error goes down quickly). When the reality is that button colour does make a difference, the test does take the risk of saying there is actually no difference between the two (Type II error goes down slowly).

Estimate the sample size

Before wrapping up let's make a step back and position ourself back in time before we started the experiment. How we can estimate for how long we should run our experiment in order to be confident that our results are statistically significant? To answer this question you simply need to use the same tools we just saw, from a different perspective.

First, you need to make an estimate around your current baseline conversion. If you use google analytics it should be straightforward to know what is the conversion of your checkout page with the green button.

Second, you need to make a guess on what type of effect size you are willing to detect (minimum detectable effect size). In our example we would have chosen an effect size of 20%. In various disciplines effect sizes have been standardised to make different experiments comparable. Most famous effect size measure is Cohen's d 17 18.

Third, you need to assert what risk you are willing to take on the Type I error, or equivalently, what \alpha level you are willing to choose for your p-value. This is the value you will refer to at the end of the experiment to make your conclusion.

Finally, you need to assert what risk you are willing to take on the Type II error, or equivalently, how likely you are willing to say your orange button did not perfomed better when in reality there was an effect (i.e. orange button is better). This is equivalent to the power, where you assert how likely you are going to be correct when you conclude that the orange button is better, when it is actually real.

Mathematically speaking computing the required sample size seems to be a difficult problem in general and I would like to point you to the footnotes for a deeper discussion 19 20 21. Here I will show the approach taken from the Engineering statistics handbook 22.

First, we are look at the statistics:

\delta = |p_a - p_b|

and we would like to compute the sample size such that:

(1) P(reject H0 | H0 is true) = \alpha
(2) P(reject H0 | H0 is false) = power = 1 - \beta

Now, you have to use some faith and intuition. What is the smallest value of \delta, say \delta_{min}, that we care about? Our minimum detectable effect size! You can imagine \delta_{min} is a function of both (1) and (2). The smallest value in (1) for which we start rejecting is:

z_{1-\alpha} * \sqrt{\frac{p_a * (1-p_a)}{N}}

while the smallest value in (2) for which we start rejecting is:

z_{1-\beta} * \sqrt{\frac{p_b * (1-p_b)}{N}}

where power = 1 - \beta putting it all together we have:

\delta_{min} = z_{1-\alpha} * \sqrt{\frac{p_a * (1-p_a)}{N}} +z_{1-\beta} * \sqrt{\frac{p_b * (1-p_b)}{N}}

Take the N out of this equation and you get:

N = (\frac{ z_{1-\alpha} * \sqrt{p_a * (1-p_a)} + z_{1-\beta} * \sqrt{p_b * (1-p_b)}}{\delta_{min}})^2

So, using our example we state that: (i) our baseline conversion for the green button is 1% or p_a = 0.01; (ii) our estimate of effect size is 20% which leads to the orange button converting at 1.2% or p_b = 0.012; (iii) we accept a Type I error probability of 5% or \alpha = 0.05; (iv) we want a Power of 80% to make sure we detect the effect when it is there.  This is translated in R as follows:

which seems to be consistent with the simulations we did previously. On the other hand it is 30% lower than the one reported by a fairly famous online tool 23 24. If you know why please let me know in the comments.

Wrapping up

In this blog post we took a simple conversion table and we did a step by step statistical analysis of our results. Our experiment is designed like an extraction with replacement from a jar which is modelled with a binomial distribution.

We concluded that given the observed data the checkout page with the orange button is 20% more likely ((0.012 - 0.01) / 0.01) to convert than the one with the green button. If the green button is equivalent to the orange button (no effect), we came to the wrong conclusion with a probability of 2.76% (P(type I error), also called p-value). If the orange button is better than the green button, we came to the right conclusion with a probability of 60% (1 - P(type II error), also called the power). In this same scenario (orange button is better) if we double the number of visitors in our experiment our probability of being correct (power) will go up to 86%. Finally, we reviewed how to estimate the required sample size for our experiment in advance with a handy formula.

It is worth mentioning that in practice you rarely use a binomial test but I belive the simplicity of the distribution makes it adequate for this type of presentation. A better resource on choosing the right test can be found in the references 25.

If you enjoyed this blog post you can also follow me on twitter. Let me know your comments!

References


Also published on Medium.

  1. Wikipedia page on A/B testing
  2. Extraction with replacement
  3. Binomial distribution
  4. Probability mass function
  5. 3 times standard deviation rule of thumb
  6. Central Limit Theorem
  7. Sampling Distributions
  8. The Central Limit Theorem around 1935
  9. Neyman and Pearson, Biometrika, 1928
  10. Wald, The Annals of Mathematical Statistics, 1939
  11. Lehmann and Romano, Springer Texts in Statistics, 2005
  12. One-tailed vs two-tailed tests
  13. p-value on wikipedia
  14. Percentile on Wikipedia
  15. Binomial test on Wikipedia
  16. Binomial test error oscillations on CrossValidated
  17. Thresholds for interpreting effect sizes
  18. Using Effect Size—or Why the P Value Is Not Enough
  19. Sample size determination for hypothesis testing of means
  20. Statistical Methods for Rates and Proportions, Third Edition
  21. Sample Size Calculation For
    Comparing Proportions, Wiley Encyclopedia of Clinical Trials
  22. Engineering statistics handbook - sample sizes required
  23. Evan Awesome AB Tools
  24. Evan Miller source code for sample size
  25. An Overview: choosing the correct statistical test

7 thoughts on “A/B Testing, from scratch

  1. This is great and astonishing that so many statisticians and scientists completely overlook effect size and power. What are your thoughts on a Bayesian approach that actually calculates a posterior? We have powerful computers now that can explicitly compute these probabilities.

    1. Hi Barry, thank you for your comment.
      In the Bayesian approach I think the main advantage is that the results are much easier to interpret compared to the hypothesis testing framework. In terms of the quantitative result it should be equivalent on the long run. Writing an article on the bayesian approach is definitely on my todo list, so I'll get into the details 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Answer the question * Time limit is exhausted. Please reload CAPTCHA.