
Shirbi Ish-Shalom | Using R to Up Your Experimentation Game | RStudio
Have you ever cut an A/B test short? Maybe because of traffic constraints, your antsy boss, or early successful results. In reality, cutting your test short can be catastrophic, making your business decision no better than a coin flip. Learn some R-driven tips & tricks to get meaningful results quickly with a statistically rigorous methodology called sequential testing, an A/B testing enhancement my team employs at Intuit. Key Takeaways: 1) What is sequential testing and how to use it. 2) How to learn (and fail!) quickly by taking big metric swings 3) How I used R to share my learnings & make them useful for anyone (even non-data scientists!) at my company About Shirbi: Shirbi Ish-Shalom is a human person
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Have you ever heard the phrase, the results were significant early, I guess I can call it now, or my boss is getting really antsy and I think she wants to test results soon. They seem directionally positive, or we really need to balance statistical rigor with the needs of the business.
The same flurry of phrases would always overwhelm the data scientists in my meetings, bad fluorescent overhead lighting illuminating the battleground between statistical rigor and the omnipresent needs of the business. As a former data scientist turned product manager and self-declared data nerd, I see both sides of the story. I wanted to help bridge the gap between my business and data peers by understanding experimental methods that allow our organization to run tests more quickly and maintain statistical rigor.
Today, I'll walk you through three easy to follow steps motivated by my own experiences at Intuit so that you can apply it to your own use cases for running experiments more quickly and statistically rigorously. First we'll talk about how to take big swings so you can learn and fail quickly. Two, we'll talk about what sequential testing is and how to use it so that you can end tests early confidently. And three, we'll talk about how to use R so that I and you can share your learnings with your organization, even non-data scientists.
The danger of reading tests early
In reality, businesses do need to move quickly to stay innovative. However, the results of reading a test early could be catastrophic, leading to your business decision being no better than a point flip.
Here is data for a simulated AA test, a test where we have two cohorts but they have exactly the same experience. We know that since these two experiences are exactly the same, there should be no statistically significant difference between the two. However, as you can see over a 30-day window, the p-value actually crosses the significance threshold a number of times, all due to random chance.
If you were to call your test early, you could be reading the same noise. And instead of making an informed, data-backed business decision, you could simply be rendering your business decision no better than if you had flipped a coin.
And instead of making an informed, data-backed business decision, you could simply be rendering your business decision no better than if you had flipped a coin.
Taking big swings: lessons from a first experiment
The first big experiment I ever ran, I fell prey to this exact fallacy. We were testing a new version of a recommendation model, and we wanted to test our recommendation model out in the wild against an older iteration. This was the perfect candidate for an A-B test. However, to move our baseline metric of 3%, we needed almost a million samples, and it would take us three months to run. Guess what? No surprise, we cheated.
For the recommendation experiment, we used three-day connection rate as a baseline metric. This suffered from being a bottom-of-the-funnel metric, because not only did the users have to see the recommendation and click on it, but then they had to connect and also hold on to that connection for at least three days for us to register that as a win. That led to our metric being only 2.88%, which is fairly low. To make matters worse, we were taking only a tiny swing, only aiming to move our baseline metric by 3%. With such a low minimum detectable effect of 3%, we needed 12 weeks or three months to be able to fully bake our test.
If you think about it, we were looking to move a baseline metric of 2.88% by only 3%, which would have led to an absolute lift of less than a tenth of a percent. If we had taken a bigger swing of 10%, 7.5%, or even just 5% as our minimum detectable effect, we could have reduced our test time from the 12 weeks we needed to just two weeks.
Applying big swings to a second experiment
A year later, we ran a second experiment where we wanted to learn from our missteps from our first experiment. In this second experiment, we wanted to test automating sales tax. We wanted to show users' country-specific sales tax rate for their region in their first time use experience. This time, we wanted to apply the principle of taking big swings to our test. This time, we used a metric much closer to the user action, engagement clicks, with a relatively high baseline of about 7%. More importantly, we declared that we were going to take a big swing and said our minimum detectable effect had to be 100%, meaning that our baseline conversion rate of 7% had to double to 14%. This time, we were able to take our projected timeline to be only two weeks.
The key idea, taking big swings to learn and fail quickly, is one of the biggest levers you can push to speed up your experience. But we wanted to do even better. By taking our experimental philosophy of taking big swings, we were able to layer on an A-B testing enhancement, sequential testing, which allowed us to take our two-week test time and actually cut it in half to just one week.
What is sequential testing?
But what is sequential testing? Sequential testing is kind of like doing your taxes. It allows you to not overspend and waste time waiting for results when you could have already called it. But it also ensures that you don't underspend time and actually call a test that's not statistically regress. Sequential testing allows you to spend that exact right amount of time such that when you call the test, you didn't overspend or underspend time to be able to call a winner.
The way that it does that is by building in a checkpoint or multiple checkpoints where you can check the results of your test and peak earlier than a standard fixed horizon test. Sequential testing is also ideal for software situations because software gives us the unique ability to monitor results in real time, which means that any moment of a peak, we can go ahead and analyze the results in that moment.
Additionally, since we have their experimental philosophy of taking big swings, it really lends itself to sequential testing because sequential testing has the unique ability to call winners early. For our sales text test, we applied two checkpoints, one at the 50% mark where half of the total samples had come in and one at the 100% mark where a total or full sample set had come in at approximately two weeks. At each checkpoint, you can see that we have a z-score acceptance criteria necessary to be able to successfully determine a winner. The acceptance criteria are much more stringent in the earlier checkpoints to account for the smaller sample size that you're using to determine whether there's a winner.
Note that the z-score acceptance criteria is greater than the 1.96 critical value, which we normally see during fixed horizon testing. It's there because we're paying a small price to be able to peak earlier. Namely, the price is that our experiment is slightly less powerful in a statistical sense so that we may miss the detection of a true effect or win. In this case, we're using something called the O'Brien-Fleming boundary, which we've chosen because it tries to minimize that payment at the end of the test because we're trying to get as close to the 1.96 critical value as we can with that last peak. As an aside, if you want to check out more, there's many more boundaries you can use which may suit your needs more.
Building an RStudio dashboard to share the methodology
I wanted to make building a sequential test plan as simple as possible to abstract away the statistical complexities so that anyone at my company could use it. To do that, I designed and built an RStudio Flex dashboard to make creating a test plan as simple as filling out just a few data points.
Here you can see my Flex dashboard from my sales text test pre-filled. We included the baseline conversion rate of our primary metric at 7.03, the power we wanted or 80% here, which is an industry standard, a statistical significance level of 5%, which is also an industry standard. And here we included our big swing requirement of 100%. And finally, we included the optional value of our weekly traffic to help translate our sample size requirement to the amount of time we would need.
Until this point, these are all values you would include in a normal A-B test plan sample size calculation. The only additional value we needed to include was the number of peaks we wanted to take during our test. Since our normal A-B test already had an estimated time of two weeks, we only wanted to add one checkpoint to be able to account for normal weekly behavioral fluctuations. Therefore, here we include the number of peaks to be just two.
At the top, we have three key numbers available. The number of peaks, or two in this case, that we've set, the total number of samples that we would need, and finally, the total amount of time it would take given the number of samples in our average weekly traffic. Both of these numbers are based on a normal A-B testing sample size calculation. Here our table shows us the breakdown of samples needed, number of days, and then the lower and upper z-score thresholds for each checkpoint given all of our inputted values. Finally, our threshold graph illustrates how our z-score threshold changes with each checkpoint.
Ultimately, being able to use this dashboard and share out my test plan, I was able to go from an earlier test with one million samples to only needing 294 samples and taking half the time of a normal fixed-rise and A-B test. Instead of taking three months to bake, we only needed to wait one week. Instead of needing the full bake time, we were able to have multiple checkpoints to really get the right amount of time to be able to guarantee statistical rigor without wasting any time. And finally, the most important of all, remove the temptation to cut corners, and instead be able to guarantee statistically rigorous results at shorter time points.
And finally, the most important of all, remove the temptation to cut corners, and instead be able to guarantee statistically rigorous results at shorter time points.
Three steps to run better experiments
So how did we get there? As a reminder, there were three simple steps you needed to take to be able to achieve these goals for yourself. One, remember to take big swings. Don't take a minimum detectable effect that's too small. That's a huge lever to be able to make your experiment shorter right off the bat. Two, use sequential testing. Now that you know how, it can be simple to just implement an overlay on top of your own A-B tests. And three, use an R Shiny Flex dashboard or your own tool of choice to be able to demonstrate how to do this to everyone in your organization and share the goodness that is sequential testing so that everyone, even non-data scientists, can run shorter, more statistically rigorous tests.
Now with just those three tips and tricks in your tool belt, you can avoid early significant results, disappointing your antsy boss, and business constraints, because now you have all the tools you need to be able to run your tests shorter and reliably so that everyone in your organization can feel confident in your results.
