Resources

How Data Scientists Broke A/B Testing (And How We Can Fix It) - posit::conf(2023)

Presented by Carl Vogel As data scientists, we care about making valid statistical inferences from experiments. And we've adapted well-established and well-understood statistical methods to help us do so in our A/B tests. Our stakeholders, though, care about making good product decisions efficiently. I'll describe how the way we design A/B tests can put these goals in tension and why that often causes misalignment between how A/B tests are intended to be used and how they are actually used. I'll also talk about how I've used R to implement alternative experimental approaches that have helped bridge the gap between data scientists and stakeholders. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Bridging the gap between data scientists and decision makers. Session Code: TALK-1076

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, sounds like my mic is working, okay great. So yeah, my name is Carl Vogel, I'm a Principal Data Scientist at a company called BabyList. I am here to nominally talk about A-B testing.

But I'd like to start with a story, and in this story you are a data scientist, which is hopefully not super hard to imagine, and you work at a company that sells goods or services online, and one day a product manager comes up to you asking about a test they want to run for some feature they want to launch. And this is not, this is like a substantial feature, it's not, we're not like moving a button on the page somewhere, we're not like changing some copy, designers were involved, engineers were involved, you know, this is part of like a broader user experience strategy we're trying to do.

And you get like that perennial question, that perennial data question of like how much data do I need for this, or in this case, how long should I be running this experiment? And now you are a competent and like a diligent data scientist, hopefully that's also not hard to imagine, and you ask some thoughtful questions about, well, what are you trying to measure, and what does success and failure look like for this feature, and importantly, how big of a conversion lift are we looking to get here to make this worthwhile?

And the product manager sort of struggles with these questions a little bit, but you get answers, and you run off and do the thing you were trained to do, and you do some sample size and some power calculations, and you figure out an expected test length, and you get back to them, and you say, well, you're going to need six weeks for this, to test this feature. And they go, that's great, thank you for this, we have budget appetite for two weeks.

And so you like take a deep breath, and you wish them well, and you warn them that they may not be able to detect the kind of effects they're interested in in two weeks, and they seem oddly okay with that warning, they're like, okay, cool. And two weeks pass, and you go take a look at the data, and like lo and behold, there is a conversion lift in this new feature, and if it were real, it would mean meaningful amount of money for the company.

But given the test length and sample size, it's not statistically significant, and you go back, and you report this to them, and they respond to you with three words. And now you have maybe studied and applied experimental methods for a long time, for many years, and you have never heard these three words before, but they're called launch on neutral, or maybe they said launch on flat. But either way, it basically means, hey, I heard you, it's not statistically significant, but it's positive, and I'm going to launch the feature anyway.

But either way, it basically means, hey, I heard you, it's not statistically significant, but it's positive, and I'm going to launch the feature anyway.

But you are introspective, and you are curious, and when you are done raging about the lack of respect for type one error rates or whatever, you start asking stakeholders, hey, what's the deal? Why do we do this? Why do we run an underpowered test, and then launch on an insignificant result?

If you're like me, you might get answers kind of like this, right? You're going to hear about some faith in a broader strategy, or wanting to launch something and learn and iterate on it, and all this stuff. And this will start to make you think about A-B testing in your organization a little bit differently, as something that has attributes that make it distinct from the raw application of null hypothesis significance testing, right?

And some of these attributes are that, you know, the features we test are not just coming off of a conveyor belt, like randomly drawn out of a population of ideas that might make us money, might lose us money, who knows, right? They are carefully planned. They are roadmapped. They have a lot of path dependencies between each other.

Secondly, we always struggle to have these conversations about sample size and test design with folks, because you need some effect size input into it, and they always struggle to tell you that. And that's because by the time we have gotten to the test, we have sunk all the cost of deploying the feature. It is basically pushing a button at that point. So any lift is good.

And we're, like, talking to them in these terms of, like, type 1 and type 2 error rates, and this is just, like, doesn't correspond to how, like, these decision makers are thinking about the risk in this decision. They just kind of, like, want to make more money than they lose on average, and, like, never lose too much money at a time. It's really hard to, like, map this to, like, the false positive and false negative rates conditioned on, like, a null hypothesis, right?

They are asking us this question, essentially, right? How do I make a good decision about the effect size I see? And we are handing them some tools that go, well, here are some statistical guarantees on an inference you might want to make. And so there's a little bit of a mismatch, and they end up misusing the tools or ignoring the tools, and this is kind of what this talk is about.

When we see this happening, the instinct is kind of, like, correct their use of the tool, but I want to argue in the A-B testing context, we maybe want to think about handing them slightly different tools.

And so for the rest of my time, I am going to talk about two approaches to thinking about an A-B test that have helped me have, like, more productive conversations with decision makers.

Non-inferiority test designs

So the first approach is non-inferiority test designs, which are not new and not esoteric, but I think they're slept on in the A-B testing context a little more than they ought to be. You'll notice the picture here is a guardrail, and that's the metaphor to keep in mind. The main idea of this approach is that instead of testing whether the new version of the site is better than the current one, we're just going to test whether it's not worse by some margin. And that margin is the delta in the red box there, and we call that the inferiority margin.

Well, when you have a conversation about what these margins ought to be, you are forcing conversations about the sort of things that motivate these, like, launch on neutral type decisions about, you know, well, how much do you want to risk to launch this thing? How much do you believe in it? How quickly are you going to iterate on it and improve on it after it's launched, right? And stakeholders can kind of start to give you, like, meaningful answers to these questions instead of coming up with a fake effect size that they want to find.

And you can start to power against this, like, well, any positive effect is good type of scenario. And you can start to have a conversation about, well, look, if you run a test for three weeks and you want good power against any feature that, like, isn't losing us money, then you may have to accept some small risk that it actually will lose us, like, say, 1.5% conversion drop, right? And maybe that's an acceptable risk. Maybe it's not an acceptable risk. But it's an assessment of risk that usually they can reason about a little bit better. And you end up with a more productive conversation.

Value of information approach

But there's another method that I like that I want to talk to you about. And this one directly sort of attacks the core problem, the core question that we have in these test design conversations. And that is... what's the hurry? Right? Like, why do we not have the patience to run, like, an adequately powered A-B test in this organization?

And we sort of know the answer. Like, running a test a long time is costly. There's an opportunity cost of time. This gets back to that, like, the nature of, like, the road mapping and the path dependence amongst features. If we're waiting a long time for a test of what feature won, that is going to hold up feature 2, which depends on the launch or the non-launch of feature 1 and what happens.

There's the opportunity cost of sampling and randomization and tests. By construction of an A-B test, a bunch of users are not getting the best version of your site. If you've ever worked with bandits or seen bandits, you've seen them trying to approach this problem. And then lastly, there's just, like, the day-to-day maintenance cost of tests. Having a bunch of tests running on the site at once is... you know, engineering effort and, like, code complexity and data storage and whatever.

So the question is... if we know about all these costs and we know they affect how decision makers want to run tests, why aren't we incorporating them into test designs? And this is where value of information, experimental designs, can help.

So we know the time we spend running a test longer is costly. We know the extra data we get from running a test longer is valuable. If we can quantify the cost and we can quantify the value, that should be telling us how long we should be running a test. If the value of more data exceeds the cost of more data, you should keep getting data. And if the cost of more data exceeds the value of more data, then you should stop getting data.

The longer our tests run, the more data we get, the more valuable that data is. Experimental data is less valuable when you have a lot than when you have a little. Costs increase as you wait to get that data. If it's more valuable to get the data than it is costly, you should get the data.

How do we think about the value of data, though? Well, before we run an experiment, before we have any data, we know very little about what the conversion lift of a new feature might be. It could be very negative. It could be very positive. If we make a decision based on our best information now, we could end up launching an awful feature or failing to launch a really, really good one. And then as we collect data, we have a better idea of what that conversion lift might be, the range of values it might take narrows. We may make an incorrect guess now, but our guess is likely to be wrong by less.

And it turns out that you can put a value on being probably less wrong. And again, if that value exceeds the cost of the time it takes to get that data, you should be getting it.

And it turns out that you can put a value on being probably less wrong.

So how do we actually, say, like, compute that value, that value of being potentially less wrong? It turns out it has a name. It's called the expected value of sample information. And I'm gonna show you a really simplified way of how you might estimate it.

So we start with a prior over what we think the conversion lift might be. Relatively wide range. We don't know very much. It could be negative. It could be pretty positive. We are going to draw a bunch of values out of that prior, a bunch of potential lifts out of that prior.

For each of those lifts that we draw, we're gonna simulate an experiment. Let's say we're interested in, like, hey, what if I want to get two more weeks of data? So I can simulate a two-week experiment, control and treatment with that lift that I drew out of the prior. That data and that prior generates a posterior with a new opinion about what those lifts might be.

Each of those posteriors may change my mind by... may not change my mind at all based on what I was gonna do under the prior. They may change my mind a lot based on what I was gonna do under the prior. If an experiment generates data... is likely to generate data that changes my mind by a lot, that was a valuable experiment to run. If the prior had never had any hope of changing my mind, there was no point in doing it.

And so we run all these simulations, we get all these priors, we get all the posteriors, we get the values, we average those out. That's an estimate of this expected value of getting this extra data. Even better, this is like an inherently kind of sequential process. It's just posterior updating. So you can do it over and over again. After you get some data, you're really just asking, what's the value of some more data?

And this changes the core decision in an A-B test from... is B better than A? As though that's a really hard problem to figure out. To... should I stop getting data? Or should I keep getting data? There's a good fit for A-B tests. Because we don't have to recruit subjects for an A-B test. We just have to wait.

And then once more data isn't worth it, you just launch the best observed variant. The inference problem, the statistical significance problem is irrelevant at that point. This is the best information we have. And it's not worth getting more.

And it turns out, like, I find this a really compelling way to think about A-B tests with decision makers. It directly gets at the core concepts that they think about when they want to make a decision. Cost. Benefit. Time. Risk. Everything's in dollars. The outputs are in dollars. They're not like... error rates.

So it's more complicated than traditional testing. But it's tractable for a pretty broad range of the kinds of A-B tests I've run in my experience. But I've built whole analytics engines on it with R and Shiny and worked with product managers on it who have found it... it gels really well with how they want to make decisions and kind of liberates them into being able to... oh, I can figure out how a test should work with, like, dollar outputs.

Lessons and takeaways

So those are the two methods. And this is the part of the talk where I'll reveal that I've failed to pay off on the clickbait title, per se. But hopefully there are some useful lessons.

So the first one is... I'm not trying to sell you on these specific two methods. I don't think there's, like, a one-size-fits-all approach to A-B testing. In your organization, you're gonna make decisions differently. You're gonna need to figure out what kind of measurements you need to make to, like, support those decisions. These have pros. These have cons. There's no silver bullet.

But when you observe stakeholders misusing the tools that you have provided them to do analyses... it should really cause you to rethink, oh, what is this tool I've handed to them? And does it align with how they make decisions? Does it align with their concerns about risk and cost and time and value and all that important stuff? Am I giving them outputs that map to how they think about the problem?

And when I do that, when I go back and I just try and rethink this, right? About the tools that I'm handing them, you really want to get at... am I solving the core problem or am I just solving the symptoms of the misuse that I'm observing? So, running a test to a neutral, running a test to a significant, peaking. All this stuff are kind of a symptom of the problem that the A-B test frameworks we often work with don't deal with the cost of time.

And there are, like, lots of advanced techniques out there. There's, like, covariate adjustments, sequential p-values and all this stuff that will run out that will help a test go faster. And they're great and you should use them when you can. But they don't answer the question of, like, why does this test need to go so fast? And so they're really just kind of treating the symptom of impatience.

And this isn't just about A-B testing, right? Data scientists sometimes, like, love a tool and, like, apply it not super discriminantly to problems. And so we end up with lots of places where the tools we've handed stakeholders aren't exactly the perfect fit for how they think about it. A-B testing is a really interesting case, because it's, like, a domain where, you know, it feels like this is a solved statistical problem, this should be really straightforward and then you go try and use it in practice and it gets messy really fast.

But this is, I think, the cool stuff that we get to do. This is, like, a vaguely weird time for data scientists. It feels like a lot of the problems we used to work on are getting automated or outsourced or standardized or whatever, right? But these kinds of misalignments between decision making in an organization and the data science tools used to support those decisions happen, like, all over the place and all the time in organizations.

And identifying those and addressing them by, you know, going back to the first principles of the problem and really translating that decision making problem into quantitative methods and quantifying the core concepts in that decision making problem is where we can, like, add value, right? And I don't want you to let, like, SaaS vendors and, like, chat GPT convince you that these are all solved problems and there's nothing left to do. I think there's a lot of things like this to do out there still. And I think that's what we're here for, right?

And that's all I have. Thanks for coming, everybody. I hope you enjoy the rest of the conference.

Q&A

Is there a risk of compounding poorly tested changes into real deterioration of the project? So I've got a yes. But can you talk about that a little bit?

Yeah. This is asking about the non-inferiority stuff, right? If you're willing to, like, accept a tiny loss on each test, right? It starts to add up. Yes. That absolutely can happen. The way I think about this is having kind of an aggregate inferiority margin budget over a bunch of tests and going, like, well, you can put a margin on this one and this one and this one, but you can't, like... you know, this is, like, the total loss that we can kind of accept over, like, a long sequence of tests or for a year or over some unit of time. And so you have to, like, sort of budget out that risk that you're... you should have a risk budget for all these decisions, right? You don't want to think about them in isolation, right?