Emily Robinson | Building an AB testing analytics system with R and Shiny

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thank you so much for joining me today. I know it's been, at least I've had a really hard time choosing between talks. I actually had circled solving puzzles with R before I realized I probably should show up for my own talk to give it. So thank you all so much for coming.

My name is Emily Robinson and I'll be talking about building an AB testing analytics system with R and Shiny . All my slides are available on the link up here, bit.ly slash rstudio 19. And you can also, if you're a tweeter, you can follow me on Twitter at Robinson underscore ES. I usually live tweet a lot of talks, although I will not be live tweeting my own, sadly.

All right, so a little bit about me. This is my part-time dog, Abby. I call her my part-time dog because I steal her from my parents occasionally. I'm a data scientist at Data Camp, where I work on the growth team on experiments. That's a lot of what you'll be hearing about today. I've been an R user for about seven years. I first learned R when I went to Rice University for undergraduate, and I walked into class one day, and I had this really good teacher, like statistics professor, and I texted my brother this, who was studying computational biology. I'm like, oh, I have this great professor, his name is Hadley Wickham . And my brother is David Robinson, who will be giving the keynote, and he went, you're being taught by Hadley Wickham, who back then was not as well-known as he is now. So I was really lucky to get to learn R first from him.

And some things I enjoy talking about include building and finding data science community. One of the reasons I'm so excited to be at RStudio Conference, this is my third time here, is that the R community is just so friendly. I'm sure many of you already know this. You saw how Hadley started off his talk today, asking people to, you know, please be extra welcoming, follow the Pac-Man rule. So I just really love it. Diversity in STEM, I'm part of the R Ladies group. If you were at the opening reception, you might have seen their poster. They're a group for promoting gender diversity in R. They're all around the world with different chapters. I really suggest checking them out.

And I'll add to that, when you've run the same process three times, make a dashboard.

Building the Shiny dashboard

So I could have come up with a solution here, right, of, like, making another package. That's what we saw for the first problem was the solution to all this copy-pasting. But in this case, a dashboard is really what's going to be helpful. And what it also did was it's building a tool that empowers others. We've heard a lot today about how useful Shiny can be. If you saw the presentation at 11 a.m. this morning from Jacqueline and Heather Nolis on their work at T-Mobile, they talked about how putting a Shiny app in front of the stakeholders so they could play with the machine learning model themselves is what got them the resources to build out a whole team.

So I built a Shiny app for showing our experiment data. I started with the health metrics. So now for every experiment, these are automatically calculated, and we can know if something's going wrong. It's color-coded. So we see that the duplicate cookies, cookies of multiple variation, things look good. It's green. Cookies with no page views is not looking as good. It's yellow and red. And so then that can suggest what we might need to look into.

We have a biometric view. So if you want to look for all the experiments that are running or even all time, how did all of those affect chapter starts? We can see, all right, the control versus treatment, the percent change. Again, this color coding, which is a small thing, but it's nice because at a glance, I know what to take from this. All right, three experiments didn't have an impact. The fourth one had a negative impact. And that will only show in red when the p-value is less than 0.05. We also have our nice confidence intervals, again, just for some more visualization and helping to understand these.

We have an individual experiment view. So now this shows the description and the hypothesis. Because as the experiment metrics grew, as we started tracking more, we went from like three to 11, it became a bit cumbersome to look just by metric when you're only interested in one experiment. So again, all the information was there, but just making it easier for people to see at a glance and to have this individual experiment view.

And to level up a little bit, I would get a common request of what percent increase can we detect in a two-week test? And this is a really important question. Because otherwise, you may end up running tests, you know, say you have a great idea. I want to know, are we going to increase registrations by changing the color of a button? But if you run a power calculation, this will tell you, all right, in two weeks with the amount of people who are visiting, the amount of people who say register already, you would need a 50% increase in registrations to detect a change. So that probably means you don't want to run that test. And I was really glad people were asking this question. I encouraged them. But could I make a tool so that people could answer this themselves without code?

So I wanted to go from delivering information to helping people discover information. And in the famous words of Barney Stinson, challenge accepted. We got the impact calculator. So now anyone at Datacamp can enter URLs here and say they're interested in running an experiment on our homepage. They can say, all right, maybe I'm only interested in signed out users, maybe only mobile. And this table that you see will recalculate and say, all right, first, this is how many unique users there are over a time period. This is what percent of them start a chapter, register, et cetera. So this is really great, because sometimes even just from this, you'll be like, oh, wow, I didn't realize these four URLs get so little traffic. Probably doesn't make sense to run an experiment. But the second part is this has a power calculation table underneath it. So it will tell you the answer to this question. Of an 18 days, you'll be able to detect an increase of, say, 4% for chapter starts, 6.5% for registrations. And it's always a bit of a guessing game of what impact your test will have. But usually you'll have some sort of idea of what's reasonable. And so they could look at this. They could say, all right, well, maybe if I do both desktop and mobile, is that going to get me down to 3% if I make it 25 days instead? And they could iterate really quickly without ever having to come to me.

Lesson 4: Make it easy to do the right thing

And my final thing that I learned was to make it easy to do the right thing. One best practice for A-B tests is to have a single key metric per experiment. It helps clarify decision making. Because if you have multiple metrics and one goes up and one goes down, you could be left not really knowing what you should do in that situation. It also allows you to run that power analysis. So for a power analysis to be run, you need to have the metric. You need to know its rate. And that can tell you how long to run the experiment for.

So it's great. And I really am doing a lot to educate people at Datacamp about A-B testing best practices. But even more than that, building it into the process and making it easy for people. So this is our air table where we put the information about our experiments. And you can see at the top here is a success metric. And so each experiment is required to have one. And only one. We can write in the notes maybe other things we care about. But this is the key metric. And then we can show— you know, then that forces people to make a decision. We can match that to something on the dashboard. And it can help make decision making at the end a lot easier.

Second guideline is to run the experiment for the length you've planned on. This is a graph showing about how— if you simulate thousands of null experiments and you checked every day if the P value was less than .05 and you stopped your experiment at that point, you'd get an over 20% false positive rate. So a huge increase. You really want to do that power analysis, set the time of your experiment, and stick to it. That's why we added to the health checks metric a start date and an end date for experiments. Again, it could be really tempting for a stakeholder to message me, hey, I saw— I checked this dashboard on the individual experiment. This metric went up. Why don't we just go ahead and launch it? But now they say, okay, I see the end date is not for another couple days. I know I'm supposed to follow this best practice, so I will do it.

Conclusion

So in conclusion, to recap, build tools to save yourself time. That's where funnel join came out. If you're finding yourself doing something over and over again, it's really worthwhile to invest some time in making it easier. And it also means the other thing we found is that now a lot of analyses we wouldn't have really started before because they'd be a bit cumbersome. We have funnel join in our back pocket, and that can help us— we know we can do it really quickly and it enables us to do some more work. Everything that can go wrong will go wrong. You have to check your assumptions and build systems to do that. Build tools that empower others, especially, I found, effective dashboards. And finally, make it easy to do the right thing. It's great to put in work educating others, maybe some rules to follow, but the easier you can make it for them to follow the best practice, the more likely you'll get the result you want.

It's great to put in work educating others, maybe some rules to follow, but the easier you can make it for them to follow the best practice, the more likely you'll get the result you want.

Many thanks to the growth and data science teams at Datacamp, especially David Robinson and Anthony Baker, who are co-authors of funnel join. The analytics and data engineering team at Etsy, where I was first a data analyst. And with that, thank you very much, and I'm happy to take questions.

Q&A

Thanks for putting this together. Any words of caution before we rely on your package on our own work? Yeah, so it's funny. When I gave this talk for the first time back in October, November, there was a big caution sign that was like, still in beta. The only thing I would say is actually, there's some issues. It hasn't been tested on non-postgres remote tables. I don't think it would give you the wrong answer so much as not work. Some other behaviors we know, like there's a bug sometimes when you do multiple after joins remotely, but now we don't enable you to do that. We basically throw an error. So we've really tried where we can to throw errors. That being said, we use it very extensively at Datacamp. It's built with Travis CI. There's a bunch of tests we have, but definitely try it out. But if something looks a little wacky, always double check a bit, but we found it generally works pretty well.

Emily Robinson | Building an AB testing analytics system with R and Shiny | RStudio (2019)

Transcript#

What is A/B testing?

Experimentation at Datacamp

Lesson 1: Build tools to save yourself time

Lesson 2: Everything that can go wrong will go wrong

Building the Shiny dashboard

Lesson 4: Make it easy to do the right thing

Conclusion

Q&A

Featured software#

rstudio

Shiny