Resources

Emily Robinson | Building an AB testing analytics system with R and Shiny | RStudio (2019)

Online experimentation, or A/B Testing, is the gold standard for measuring the effectiveness of changes to a website. While A/B testing is used at tens of thousands of companies, it can seem difficult to parse without resorting to expensive end-to-end commercial options. Using DataCamp’s system as an example, I’ll illustrate how R is actually a great language for building powerful analytical and visualization A/B testing tools. We’ll first dive into our open-source funneljoin package, which allows you to quickly analyze sequential actions using different types of behavioral funnels. We'll then cover the importance of setting up health checks for every experiment. Finally, we'll see how Shiny dashboards can help people monitor and quickly analyze multiple A/B tests each week. VIEW MATERIALS http://bit.ly/rstudio19 About the Author Emily Robinson I work at DataCamp as a Data Scientist on the growth team. Previously, I was a Data Analyst at Etsy working with their search team to design, implement, and analyze experiments on the ranking algorithm, UI changes, and new features. In summer 2016, I completed Metis’s three-month, full-time Data Science Bootcamp, where I did several data science projects, ranging from using random forests to predict successful projects on DonorsChoose.org to building an application in R Shiny that helps data science freelancers find their best-fit jobs. Before Metis, I graduated from INSEAD with a Master’s degree in Management (specialization in Organizational Behavior). I also earned my bachelor’s degree from Rice University in Decision Sciences, an interdisciplinary major I designed that focused on understanding how people behave and make decisions

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thank you so much for joining me today. I know it's been, at least I've had a really hard time choosing between talks. I actually had circled solving puzzles with R before I realized I probably should show up for my own talk to give it. So thank you all so much for coming.

My name is Emily Robinson and I'll be talking about building an AB testing analytics system with R and Shiny. All my slides are available on the link up here, bit.ly slash rstudio 19. And you can also, if you're a tweeter, you can follow me on Twitter at Robinson underscore ES. I usually live tweet a lot of talks, although I will not be live tweeting my own, sadly.

All right, so a little bit about me. This is my part-time dog, Abby. I call her my part-time dog because I steal her from my parents occasionally. I'm a data scientist at Data Camp, where I work on the growth team on experiments. That's a lot of what you'll be hearing about today. I've been an R user for about seven years. I first learned R when I went to Rice University for undergraduate, and I walked into class one day, and I had this really good teacher, like statistics professor, and I texted my brother this, who was studying computational biology. I'm like, oh, I have this great professor, his name is Hadley Wickham. And my brother is David Robinson, who will be giving the keynote, and he went, you're being taught by Hadley Wickham, who back then was not as well-known as he is now. So I was really lucky to get to learn R first from him.

And some things I enjoy talking about include building and finding data science community. One of the reasons I'm so excited to be at RStudio Conference, this is my third time here, is that the R community is just so friendly. I'm sure many of you already know this. You saw how Hadley started off his talk today, asking people to, you know, please be extra welcoming, follow the Pac-Man rule. So I just really love it. Diversity in STEM, I'm part of the R Ladies group. If you were at the opening reception, you might have seen their poster. They're a group for promoting gender diversity in R. They're all around the world with different chapters. I really suggest checking them out.

What is A/B testing?

So just so this talk will make a little more sense, if you're not familiar with Datacamp, it's an online data science learning platform. So there's a little over 200 courses now in R, Python, Shell, SQL, kind of all the big languages. This is one of our R courses, and it's a, you know, way to learn interactively in the browser. So you have short video lessons, and then you code.

So let's say we have a user, Linda, and Linda logs on to Datacamp one day, and she's interested in learning deep learning. So Linda could maybe see this page, and that's one of our course pages. There's about a 50% chance of this if we're running an A-B test on it. But she also could have seen this page. So you might notice there are a couple differences here, right, the color, but a big one is that registration form that's now on the page. So what A-B testing is is randomly assigning visitors to your website to one or two or more experiences. Sometimes it involves changes on the page here. Sometimes it's things like changes to a search algorithm or making the page faster. But the idea is that by randomly assigning people and, you know, running these conditions at the same time, you're controlling for all the other factors. So this lets you know that if you see a difference between the control and the treatment group and you've run everything properly, any behavioral difference should be because of your change. And then you can be really confident it's had the impact you hoped it would.

Experimentation at Datacamp

Life before Datacamp, I was a data analyst at Etsy. That's where I got started on experimentation. So I worked on over 60 experiments with our search team. Etsy had been doing experimentations for over eight years at that point and running over 500 experiments per year. To support all of this, Etsy had a big team of data engineers. So there were over five of them working full time on the data experimentation platform, which Etsy had custom built. Over 1,000 metrics were computed for each experiment. There was a really nice UI that visualized all of this data so we could see things like confidence intervals, you know, power calculations for detecting 1% change. You can even customize that.

And I come to Datacamp in April, and I know that part of my job is going to be helping start experimentation at Datacamp, but I still felt a little bit like this in that first couple weeks. Now, to be very clear, especially for everyone at Datacamp who might be watching, I did not lie on my resume, but here it was very different. There was no system for planning, analyzing, or presenting experiment results. I think they'd only run about one or two tests before I joined. And there were no data engineers to build it. And again, that's partly why I was brought in as a data scientist, but this part would be a bit new.

And I want to share four lessons I learned along the way.

Lesson 1: Build tools to save yourself time

The first is build tools to save yourself time. Who here has ever had a first this, then that question? Something like who tried X, then did Y? What percent of people who did X, then did Y? What was the last thing people did before doing Y? Or what were all the things people did after doing X, right? So they can be, you know, structured differently. Maybe you're interested in all the behavior afterwards, just the first time. But really, it's you have two sets of behaviors, and you want to follow one user and understand what happened first this, then that.

And this is the type of questions I would ask in A-B testing. Say what percentage of the people in the treatment first control registered? And this would be registered after starting the experiment. So I'd have experiment starts and registration. What were the ad clicks that had a course start within two days? And my first attempt at doing this involved a lot of lengthy repetitive code. There was a ton of copying and pasting, and it was very hard to switch between different types of funnels. By which I mean switching between something like I want the first time they did this, and then the first time they did that. I want the first time they did this, and everything they did after that. It wasn't very easy to go between those types of funnels.

And when you're doing repetitive tasks, you know, a lot of you may know here, it's time to write a function. And in my case, a package. Unfortunately, I faced a bit of an issue, which was this was me and writing packages. I sort of knew this was something. I even had written a blog post a couple years ago that was like, you know, writing my first R package, part one, where I talk about how I'm going to write this R package. And there was never a part two. There actually still is not.

So you know, I was a little, you know, wasn't really sure what to do. But fortunately, I had David Robinson at data camp. Dave Robinson is our chief data scientist at data camp. As I said, also my brother. And he is very experienced in writing packages. So one day you set up a hackathon for him, me, and one of the other growth team members at data camp to write the package funnel join. And if you'd like to try it out yourself, it's available on GitHub. And what funnel join is, it's goal is to make it easy to analyze behavior funnels.

So I'm going to walk through how funnel join works. The structure of it is you start with two tables. The first table is your first set of behaviors. And then the second table is your second set. The one that you want to happen afterward. You tell it what the user column names are, because you want to be able to connect the user across tables. You tell it what the time columns are. Again, you want to make sure the time in the table on the left-hand side always has to be less than the time on the right-hand side. It has to happen before. The type of after join, which I'm going to talk about a little bit more. And finally, the type of join. So like inner, left, right. All the dplyr joins.

So one example is a first any question. What are all the courses people started after visiting the home page for the first time? So here I would write my funnel join. I write an after inner join, so using the type inner. I specify in the type argument the first-any. And then I have my time columns and my user columns. And so I get out this table, and you'll see, just as we would hope, course started at is always greater than viewed at. So now I could say, oh, person three viewed the home page on June 6. And we have two entries for them, even for the same home page view, because they started two different courses. And that's because I did a first-any join.

But if I wanted to switch it up and say what percent of people saw the pricing page for the first time and then subscribed, I just changed the type of funnel. So a very similar structure. Change type to first, first after. And now because it's a left join, we also have the NA at the subscribe dot. What if I want to say, well, that's great, but I really want a tight funnel. I want it to happen within four days. I can add a max gap argument. So I give it a diff time argument. And now if I ran the same code, we see a bit more NAs on the subscribe dot, because even though those people subscribed, they didn't do it within four days.

As I said, there are many funnel types supported. So this last before, first after, any-any, first-any, supports all types of dplyr joins. It works on remote tables, because it uses dplyr, so it takes advantage of dbplyr. And bug fixes, pull requests, feature requests, welcome. Please try it yourself and let me know how it goes.

Lesson 2: Everything that can go wrong will go wrong

So my second lesson is that everything that can go wrong will go wrong. A non-exhaustive list of things that have happened. People are put in both control and treatment. That should not happen. You should get one experience. People in the experiment have no page views. They are mysterious users with only back-end events. People have multiple experiment starts in the same group. There aren't the same number of people in the control versus treatment. Experiment starts didn't have cookies, so we couldn't even track the user. We couldn't link that start to any other event.

We realized here we really needed to check our assumptions. So we wrote up this doc of things we needed to do, because, again, these were all things we didn't initially think to check. If the system is working correctly, they should happen. When we started getting weird results, we dug into it, and we realized, oh, all these things are going wrong.

My initial solution was, again, this kind of lengthy, repetitive code that I'd copy between experiment files, you know, OK, this check, all these things. But as a famous data scientist once said, when you've written the same code three times, write a function. When you've given the same in-person advice three times, write a blog post. And I'll add to that, when you've run the same process three times, make a dashboard.

And I'll add to that, when you've run the same process three times, make a dashboard.

Building the Shiny dashboard

So I could have come up with a solution here, right, of, like, making another package. That's what we saw for the first problem was the solution to all this copy-pasting. But in this case, a dashboard is really what's going to be helpful. And what it also did was it's building a tool that empowers others. We've heard a lot today about how useful Shiny can be. If you saw the presentation at 11 a.m. this morning from Jacqueline and Heather Nolis on their work at T-Mobile, they talked about how putting a Shiny app in front of the stakeholders so they could play with the machine learning model themselves is what got them the resources to build out a whole team.

So I built a Shiny app for showing our experiment data. I started with the health metrics. So now for every experiment, these are automatically calculated, and we can know if something's going wrong. It's color-coded. So we see that the duplicate cookies, cookies of multiple variation, things look good. It's green. Cookies with no page views is not looking as good. It's yellow and red. And so then that can suggest what we might need to look into.

We have a biometric view. So if you want to look for all the experiments that are running or even all time, how did all of those affect chapter starts? We can see, all right, the control versus treatment, the percent change. Again, this color coding, which is a small thing, but it's nice because at a glance, I know what to take from this. All right, three experiments didn't have an impact. The fourth one had a negative impact. And that will only show in red when the p-value is less than 0.05. We also have our nice confidence intervals, again, just for some more visualization and helping to understand these.

We have an individual experiment view. So now this shows the description and the hypothesis. Because as the experiment metrics grew, as we started tracking more, we went from like three to 11, it became a bit cumbersome to look just by metric when you're only interested in one experiment. So again, all the information was there, but just making it easier for people to see at a glance and to have this individual experiment view.

And to level up a little bit, I would get a common request of what percent increase can we detect in a two-week test? And this is a really important question. Because otherwise, you may end up running tests, you know, say you have a great idea. I want to know, are we going to increase registrations by changing the color of a button? But if you run a power calculation, this will tell you, all right, in two weeks with the amount of people who are visiting, the amount of people who say register already, you would need a 50% increase in registrations to detect a change. So that probably means you don't want to run that test. And I was really glad people were asking this question. I encouraged them. But could I make a tool so that people could answer this themselves without code?

So I wanted to go from delivering information to helping people discover information. And in the famous words of Barney Stinson, challenge accepted. We got the impact calculator. So now anyone at Datacamp can enter URLs here and say they're interested in running an experiment on our homepage. They can say, all right, maybe I'm only interested in signed out users, maybe only mobile. And this table that you see will recalculate and say, all right, first, this is how many unique users there are over a time period. This is what percent of them start a chapter, register, et cetera. So this is really great, because sometimes even just from this, you'll be like, oh, wow, I didn't realize these four URLs get so little traffic. Probably doesn't make sense to run an experiment. But the second part is this has a power calculation table underneath it. So it will tell you the answer to this question. Of an 18 days, you'll be able to detect an increase of, say, 4% for chapter starts, 6.5% for registrations. And it's always a bit of a guessing game of what impact your test will have. But usually you'll have some sort of idea of what's reasonable. And so they could look at this. They could say, all right, well, maybe if I do both desktop and mobile, is that going to get me down to 3% if I make it 25 days instead? And they could iterate really quickly without ever having to come to me.

Lesson 4: Make it easy to do the right thing

And my final thing that I learned was to make it easy to do the right thing. One best practice for A-B tests is to have a single key metric per experiment. It helps clarify decision making. Because if you have multiple metrics and one goes up and one goes down, you could be left not really knowing what you should do in that situation. It also allows you to run that power analysis. So for a power analysis to be run, you need to have the metric. You need to know its rate. And that can tell you how long to run the experiment for.

So it's great. And I really am doing a lot to educate people at Datacamp about A-B testing best practices. But even more than that, building it into the process and making it easy for people. So this is our air table where we put the information about our experiments. And you can see at the top here is a success metric. And so each experiment is required to have one. And only one. We can write in the notes maybe other things we care about. But this is the key metric. And then we can show— you know, then that forces people to make a decision. We can match that to something on the dashboard. And it can help make decision making at the end a lot easier.

Second guideline is to run the experiment for the length you've planned on. This is a graph showing about how— if you simulate thousands of null experiments and you checked every day if the P value was less than .05 and you stopped your experiment at that point, you'd get an over 20% false positive rate. So a huge increase. You really want to do that power analysis, set the time of your experiment, and stick to it. That's why we added to the health checks metric a start date and an end date for experiments. Again, it could be really tempting for a stakeholder to message me, hey, I saw— I checked this dashboard on the individual experiment. This metric went up. Why don't we just go ahead and launch it? But now they say, okay, I see the end date is not for another couple days. I know I'm supposed to follow this best practice, so I will do it.

Conclusion

So in conclusion, to recap, build tools to save yourself time. That's where funnel join came out. If you're finding yourself doing something over and over again, it's really worthwhile to invest some time in making it easier. And it also means the other thing we found is that now a lot of analyses we wouldn't have really started before because they'd be a bit cumbersome. We have funnel join in our back pocket, and that can help us— we know we can do it really quickly and it enables us to do some more work. Everything that can go wrong will go wrong. You have to check your assumptions and build systems to do that. Build tools that empower others, especially, I found, effective dashboards. And finally, make it easy to do the right thing. It's great to put in work educating others, maybe some rules to follow, but the easier you can make it for them to follow the best practice, the more likely you'll get the result you want.

It's great to put in work educating others, maybe some rules to follow, but the easier you can make it for them to follow the best practice, the more likely you'll get the result you want.

Many thanks to the growth and data science teams at Datacamp, especially David Robinson and Anthony Baker, who are co-authors of funnel join. The analytics and data engineering team at Etsy, where I was first a data analyst. And with that, thank you very much, and I'm happy to take questions.

Q&A

Thanks for putting this together. Any words of caution before we rely on your package on our own work? Yeah, so it's funny. When I gave this talk for the first time back in October, November, there was a big caution sign that was like, still in beta. The only thing I would say is actually, there's some issues. It hasn't been tested on non-postgres remote tables. I don't think it would give you the wrong answer so much as not work. Some other behaviors we know, like there's a bug sometimes when you do multiple after joins remotely, but now we don't enable you to do that. We basically throw an error. So we've really tried where we can to throw errors. That being said, we use it very extensively at Datacamp. It's built with Travis CI. There's a bunch of tests we have, but definitely try it out. But if something looks a little wacky, always double check a bit, but we found it generally works pretty well.