tidymodels: Adventures in Rewriting a Modeling Pipeline - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

My name is George Stagg . I'm a software engineer at Posit, and today I'm going to be chairing this session, which is named Tidy Up Your Models. So the first person we have today to speak to you is Ryan Timpe, and I'm going to pass straight on to him to get started. Thank you.

Thank you. So last night at the reception, someone ran up to me, super excited to meet me. I was like, yes, I made it, I'm nerd famous. But they thought I was Hadley Wickham . So just pointing that out there. I'm not Hadley Wickham. My name is Ryan Timpe. I am a lead data scientist at the Lego Group. My team and I build models to help our business understand how all of our marketing, our trade promotions, our product launches, all that impacts our sales and our brand health. I've been at the Lego Group for almost five years, and over that time, the field of data science has really changed. As you might have noticed, RStudio is now Posit. My managers are always talking about production, and if you checked out the title of my talk before you sat down, our models are now Tidy. And this change can be challenging, especially when it impacts our work, but it can also be kind of fun and rewarding to rethink how you do your day-to-day job.

So this talk is about that. Some of the recent fun and challenges I had while adapting to the changing world of data science were off the Lego Group. So let's start with that first adventure. And I'm going to call my changes adventures. It gets more fun. So when I joined the Lego Group, my first job was to build an R package to do our modeling better. And I did that. I built a package. It was simple, but it worked, and it answered some important questions for our business.

For no special reason at all, I'm going to represent that model on the screen as a red Lego brick. No reason. And then over the four years, we had to do more with this package. We had to add ways to add more data to the package and the model. We had to add more pre-processing. We had to add more post-processing. And then we had to do more things to add business value to make sure our stakeholders were actually using the models and driving change in the organization. Then we kept doing this over and over and over, more data, more pre-processing functions, more post-processing functions. And eventually, our pretty simple model and package looked a lot more like this. And this happened because we grew our package organically. It did everything we needed it to do. It ran some pretty awesome models, but it was getting very difficult for my team and I to navigate this and have this package grow with us.

So as we need to make more changes and add more things, we need to find all those connections and where to put each new step. And we came to the point where we weren't effective data scientists and we needed to make a change. And so four years later, we had matured, we knew exactly what we needed out of our package, and it was time to start over. So to do that, we identified tidymodels . tidymodels is a framework that adds organization to your models. So it takes that big messy pile of bricks and it stacks them. I never used tidymodels before, and I needed to use it on my job. So I used this book to help me by Max and Julia, who you will see on the stage in a few minutes.

I know everything about Lego, and I know how Lego can use these models. So I got to spend all this extra time I earned from using tidymodels to focus on what I could add to these models to really help the Lego group.

So I got to spend all this extra time I earned from using tidymodels to focus on what I could add to these models to really help the Lego group. So for example, if you look at preprocessing again, recipes is essentially a bunch of these step functions that you can pipe together to transform your data. And these are step functions that are used by data scientists everywhere. For example, step normalize will take your numerical data, give it a mean of zero and standard deviation of one, and these are very common things that data scientists everywhere use. And this freed up my time to focus on my, I could use custom step functions, because in my domain, I have some very specific transformations I have to use. I'm in marketing, we have to add carryover effects to our data, we have to add diminishing returns, and these are more specific to my domain. So I used all my free time, well, not free time, but extra time that I earned from using the open source tools to make my custom step functions. And this really added the Lego twist to my new modeling package.

And I found opportunities to add this Lego value add everywhere across this modeling pipeline. For example, my data, our team has thousands of possible data series we could put in our models. You can't put that much data in your models. So we had to spend a lot of time figuring out which data sources belong in our models, so I was able to write tools to help us do that better. The modeling process, I could spend more time making sure my model was better specified, making sure our residuals were behaving better. Post-processing is huge for us. We have to take all of our raw model output and make it make sense for the organization, like calculating media uplifts and calculating return on investments coming out of the models. We have a lot of functions here. I got to spend a lot of time on this very important part of the modeling process. Business value, I had more time to work with our stakeholders by leveraging the open source tools. And then on the flip side, when you leverage these open source tools, you get their help with all the bugs. You outsource your bug squashing, because you're going to hit fewer errors in the process. And when you do hit errors, there's a lot more documentation out there on how to solve those errors. Rather, if you're writing your own functions and your manual modeling functions, it's a lot harder to solve some of these errors. So by leveraging this open source community and tools whenever I could, I just became a much better Lego data scientist.

Speed gains

Another big benefit I saw was speed. So no one else is going to stand on the stage and tell you how fast tidymodels is. Julia warned me about that. It adds overhead. But actually, by the swap, we actually unlocked a lot of speed in a few different ways. And I want to talk about those. With our process. So since I wasn't running my own dplyr code and my tidyr code over and over for my data processing, my functions just ran a lot faster. And that was very noticeable to us. And it was a huge win when we discovered that. Our algorithm. We had originally in our models, we used Bayesian models. We had manually written stan code to do that. That was really awesome, because we got to customize every piece of that code or that model. But with tidymodels, our rstanarm works a lot better with parsnip in tidymodels. And that's a bunch of stan models prewritten and precompiled by experts out there. So we leveraged those models, that algorithm instead this time around, and that really sped us up.

And then just our ways of working was a lot faster. So with the swap to tidymodels and the workflows process, we were able to keep now every single part of the modeling process and development process on the same platform. We can basically do our entire developments in one R notebook. And that really just saved us tons of data scientists tons of time iterating over models, developing models, because we could move a lot more quickly. And without switching environments and everything.

So I can quantify this change. We had a batch of models for one project that used to take about 30 minutes to run from start to finish. Now they take just 60 seconds. That's a 30-fold increase for us and a really measurable increase that helps our team work better.

We had a batch of models for one project that used to take about 30 minutes to run from start to finish. Now they take just 60 seconds. That's a 30-fold increase for us and a really measurable increase that helps our team work better.

If we add that into the earlier example of the Lego value add, we have a pretty awesome list of benefits. As you know, though, sometimes benefits come with costs. This was no exception here. We did have a few extra new hurdles with the tidymodels swap. And so we had to kind of rethink our ways of working. So we're used to solving some specific problems one way for four years. We had to come up with some new ways. For one example is that algorithm. When we swap from our manually written STAN code to rstanarm, we gave up a lot of control. With STAN, we could customize really every single prior distribution type based on a variable type. And when we swap to rstanarm, we're a lot more limited in what we can do. But whenever we came across these types of challenges, we were able to kind of step back. Can we solve this problem that we're used to solving one way, a new way? And every time we came across these problems, the answer was yes, we could work with this. And by far, everywhere, the benefits of the tidymodels swap really outran the cost. So no regrets there.