Simon Couch - From hours to minutes: Accelerating your tidymodels code

Transcript#

This transcript was generated automatically and may contain errors.

Good morning, y'all. My name is Simon. I write open source R packages at Posit, and I want to show you how to speed up your tidymodels code.

So to start off, we'll work through an example predictive modeling problem in this talk. We're working with a binary outcome, so we're deciding between the values yes or no for a given variable, and we're working with a data set that is 100,000 rows and 18 columns. So that first column is the binary outcome that we're trying to predict, and the remaining 17 we can use to try to predict the value of that outcome. We're working with 100,000 rows, so we're not necessarily in big data territory yet, depending on your definition, but not a trivial amount of data either. And we have a mix of numbers and categorical variables as predictors. So my question is, how long will it take to tune a boosted tree model using tidymodels on my laptop?

If you're not familiar with a boosted tree, a boosted tree is a model type that's analogous to something like a logistic regression or a decision tree, but it's particularly computationally intensive to train. And when I say tune that model, we're going to try out multiple potential argument values to the training function and see which ones stick.

So we're going to choose between two different modeling approaches here. If I just use all of the default values in tidymodels, I use the default engine or the underlying package that will actually carry out the model fit. That's called XGBoost. That takes three hours and 40 minutes. If I'm working with 100,000 rows, my expectation, if I'm just modeling a binary outcome on tabular data, is that we should be able to iterate a lot faster on that than three hours and 40 minutes. So the error metric I have that we're using in this example is the area under the ROC curve that takes values between zero and one.

Today, I'm going to show you that you can cut down on the time required to tune this model by over 99% while sacrificing only 0.03% of your predictive performance. And you can do that by just making four high-level changes that with tidymodels only require seven lines of code. So when I outline these four changes, this is going to go pretty fast. Don't feel like you need to take notes all the way through. I'm going to share resources at the end of this talk. But if you're feeling eager, you can also just hop to the link shown at the bottom of this slide, github.com slash simonpcouch slash rpharma dash 24.

Today, I'm going to show you that you can cut down on the time required to tune this model by over 99% while sacrificing only 0.03% of your predictive performance.

How tidymodels fits a model

Okay, so quickly some background on what tidymodels is actually doing when you go to fit a model. So at the top of this diagram, I have this code reading fit, boost tree. We're modeling the outcome as a function of all of the predictors, and we're using this data set called d. What actually happens in this situation is that tidymodels, the tidymodels team doesn't actually implement any of the model training or prediction code that we're running. There are a lot of our users who are intimately familiar with the details of different statistical models, and we trust those practitioners to be the ones to implement the modeling code. What we do as the tidymodels team is put a wrapper on top of that training and prediction code that unifies the interface.

So this code fits a boosted tree model, the code to fit a logistic regression or a decision tree or all sorts of other model types, and the code to do so with all sorts of different modeling packages is exactly the same regardless. So the first step is that we translate that inputted code into a code that the modeling engine will understand. So the default modeling engine for a boosted tree in tidymodels is actually XGBoost. That code will take some amount of time to evaluate, and it will return its result back to tidymodels. Tidymodels situates it in a common user interface, and the fit function returns its value.

So this code takes some amount of time to run. I've shown the portions of that evaluation time that the tidymodels team is responsible for in green, and I've shown the portions that the modeling engine is responsible for that the tidymodels team can't necessarily control in orange. In reality, the green portions here are on the order of like a couple milliseconds, and they don't tend to scale with the size of the data. The orange segment here is, again, the implementer of the modeling engine is responsible for how long that code takes to run, and that tends to scale with the size of the data. So larger data sets take longer to fit.

So if I'm tuning a boosted tree, that analogous diagram in that situation is that we're fitting many, many models, and we're doing the translation step on either side, but the tuning process requires trying out a bunch of different arguments to the boosted tree model and fitting it to a bunch of different subsets. So in this example, we're trying out 10 possible boosted tree argument settings, and for each of those 10 argument settings, we're fitting that model to 10 different subsets of the data. That's called resampling. So there's 100 model fits total happening here. This tuning process that we're visualizing right now is the one that took three hours and 41 minutes. So how can we cut down on the time to tune that model by over 99%?

Four changes to speed up tuning

The first thing that we can do is distribute our computations across CPU cores. So modern laptops are pretty incredible. Many have many, many cores. The most recent MacBooks are looking at double digits. I'm working on a laptop right now that's four years old, and I can use four CPU cores at a single time. So instead of fitting 100 models in a row one at a time, we can fit four at a time, and best case scenario, cut down on the amount of time to tune this boosted tree by four. In tidymodels, all you need to make this happen is one line of code. If you're familiar with the future framework, tidymodels integrates with future, and so this is just future code as it is, and this one line of code will tell tidymodels all it needs to know about how to distribute computations on your laptop.

Next, I'm going to pick a modeling engine that is particularly well suited to the modeling problem that I'm working on. So by default, the modeling engine in our boosted tree case is XGBoost. I happen to know, based on the way that this data set that we're using as an example looks, that a modeling engine called LightGBM might fit faster on that example. So we've cut down on the width of this diagram by a pretty significant factor already, both by distributing our computations across four CPU cores and choosing a model that fits faster than the default model. It's not always the case that a given modeling engine will fit faster than another, regardless of the data context that you're working with. There are situations where XGBoost is more performant, but this is a matter of building your intuition through practice, and I'll show you at the end of this talk through some resources that I'll share with you.

In tidymodels, there's no sort of syntax foo that you need to do to translate your fitting code from one modeling engine to another. All you need to do is set the engine argument to a different value.

Next, we can use something called the sub-model trick. This is the third modification that I'll talk about. The sub-model trick is a bit of a rabbit hole, so we're not going to go too deep down that rabbit hole. But the idea here is that we can evaluate many more models than we actually fit. The sub-model trick enables you to fit a single model and predict across several of them, and tidymodels has some tricky stuff going on in the back end to make sure that we enable that as often as we can. The way that you can explicitly opt into that sub-model trick as a user is by using a function called grid regular, and using a regular grid will enable the sub-model trick. In cases where you're just tuning a single model parameter, this just works if it's available, so there's nothing you need to do as a user. But this is for added lines of code here.

And the fourth technique that we'll use is something called racing. So, I said earlier that we're training or we're choosing between 10 different possible values, and we're doing so by fitting each of those 10 different possible argument values to 10 different subsets of the data. What happens in practice is that we can fit each of those 10 possible models to 3 of those subsets, or even 2, and that will give us enough information to realize that some of those models are much more likely to be the most performant, the best possible model we have access to, than others. And so, instead of continuing to train all of those models that we're pretty sure are not going to outperform the most performant, and we can bound that pretty sure by a significance level, we can just go ahead and give up on those and not fit them to the remaining 8 or 7 or 6 resamples, and that will cut down on our time to tune even further.

In tidymodels, to make this happen, all you need to do is change one line of code. That default grid search can be transitioned to racing with a repeated measures ANOVA model. That's called tune race ANOVA, and that can cut down on your time to tune as well.

And so, altogether, we started with this tuning process that fitted every possible model in sequence, and each of those model fits took a good while, and that resulted in a training process that was 3 hours and 40 minutes. That will really hinder your ability to iterate as you develop models interactively. We can make four small changes that require changes to seven lines of code, give or take, depending on how liberally you add spaces, to transition that code to run in 90 seconds, and you don't have to sacrifice predictive performance on your way to doing so.

We can make four small changes that require changes to seven lines of code, give or take, depending on how liberally you add spaces, to transition that code to run in 90 seconds, and you don't have to sacrifice predictive performance on your way to doing so.