Resources

Simon Couch - From hours to minutes: Accelerating your tidymodels code

From hours to minutes: Accelerating your tidymodels code - Simon Couch Abstract: This talk demonstrates a 145-fold speedup in training time for a machine learning pipeline with tidymodels through 4 small changes. By adapting a grid search on a canonical model to use a more performant modeling engine, hooking into a parallel computing framework, transitioning to an optimized search strategy, and defining the grid to search over carefully, users can drastically cut down on the time to develop machine learning models with tidymodels without sacrificing predictive performance. Resources mentioned in the talk: - Presentation slides https://simonpcouch.github.io/rpharma-24/#/ - GitHub repository for talk https://github.com/simonpcouch/rpharma-24 - Efficient Machine Learning with R: Low-Compute Predictive Modeling with tidymodels https://emlwr.org - Optimizing model parameters faster with tidymodels https://www.simonpcouch.com/blog/2023-08-04-parallel-racing/ Presented at the 2024 R/Pharma Conference

Nov 25, 2024
18 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good morning, y'all. My name is Simon. I write open source R packages at Posit, and I want to show you how to speed up your tidymodels code.

So to start off, we'll work through an example predictive modeling problem in this talk. We're working with a binary outcome, so we're deciding between the values yes or no for a given variable, and we're working with a data set that is 100,000 rows and 18 columns. So that first column is the binary outcome that we're trying to predict, and the remaining 17 we can use to try to predict the value of that outcome. We're working with 100,000 rows, so we're not necessarily in big data territory yet, depending on your definition, but not a trivial amount of data either. And we have a mix of numbers and categorical variables as predictors. So my question is, how long will it take to tune a boosted tree model using tidymodels on my laptop?

If you're not familiar with a boosted tree, a boosted tree is a model type that's analogous to something like a logistic regression or a decision tree, but it's particularly computationally intensive to train. And when I say tune that model, we're going to try out multiple potential argument values to the training function and see which ones stick.

So we're going to choose between two different modeling approaches here. If I just use all of the default values in tidymodels, I use the default engine or the underlying package that will actually carry out the model fit. That's called XGBoost. That takes three hours and 40 minutes. If I'm working with 100,000 rows, my expectation, if I'm just modeling a binary outcome on tabular data, is that we should be able to iterate a lot faster on that than three hours and 40 minutes. So the error metric I have that we're using in this example is the area under the ROC curve that takes values between zero and one.

Today, I'm going to show you that you can cut down on the time required to tune this model by over 99% while sacrificing only 0.03% of your predictive performance. And you can do that by just making four high-level changes that with tidymodels only require seven lines of code. So when I outline these four changes, this is going to go pretty fast. Don't feel like you need to take notes all the way through. I'm going to share resources at the end of this talk. But if you're feeling eager, you can also just hop to the link shown at the bottom of this slide, github.com slash simonpcouch slash rpharma dash 24.

Today, I'm going to show you that you can cut down on the time required to tune this model by over 99% while sacrificing only 0.03% of your predictive performance.

How tidymodels fits a model

Okay, so quickly some background on what tidymodels is actually doing when you go to fit a model. So at the top of this diagram, I have this code reading fit, boost tree. We're modeling the outcome as a function of all of the predictors, and we're using this data set called d. What actually happens in this situation is that tidymodels, the tidymodels team doesn't actually implement any of the model training or prediction code that we're running. There are a lot of our users who are intimately familiar with the details of different statistical models, and we trust those practitioners to be the ones to implement the modeling code. What we do as the tidymodels team is put a wrapper on top of that training and prediction code that unifies the interface.

So this code fits a boosted tree model, the code to fit a logistic regression or a decision tree or all sorts of other model types, and the code to do so with all sorts of different modeling packages is exactly the same regardless. So the first step is that we translate that inputted code into a code that the modeling engine will understand. So the default modeling engine for a boosted tree in tidymodels is actually XGBoost. That code will take some amount of time to evaluate, and it will return its result back to tidymodels. Tidymodels situates it in a common user interface, and the fit function returns its value.

So this code takes some amount of time to run. I've shown the portions of that evaluation time that the tidymodels team is responsible for in green, and I've shown the portions that the modeling engine is responsible for that the tidymodels team can't necessarily control in orange. In reality, the green portions here are on the order of like a couple milliseconds, and they don't tend to scale with the size of the data. The orange segment here is, again, the implementer of the modeling engine is responsible for how long that code takes to run, and that tends to scale with the size of the data. So larger data sets take longer to fit.

So if I'm tuning a boosted tree, that analogous diagram in that situation is that we're fitting many, many models, and we're doing the translation step on either side, but the tuning process requires trying out a bunch of different arguments to the boosted tree model and fitting it to a bunch of different subsets. So in this example, we're trying out 10 possible boosted tree argument settings, and for each of those 10 argument settings, we're fitting that model to 10 different subsets of the data. That's called resampling. So there's 100 model fits total happening here. This tuning process that we're visualizing right now is the one that took three hours and 41 minutes. So how can we cut down on the time to tune that model by over 99%?

Four changes to speed up tuning

The first thing that we can do is distribute our computations across CPU cores. So modern laptops are pretty incredible. Many have many, many cores. The most recent MacBooks are looking at double digits. I'm working on a laptop right now that's four years old, and I can use four CPU cores at a single time. So instead of fitting 100 models in a row one at a time, we can fit four at a time, and best case scenario, cut down on the amount of time to tune this boosted tree by four. In tidymodels, all you need to make this happen is one line of code. If you're familiar with the future framework, tidymodels integrates with future, and so this is just future code as it is, and this one line of code will tell tidymodels all it needs to know about how to distribute computations on your laptop.

Next, I'm going to pick a modeling engine that is particularly well suited to the modeling problem that I'm working on. So by default, the modeling engine in our boosted tree case is XGBoost. I happen to know, based on the way that this data set that we're using as an example looks, that a modeling engine called LightGBM might fit faster on that example. So we've cut down on the width of this diagram by a pretty significant factor already, both by distributing our computations across four CPU cores and choosing a model that fits faster than the default model. It's not always the case that a given modeling engine will fit faster than another, regardless of the data context that you're working with. There are situations where XGBoost is more performant, but this is a matter of building your intuition through practice, and I'll show you at the end of this talk through some resources that I'll share with you.

In tidymodels, there's no sort of syntax foo that you need to do to translate your fitting code from one modeling engine to another. All you need to do is set the engine argument to a different value.

Next, we can use something called the sub-model trick. This is the third modification that I'll talk about. The sub-model trick is a bit of a rabbit hole, so we're not going to go too deep down that rabbit hole. But the idea here is that we can evaluate many more models than we actually fit. The sub-model trick enables you to fit a single model and predict across several of them, and tidymodels has some tricky stuff going on in the back end to make sure that we enable that as often as we can. The way that you can explicitly opt into that sub-model trick as a user is by using a function called grid regular, and using a regular grid will enable the sub-model trick. In cases where you're just tuning a single model parameter, this just works if it's available, so there's nothing you need to do as a user. But this is for added lines of code here.

And the fourth technique that we'll use is something called racing. So, I said earlier that we're training or we're choosing between 10 different possible values, and we're doing so by fitting each of those 10 different possible argument values to 10 different subsets of the data. What happens in practice is that we can fit each of those 10 possible models to 3 of those subsets, or even 2, and that will give us enough information to realize that some of those models are much more likely to be the most performant, the best possible model we have access to, than others. And so, instead of continuing to train all of those models that we're pretty sure are not going to outperform the most performant, and we can bound that pretty sure by a significance level, we can just go ahead and give up on those and not fit them to the remaining 8 or 7 or 6 resamples, and that will cut down on our time to tune even further.

In tidymodels, to make this happen, all you need to do is change one line of code. That default grid search can be transitioned to racing with a repeated measures ANOVA model. That's called tune race ANOVA, and that can cut down on your time to tune as well.

And so, altogether, we started with this tuning process that fitted every possible model in sequence, and each of those model fits took a good while, and that resulted in a training process that was 3 hours and 40 minutes. That will really hinder your ability to iterate as you develop models interactively. We can make four small changes that require changes to seven lines of code, give or take, depending on how liberally you add spaces, to transition that code to run in 90 seconds, and you don't have to sacrifice predictive performance on your way to doing so.

We can make four small changes that require changes to seven lines of code, give or take, depending on how liberally you add spaces, to transition that code to run in 90 seconds, and you don't have to sacrifice predictive performance on your way to doing so.

Resources for learning more

If you're like, what the heck just happened, and you'd like to learn a little bit more about the basics of tidymodels and interactive predictive modeling, I would recommend you check out the book Tidy Modeling with R, that's at tmwr.org. It's a great place to get started to learn how to use the tidymodels. If you're interested specifically in the contents of this talk today, I'm excited to share with you that I'm working on a book, and it's called Efficient Machine Learning with R. It's currently, the draft is hosted at emlwr.org, and the first chapter of that book runs through the contents of this talk in detail. I talk through specifically those four changes that I've made in this talk, and then each of the four chapters that then follow that introduction, if that wasn't enough detail for you, delve even further into the details of each of those four optimizations.

This book is pretty early on in its writing. There are only a few chapters that are very readable at this point, but in the remaining ones I try to link you out to other blog posts that I've put together over the years. I'm really excited to share this new work today, and I hope you'll follow along with me as I work on drafting the rest of the book. You can follow me on socials at simonpcouch on various platforms, and there I'll post when I draft new portions of different chapters of the book. If you'd like to see the source code for these slides, or access any of the resources that I've called out during this talk, you can check out the github repository at github.com slash simonpcouch slash rpharma-24. Thank you very much for being here. I'm very grateful for the opportunity.

Q&A

Okay, I see a question from Skyler Gray that says, what if we're tuning multiple parameters? So, in the example that I walked through in the talk, we are actually tuning multiple parameters, and that grid regular function will take care of tuning multiple parameters at the same time, while also enabling the submodel trick. The reason that I call out just tuning a single parameter is because that will happen automatically if you're tuning a single parameter without you having to make any changes to code.

We have another question that reads, how should we maximize the computing resources to speed up the boosted tree modeling process if we use Posit Workbench? Okay, so this is an interesting one. I encourage you to check out the parallelism chapter in the current draft of the book. That's one of the more fleshed out chapters right now, and specifically the bottom where I talk about computing in the cloud and using multiple cores in the cloud. So, the gist of it is that the remaining three tricks that I talk about in this talk will be specifically helpful for you in terms of cutting down on the total number of model fits and reducing the time for a given model fit to take place.

Parallelism in the cloud is a bit, there's sort of higher communication overhead between CPU cores than you would find when you're working locally on a laptop. And so, what you end up seeing there is that the payoff that you get by using a few cores on something like Posit Workbench or any other cloud-hosted R session won't be as significant a payoff as it would be if you're using just like local laptop CPU cores.

Another question from Mark Burton says, is TuneBay also helpful for speed or just for getting really close to optimal hyperparameters? Yeah. So, that's something we would call an iterative search where the search function can take the information that it has from earlier searches and use that to inform the values that it should look for next. I would say that a combination that might be particularly helpful for you is starting out your resampling by using racing and maybe trying out 10 or 50 or 100 different possible hyperparameter values and then using that initial set of values, something like TuneBay or another iterative search function can look near the areas of the most performant values and try to optimize that hyperparameter right on the dot.

All right. It looks like we're at 10 20. I appreciate folks for coming through. Thank you for the questions and I hope you find that book draft helpful as you work on speeding up your tidymodels code. Thanks.