
Max Kuhn | Total Tidy Tuning Techniques | RStudio (2020)
Many models have structural parameters that cannot be directly estimated from the data. These tuning parameters can have a significant effect on model performance and require some mechanism for finding reasonable values. The tune and workflow packages enable tidymodels users to optimize these parameters using a variety of efficient grid search methods as well as with iterative search techniques (such as Bayesian optimization)
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, thanks for coming. So probably maybe to nobody's surprise, I'm going to talk about model tuning again. So my job within the Tidyverse group is to work on kind of all types of modeling and how we can do that in a more tidy way.
The place to start for us, normally would be with more predictive models, but tidy models infrastructure is not really constrained to do just that. We want to work on inferential models and descriptive models and all sorts of other things. Some of these models need what we call parameter tuning, and that's what I'm going to talk about today in this new package we have called Tune.
So most of our Tidy models packages, we tend, in the Tidyverse, we don't tend to write like one huge package that does all the things in it. So we tend to write packages that are interconnected with each other. And so in doing so, before I can show you what Tune does, I have to show you what a few other packages inside Tidy models and how they work and how they feed into Tune.
Introduction to the Ames housing data and recipes
So if you've never seen Tidy models before, we're going to start with what I like to call the new Iris data, which is the Ames housing data. It's basically we have sale prices on houses in Ames, Iowa. That's what's shown here. We have data in different neighborhoods, and then we have all these features of houses like their general living area, what type of building it is, and that kind of thing. So the goal here would be to predict what the sale price is of a house based on the characteristics of its data set, like the other aspects of the house.
So let's say we were going to preprocess our data using recipes. So recipes, if you've never seen it, is a package that is really good at doing feature engineering and preprocessing every data before you put it in the model. It follows this analogy of baking a cake or making something to eat. So in a recipe, the first thing you start with when you look at a recipe is here are the ingredients into the recipe.
So if we load the Tidy models package, that loads most of our individual packages. So it's like a meta package. Then we can load the Ames housing data and create the data. So the initial recipe here, all it really does is says the ingredients into our recipe is we have a column called sale price. That's our outcome. And then I just picked a couple of predictors here to look at that are in this data set. And the recipe doesn't really do anything at this point. It's just beginning to say here's a list of the specification of things we want to do.
And the way recipes works is it's kind of like dplyr or ggplot where you can just keep layering in things that you want to do to the data in the order in which you want to do them. So sale price is very much right skewed along with a couple other variables. So you can add a step to your recipe that says log in the data. You can use basic dplyr selectors, including like ends with and things like that. And you can tell it what base to use.
Neighborhood that I highlighted in that map has some interesting characteristics where there are a few neighborhoods with only a handful of observations in the data set. So what happens sometimes is when you have a variable that is very unbalanced in its frequency, if you like resample the data or so want to make dummy variables, it's very often you get some columns that have like all zeros in the dummy variable column. So we have a step called step other that what it can do is based on a frequency cut off that you use, for example, the one here, you can say anything that has less than 5% of the training set in this neighborhood. Let's dump that into another category called other and then redefine our factor levels for that variable to have other in it instead of the original levels. So that's something you can do. You can make dummy variables. You see there's like a recipe selector here that will select all the character or all the factor variables.
And then based on that, you can add a few more things. We know from this data set that there's a linear effect between the general living area of a house and its sale price, both on a log scale, but we've noticed over time that different building types have different slopes. So that's an interaction effect. So you can specify interactions. We can also include splines. So if you do a plot of sale price versus longitude or latitude, you see there's really wavy type relationships between these two variables.
Instead of modeling these predictors in a linear way, you could use a natural spline, so step NS, to replace the original, let's say, latitude predictor with a basis expansion, like basically a couple of nonlinear terms that represent that data. And that allows that variable to be predicted or modeled in a nonlinear way. So if you ever use GM smooth, chances are, depending on the size of your data, you've used a spline to figure out what that nonlinear relationship is.
So what you would do with this recipe is you would apply these steps in a separate function call. You would apply these steps to your training set. It would do all these operations in sequence, and that would give you like a ready-made data set that you could then put into a model. So in terms of modeling, oh, and one other thing here is I chose, just based on looking the data, to use five degrees of freedom for the spline. So lower degrees of freedom means it's less smooth, like linear, and higher degrees of freedom allow it to be more potentially smooth.
Specifying models with parsnip
All right, so you might think about using the parsnip package to do modeling. And if you've never seen that before, it's very similar recipes in that you make a specification of what you want to do before you actually do it. So let's say we just want to fit some type of linear regression model, like numeric outcome, we know we're going to have slopes and intercepts, that's like the structure of our model. So in parsnip, we have all these functions that delineate, like you want to do this type of structure, like k-nearest neighbors, or linear regression, or like a neural network.
So first you state what type of model you want to have, which is not really terribly descriptive as you can see when we print it out. And then you'd state the engine. So the engine is basically what implementation do you want to use. So you might use an implementation that does Bayesian analysis to estimate your parameters, or like a robust regression and so on. So you have a lot of options. I'm going to choose Glimnit here, which is like a regularized regression model. But you could have used LM, or Stan, or Kieris, or all the things that we have implemented. And so now you've been a little more specific, saying we want to fit this structure of this model, but use this estimation to estimate parameters.
And so I mentioned Glimnit does regularized regression. So there are also some other parameters that we probably should specify for Glimnit. For example, the amount of regularization in this value called penalty, this is a fairly high amount of regularization. And then what Glimnit can do is it could fit two different types of regularization. It could fit the lasso model, which is like L1 regularization, or it could do ridge regression, which is like L2 regularization. And the cool thing about it is you can choose a mixture of these things. So if you're not sure whether you should be doing the lasso or ridge regression, you could try a mixture of the two to see if that works better for you than just purely one or the other.
So in this call, I picked a fairly high amount of regularization and about a 50-50 mixture of lasso and ridge regression. And you can see when I print it out, it's there. But how do I know? I have all these special numbers, right? Like five degrees of freedom, a penalty value of 0.1, and so on. So you might say, oh, you must have done a lot of analysis for this. But basically, I can't look at those numbers and say that's a great model. So the thing about this is we need to tune these values.
So the thing about this is we need to tune these values. If you want to optimize our model performance, and we don't know what these values are, and there's no analytical function that says plug your x's and y's in, and therefore, it spits out a value of whatever for penalty. So we have to do something to find reasonable values of these parameters when we can't directly estimate them. And that's called model tuning.
If you want to optimize our model performance, and we don't know what these values are, and there's no analytical function that says plug your x's and y's in, and therefore, it spits out a value of whatever for penalty. So we have to do something to find reasonable values of these parameters when we can't directly estimate them. And that's called model tuning. I'm going to show you the tune package in just a second.
Tagging parameters with tune()
Basically, I have at least two ways of doing that model tuning. And what we're going to do is we're going to figure out ways of tagging which parameters do I want to optimize. So let's say I only want to optimize mixture and keep everything else the same. So how would we tell the tune package which actual parameters that we should optimize? So we're going to do that in a second.
So I've been at RStudio for three years. I've been thinking about this particular problem for 20 years. And so when I started working on parsnip and recipes, I've been putting a lot of thought about, okay, how is this going to feed into the packages like tune that come later in the pipe?
So what we want to do is we want to say, which of these parameters do you want to optimize in some way to find good performance measures? So what we're going to do is we're going to substitute these actual values in with a function call called tune. And this function call doesn't actually do anything. It just returns an expression that says tune.
Maybe not as happy as I just did, but the point is, if you've ever used recipes or parsnip, the thing is, when you create your recipe or you create your parsnip model object, it doesn't automatically just go ahead and fit those things. So we capture all these expressions and values. So we can add in a function value there, because we haven't actually evaluated any of these arguments. So if we wanted to say which variables we're going to tune, we can actually, instead of putting a five in here, put this value in, and we'll do it for this one and this one. So we can have, actually, other values that we keep constant if we like them, but then when we feed this into the machinery of tune, it would understand which ones we should look at.
And we've interlaced everything between recipes and parsnip and this other package of dials where we've used a common nomenclature for everything. So as you'll see in a little bit, there's a lot of really good defaults of everything. So if you don't know what range of penalty you should use or what you should be doing, we have a lot of fairly safe choices that you can use to automate the selection of what grid you should use and what values and ranges. So these things all sort of work together, and if you want to specify as much as possible, you can do that. But if you just want to let the machine figure it out, or if you feel like, oh, I don't know what I should do, our reasonable defaults are there to sort of do that for you.
So one thing about this is we're giving longitude and latitude the same amount of smoothness or wiggliness in the model. And we might want to separate those out. Maybe longitude wants a little bit of smoothing, and the other variable wants a lot of smoothing. So what we might do is we might say, well, let's create two different calls, and then we've given them separate identifiability.
But the problem with this is when Tune parses this, it sees that there are two parameters called degrees of freedom that you're tuning, and it doesn't have a really good sense of like, well, how do I distinguish one from the other? So one thing you can do, whether you have this prop or not, is you can sort of annotate the names of these parameters. You can, like, if you want to recognize them later and type something in that your boss will recognize if you show them some plots, you can add some ID fields here. So the one argument to Tune is basically an optional character string that this is how that particular parameter will be identified in plots and things like that, and it also lets you differentiate what this degree of freedom parameter is from this one. And so sometimes you would need to do this, but other times you might just do it because you want to use a name that, like, makes a lot of sense for you.
Grid search with tune_grid()
So now that we defined what we want to do, there's a couple different ways we could optimize these parameters. And the most, like, standard old-school way of doing it, which I think works really well, is use grid search. And in grid search, the ingredients to that is you need to have, like, some specification of, like, what you want to do, so your model and your recipe. And then you also need to have, or at least we prescribe you to have, some specification or some object that dictates how you want to resample the data. So what we don't want to do is we just don't want to take our data set, build a model with these parameters, and just re-predict the data set. So whether it's, like, a separate single validation set or whether you do bootstrapping or tenfold cross-validation, a sort of required ingredient that we don't really have a good default for deliberately is some resampling specification.
Optionally, you can define a pre-existing grid of candidates that you are in particular interested in investigating. But if you don't, we'll make some for you. And another thing we can do for you that you can specify is, you know, whether you want to optimize, I don't know, median absolute error or root mean squared error. So we have defaults for these if you don't want to go there.
So what we're going to do is we're just really going to specify these two things. For resampling, we're going to use the RSample package. There's enough data here where tenfold cross-validation will be fine. The defaults here give you tenfold, and that's in this object.
So our tuning parameters inside this function start with tune under bar. So tune grid does grid search. And then if you look at the arguments to that, here's our recipe and our model. Again, these are the versions of that that have the placeholder in there for tune. Here's the resampling object that we need to specify. And then I wouldn't have needed to do this, but I just said, you know, give me ten tuning parameter values, like combinations of those four parameters to try based on what you know about these parameters and ranges that seem to be applicable. So again, you can define those things, but we've set most of these things up for you.
And when you run that, you get this kind of intimidating object looking back at you. The first two columns here were already in the tenfold object. So that's what happens when you create a RSample resampling object. But depending on the options you choose, you get at least two columns with that. One's called .metrics and one's called .notes. You can see they're list columns with a bunch of tables embedded in them. So this has all the RMSC values and R-squared values across all the ten models that we wanted to evaluate. And then that's for fold one. Here's what the results would be for fold three and so on.
And so what we would want to do is, if we want to get at that data and plot it and investigate it, we would want to do some, like, dplyr, tidy our unnest stuff to make that happen. Rather than putting you through that and having to write that every time, we've written a bunch of high-level accessor functions so you can just say what you want to get and not have to worry about looking up how tidy R works and things like that. So some of them, not all of them, would include things like collect my metrics. So take all these tibles that are nested and put them back into a format that I would like.
And so what this does is, and let's look at the first four rows of that, is it gives you the tuning parameter values that you looked at. So for example, the first model used, the first candidate set used these four values. And its RMSE value is in log 10 units is .0796. The R squared is almost about 80%. And then we have the second set of, the second combination of tuning parameters here and so on.
So you might want to take these and sort them or plot them. Sometimes people just, you know, they're in a hurry or they're trying out a bunch of different things. They might say, just give me the numerically best one. So let's say, give me the one with the smallest root mean squared error. And so we have functions like show best, which says, you know, for this object, give me the one with the best RMSE. There's another one called select best, which actually just returns the parameter values. You can select on like the three or the one standard error rule or other methods for doing that too.
There's an autoplot method that I'm fairly dissatisfied with. So this can only get better. Right now it shows the marginal relationships. So if I plot the latitude degrees of freedom against let's say R squared, you can see it's like really increasing and then tanks out here. Well, the reason it tanks out here is not because it didn't want that much smoothing. It's because the penalty value associated with that particular combination was really high and that model didn't work so well. So what we need is we need a better visualization tool that could like represent what's going on here in a way that people can look at things.
If you've ever seen the shiny sand package, I've always wanted to create something like that. So my thought would be, you know, if anybody wants to do that, like, you know, on the plane ride home or something, I'm not going to, you know, reject it. But an idea would be where you have your grid object and run like a shiny tune function on it and it would open up a shiny app that would let you maybe do some interactive investigation of these things. So I'm hoping that's where we go. It's not very hard to do. It's just a matter of like one more thing on the list of things we want to do here.
Finalizing and refitting the model
So let's say you want to use the numerically best results. Like just give me the one with the root mean square that's lowest. How would I weave that back into the original model objects and retrain? So what you would do is you would, let's say, select the best and that would return this table of results. We'd like a very little amount of regularization. It's about a 50-50 mixture of lasso and ridge regression. We'd like a lot of nonlinearity in the latitude and longitude data.
And so, you know, so if you're fine with that, what you can do is you can splice those values back into your recipe and into your model object. And so we have these finalized functions that if you give it the recipe that has the placeholders for the parameters and give it the best parameters, it just splices them back in as if you had specified them to begin with. So if you do that with the model and you print the model out, remember before it said like 0.1 and 0.05. Now here are the actual values to, you know, whatever precision we have that it thinks are best. And then you can just do the regular tiny model stuff and refit this model and get to the place that you want to get to.
So just some notes about that. There are these things called space-filling designs in statistics. So these work really, really well. So it's easy for us to come up with good and pretty effective default grids for you using this type of model. There might be errors and warnings. And so in Carrot and a lot of other modeling functions, like across resampling, you might fit like 200 models, and maybe one of those models gives like five warnings. And so then when Carrot started running, it says, you know, you had 2,000 warnings. Type warnings to see the first 50, and you're like, huh, right? And so there's no way to trace that back to, oh, that was model 7 on resample number 3.
So what we do is we catch everything as we go. So if you have a model or recipe that fails, it doesn't stop anything from computing. We capture all those warnings and errors as they go, and we can associate them back to specific things where you can go back and trace and see if, like, oops, I just made a mistake or there's some weird thing in the data. So that dot notes column is a table that basically has all that information in it.
The verbose mode in this, I don't have the space to show it, but I use CLI. And so I think it's kind of cool. There's a bunch of options. So if you want to save the out-of-sample predictions, you want to save the model for each folder of the recipe, you can do that and show it here. There's options for, like, all sorts of grid stuff and performance metrics. There's a lot I just sort of glossed over.
Brian Lewis gave a great talk yesterday about foreach and how to use that for parallel processing. So we use that here again. So whether you're doing, like, you know, MPI or you want to do multi-core in Linux or if you're on Windows, you have to use doParallel. We have many, many options here to do parallel processing, which gives you a lot of speed ups in what you're doing.
Bayesian optimization
So it was grid search. So one thing that we did implement is, you know, I'm not sure if it's, like, the best choice here for the types of models, non-deep learning models we fit. But iterative search, especially during Bayesian optimization, is a big thing. It's not really that hard to implement. And so what we have is you would start off with, like, a small number of parameter values, and I can feed in the ones I just generated. And basically what Bayesian optimization does is it takes those initial 20 points, it predicts a candidate set of values, of tuning parameter values it thinks would be best to try next, and you can see after the first 10, it predicts that you should try these. It resamples them. The performance is slightly worse than the original, and it just keeps going until either you run out of iterations or it doesn't see improvement after a while. This works pretty well.
Its autoplot method doesn't stink. So here's the initial set. One of them really did poorly. One of them failed. So you see a little gap here. So you can see over time or over iterations. It's staying around the current optimum that we had, and maybe if it went further, I might be able to get a big boost, but that's the results.
It's almost on CRAN. It's been, like, kind of rejected, like, four times for little things. So I'm hoping. Fifth time's a charm. If you want to learn more, there's a package down the site, here are the training notes from this week, and we're working on a tidy models book, so that will be in our repo probably in the next, the beginning of it, in the next few weeks. So keep an eye out for that. Thanks.
Q&A
Does tidy models mark the end of Carrot, or are there any plans for EG Carrot 2? I'm going to get this, like, tattooed on my arm. No, Carrot's not going anywhere. I mean, it's not. I swear to God it's not. When people ask me this question, I'm like, it's not really. I mean, it's not, like, feature complete. I'm keeping up with it and making sure everything still works. I'm putting most of our development effort into tidy models. Carrot's 15 years old, so something that's, like, 15 years old, it's hard to make a lot of big feature changes to. So it's not going away. I wouldn't expect huge, like, changes or new features in it, but it's doing pretty well as it is. So no worries there for people.
What about proper support for time series forecasting in tidy models? That's a great question. So I would think we want to do that. We tend to prioritize what we implement by things either we have, like, a huge increase in interest in or things we know a lot about, and I'm not really a time series guy. We want to do this, but it's a little bit lower priority because the stuff that Rob Hyman's doing is so good that we figure, like, you're not left with nothing right now if you want to use tidyverse-type stuff with time series. So I think we'd like to use some of what he's doing there, but you already have something. So I think for the time being, unless somebody wants to implement and put in a PR or something like that, we're going to work on things that just don't exist right now before we get to those things. Thanks so much.

