Max Kuhn | Total Tidy Tuning Techniques | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

Hi, thanks for coming. So probably maybe to nobody's surprise, I'm going to talk about model tuning again. So my job within the Tidyverse group is to work on kind of all types of modeling and how we can do that in a more tidy way.

The place to start for us, normally would be with more predictive models, but tidy models infrastructure is not really constrained to do just that. We want to work on inferential models and descriptive models and all sorts of other things. Some of these models need what we call parameter tuning, and that's what I'm going to talk about today in this new package we have called Tune.

So most of our Tidy models packages, we tend, in the Tidyverse, we don't tend to write like one huge package that does all the things in it. So we tend to write packages that are interconnected with each other. And so in doing so, before I can show you what Tune does, I have to show you what a few other packages inside Tidy models and how they work and how they feed into Tune.

Introduction to the Ames housing data and recipes

So if you've never seen Tidy models before, we're going to start with what I like to call the new Iris data, which is the Ames housing data. It's basically we have sale prices on houses in Ames, Iowa. That's what's shown here. We have data in different neighborhoods, and then we have all these features of houses like their general living area, what type of building it is, and that kind of thing. So the goal here would be to predict what the sale price is of a house based on the characteristics of its data set, like the other aspects of the house.

So let's say we were going to preprocess our data using recipes. So recipes, if you've never seen it, is a package that is really good at doing feature engineering and preprocessing every data before you put it in the model. It follows this analogy of baking a cake or making something to eat. So in a recipe, the first thing you start with when you look at a recipe is here are the ingredients into the recipe.

So if we load the Tidy models package, that loads most of our individual packages. So it's like a meta package. Then we can load the Ames housing data and create the data. So the initial recipe here, all it really does is says the ingredients into our recipe is we have a column called sale price. That's our outcome. And then I just picked a couple of predictors here to look at that are in this data set. And the recipe doesn't really do anything at this point. It's just beginning to say here's a list of the specification of things we want to do.

And the way recipes works is it's kind of like dplyr or ggplot where you can just keep layering in things that you want to do to the data in the order in which you want to do them. So sale price is very much right skewed along with a couple other variables. So you can add a step to your recipe that says log in the data. You can use basic dplyr selectors, including like ends with and things like that. And you can tell it what base to use.

Neighborhood that I highlighted in that map has some interesting characteristics where there are a few neighborhoods with only a handful of observations in the data set. So what happens sometimes is when you have a variable that is very unbalanced in its frequency, if you like resample the data or so want to make dummy variables, it's very often you get some columns that have like all zeros in the dummy variable column. So we have a step called step other that what it can do is based on a frequency cut off that you use, for example, the one here, you can say anything that has less than 5% of the training set in this neighborhood. Let's dump that into another category called other and then redefine our factor levels for that variable to have other in it instead of the original levels. So that's something you can do. You can make dummy variables. You see there's like a recipe selector here that will select all the character or all the factor variables.

And then based on that, you can add a few more things. We know from this data set that there's a linear effect between the general living area of a house and its sale price, both on a log scale, but we've noticed over time that different building types have different slopes. So that's an interaction effect. So you can specify interactions. We can also include splines. So if you do a plot of sale price versus longitude or latitude, you see there's really wavy type relationships between these two variables.

Instead of modeling these predictors in a linear way, you could use a natural spline, so step NS, to replace the original, let's say, latitude predictor with a basis expansion, like basically a couple of nonlinear terms that represent that data. And that allows that variable to be predicted or modeled in a nonlinear way. So if you ever use GM smooth, chances are, depending on the size of your data, you've used a spline to figure out what that nonlinear relationship is.

So what you would do with this recipe is you would apply these steps in a separate function call. You would apply these steps to your training set. It would do all these operations in sequence, and that would give you like a ready-made data set that you could then put into a model. So in terms of modeling, oh, and one other thing here is I chose, just based on looking the data, to use five degrees of freedom for the spline. So lower degrees of freedom means it's less smooth, like linear, and higher degrees of freedom allow it to be more potentially smooth.

Specifying models with parsnip

All right, so you might think about using the parsnip package to do modeling. And if you've never seen that before, it's very similar recipes in that you make a specification of what you want to do before you actually do it. So let's say we just want to fit some type of linear regression model, like numeric outcome, we know we're going to have slopes and intercepts, that's like the structure of our model. So in parsnip, we have all these functions that delineate, like you want to do this type of structure, like k-nearest neighbors, or linear regression, or like a neural network.

So first you state what type of model you want to have, which is not really terribly descriptive as you can see when we print it out. And then you'd state the engine. So the engine is basically what implementation do you want to use. So you might use an implementation that does Bayesian analysis to estimate your parameters, or like a robust regression and so on. So you have a lot of options. I'm going to choose Glimnit here, which is like a regularized regression model. But you could have used LM, or Stan, or Kieris, or all the things that we have implemented. And so now you've been a little more specific, saying we want to fit this structure of this model, but use this estimation to estimate parameters.

And so I mentioned Glimnit does regularized regression. So there are also some other parameters that we probably should specify for Glimnit. For example, the amount of regularization in this value called penalty, this is a fairly high amount of regularization. And then what Glimnit can do is it could fit two different types of regularization. It could fit the lasso model, which is like L1 regularization, or it could do ridge regression, which is like L2 regularization. And the cool thing about it is you can choose a mixture of these things. So if you're not sure whether you should be doing the lasso or ridge regression, you could try a mixture of the two to see if that works better for you than just purely one or the other.

So in this call, I picked a fairly high amount of regularization and about a 50-50 mixture of lasso and ridge regression. And you can see when I print it out, it's there. But how do I know? I have all these special numbers, right? Like five degrees of freedom, a penalty value of 0.1, and so on. So you might say, oh, you must have done a lot of analysis for this. But basically, I can't look at those numbers and say that's a great model. So the thing about this is we need to tune these values.

So the thing about this is we need to tune these values. If you want to optimize our model performance, and we don't know what these values are, and there's no analytical function that says plug your x's and y's in, and therefore, it spits out a value of whatever for penalty. So we have to do something to find reasonable values of these parameters when we can't directly estimate them. And that's called model tuning.

If you want to optimize our model performance, and we don't know what these values are, and there's no analytical function that says plug your x's and y's in, and therefore, it spits out a value of whatever for penalty. So we have to do something to find reasonable values of these parameters when we can't directly estimate them. And that's called model tuning. I'm going to show you the tune package in just a second.

Max Kuhn | Total Tidy Tuning Techniques | RStudio (2020)

Transcript#

Introduction to the Ames housing data and recipes

Specifying models with parsnip

Tagging parameters with tune()

Grid search with tune_grid()

Finalizing and refitting the model

Bayesian optimization

Q&A

Featured software#

rstudio

tidymodels