Max Kuhn | What's new in tidymodels? | RStudio

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Max Kuhn . I'm a statistician and work as a software engineer at RStudio. I mostly focus on open source tools for building models. I run the tidymodels group, and tidymodels is a collection of R packages that have very similar syntax and design principles to the Tidyverse . So we've made a lot of improvements to tidymodels in the last six months. Some of those improvements have been around user interfaces. So we have better ways of specifying which predictors are added to a model. We've made other improvements around computational efficiency. So we have limited support now for sparse matrices and expanded support for payload processing.

Now what I'd like to talk about more in this video is a specific package that we just released called FineTune . And FineTune is an expansion of our Tune package and gives two additional ways to optimize hyperparameters or tuning parameters in models. Now those are types of parameters that can't be directly estimated from the data, such as the number of neighbors in a k-nearest neighbor model. And so one variation we'll look at in this presentation is a new method of doing grid search. And the other method is another iterative method called simulated annealing.

The cell segmentation example

To illustrate the methods that we'll talk about today, we'll use a cell segmentation data from the model data package. It's a two-class classification problem with about 58 numeric predictors. We use a support vector machine. And for the support vector machine, there's a variety of tuning parameters, such as the cost value or the radial basis function kernel parameter called sigma. And then this talk will be about two different ways of finding optimal parameters of these values. Now these tuning parameter values are important. They have a big influence on the model fit and performance. The problem is we can't directly estimate them from the data. So we need to find good ways of coming up with values for those two parameters, even though we can't directly estimate them. To measure performance, we'll use tenfold cross validation, and we'll measure the errand of the R-C curve. Now the model definitions and the complete code set that you can use to reproduce these analyses are shown at the link here.

So to give some visual demonstration of what we're trying to do, what I did is I took these two different tuning parameters and generated a very large grid and re-sampled them. And that gives a sense of what the truly best result would be from our search. And you can see here on the left-hand side this contour plot. Blue values indicate poor performance of the R-C curve, and then white values are more optimal. And so you can see the lower diagonal of the search space is pretty flat and not very good. And as you go to the upper diagonal, there's sort of like a little mountainous ridge that has good performance values. And the best performance value on the left-hand side plot is shown by the black dot. And any really point along the ridge that goes from that black dot sort of down towards the southeast will give us pretty good performance. The problem is if we go over that ridge, that dark spot there, which you can see a little bit better on the right-hand side surface plot, gives really, really poor performance. So we want to find the values of these tuning parameters, and we'll be trying to find the best ones, which are indicated in the white on the plots.

Racing methods

Now, the first method we'll talk about is something called racing methods. And this is sort of an alteration of grid search that's been around for quite a while. Now in ordinary grid search, what you would do is come up with a candidate set of parameter values that you want to try. And then you'd measure their performance, usually using resampling, and then pick the one that seems to be best or has the characteristics that you like. Now, the problem with this is you have to evaluate a potentially large number of models to get performance estimates until you can make a decision. So let's say we do 20 values of the support vector machine tuning parameters, and we use tenfold cross-validation. That would be about 200 total model fits, and we can't make a decision on which tuning parameters are good or bad until we do all that work.

So what racing methods do is they sort of incrementally evaluate the resamples. So in the first couple resamples, you proceed as normal and measure their performance. And then what you do subsequently is do an interim analysis after each resample to figure out, basically, are there tuning parameters that have a low probability of being called the best results? And so one way we could do that is we could fit an ANOVA model to those results, where the outcome is, let's say, the errand of the RC curve, or whatever performance metric you want to use. And the predictive factor is the qualitative tuning parameter combinations, like, you know, model 1, model 12, and so on. And so what we can do is we can evaluate those and see which ones are statistically different from the best possible. And if they are, we can discard them. Now, there's a variety of ways to do racing methods. We have a second one implemented in FineTune that uses win-loss statistics and something called the Bradley-Terry model, and treats these data as if they were a sports competition. And if you want to learn more about these methods and particularly how we implemented them here, there's a paper there from 2014 that has the details. And for our particular example, we'll use a grid of 20 support vector machine candidate values that are generated with a space-filling design. And we'll show how racing can be used to efficiently evaluate them compared to ordinary grid search.

So let's illustrate the race. So let's say we take our 10 folds and we randomly arrange them so they're in a random order. We would take the first fold and do what we'd normally do. We'd compute our models and then make predictions and measure, let's say, the area of the RC curve. Now, in the plot, what you see is on the y-axis is just the model number, which is just sort of arbitrarily numbered. There's 20 of them. And on the x-axis, we see the measure of performance. And so the blue dot for model 12 indicates that it had the best performance across all the tuning parameter values. And then the rest of the points show the loss of performance compared to the current best. So you can see model 12 was the current best, but model 17 was close behind. And on the bottom there, model 16 did not have very good performance. And so this is after we've evaluated the first resample.

Now, in the second resample, we do the same thing. We just resample and compute the performance metrics for all 20 tuning parameter values. Now, in this plot, instead of the individual RC values, these points represent the average of two ROC area under the curve values for each model. Now, the leaders switch from model 12 to model 17. But you can still see that there's a variety of different models that do not have very good performance compared to those.

Now, in iteration 3, we start doing the interim analysis. In this case, we're going to be talking about the ANOVA method. And what that allows us to do is put a confidence interval on the loss of performance compared to the leader for each tuning parameter value. And so you can see the ones that are in black, there's no evidence to reject that they're statistically different from the best value in blue. But the ones in gray that you can see are statistically different from that. And therefore, we think it would be a pretty good idea to discard them from further consideration. So basically, we use the ANOVA statistics to figure out which tuning parameter values to keep resampling and which ones we can get rid of. And in this particular case, in the third iteration, we're getting rid of 17 different tuning parameter values.

Now, on iteration 4, the confidence interval is tightened up pretty considerably. And model 1 there falls out to where we're left with two models. And as we keep resampling, you can see that we're getting closer and closer to having one fall out. And then iteration 9 of 10, model 12 finally is discarded because it's not statistically or because it is statistically different than the best.

So this is on the ninth fold out of 10. What we normally do is if we have any situation where there's only one left over, we keep resampling it till it has the full precision that we would like to measure performance. So after nine iterations, basically, we filter out all but one tuning value combination. And then we just resample model 17 once. And what the result is, instead of fitting the full set of 200 model fits, we actually considered 74 model fits. So it's a pretty substantial reduction in both time and evaluations of models.

74 model fits. So it's a pretty substantial reduction in both time and evaluations of models.

And since our interval analysis method is not very slow, we're able to call these values pretty quickly. And so the nice thing about racing is you can do a very large grid if you'd like and do it pretty efficiently because chances are you will not be resampling the full grid for all of the possible resamples.

The code for racing is shown here. It's very similar to the tune grid function in the tune package. And you can see that we first set the seed so we make sure we get reproducible results. And then what we do is run the tune race ANOVA function. Its arguments are very, very similar to tune grid. The relevant one that's shown here is the grid argument. And that could either be an integer that says basically how many tuning parameter combinations you want to evaluate and let the function figure out which to do. Or you could pass in a data frame there that has the exact combinations that you're interested in. There's a control argument that lets you specify a lot of the numerical aspects of things like the size of the confidence interval that's used for elimination and so on. There's also a verbose elim option that does some logging that shows you the progress as you proceed. So you can see there it shows that we eliminated 17 tuning parameter values at the first iteration and so on. Now, if you'd like to try the other racing method that's based on win-loss statistics, that function is called tune race win-loss. And the syntax is identical to what we see here in tune race ANOVA.

Simulated annealing

So as I've mentioned, racing is really a grid search method that's done very efficiently. And again, in grid search, you predefine all the values you want to look at. So the next method we'll talk about is an iterative search method based on simulated annealing where new values are determined after each current value. So you don't predefine the tuning parameter values.

So simulated annealing is really a pretty old search routine. And it conducts a bias random walk around our tuning parameter space. And the bias comes in because once it creates a new tuning parameter candidate and evaluates it, it may discard that from the search and pretend like it never existed. Or if the results are not too far from optimal or a new best, it would accept that and proceed from there.

So we'll show that in detail in a little bit as we show the example. One thing we should talk about is how simulating comes up with its next value to evaluate. So on the right-hand side, we sort of have a plot where the black line shows a search that's gone through six iterations and is slowly going down towards the bottom right. And then once we are at that last point, we need to generate values that are within a local neighborhood of the current best. So what simulated annealing would do would find values that are not too different from the one you're currently evaluating.

So what we've done is we've generated a variety of random combinations that are shown in red there that are within some sort of radius around the current best. And then we would select one from random as the next point that we should evaluate. Now the bias-random walk part comes in because if we select one of those and the performance is not as good as the current value, we have two choices. The algorithm can either accept it and proceed from it going forward or discard it and again, pretend as if it was never generated.

And the determining factors for whether we discard or accept a result are based on two things. One is how early you are in the search process. So you're much more likely to accept suboptimal values at the early parts of the search progress. Um, that's how simulated annealing works. And as you get to the later iterations, you're much less likely to accept poor values. Now, the second factor in that decision is how bad was the current result. So if the current result is just slightly worse than the current best, it's more likely that you would accept it. But if it's a really awful point, um, probabilistically it's less likely that you would consider that as the current point and base your next point off of that one. So we'll show this in some detail with our current example.

So here's our search space that we looked at earlier. Um, I've generated a grid of four different, uh, tuning perimeter values and evaluate them using resampling. Uh, the current best from that grid is the green dot. Um, so simulated annealing will start from that green dot and proceed from there. So on the first iteration, we're going to define again this neighborhood around that green dot and randomly generate a point. And so we did that. It actually turns out it's a new best. It's going towards the better results. And so we keep that and then base the iteration two results on this value. Now in iteration two, we go in kind of a poor direction and get actually fairly bad results. And they're low enough to where we would discard this as being suboptimal. And so when we go to iteration three, it's based on the results from iteration one, as you'll see in just a second. So discarding it means we do not use it to generate the next point in the search.

So in iteration three, you see we get another new best and then we can keep going around the search by finding things that are in a local neighborhood. Now, one of the things about simulated annealing is it's a global search method, which means it can go back and reevaluate areas that it might have already evaluated. And so in this particular result on iteration eight, it goes back in a different direction and we accepted a suboptimal. And as you can see, we're going to go into an area that's not very good and we're getting a series of just poor results.

Now, simulated annealing also has this feature where you can do a restart. So if you don't have good results within a certain number of iterations, you can trigger the search to start from its last known best point and proceed from there. And that is actually what happens here, that we still have another suboptimal result. And then we start the search again from the current best.

And so we proceed along here. It happens again. We go off in a bad direction and then we start. And so you can see we're close to the number of iterations. So I did 30 iterations here and it would just continue to look until it finds better and better values.

So that's a little graphical representation of how simulated annealing works. If you want to use this in code, the code is very similar to the tune base function in the tune package. The arguments are almost all the same. You could tell a number of iterations to proceed from. Again, there's another control parameter that determines things like the size of local neighborhood and so on. So it has some of the numerical properties that you can tweak for simulated annealing. It also has a verbose method. So if you set that to true, you get some nice logging as the search continues. What's nice about this is if you're running interactively and it's been running for a while and you feel like maybe it's done as best as it could do, you can manually interrupt the execution of the tune semineal function and it will return all the results that it's achieved so far. So by manually interrupting it, you don't lose all the results that you've generated. It will give you the ones that you've completed to date.

Learning more

And so that's the other function that we wanted to show for model tuning in the fine tune package. If you wanted to learn more, look at tidymodels.org. Especially if you're new to Tidy Models, it has a lot of articles on getting started, a nice sequence of examples of how to use recipes and parsnip and variety of different things. There's other articles there for people who have used it, more advanced articles that show different aspects of Tidy Models. Julia Silgi and I are also writing a book on Tidy Models. It's about halfway done at the end of 2020. Chapter 13 is about grid search and it has more details about racing and chapter 14 is about iterative search and one of the things it describes there is seminealing. You can also look at the fine tune webpage, which has a nice formatted reference section as well as the blog posts that we use to launch the package. Thanks for watching and I hope you enjoy it.

Max Kuhn | What's new in tidymodels? | RStudio

Transcript#

The cell segmentation example

Racing methods

Simulated annealing

Learning more

Featured software#

rstudio

tidymodels