Resources

Max Kuhn - The Post-Modeling Model to Fix the Model

The Post-Modeling Model to Fix the Model by Max Kuhn Visit https://rstats.ai/nyr to learn more. Abstract: It's possible to get a model that has good numerical performance but has predictions that are not really consistent with the data. Model calibration is a tool that can fix this. We'll show some examples of poor predictions and how different calibration tools can re-align them to the data. Bio: Max Kuhn is a software engineer at RStudio. He is currently working on improving R's modeling capabilities. He was a Senior Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He was applying models in the pharmaceutical and diagnostic industries for over 18 years. Max has a Ph.D. in Biostatistics. Max is the author of numerous R packages for techniques in machine learning and reproducible research. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award from the American Statistical Association, which recognizes the best book reviewed in Technometrics in 2015. Their second book, Feature Engineering and Selection, was published in 2019. Twitter: https://twitter.com/topepos Presented at the 2023 New York R Conference (July 14, 2023)

Aug 4, 2023
18 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thanks for waiting around to the second to last talk here at the conference, I appreciate that. Nobody will probably be surprised that I'm going to be talking about modeling, I think I said that last year too.

So the idea is this, something I've been working on lately, I should say up front that a person I work with at Posit named Edgar Ruiz did most of like the programming on this, so he and I worked together on this for like six months. The idea is the model behind the model to fix like the first model, and hopefully that'll come across well here.

So I was trying to think of like how I could start this, and it's sort of like the idea that you have a data set, you do all this work, and sometimes they're very difficult and you find a small handful of models that if you tune them just right, actually get pretty good performance, which is not really how it usually is. In this particular data set, it was like 1,500 data points and 56 predictors, and a model that seemed to work pretty well was something called a Naive Bayes model, and it had an error in the RC curve of 0.86, and there it is right there, so you think like, oh yeah, this is really awesome, it's great, but a college professor of mine told me that the only way to be comfortable with your data is never to look at it, so then you start plotting things.

And what I did is I took sort of, I did, as you'll see in a minute, some re-sampling, and these are the outer sample predictions, and that Gigi plot on the right basically has things fastened by the true class of this data, and so on the top you see that there's a big bar there at 1, and a smaller bar at 0, it's basically bimodal, and the converse happens for the no event data on the bottom, but it's kind of weird, it's kind of like a skeevy distribution, you wouldn't expect this to happen, and in fact in a lot of cases when you predict it wrong, you're like really confidently incorrect.

So this is not good, this is, you know, it's something where the model is separating the class as well, but not in any realistic way, once you start looking at the probabilities that it generates. And so one way to think about this is, well the way really to think about this is that the model's not very well calibrated, and calibration essentially means that if you had an individual data point, and its prediction is like 80%, like if you had like a Marvel MCU way to like alternate reality this thing like a thousand times, you would expect the actual event to occur about 80% of the time.

The model's not very well calibrated, and calibration essentially means that if you had an individual data point, and its prediction is like 80%, like if you had like a Marvel MCU way to like alternate reality this thing like a thousand times, you would expect the actual event to occur about 80% of the time.

Visualizing calibration

Alright, so before I continue, just show some tiny models code, I'm not going to go through this, the only things I want to highlight are basically that we did cross validation using this V fold CV, and at the bottom here is the code where we actually fit the model, or re-sampled the model, so you'll see these Bayes result object used, and it basically is the results of cross validation. So that contains all of our holdout predictions every time we did the 10 fold cross validation, just a little technical note.

So a while back, like a long time ago, I wrote this package called probably, and probably was mostly sort of like a placeholder that had a few things in it, we had some things about equivalent zones in there, and then some tools that will help you estimate appropriate probability thresholds for classification models, and it kind of sat for a while, we didn't really have time to do much with it, and we've kind of like done a lot with it over the last year, I mentioned Edgar and I worked on calibration tools that are in probably, and that's what we're going to talk about, I just more recently added some conformal inference things to get prediction intervals, this is kind of like a hot research topic right now.

So we're going to look at calibration, and all this stuff is in the probably package, and the URL for this is at the bottom, so if you want to reproduce all this and see how it works, you can get the code.

All right, so let's look back at this histogram I had earlier, it's, as I mentioned, like you need like alternate reality to see if any one individual data point is well calibrated, but actually you can view the collection of data points, and find ways to figure out if they are well calibrated or not, this is a really old technique, so it's nothing new, but what I did is those vertical dotted lines basically are the data points that represent probabilities, estimated probabilities of less than 10%, so if we look at data and order them by their probability of the event, and let's just break that, I hate binning, but let's bin it into 10 groups based on their estimated probabilities, and so if the bottom is zero and the top is 10%, sort of the midpoint is 5%, and if the model is well calibrated, you expect about 5% of your actual labeled data to be the event of interest.

And so what you can do is you can do this, let's say, for 10 different bins and see if it lines up with what you would expect it to be just by chance, and so this is the typical calibration plot, you see the code on the left, the code here is really boring, I want to have some really cool looking pipe things, but we made the code pretty simple, so we optimize function names for tab complete, so all these things start with CAL for calibration, so we have a bunch of plotting functions, and you can see here on the right that that first dot on the left is the results of that bin I showed, where you think if the model were well calibrated, that data point should be at one for both of those axes, but it's not, so it's really like overestimating the probability, and then you can see it sort of meanders in the middle and gets a little bit well calibrated towards the end there, but this is about as bad as it gets, honestly, and no shocker, I chose a Naive Bayes model for this particular reason, with Naive Bayes, if you have too many predictors and those predictors are highly correlated, which they are in this data set, you're likely to get this sort of like bathtub shaped or U-shaped distribution of your probabilities, so that wasn't an accident for me.

So one issue with this plot is like, well, that's 10 bins, should I have used 20, should I have used five? And so another way to do this is take 10% bins and just do like a moving window. So we have that, which I think is a little better than the other plot, so you can see that, yeah, the bad trend is still there, despite how frequently we bin them, and then another way we can do this is we can take the data, the original data here, and build a logistic regression model, and if our logistic regression model, the predictive probabilities from that, if the model, the original model is well calibrated, the predictions from that logistic model should be on that diagonal line, so we have another function that would do that using splines, so it's basically a generalized additive model, and you can see the same trend here with some confidence bounds on it.

So in other words, this model's performance is good in terms of separation, but in terms of probabilities, it's not great at all.

Calibration methods

So what do we do about that? Well, it turns out that this isn't like, this wouldn't be my first choice, we were discussing this in the workshop the other day, my first choice would maybe be, is there a model that can give me a performance that's close to this, but does have good calibration properties, but if you can't find that, or if there's some reason like this model's more efficient, or you just like it, what you can do is you can build another model on top of it that will basically fix the problem, and so all these tools, what they basically do is they try to estimate these trends, like you could take this logistic regression, and when you get new samples, like if you think, if this curve shows at some point you're overestimating the probability of the event by like 50%, you can basically pull that out, you can basically pull it down by roughly that amount, so that it basically, you can coerce these probabilities to fit along that diagonal line.

And so one of the classic ways of doing this is through logistic calibration, where you fit this model, and then when you get new probabilities, you basically deconvolute it so it does actually comply with the event rate you think should be there. Another thing that's pretty useful is isotonic regression, which is a way to do monotonic, like this trend should be increasing, right? It shouldn't be going wiggling around, and so what isotonic regression is, like a regression model that ensures that you have a completely increasing relationship, or a completely decreasing relationship. Now one downside to using isotonic regression is, if you have like 1,000 unique probabilities in your model, isotonic regression may convert that to like 50 in the end, so it can kind of discretize your model output.

So one thing that we did that seemed to do better there is just to bootstrap that like 50 times. And so if you use isotonic regression and use a bootstrap on that, you can sort of reduce it, the problem of it only giving you a few predicted values. And another tool that, like if any of you know much Bayesian analysis, the beta distribution is frequently used when looking at data between zero and one as the conjugate prior for the binomial, and so there are tools using the beta distribution to estimate certain quantities of beta distributions to sort of fix this calibration problem. And we'll look, even though I won't go into details of it, we'll look mostly at the results of beta calibration.

And so one other question about this, as I was thinking about it, is like, well, wait, what data are we using? Because you know, we're always saying like, don't re-predict your training set and start doing things with that. What we'll do here is we'll use the holdout sets from cross-validation, those predictions should be pretty accurate, and we'll use those to estimate these trends that we want to pull out and refactor out the data. Ideally what you'd like to do is you'd like to pre-plan for this, and if you have enough data, hold some of that data back just for calibration.

So like the first job I had was at a molecular diagnostic company, and I would work on the assay development, but I was also the people who built what they call the algorithms for the instruments, and algorithms may be a little bit more like sophisticated than what we actually did most of the time. But what we would do is we would have clinical trials that we would collect all the data that was a substrate for our algorithms, and then while we did that work, they would continually accruing samples that we would basically use for calibration, because when we went to FDA we had to show them that this was well calibrated with the actual event. And so if you do have the chance to set aside data for calibration, that's probably the best thing you could do.

Measuring calibration with the Brier score

So then the other question is like, well, how do we measure calibration, because clearly the RSC curve didn't do that. Well, one thing you can do is use something called the Breyer score, and it's a score that's been around for quite a long time. If you look at the equation of it, it's almost like sum to squared error for binary data, and when I first saw it, I was like, what? Why would you do that? That doesn't seem like a good idea. But it turns out that there's a lot of theory behind this, that the nature of this statistical measure of performance is actually quite good. In fact, it's better than a lot of other things.

And so I've sort of come around to maybe the Breyer score being something that we should be using just by default to measure how well classification models do. So it's almost like an error estimate for classification. So the best value you can get is zero, so you want to minimize it, and it depends on how many classes you have to figure out where the line of, oh God, this is awful is. For two classes, if you get a Breyer score of about 0.25, you're not doing well. So that's the situation we're in here. So that's sort of like the line of, not like a dignity, of disaster.

Applying calibration in tidymodels

So all right, so a little bit more tiny models code. We can look at how these models work. We can actually do the beta calibration. But what we didn't want to do is have people do the calibration and just start applying it to their data set, because we need sort of like a second way to evaluate when the calibration works or not. We don't want them continually re-predicting the test set and things like that.

So what we did is we have these calvalidate functions for the different methods of calibration, and you can give it your original data, like in that first argument, and then you can give it some optional metrics to measure. And what it does is, in this case, we did tenfold cross-validation. So we have all those predictions. So we would use like 90% of those predictions to estimate the beta calibration, and then apply that to the other 10% and compute our performance metrics, and then go sort of round robin. So it's basically cross-validation for calibration.

And when you do the collection function, it gives you the results. Basically, you can see the results, both for the Briar score and the RST curve. You see the Briar score goes from about 0.21, and after it's calibrated to about 0.145. So numerically, that's a nice improvement. It's like a quarter better. You might notice the RST curve doesn't really change at all, even though standard error is a little bit different. Most of these calibration methods, they just sort of change the rank order of the predictors. They change their probability values, but they usually do it in a fairly linearish way.

So it doesn't surprise me things like accuracy, or well, maybe not accuracy, but like log likelihood or the error under the RST curve and things like that aren't really drastically affected or not affected at all by calibration. So the thing that we care about calibration, it looks like we've improved it, but we haven't really hurt the separation aspect of this model by doing this.

And so we have all these validation functions for all the different methods. So typically, I didn't do it here. You would just go around Robin and see which one you like or which one gives you the best performance using resampling, and that's probably the best way to choose. And in this case, beta did the best. So what we do in the last line there, line 12, is we do like a final estimate that takes the entire training sets of the predictions, the ones that were held out, computes all the beta statistics from that data set, and that's what you'll use going forward whenever you process a test set or new samples and things like that.

So looking at test set, this is sort of like a manual way in TinyModels to do it. You could fit your United Bayes model in the entire training set, and then Augment would then make those predictions, attach it to the original test set, for example, and then we have a data frame there on the left, mbpred, that has like the raw predictions from the United Bayes model. You can see our statistics there are very consistent with what the resampling gave us. The Briar score is like 0.22, which is verging on disaster, and then the RC curve is actually pretty good.

When we want to apply the calibration, what we would do is we have a function called calapply, so you can give it any of the calibration objects and give it your predictions, and now you have sort of like the fixed version of those probabilities. And then when we apply the same estimates now, the Briar score goes down to 0.15. We estimated from resampling to be about 0.14, and again, the RC score looks pretty good. This is the test set results. And then if we wanted to see how it looked in our plots, well, look, it looks really well calibrated at this point. It's a little bit off in a few little places, but this is just like head and shoulders above what we were using earlier.

And so basically, this is the gist of how you would do this in tiny models. It's something, somebody in the workshop asked me if I've ever done this, and I was like, well, no, I've never done this for other models, but then I've never really had the tools to do it either. So now you have tools to do this if it works well.

We also have tools for multi-class models, like three or more classes, and also for regression models. So if you have a regression model where you plot like the observers as predicted, and they sort of like fan off on the end, and you can't find a way to fix that, as sort of like, maybe not a last resort, but you could use a similar way to sort of like unbend that curve using calibration.

What's next: post-processing in workflows

So now you have this calibration object. You have your original model fit. The code I showed here is right now how you would apply the calibration. That's a little bit more work that we want you to have to do.

So what we're doing next is we have these objects in tiny models called workflows. You can think of them as like scikit-learn pipelines, but Hadley was like, no, don't use the word pipeline. And so what workflows do is you can bind together a preprocessor, like a recipe, like Emily talked about recipes earlier a little bit. So you can take a preprocessor object in whatever model you're trying to fit, and you can combine them into one object.

And so next on our list is to add post-processing objects to the workflow. So whether it's calibration or optimizing your probability threshold for like a two-class system or a variety of things you might do to your data due to your predictions after you fit your model, you'll be able to sort of attach like a calibration object. So when you use the fit or the predict function at the very end, you don't have to do this extra step of holding on to this extra object to then transform your probabilities. It'll be built into the workflow, and that'll be a nice little workflow or nice little usability feature for people.

We're on a little bit of minutiae here. We try to convince everybody to resample everything, right? That's like our mantra is if you want to get good performance statistics, you have to resample everything. And so when you use a workflow and you resample it, if the workflow does any sort of interesting estimation parts like feature hashing or some complicated preprocessor, we want to make sure that gets executed every time freshly with resampling just the same way the model does.

And for the post-processor, that's mostly true, but we're not really exactly sure how to make that happen with calibration because if you only have 10% of your data held out in that original process, it's weird. Like when we cross-validated the calibration earlier, we had the 90% of the predictions that were held out to do the calibration to then apply to the 10% that are most recently held out and compute performance. And when we just do normal cross-validation for models and preprocessors, it's a little bit the opposite. So we're not really sure how we're going to make that work yet, but we will figure it out because during regular cross-validation, you only have 10% held out at any time. You don't have any of the other data to do the calibration to then measure it. So it might take a little bit of work for us to get this going with resampling, but I think using calibration in workflow objects should be hopefully before the end of the year.

So that's what's on deck. Thanks for sticking around. And just to remind you, Edgar Ruiz did most of the grunt work on this, and so we appreciate all the help that he gave.