Resources

Max Kuhn - Evaluating Time-to-Event Models is Hard

Censoring in data can frequently occur when we have a time-to-event. For example, if we order a pizza that has not yet arrived after 5 minutes, it is censored; we don't know the final delivery time, but we know it is at least 5 minutes. Censored values can appear in clinical trials, customer churn analysis, pet adoption statistics, or anywhere a duration of time is used. I'll describe different ways to assess models for censored data and focus on metrics requiring an evaluation time (i.e., how well does the model work at 5 minutes?). I'll also describe how you can use tidymodel's expanded features for these data to tell if your model fits the data well. This talk is designed to be paired with the other tidymodels talk by Hannah Frick. Talk by Max Kuhn Slides: https://topepo.github.io/2024-posit-conf/ GitHub Repo: https://github.com/topepo/2024-posit-conf

Oct 31, 2024
20 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, as Hannah said, I'm here to talk about how we measure the effectiveness of these models.

If you're watching this video sometime later from now, maybe take a look at Hannah's video beforehand that we just saw, Lies. There's a fair amount of overlap. We kind of use similar data sets and she showed you some things that I'll kind of show you also in a different context.

So the title of this is like, you know, like doing this is hard and that's a little bit of a reflection of like the development process. There are a lot of different ways and R packages you can use to qualify how well your event time model works. This is probably why it took us as long as it did to actually get everything fully implemented in tidymodels is I think a lot of the time was just doing a lot of research and figuring out what we should be doing. As opposed to like what's out there.

Survival model predictions and evaluation times

So, you know, Hannah showed you something similar. A lot of times with survival models were more focused on probability predictions, especially the probability of survival as opposed to like the point prediction of like your median survival time. And here's an example of three data points like she just showed you. Two of these data points had event times of 17. The red one here was an event and the green one here was censored. Then we have another one that went a lot further to 100. You can see it's probability curve is a lot higher than the other two.

And so, you know, a lot of times when we look at these models, we want to look at, you know, not so much like when does the event happen, but along some range of time. Like what's the probability that it could happen at any time or the probability that it doesn't happen up until now. And so when we go to look at these models, if we're really focused on probabilities, then we might want to make our performance metrics be dynamic in the terms of we want to look at specific time points and see how the model is doing. And so this is an example where every, let's say, 50 days, we're going to look at how the model works. Like does the model perform really well here at this slice of probabilities versus way out here in the tail where there's, you know, fewer events, let's say, late in the time period. So we're going to be looking at like time dependent or dynamic performance metrics for these models.

Converting censored data to binary indicators

So Hina kind of went into this. What people tend to do in these situations is they take the actual like event time data, like your outcome data, and given a certain what we'll call evaluation time, like that's the time we're going to look at our model performance. You can take that data and probably, I'll show you why that probably is there in a second, convert that to a binary indicator at a certain time. And the way I think about this is I think about like a package getting delivered to my house. So, you know, I order something, let's say from Amazon and it's been like a day and the package hasn't arrived yet, but I know it's been at least a day. So I know my event time so far is one day, but it's censored right now at one day.

So, you know, what we can do is we can look at specific time points along there and convert that data point of censored at one day to either an event, a non-event or something else. We'll see in a minute. So once we've done that, then people tend to use standard like binary classification metrics to evaluate the models. And the two that have mostly stuck in the literature are the Breyer curve and the good old well-known area of the RC curve. Breyer curve, I'll talk about it a little bit if you've never seen it. It's really like a measure of calibration, like how close are my, how accurate are my probabilities compared to their true values? And then the RC curve is more about separation of events versus non-events, which doesn't mean you have to have like very accurate probabilities. It's just, they need to be separate between your events and non-events.

We'll talk about both of these and there's a little bit more details at tinymodels.org.

All right, so we have three cases that like if we have an evaluation time of tau, let's say, you know, I'm looking at packages being delivered and I have a delivery that actually happened at day two. Let's say tau is like day three. So at day three, if you're evaluating your model, you have to think about three different cases. If the actual time that I have is an event, let's say day two and it's before tau, then we count that as event because it's already happened. Now, if my time, whether it's censored or a real data point, like an actual event time happens after tau, like after three days, I know it's not an event because nothing has happened either way. But let's say I have, I'm evaluating my model at day three and I have a censored value at day two. Now, I don't know if that real event time could have been like 2.1. We don't know what the answer is. It could have been five. So we don't know what to do with that. It's really ambiguous.

And so what these models do is they basically just discard those data points. And so the interesting thing here is we're actually throwing away data because we don't know what the answer is. And that the amount of data that we have at any given evaluation time as we evaluate these models is going to change. And that, as you'll see, has a bit of an effect on how we evaluate the models and the trends that we see.

And so what these models do is they basically just discard those data points. And so the interesting thing here is we're actually throwing away data because we don't know what the answer is.

So this is a nice little figure sort of demonstrating the same thing. Here is, let's say we have a package delivered at day two and then a censored value day four. If I'm evaluating my model at day one, like nothing's happened to the left of this. So they're both not events. If I'm evaluating at day three, I did have an event here. Something happened after three, but we know it at time three, it's not an event. And then here's the ambiguous case. This one's an actual event. Does it happen? But this one could have happened anywhere between, you know, these two lines or it could have happened much later. So we can't really do anything with that data point.

Inverse probability of censoring weights

Now, one thing we know from like statistics is, you know, generally speaking, we just don't throw away data and think like everything's cool. So, you know, we have to think about what the consequences are in terms of measuring performance. When we have data that is just eliminated, you know, in this case, for good reason, because we don't know what to do with it. And typically we know that just doing the analysis of those data as if nothing ever happened generally leads to bias. And that's a, that's, we've known that for like a long time. And so there are a variety of ways you can handle that. But the one that people tend to do is they tend to take a causal inference approach and then build what they call a propensity model to predict, you know, the probability of that data point being missing and then use that additional score for each data point to weigh their metrics and things like that. So this is a pretty well-known thing. People do it quite a bit. You've probably seen it if you were in the causal inference workshop.

Now in our context, we're going to do basically, we're going to do that. We're going to estimate the probability of something being censored at a specific time tau. And there's detail at the end if we have time or if we need it. But the way that we're going to take a very simple method, it's not informative censoring assumptions. We don't think that there are any predictors influencing that. We can change that later on in tidymodels. But right now we do something that honestly I had never heard of, but apparently it's fairly common called a reverse Kaplan-Meier curve. So it's a Kaplan-Meier curve. We know what that is, but they just switch the event indicators and it's giving us basically the probability that something is censored up to time tau.

And so then what we do is we do these things called inverse probability of censoring weights. We get that probability for every data point, whether it's an event or non-event. We take its reciprocal and we use that as a case weight when we compute our metrics. And we'll see that in just a minute. The best paper on this is Graf et al here. I've read this like five times to just get my mind around it. And that's probably the best brief reference we have. It's mostly concerned with Breyer scores and that's what we'll talk about later.

So here's an example. So it's a churn data set. I have a 1000 data point validation set. And I'm going to look at evaluation times going from about 5 to a little bit over 200. And so the coloring here is how many data points are able to be included in the analysis because they're not ambiguous. And then the line demonstrates the sum of the weights at every design point or at every evaluation time. And you can see that it may be hard to tell from the colors, but the number of data points that I have for analysis drops from 1000 maybe to about 250 once you get around 10.200. And then, which is not, I think, uncommon from all the data sets I've looked at, but this trend here is not really a reproducible trend from data set to data set. But it's interesting because things are pretty stable up until about 100, day 100. And then you have a very smaller number of weights, which means there are lower weights with fewer people and then an explosion of them. And what happens with propensity score sometimes is if you're inverting a probability that's close to zero, sometimes you get like case weights for like 90 or 100 or something pretty large. And that's kind of what happening here.

Breyer score

All right, so the Breyer score, as I mentioned, it's a measure of calibration. You can think of it as like root mean squared error, but for probabilities and basic classification, you would take these Y's here or the indicator 01. So if I'm looking at like class one, it would be a one here minus the probability of being in class one. So if your model was perfect, the probability would be one, you'd be subtracting one from one and zero. So the Breyer score is best at zero. That means you have a perfect model. And with two classes, the sort of like danger zone of my model is really not all that great is about 0.25 and above. So that's just basically for classification.

In our situation, so we have events and non-events, we're looking at a two class problem. And what we do here is we do the same calculation, like a Y equal one here means this event, the probability here would be one minus the probability of surviving. So that's what happens when you have an event. And if it's not an event, it's a zero here. And then it's just a survival probability. And then these are the case weights I mentioned. It's the same calculation, but instead of dividing by n, we divide by the sum of the weights at time tau. And that sum of weights is going to, as you just saw in the last figure, is going to vary over time quiz.

Computing metrics in tidymodels

So how do we do this in tidymodels? Hannah showed you a little bit of this. So it's maybe not new, or at least it's not more than 10 minutes new to you. Instead of using predict, what we can do is use augment on our model fit, and I'm going to give it my validation set, and I'm going to do a fair number of evaluation times here just to have nice plots to be honest with you. The difference between what augment gives you and what predict gives you is augment will give you the extra columns of the probability of being censored and the case weight. So it just appends those two. It's another list column, like the one she showed you before that had 24 rows. In this case, it'll have, I think, 225 rows in it because I did a lot of evaluation times.

And then yardstick has a bunch of metrics that are appended with survival. So we have a briar score for classification. So if you wanted to do that for survival models, you'd say briar survival, give it the data set, tell it where the predictions are, and then it gives you kind of like the standard yardstick format where it gives you a metric name. There's now an allow time field here that says that, you know, 30 days, my briar score is this, which is really low. And, you know, I just filtered to show you some.

Now, so you can look at these and maybe the time points that are most interesting to you and figure out, is my model good enough? Where is it good? Where's it not good? If you don't really want to have to think about the sort of dynamic nature of this, the other thing you can do is you can do briar survival integrated, and it basically does airing of the curve and normalizes it. And honestly, the value here is not that bad for a briar score. It's 0.07. It's over near 0.25, which is good. What does it look like over time in this particular case? Early on, the model does really well. The score is virtually near zero, and then you see it sort of getting a little bit worse, and it peaks, you know, around 0.25 at time 200, and then it comes down again and gets a little bit better. Like early on, this seems like a pretty reproducible trend across datasets, but you know, where your model may work and not work is going to, again, be dataset dependent.

Calibration plots

So, you know, I was wondering how to like convey why this is the way it is. What I decided to do is take a few different evaluation times and do like what we would call a calibration plot. So if we had a regular classification model, what we would do, or the typical thing people do, is they take their predicted probabilities and they order the data by those probabilities. And then they come up with, let's say, 10 bins. So the highest bin is I would take samples that have probabilities from like 0.9 to 1. I would group them and actually count their actual event rate, like, you know, the mean of the zeros and ones for those data points. If my model is well calibrated, then that rate should be about the midpoint of the bin. So if I'm going like 0.9 to 1, the midpoint there is 0.95. If my model is doing well, the actual event rate should be about 0.95.

And so you have this curve where we do it, and I did it over a couple different event times. So looking at an early time point, most of the data has a very, very high probability of being event. So the size of these points conveys how many data points are there. Remember, this had a very low Breyer score. Even though we have some points that are well off the curve, these points don't have anywhere near as much sample size as this one. So basically, the reason the curve is doing really well early on, because almost everything is event and they're being really easily predicted as events. And the few that are not events don't have a sample size enough to really impact that too much.

And so as time went on, we saw that the Breyer score got a little bit worse. You see, we're filling out more of the probability space here, and they're not really on that line. They still are up here a fair amount, but then you're slowly starting to get more data points down here, and they're not along the line. So you can see it's like steadily getting worse. It's kind of like, I don't know, that doesn't seem that bad to me. But you can see it's just broken at 200, right? We have some that are like all over the place, and it's particularly poorly calibrated at that point. Now, this doesn't mean that it's not necessarily a poor model. If you need to get accurate probabilities, maybe this isn't the model you want. But if your goal is to differentiate who's an event and who's not an event at a specific time, then we'll see what that looks like in the RC curve. They don't always have to be matching sort of characteristic to them.

ROC AUC for survival models

All right. So we're probably familiar with this. They are near the RC curve. The calculations are pretty straightforward. It's pretty much what we would have done before, except for incorporating these case weights. So when we go to calculate the sensitivity of the specificity at a particular probability threshold, we don't have like counts in the numerator or denominator. We have the total sum of the weights, the denominator, and the sum of the weights for the events in the numerator for the sensitivity. And conversely, you can imagine what we do for specificity.

The reference here is this. There's a lot of references that do this inverse probability weighting, and they all do it kind of slightly different. We feel like the graph paper is like as good as it gets. So there's a lot of additional things that we do in some of the details that maybe these folks didn't do, but we apply basically the same process as we did for the prior score. We don't do a different like weighting scheme or whatever for entering the RC curve. And again, it measures separation between the events and non-events.

As you might imagine, the code is pretty boring. It's RCAUC survival, same format. And then if I look at that over time, similarly, it's nearly perfect at the start and then degrades at about 200, you know, it reaches its nadir and then increases. But, you know, the end of the RC curve here is about 0.75. And I know people who would like kill to get an error in this RC curve of 0.75. So, you know, maybe this isn't bad for discrimination, even if it's not great at calibration. So there's our results for that metric.

Conclusions

So basically the conclusion is there's good news and bad news. To me, sort of the bad news is when you evaluate these models, it's actually a pretty complicated thing. We're fitting a separate like censoring model so that we can qualify how well our model does. And if we, you know, when we eventually add informative censoring there, it's going to get even more complicated. So we have this like, you know, significant weighting scheme and have to adapt our metrics to accommodate that. But really, the good news is I think like most tidymodels is we can take something that's fairly complicated or like computationally intensive and give you an API that's very, very simple. If you want to do something different or if you want to look at the details, like those calibration plots I made, you have all the data that's there in the augment method. So it's not like we're obfuscating anything or preventing you from doing anything on your own. But, you know, what we want to do often in tidymodels is take something that we think is the really the right thing to do in terms of data. Even if it's really complicated, we want to give you a syntax that's that's very simple, so you don't have to like, like worry about that too. So that's pretty much it. Thanks to the group, Hannah, Mio, and Simon, who you will see more of in a minute. And thank you.

But really, the good news is I think like most tidymodels is we can take something that's fairly complicated or like computationally intensive and give you an API that's very, very simple.

Q&A

Thank you, Max. We have just a little bit of time for questions. So the first question is around the packaging of some of the time to event data functions. So how does the team decide if functionality makes it into one of the core packages? For example, Portional Hazards is in parsnip, while other engines are in the censored package, and Briar Score for Survivor Models is in Yardstick. Yeah, some of that's easy to describe. Like we, it's a lot of the time, we're in the process of trying to figure out what's the best way to do it. a lot of times, honestly it's driven by groupings of models. So it makes sense to have the model engines for sure event time data for the most part uncensored.

I think Bonsai too, but basically a lot of times to be honest with you, it's governed by package dependencies. So we don't want to have parsnip have, it already has a lot of dependencies as is we don't want to make that like a hundred. So sometimes we modularize our packages so that it's easier for you to install them. It's easier for us to manage them. We try to keep a pretty, I think rational, like grouping of what goes in what packages, but I mean, opinions may vary on that.

Questions are coming in fast and jumping around on me. So I'm struggling to keep up here, but is it ever a good idea to post-process time to event data to improve calibration? Yeah, well, probably. That's a joke because if it, when it does happen, it'll be in the probably packet. So we see a lot of that in the literature. We implemented some calibration routines and probably for regressional classification. We will, we will, I'm not sure when, but you know, every intention of extending that to survival models. Because I, where I see calibration happening the most is within these types of models.

How we package it in, like this guy you're going to see in a minute, Simon and I are working on the whole post-processing. Well, mostly him. The whole post-processing aspect of uptiding models where you can potentially attach a post-processor like a calibrator to a model workflow and just have that happen in the course of building and tuning your models. So that's the intent. We'll probably have that sooner for classification regression than we will for survival models, but our intent is to absolutely do that. Very good. Well, thank you, Max. Thanks.