Max Kuhn - Evaluating Time-to-Event Models is Hard

Transcript#

This transcript was generated automatically and may contain errors.

Hello, as Hannah said, I'm here to talk about how we measure the effectiveness of these models.

If you're watching this video sometime later from now, maybe take a look at Hannah's video beforehand that we just saw, Lies. There's a fair amount of overlap. We kind of use similar data sets and she showed you some things that I'll kind of show you also in a different context.

So the title of this is like, you know, like doing this is hard and that's a little bit of a reflection of like the development process. There are a lot of different ways and R packages you can use to qualify how well your event time model works. This is probably why it took us as long as it did to actually get everything fully implemented in tidymodels is I think a lot of the time was just doing a lot of research and figuring out what we should be doing. As opposed to like what's out there.

Survival model predictions and evaluation times

So, you know, Hannah showed you something similar. A lot of times with survival models were more focused on probability predictions, especially the probability of survival as opposed to like the point prediction of like your median survival time. And here's an example of three data points like she just showed you. Two of these data points had event times of 17. The red one here was an event and the green one here was censored. Then we have another one that went a lot further to 100. You can see it's probability curve is a lot higher than the other two.

And so, you know, a lot of times when we look at these models, we want to look at, you know, not so much like when does the event happen, but along some range of time. Like what's the probability that it could happen at any time or the probability that it doesn't happen up until now. And so when we go to look at these models, if we're really focused on probabilities, then we might want to make our performance metrics be dynamic in the terms of we want to look at specific time points and see how the model is doing. And so this is an example where every, let's say, 50 days, we're going to look at how the model works. Like does the model perform really well here at this slice of probabilities versus way out here in the tail where there's, you know, fewer events, let's say, late in the time period. So we're going to be looking at like time dependent or dynamic performance metrics for these models.

Converting censored data to binary indicators

So Hina kind of went into this. What people tend to do in these situations is they take the actual like event time data, like your outcome data, and given a certain what we'll call evaluation time, like that's the time we're going to look at our model performance. You can take that data and probably, I'll show you why that probably is there in a second, convert that to a binary indicator at a certain time. And the way I think about this is I think about like a package getting delivered to my house. So, you know, I order something, let's say from Amazon and it's been like a day and the package hasn't arrived yet, but I know it's been at least a day. So I know my event time so far is one day, but it's censored right now at one day.

So, you know, what we can do is we can look at specific time points along there and convert that data point of censored at one day to either an event, a non-event or something else. We'll see in a minute. So once we've done that, then people tend to use standard like binary classification metrics to evaluate the models. And the two that have mostly stuck in the literature are the Breyer curve and the good old well-known area of the RC curve. Breyer curve, I'll talk about it a little bit if you've never seen it. It's really like a measure of calibration, like how close are my, how accurate are my probabilities compared to their true values? And then the RC curve is more about separation of events versus non-events, which doesn't mean you have to have like very accurate probabilities. It's just, they need to be separate between your events and non-events.

We'll talk about both of these and there's a little bit more details at tinymodels.org.

All right, so we have three cases that like if we have an evaluation time of tau, let's say, you know, I'm looking at packages being delivered and I have a delivery that actually happened at day two. Let's say tau is like day three. So at day three, if you're evaluating your model, you have to think about three different cases. If the actual time that I have is an event, let's say day two and it's before tau, then we count that as event because it's already happened. Now, if my time, whether it's censored or a real data point, like an actual event time happens after tau, like after three days, I know it's not an event because nothing has happened either way. But let's say I have, I'm evaluating my model at day three and I have a censored value at day two. Now, I don't know if that real event time could have been like 2.1. We don't know what the answer is. It could have been five. So we don't know what to do with that. It's really ambiguous.

And so what these models do is they basically just discard those data points. And so the interesting thing here is we're actually throwing away data because we don't know what the answer is. And that the amount of data that we have at any given evaluation time as we evaluate these models is going to change. And that, as you'll see, has a bit of an effect on how we evaluate the models and the trends that we see.

And so what these models do is they basically just discard those data points. And so the interesting thing here is we're actually throwing away data because we don't know what the answer is.

So this is a nice little figure sort of demonstrating the same thing. Here is, let's say we have a package delivered at day two and then a censored value day four. If I'm evaluating my model at day one, like nothing's happened to the left of this. So they're both not events. If I'm evaluating at day three, I did have an event here. Something happened after three, but we know it at time three, it's not an event. And then here's the ambiguous case. This one's an actual event. Does it happen? But this one could have happened anywhere between, you know, these two lines or it could have happened much later. So we can't really do anything with that data point.

But really, the good news is I think like most tidymodels is we can take something that's fairly complicated or like computationally intensive and give you an API that's very, very simple.