Resources

Hannah Frick - tidymodels for time-to-event data

Time-to-event data can show up in a broad variety of contexts: the event may be a customer churning, a machine needing repairs or replacement, a pet being adopted, or a complaint being dealt with. Survival analysis is a methodology that allows you to model both aspects, the time and the event status, at the same time. tidymodels now provides support for this kind of data across the framework. Talk by Hannah Frick Slides: https://hfrick.github.io/2024-posit-conf/ GitHub Repo: https://github.com/hfrick/2024-posit-conf

Oct 31, 2024
19 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, yes, I work on the tidymodels team at Posit, and in a previous role I did data science consultancy, and one of the clients we worked with was a subscription-based business, and they were sort of just starting out, building out their data team, and several of the questions they had that they wanted to answer with data were sort of centered around customer journey.

So one of them was who was likely to cancel their subscription any time soon, and another thing that they were interested in was how long somebody would stay as a customer overall. And they did not use survival analysis for this, even though you can answer both of these questions with one model.

And the reason that they didn't was that they didn't think it applied. So if you were just suddenly a little bit surprised about why the hell is she talking about survival analysis, I get it.

And that's why I'm here with just a little bit of a push for a slight rebrand, because I would like to talk about this through the lens of the data. And the data is time-to-event data, and that event can really be a lot of different things. It could be a customer journey. It could also be your employees' thing, or a machine needing repairs or replacement. It could be pets getting adopted. It could be complaints being processed, or, yes, death, hence the maybe overly successful name for the methodology here.

But I'm most definitely not here to point fingers at that data science team, because we very famously know all models are wrong, and some are useful, and they were definitely getting some use out of the models that they used. I'm just here to offer you methodology tailor-made for this problem, sort of forged in the fires of situations where it is really about life and death.

I'm just here to offer you methodology tailor-made for this problem, sort of forged in the fires of situations where it is really about life and death.

And I want to build some intuition with you here for that kind of data, what kind of challenges come with it, and how that methodology addresses it. And of course, I'm also here to show you how to do that with tidymodels. We've worked on integrating that in the framework, so that it just walks and talks and feels like other elements of tidymodels, essentially giving you another tool in your toolbox for situations when you encounter this type of data.

Why standard tools fall short

So let's look at these questions that they had, and why maybe some standard tools don't quite give you what you're actually looking for. So the first one of these was how long somebody's going to stay as a customer. And we might have several observations over a time length. People show up, have a subscription, eventually cancel that and leave. Some stay longer, some stay shorter. That is a numeric value. You might be thinking, well, I'll do a regression on this. Yes, you can.

But you should keep in mind that that time window is the observation time. It is not time to event, time until somebody cancels. There's no notion of event if you're just looking at this, right? So we incorporate that in my cute little drawing. We might have situations where somebody comes and then eventually leaves. That's like the black dot. That's where we observe the event that we're interested in. And that middle row, that person showed up and is still here when we're at now. So there's still a customer. That's pretty good.

This type of observation is what we call censored because we haven't observed the full time until the event. If you're taking your time here and then run your model and interpret it as time to event, you're making an implicit assumption that everything is an event, including the customer that is still here. So you're sort of underestimating the length of time that they stay on just because you're like, oh, it's now. It's a cutoff. I'm saying you left.

So maybe that's not ideal and that's not what you want to be doing. An alternative approach that possibly comes to mind here is to say, okay, well, then I'm just leaving those observations. But that information is there and you're not randomly leaving out data. You're leaving out very specific observations. So you're biasing your results in that regard too.

Sort of side note, could also be like, okay, fine. I'll wait until I see all of my observations, all of my events, especially in that example that is also maybe like not what you want to be doing, waiting until all of your customers have left, right?

So you can do things, but I hopefully get the point across that they don't quite fit. When we look at the other question, who is likely to stop being a customer? We can say, yeah, that is a binary event. We know how to model binary things. Give me a classification model. And you're sort of ignoring the time aspect here. And that matters that you're ignoring, because then the question actually becomes, who is likely to stop being a customer while we observe them? And that observation window could differ wildly.

And sort of to gather up the aspects from both of these examples is we're met with a challenge that our outcome has two aspects, time and event status. And our outcome might be censored. And then we have an incomplete observation, but it's not a missing observation. And that is a difference, and we would like to be able to account for that difference. In classification and regression, sort of like standard tools, great for many things, but they're not directly equipped to deal with either of these two challenges.

Building intuition for survival analysis

I've sort of given it away already, but survival analysis is the methodology that is made for these situations. It can simultaneously consider if events happen and when the events happen. So it captures both of these aspects. And now I'm not going to derive survival analysis from first principles all in this talk. But I said I do want to build some intuition.

And the last attempt we took at this was modeling the event outcome. But we had like these different time windows. Like, yeah, okay, well, let's look at these time windows. Let's try to get closer to a good model. Well, now I've just been sliding these over, let's start at the same time. And then I'm going to put a time window over this. So we could try to model who is leaving in that window. That observation on top is clearly leaving. The other two are there at the start, and they're still there at the end. So they're not leaving. So it's a non-event for this time window.

That could be one model. And we could slide the window over. Do another one. Now we encounter a censored observation. For that one, we don't know if they would still be around at the window. We just, we don't know. We don't have the information. So maybe we're not including it in this. But mind you, we were able to include it in this. So that's at least a little bit better than before.

And looking at this in different time windows gives us different estimates of the probability of having an event, or consequently, a probability of not having an event. Or sort of in classical methodology, as you speak in this, the probability of survival. You haven't had your event yet. So we're starting to build up the probability for one time window for another. And we could combine these and get a view over this. And having sort of this step function, this probability of survival over time, now captures both the time aspect and the event aspect.

And we are not going to build a series of models. That's sort of walking in the right direction, but it's not quite it. So it does highlight two of the central ideas for survival analysis, though. So for one is, the thing that you're modeling is that survival curve, or derivatives of it, because it captures both of the components of your outcome that you're interested in. And the censored observations are partially included rather than discarded outright. So that's how survival analysis, as methodology for timed event data, addresses these unique challenges about how the data just exists.

Fitting a model with tidymodels

And you didn't come for a stats lecture. I know. So I do have some code for you. I have a churn data set here. The first column is tenure, how long somebody's been a customer at this company. The second one is churn, that's the binary event, yes and no. And the first thing we're going to do with this has nothing to do with tidymodels, but very much with this type of data, which is we're taking these two columns and combining it into what is called a surv object. That is sort of the standard data structure for this type of data. And that will be our response.

And then yes, we are doing tidymodels here. We can start with the very classic initial split into training and testing data. And then we are going to fit one of these models. I'm sprinkling in some preprocessing with recipes. We are here removing all of the predictors with zero variance because they don't tell us anything about our problem.

And then we have a little bit that's different from classification and regression. We have a different type of model. It's a proportional hazards model. That is one of the most common ones. So I sort of think of it as the cousin to a linear model as the workhorse that you can drag out and get, it's a good starting point. What is different also is the set mode here. We're not setting the mode to classification and not to regression because that's what I've been talking about for a good 10 minutes. It is censored regression to indicate that our outcome here is a little bit different. We're trying something a little bit different.

But then we're back in very familiar tidymodels territory. We take in our recipe and our models back and putting it into a workflow. And then we can fit the thing on the training data.

Now we have a fitted model object and I sort of teased already that we can answer both of the questions we had with the one model. So how long is somebody going to stay as a customer here? We can predict that with the prediction type time, sort of shorthand for survival time. It gives you a numeric value for how long somebody is likely to stay. And the other question was, who's likely to stop being a customer? We're going to flip that around and looking at who's likely to stay. And that is sort of survival or the probability of survival. And that needs to be calculated at a different time point. That's why we had the curve.

Unfortunately, over all of the time points, the very end, the probability of survival is zero for all of us. So that is the serious note that you need to put into any talk about survival analysis. And calculating that probability of survival here at different time points, that is monthly here from month one up to 24 for two years time frame. And then we get this thing back that has a bunch of tables and a list column. And that is because for one observation, we're getting 24 prediction points back. And if we take one of these little tables to inspect it a little closely, we can see one column is the evaluation time, and the other one is the probability, the predicted probability of survival. And then we can take these and make and draw the curves if we want to be looking at them. So here are some example ones based on that prediction. And you can see that the patterns are different. People don't all leave at the same, after the same amount of time. Therefore the curves also look different.

tidymodels for time-to-event data

So tidymodels for timed event data means there's a few specialized parts. For one, the models, your outcome is different. You need a different model to estimate how this outcome is connected to your predictors. In tidymodels, you can use parametric models, semi-parametric models, and tree-based models for this. We do have different prediction types for this slightly different modeling problem. So you've seen the two that exist, that we have for all of the models, which is the survival time and the probability of survival. And then depending on like the model or the engine, you might get some other ones which are related to these.

And if you have an outcome that has a special form, and you have predictions that is a special form, and you want to assess how well those models do, you also need special types of metrics. And that is what Max is going to be talking about after me.

I get to finish on a slightly different note, which is that tidymodels for timed event data is not just sort of different types of models in censored and parsnip, different types of metrics and yardstick. But we also did a little bit of work, mostly behind the scenes here on workflows and tune, so that you can combine these models with recipes and resampling and orchestrate everything in tune, sort of like your food chain of tidymodels is available.

And because it is a framework and things should work with each other and click together, you then also have the possibility to combine your timed event data analysis with some of the other specialized tools. So if you have spatial data, you could do spatial resampling, you have text data, you could generate predictors from that in combination with your timed event model.

So when I say, it's tidymodels for timed event data, I do mean tidymodels for timed event data. Thank you.

Q&A

Thank you, Hannah. And we do have questions being taken in Slido for both live and virtual attendees. The link there is on the screen. We have at least a couple of questions since I've last checked. I love this first question. How does survival analysis perform in a data set with meaningful seasonality, such as January versus November?

I don't think that that is necessarily a question of survival analysis, that the seasonality is sort of an effect of your predictors. So a step from recipes to turn your dates into appropriate indicators is something that you can use here and combine. So that could be a good first approach to including the seasonality.

And once you have the predicted probability of survival code over time for each customer in your churn example, how do you use that to draw inferences about how long a given customer will stay? Is there a probability threshold or some other sort of technique?

So for one, typically like how people burn this down to one number is to say, well, what, where's the 50% threshold? Like what time is that? That's your median or mean survival time. That is the numeric predictions that you also saw. I think the other interesting use of this is to say, right, we know that person has already been here for so long. What is the probability that they'll leave tomorrow? And if that is high, do I want to do something about it? Do I want to contact them? Do I leave them alone? Do I not spend resources on pestering people who I think are very likely to stay? So that kind of decision you can make and you have a probability to go with that rather than, oh, they're going to be here for 10 months and like, we're getting close. Do we need to do something?