Resources

Tidymodels: Now Also for Time-to-Event Data! - Hannah Frick

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. In addition to regression and classification, it now also supports censored regression for time-to-event data. This type of data with potential censoring requires dedicated models and performance metrics from the field of survival analysis. While the censored package has made survival models available for a while, the recent addition of survival metrics to the yardstick package has enabled us to support this type of analysis across the entire framework. The same ease of use and vast functionality, from resampling and feature engineering to tuning, is now available for this additional modeling problem. Hannah Frick is a software engineer on the tidymodels team at Posit. She holds a PhD in statistics and has worked in interdisciplinary research and data science consultancy. She is a co-founder of R-Ladies Global

Oct 21, 2024
22 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Okay, thank you for joining us for this session. So it's my pleasure to welcome Hannah Frick, who is a senior software engineer at Posit. She's on the tidymodels team and holds a PhD in statistics and worked in interdisciplinary research and data science consistency. She's a co-founder also of the R-Ladies global. The floor is yours.

Thank you. Hello. Yes, so I work on the tidymodels team at Posit, so it is probably no surprise that I'm also going to be talking about tidymodels. But we are going to be talking about time-to-event data a little bit before that, because that makes sense, right?

Introduction to time-to-event data

So what is that? I saw a bunch of people coming in just now. If you were here for the previous talk, you're gonna recognize elements from Mirko's nice talk here, but I'm gonna be zooming out a little bit and taking you along on a journey. So time-to-event. I brought a data set with customer churn, where the event is your customer churns, and I am NOT an English native speaker. I had to learn that that means the customer leaves. So that's the event. It could be pretty broad, though. It is not always death is the key message, and if you hear survival analysis, you think medical research for most of the cases, but we're taking this more general. Our latest example data set is pet adoption. So time-to-pet adoption, slightly different in speed than time-to-death. But here we're gonna use this data set for a customer churn. You have the tenure, which covers the time, and the churn variable, which is the factor whether they are still a customer or not.

So coming from the situation of you have that data set in front of you, what might you want to do with it? Likely some predictions in my setting, but what exactly? For once, you might want to predict how long somebody is going to stay as a customer, but you might also be interested in knowing who is likely to churn, who's likely to stop being a customer. So we're gonna be looking at both of these questions.

Let's take the first one, and how long? That's a duration, that's a numeric thing. You might walk up to this and be like, yeah, I'll do a regression on it. So why don't we just use that time? If we want to predict the time, let's use the time and use that as our response. The slight catch here is that this is the time you observe. This isn't necessarily time-to-event, right? There's no notion of event if you just take the time. Because what we actually have is not just this, but we do have a customer that stays a while and then eventually leaves. That's the event, that's like the bottom line. But you also have two other customers that came on a little later, and they're still customers now when you want to make your analysis. So if you just take that time and model that and say it is time-to-event, then you implicitly also say everything is an event, right? And the customers that are still with you at this point, you pretend they just left. So you're sort of underestimating their tenure, right? You get something, but not quite what you want is the key takeaway.

You could argue like, okay, well if these are not complete, I'm gonna remove them. But you're also gonna bias your analysis if you do that. You might be tempted to say, okay, then I'll wait until I got everything. Also not always practical. Sometimes you need answers yet, and in this example you hopefully will never be in the situation that all your customers finally left before you start modeling, right? So that part is not quite where you want to use regression analysis.

The other question has a similar not quite right feel to it. So if you look at it and you're like, okay, well I have a binary event. Why don't I model it as that? Yes, you can, but the time aspect doesn't just disappear. If you just use this, you're leaving that out, and you're essentially modeling who is likely to stop being a customer while we observe them. That's not really what you want to be doing, right? Maybe close, but maybe not quite, because that time frame of how long you observed them can be quite different, and if somebody showed up and signed up to your business a day ago, that's different than if they've been around for 10 years. And that doesn't capture it.

So to put these things together, the quite obvious thing that is still easy to ignore is that time to event data inherently has two aspects. The time and the event status. And you might have what we call censoring. That was sort of the that the event hasn't happened yet. So that's incomplete data, but it is not missing data, and that's different, and you would like to use it for as long as you can for as much as you can. And regression and classification are great, but they model one of these aspects separately, and they don't really account properly for the other aspect that you have.

And regression and classification are great, but they model one of these aspects separately, and they don't really account properly for the other aspect that you have.

I am most definitely not here to scold you if you have done this, because we famously know all models are wrong. I'm just here to make the invitation that there's something that gets you a little closer, and that thing is survival analysis. So despite the very medical sounding name, this is applicable in a lot of cases. And the interesting part about it that it can simultaneously consider if an event happens, and when that event happens. So that's my invitation for you to look at that methodology, even if you're looking at something that is very much non medical.

The tidymodels framework

The other part I'm talking about is tidymodels, unsurprisingly. It is a framework for modeling and machine learning using tidyverse principles. So what does that mean? It means it has infrastructure and cover for the predictive modeling workflow, so we have something for resampling, there's code for pre-processing in a very elegant way, there's the models, there are the performance metrics, and there's something to orchestrate all of that into tuning models. That's the core of it, but that's not all. So it is extendable. We have done that a couple of times, and our community has done that in parts. So there is specialized stuff for spatial data, thanks to our then intern Mike Mahoney. There's some more specialized methods for tuning, thanks to Max. For recipes, there are extensions for text data, and unbalanced data, and factors with a whole lot of levels, and then the bulk of extensions sits around the models, where you can get specialized models for different tasks. The one I'm going to point out here is the censored package, because that holds the parts for survival analysis and censored regression.

So that was a whole bunch of lovely, very beautiful colorful hex stickers, but the point is that if you do predictive modeling, there's a whole lot of things you have to do that aren't actually focused on your modeling question. All of the resampling, and then doing your preprocessing, and your model fitting on the right resamples, and then predicting, and evaluating it, and calculating the performance metrics, and aggregating all of that. You do that every time, but that is not what really interests you in this case. So a framework takes that off your plate. That provides the infrastructure, and you can use the parts, and you can focus on what's really interesting. And tidymodels is built with tidyverse design principles, so if you're comfortable with the tidyverse, you should feel at home with tidymodels as well. It sort of follows the same care in APIs, and consistency in function naming, gives you like a unified front of it, so you don't have to necessarily learn individual packages. If you're sitting more in a place where you would like to give different ones a go to find out what fits your data, if you're not having very, very specialized concerns and modeling implications. So that is what we try to make easier for you, so you can focus your attention on what does actually fit well here, what can I get by making clever features, and so on.

Survival analysis in tidymodels

So that was general tidymodels. Survival analysis now works across of this, and it sort of pops up its head in specific places. That is, for one, the models, rather obvious. It is in the prediction types, and because of that, consequently, also in the metrics. So I've listed some of them here. These look very familiar to the previous talk. They often come up in class, most of them come up in classification as well, but these are like the survival specific versions. But that is just sort of the specificity here. The offer is to use it together with the rest. So like tidymodels for survival analysis is tidymodels, essentially. And you can go combine the different elements of the framework, so you could be using text pre-processing to make predictors out of that for your survival analysis model, and sort of put the building blocks together.

Code walkthrough

But it wouldn't be user if there wasn't code, right? So let's do a minimal version of this. Welcome back to Customer Churn Dataset. We're gonna load tidymodels, we're gonna load the extension package for it, and the first thing we're doing is not tidymodels at all, but we're gonna make our response variable, the outcome, that's a surv object. That is what captures both of the aspects. The time here in the tenure variable, and the event status, which is here in the churn variable. And then we're gonna dip into our familiar workflow. We're gonna make a split into training and testing data. We are gonna make a very minimal recipe here. This one just removes predictors that have zero variance because they don't tell us anything, and I'm keeping it short to keep it on one slide. Where it does get survival specific is the model. So this is a parametric survival model. There are other options. They all have a mode censored regression. So it could be a Cox model, a portion of hazards model. It could be a tree-based model, like a decision tree or random forest. Those work for other things as well. So that's why you have to declare that it's censored regression. And we're gonna use the survival package to actually estimate this.

And that was like the part where it's different to quote-unquote before. And now we're sort of back with the general workflow. We're taking these two things, combining it into one that we call a workflow, just to make sure they get estimated together. And in our case, we're just gonna fit one model on the training data, because I would like to show you that you can answer both of your initial questions with that one model. You don't have to build one regression model to answer the first question, and a classification model to answer the second question, and try to work out what works best for each of these. You can focus on just making one. And you can get predictions for how long somebody is going to stay as a customer. That is the type time that refers to the survival time. And for the second question, who is likely to stop being a customer? Or the inverse, who's likely to stick around? That is the prediction type survival or survival probability. That you need to evaluate at certain times, because survival probability overall, unfortunately, zero. So not very interesting to model. But you can answer questions like, who's likely to still be a customer after 12 months or 24 months? And it gives you this lovely little table, because you get two prediction values per customer. This is a list column, and it has little tuples in it. If you pull one of these out, you can see it has like evaluation time, and then the probability of survival that corresponds to it.

Key takeaways

So that is sort of like the quick introduction to showing you that part. I'm going to forego tuning the whole thing, evaluating it, and so on. Because really, what I want you to take away right now from this, is that censored regression lets you use all the information you have together at once. So you're doing one model, it includes the time, includes the event status, it includes your event, and it includes also your censored observations, and you make use of them for as long as you can. And the other part is that tidymodels lets you do this within a well-designed framework for predictive modeling. So there is always more, and we have a bunch of articles on tidymodels.org slash learn, that have a survival analysis tag, that includes a case study that really takes you through all of this, like more than this sort of taster that I've put up here. So that includes fitting a model, evaluating it, moving on, trying different models, tuning those models, and then like making your final choice, and then being able to predict with that one. And the other thing I want to point out here for learning more, is that we did a tutorial two days ago for survival analysis with tidymodels, so if you would like to see that, you're also welcome to check that out. And that is me. Thank you.

Q&A

Thanks for the talk. It's more of a remark, and I would like to hear your opinion on it, because tidymodels seems like a very comprehensive framework, with a lot of development effort going into it. And when I look at one particular aspect, the parsnip I think it's called, the library that implements the modeling, then I wonder, wouldn't it make sense to kind of outwork that part, like the model implementation, the number crunching, that needs to happen in every model, and in every language. So if the community could agree to implement these particular aspects of the whole pipeline, for example the likelihood function, in a generic API that could be called from different languages, couldn't this reduce development efforts considerably? I'm not sure whether I could state the idea clearly, but maybe you could comment on it.

Okay, do you mean like something generic, like an optimizer, like Optin? No, just the model implementation. For example, if you have a linear model, and you follow the likelihood paradigm, you have to specify a likelihood function, and this function gets called over and over again in an optimization routine. It can be either Bayesian or maximum likelihood estimation. And in my perspective, it would make sense to write this function, for example in C++, and then it could be used in tidymodels, but it could also be used in Python.

I see your point. I mean, my impression would be that that doesn't generalize all that well, because the likelihood captures the specificities of a model, and it is hard to generalize that. What we follow at the moment is that we trust researchers, and in part statisticians, for loads of them to get that implementation right, because they are the experts on those models, and they know them in and out, and they know how to like get that part right. And then we do the part where we let you move between options. Where we try to help out and make offers for people to implement stuff in a more general way is the hardhat package, which sort of does some of the shared elements. It doesn't do specifically the likelihood, because that is often so model-specific, but it gives you like a scaffolding where you can put that in, and then there's infrastructure for, say, having a formula interface to your optimization problem, or having a matrix interface, and that kind of thing, and giving you some sort of like hopefully cozy places to make your predictions and take some of that infrastructure problem away for people who might not know R in and out, or sort of starting out and have a focus on the statistical elements and stuff, so that we get good parts from both, basically.

Thank you, Hannah, for the very nice presentation. I have two questions, maybe very quick ones. The first one is, how does it scale the model? Like, can I go in with a really big data set with millions of rows, and will it work? And second one would be, do you also have some tidymodels module for explainability?

Okay, so first question, how fast does it go? That has a two-part answer. The core is how fast is the actual model implementation, and that depends on the people who develop the package, and how much they pay attention to this, or what allows them to do it or not. And the other part is sort of like, can you paralyze it and throw some computing power at it? And tidymodels does have, like, uses foreach as the way to register a backend, and sort of like moving to future as the backend. So you can paralyze it that way. It is meant so that if you only have to do the pre-processing once and then fit all the models, you don't have to always do pre-processing and modeling. Sort of, you try to cut out the repeated pre-processing. So it does take care of this where it can, but obviously we can't make the innermost part, the model fit faster.

And the second was explainability. There are explainers for tidymodels that our country, like Daleks, does a lot and also works with tidymodels, so it's not us who maintains that, but it works on tidymodels. And we have recently put work into implementing fairness metrics, which is not exactly explainability, but sort of goes into the direction, so I thought I'd mention that. There's more on a tidymodels, sorry, tidyverse blog on fairness if you're interested in that.