Resources

Hannah Frick | Censored - Survival Analysis in Tidymodels | Posit (2022)

tidymodels is extending support for survival analysis and censored is a new parsnip extension package for survival models. It offers various types of models: parametric models, semi-parametric models like the Cox model, and tree- based models like decision trees, boosted trees, and random forests. They all come with the consistent parsnip interface so that you can focus on the modelling instead of details of the syntax. Happy modelling! Talk materials are available at https://hfrick.github.io/rstudio-conf-2022 Session: Updates from the tidymodels team

Oct 24, 2022
17 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome everyone. Thanks for being here. I get to talk to you about a package called Censored. It's a new tidymodels package for survival models. And when people hear survival analysis, they often think of situations and data like this. And I mean, it's called survival analysis for a reason.

But even though most examples are indeed like category time to death, that is not the only context in which you might encounter censored data and hence a need for survival models. And I came across my favorite example for censored data. Before I joined RStudio, I was working at a data science consultancy. And one of my colleagues chatted to me about his latest project. And that involved a newspaper company, which wanted to know how many copies they should deliver to various outlets. And he was sort of musing that if there are any copies left, at the end of the day, they wouldn't know how many had been taken and what the demand for the paper was.

If there were none left, they would also still know how many were taken, but they wouldn't really know what the demand was and how many more could have would have been taken if they hadn't run out. So in that case, their observation of the newspaper demand is actually censored. It's capped at the numbers of papers that they provided. So this stats nerd in me grinned and was like, I suggest you look at survival analysis. And now I get to tell you basically the same thing too.

So if you have data of this format, where you're observing something until an event, be that time to death or newspapers taken until demand situation, and you may or may not actually observe that event, please use survival analysis. That is designed to take both of these aspects into account. So if you take that data and shove it into a regular regression model and model the time or the demand without taking that event indicator into account, you're treating the observations as complete, when in fact they may just be incomplete. And if you're dropping all of your censored observations and model the rest, you're also throwing away information and you're treating these observations as missing, while they are actually just incomplete.

And if you're dropping all of your censored observations and model the rest, you're also throwing away information and you're treating these observations as missing, while they are actually just incomplete.

Extending tidymodels with the censored package

So survival analysis is an important field of modeling and we're extending support for it in tidymodels. The first milestone in this are the models themselves, to be able to specify them, fit them and predict with them. And you may well be aware that R has an abundance of riches in different models contributed by different people, and it also has nearly the same abundance of riches in interfaces to these models. And as you just heard, tidymodels wants to make it effective, safe and ergonomic for you, and part of that is to provide you with a consistent interface. And the aspect in tidymodels that takes care of a consistent interface for models themselves is parsnip.

So it's designed to give you consistency both in how you specify and fit the model and how you predict and what you get back. So you don't have to remember function names, argument names across different packages, whether or not you get a vector back or a matrix or a data frame, and what that type of prediction was called in this particular package you're using. And the thing I'm really here to talk about is censored, which is a parsnip extension package. So for survival models, so I'll be talking about how these design ideas play out and marry with survival analysis.

Specifying and fitting models

Let's get to that first part, to specify and fit models. In parsnip, you need three elements to specify your model. So for one, the model type. In the example here, it's a random forest, and you need a computational engine. That's what you use to fit the model. My example here, that's the Ranger package. Often that's various R packages. It can also be tools outside of R, like Stan, if you have a Bayesian model you want to fit. And the third element to a parsnip model specification is the mode. A random forest can be used for regression, for classification, so you need to specify that here.

And these three elements are sort of very consistent in parsnip, so they will also show up in censored. So what is actually new with censored? Well, we have to make a few additions. For one, we have a new model type, called proportional hazards, to let you fit everyone's favorite Cox model. Second of all, we have a new mode to distinguish censored regression from regression. And thirdly, you probably guessed it, several new engines to existing models. So with that, censored covers parametric models, semi-parametric models, and tree-based models for survival analysis. And the last point that I want to mention here is that we have a formula interface for all of these models.

Most of the engines do that already, the glmnet package sort of famously does not, but we let you interact with that through a formula interface, and that includes a way to specify stratification.

A data example: dolphins and whales

And that's all sort of abstract, and I wanted to have a data example. Sadly, I do not have access to that newspaper data, but I bring you dolphins and whales. So these are dolphins and whales living in captivity in the U.S., whether they've been captured or rescued or were born in captivity. It's originally a tidy Tuesday data set, all the way from the pudding. And we have the response age in the first column, how old that dolphin is or was. If event is zero, the dolphin or whale is still alive. And we know the species, the sex, and how many times this animal was transferred between different facilities in the U.S.

So if we want to go fit that brand new model, the proportional hazards model and tidymodels, we need censored, and we need the proportional hazards function, and then we can add an engine. I'm going to go with the very classic survival package here as the way to fit this model. I'm going to add the mode. That's the new censored regression mode. And with that, I have like the specification. This is not fit yet. If I want to apply it to data, I have a fit function, and that's where the formula interface comes into place. Your response is typically a surv object on the left-hand side of the formula. And you want to add stratification. You can do that like you would do in the survival package and add a strata term into the formula.

You're looking at this and saying, well, that's plenty of lines of code for a model. That is indeed true. So if you work with the survival package directly, that's all a little bit condensed. Just the modeling function, the formula, and the data, that sort of does specification and fit all roll into one. And that's great.

Where sort of censored comes more into play is if you want to use something else in addition to try it out and move between different models. So if we want to go from sort of the standard proportional hazards model to a regularized version of that model, I'm going to move all of the code to the left and start doing that with the glmnet package on the right. That one has no formula interface, as I mentioned. So you need to make your response matrix and the predictor matrix. I'm doing that with model matrix here. And if you want to include stratification, you need to stratify your response. And then you're ready to go. Put it into glmnet. Remember to set the family to Cox. Otherwise, you're fitting a different model. And give it a penalty parameter lambda.

So of course, this is doable. But the code looks quite different, and it's quite a lot of work to go from one to another. And we want to make that aspect easier. So obviously, the next code is a little boring. You can take the bit that we already had for censored and copy it over. And then the main thing you change is what engine you set. One is the survival package, one is glmnet. And the other thing you want to change is the penalty parameter, because it's glmnet. But the rest is the same. So we don't have to fiddle with the data structure here. We can still use the formula in this. And if you want to switch to yet another type, the tagline for that is kind of more models, same syntax. So you might swap out your model type for a boosted tree and set the engine to xgboost, and you're ready to go.

And that applies to the rest. We have in censored for that new mode, censored regression, the parametric models in the survival reg function, proportional hazards, or semi-parametric, and then the tree-based models are decision trees, random forests, bag trees, and boosted trees you just saw.

Predicting with censored models

And that's the section on specifying and fitting. And the next one is obviously predicting.

And in the tidymodels framework, we give you a little guarantee that your predictions will always be inside a table. And the column names and types are unsurprising and predictable. And the number of rows in new data and the output are the same. So what's new then?

Answer is new prediction types. One of them for survival analysis is the survival time. So I'm just taking the first three rows here of my data set to predict on them, adding in an NA in the first row just to mix things up a little, and you can get a prediction of the survival time by using the predict function and setting type to time. And you get a table back, the column will always be dot pred underscore time, and it will have three rows because you put three rows in. So if you have a missing value, it will have an NA rather than miraculously disappear on you.

And the other big new type for survival predictions is the survival probability. That is always calculated at a point in time. So here in this example, I'm setting type survival, and I want the survival probability at age 10, 15, 20, and 40 years. Dolphin or whale that I'm looking at. And what you get back is still a table that has three rows for the three observations, but we've asked for four time points. So this is actually a list column and it has little tables stashed inside of it. If we want to look at one of them, we can pull it out and then we see the column that gives us the information about the time point that we're predicting on and the probability, except for in this case it's an NA, but if you look at the other one, you do get actual probabilities.

And this does not return you like a classical surv object, surv fit object. If you do want to access the full survival curve, you can pretty easily just dial up the number of time points you ask for. So here I'm asking for survival probability predicted at age one all the way through 80, and then I can unnest that thing to sort of stack all the little tables underneath each other and I'm squeezing in the mutate to be still able to tell which rows belong to which dolphin. And that gives you a data set that is ready for ggplot and you can easily do your visualization of the survival curve with a geom step.

Summary

So to recap, the prediction types that are new are time or survival time and survival for survival probability. These are available for all of the models that are in censored and depending on the engine you might also be able to get the hazard, the quantile of the event time distribution, or the linear predictor if your model has one. You do not have to do any of like the data prep for a matrix interface, you don't have to pad your results at the right place if there's a missing value in there, and you also don't have to do the leg work to go from a survival curve to predicted probability at like any given time point.

And the summary of this all is basically censored is a consistent interface for survival models. It gives you consistency in how you specify fit and consistency in how you predict and what you get back. So please try it out.

And the summary of this all is basically censored is a consistent interface for survival models. It gives you consistency in how you specify fit and consistency in how you predict and what you get back.

Q&A

Thank you so much, Hannah. We have time for a few questions, so I'll take them from Slido. Again, if you have questions to ask Hannah at Slido, the event ID is inspire, I-N-S-P-I-R-E.

So we have a few here. Do you have any resources or recommendations for folks who want to learn about censored regression in general? So for one, Emily, I forgot the last name. Sebor, yes, thank you. Zebrametrics is I think the Twitter handle. She has an introduction that is really good, I like, as a blog post. If you are more on a type that you do want to dive deep and you want a textbook, risk regression modeling is probably good. I'll tweet those out afterwards. Do we have some tidymodels resources for the censored regression stuff? Not about sort of learning survival analysis per se. There's examples out on the censored homepage. The URL is up there. How you can use all of the models, but it doesn't cover the theory.

Are there plans to support competing risk regressions? Down the line, probably yes. So that is a thing that has been mentioned several times. It is definitely on our radar. Are there plans to support random forest back ends like Ranger? So Ranger is not yet in there, but like one of the two that are sort of going to come next of the engines.

So there was a question about is censored able to work with workflow objects? I would expand that further. Is censored able to work with tidymodels as an ecosystem in general? As all aspects of the ecosystem, not yet. I am still having some work to do. You can use it with workflow objects. You just have to not push the formula to it, but rather the variables and specify the formula as a sort of engine argument. I will put a vignette on that online.

Maybe one more question. What is the difference between stratifying on a variable in the survival model and using that variable as a predictor? So stratification in a Cox model gives you the idea that the baseline hazard is different between groups, and that is sort of the non-parametric part, the baseline, and your predictors are the parametric part that modulate it. And if you think that this baseline hazard is not proportional between groups but rather fundamentally different, then you would put that into stratification. All right. Let's give Hannah another round of applause.