Resources

Using R to develop production modeling workflows at Mayo Clinic - posit::conf(2023)

Presented by Brendan Broderick Developing workflows that help train models and also help deploy them can be a difficult task. In this talk I will share some tools and workflow tips that I use to build production model pipelines using R. I will use a project of predicting patients who need specialized respiratory care after leaving the ICU as an example. I will show how to use the targets package to create a reproducible and easy to manage modeling and prediction pipeline, how to use the renv package to ensure a consistent environment for development and deployment, and how to use plumber, vetiver, and shiny applications to make the model accessible to care providers. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Leave it to the robots: automating your work. Session Code: TALK-1149

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, my name is Brennan Broderick and I'm a senior data science analyst at Mayo Clinic in the Robert D. Patricia E. Kern Center for the Science of Healthcare Delivery. I know, try to fit that on a sweater. So in the Kern Center, in the data science pillar specifically, our main goal is to support Mayo Clinic's medical practice through the use of statistical and machine learning models that are well backed by healthcare delivery models. And a healthcare delivery model to me is essentially the information that your model is providing should be in use and it should change practice in some way. It should be integrated into other others' workflows, not just your own.

So we have a mantra within our group that a model with good predictions is useless unless there is a well-developed healthcare delivery model behind it. Or as even stated by David Robinson at this very conference in 2019, anything still just on your computer by any first approximation is useless. Good models should be well utilized by people that can use the information to the providing to affect change. And to us, that means that our models need to be in production.

a model with good predictions is useless unless there is a well-developed healthcare delivery model behind it.

So we need to develop models in a principled way. We also need to develop data prediction pipelines principally to help us read. And finally, we need to deliver those predictions to someone who can actually do something about that. There's one thing that I kind of want you to take away from this talk. None of this that I'm talking about I feel is overly complicated. You don't have to be a DevOps or a software engineer to put your models into production. I am neither of these things. You also don't need a team of analysts to accomplish this goal either. You just need to utilize the right tools and workflows to help you out.

So where we're going or where I hope to take you is throughout the entire process, but where we're working towards is supporting clinical applications such as this one that I've worked on in Shiny where we're displaying real predictions, real-time on patients in Mayo Clinic's hospital. So throughout the course of this talk, I'm going to introduce a real project that I work on, the Respiratory Care Unit project, and I'm going to hopefully take you through the modeling portion of this project all the way to taking it into production, utilizing targets that we just learned a lot about. We'll talk about how to make these target pipelines more reproducible, more portable, utilizing tools like renv and Git. And then finally, I'll talk about accessibility and enablement using the PositConnect server, Shiny, and plumber.

The Respiratory Care Unit project

Okay, so the Respiratory Care Unit project is a project that we currently have working in production that essentially tries to solve this problem. So Mayo Clinic has a hospital, it has an intensive care unit, and the typical pathway for a patient is that they are stabilized in the ICU. They then move to a general care floor as a step-down unit and discharge from there. Clinicians notice, though, however, that there is the subset of patients that had respiratory issues that would be stabilized in the ICU. They would go to the general care floor, but their conditions would be exacerbated, which would require either a readmit to the ICU, which is bad, or they end up in a respiratory care unit, a more specialized respiratory care unit. And probably all along they should have gone to that respiratory care unit when they discharged from the ICU. So the problem that we are trying to model is essentially we want to classify who is likely to end up in the RCU after they discharge from the ICU.

So with a lot of healthcare projects, you can sketch out four general steps that we need to accomplish in order to develop such a model. We need to define a cohort of patients, that being a historical set of ICU patients that we observe throughout their encounter. We then need to pull their covariate data from the EHR, the electronic health record, on a number of topics that might be predictive, like labs, notes, demographics, medications, et cetera, et cetera. Then finally, we need to combine all that together and train and validate a model. And if we're successful, we get to put that model into production or we need to start utilizing it.

From source scripts to targets

And you can take those four general steps and you can start sketching out what you need to write in terms of R scripts. So you might start with a O1 cohort script that pulls historical data. Then you also have a number of scripts to pull and clean EHR data. And then some scripts to pull that together and train a model. This workflow requires a kind of eighth metascript where you source each one of those scripts that you defined previously, and you basically run them sequentially. But if we were to draw the dependency graph for this project, you might realize that parts of this might be inefficient. You'll notice that demo, labs, and notes are kind of all independent. We can kind of pull and process those independent.

And so kind of where this manifests itself is, say we have to, we want to add a new lab feature to our dataset. A clinician that we're working with says, you should really consider this lab. It might be indicative of underlying respiratory issues. So we do that. We add it to our lab script. But really the only things that we should have to rerun are that lab's data file, pulling together your whole dataset and training a model. But in that source scripting way of doing things, you're rerunning everything. Everything again, no matter what you change in your pipeline. So that's inefficient.

On top of that, all the inputs and outputs of all these various scripts need to be managed by you. And not only do they need to be managed by you, but they're also hidden away in this sort of workflow. We can't tell apparently what cohort passes on to labs, what labs passes on to your model framing, et cetera, et cetera. And when it comes down to real projects such as the RCUs project, this is the real graph that the RCU project produced. If you had to rerun all of these nodes yourself every single time, or if you had to keep track of all these inputs and outputs yourselves, it really becomes unruly and really error-prone. And when we're talking about delivering, the point of this pipeline is to hopefully deliver better patient outcomes. And when we're talking about real patient outcomes, we need to be correct. We can't do things in such an error-prone manner.

And so that's where the Targets package really comes and helps out in all of our processes. Targets allows us to iterate through multiple various ideas very quickly when working on such projects. So Targets is a pipeline automation and management tool that essentially does all of the stuff that I said is bad about that source scripting workflow. It handles that all for you. Very simple to define a Targets workflow. You simply need a Targets script that defines a Targets list whose elements are the nodes in the graph that I showed earlier. And each element is a target that has this named function pair where you need a name for the data object that you're creating, as well as a reusable function that defines that object. So on very simplistically on the right-hand side, you can kind of read what a Targets workflow is doing by cohort gets define cohort, demo gets get demographics, et cetera, et cetera.

And all you need to do to facilitate or run the pipeline is to run a simple function called to tar make. And Targets builds all of those objects for you.

And so in that example where I changed the lab feature data set, so if I did that with Targets, Targets knows what needs to be rerun in my pipeline, and it's going to skip on unnecessary rerunning of a certain object. So in the case of labs, it rebuilds labs, but it skips over stuff that wasn't actually dependent on that object.

Reusable functions and production pipelines

And the very nice thing about defining workflows in this way is that since Targets requires you to define reusable functions, and Targets also is going to require you or encourages you to stick all your functions into a directory named arm. And if you humor me and kind of squint your eyes at those two requirements, you might be able to see the makings of what is in a package or what you need to do to define a package. And the very nice consequence of this is that you get to reuse all the code that you wrote during the modeling part of your project and apply that directly into when you need to make perspective predictions in production.

So in the modeling sense, you're pulling data on a retrospective set of data, pulling all the covariate information that you need to define to make predictions, and then you fit a model. Then in the production pipeline, you simply need to load that model you created in the kind of first modeling pipeline, start pulling patients that are in the ICU currently, and then you get all of that work for free. All the way that you pulled and cleaned your information is consistent to make your predictions. So the benefits of this is that you spend a lot less time putting your models into prod because you're reusing all that code that you developed. That code base is very reusable. As well, there's this nice benefit that it assures that your prospective data, the data that you're pulling kind of currently in production, is cleaned in the same way as your retrospective data. Because the rules that you define to clean and pull data is every bit as important to the modeling workflow as kind of the algorithm that you're using to fit a model at the end of the day to make valid predictions prospectively.

Making pipelines stable and production-ready

So now that I've talked about making all the pipelining that we need, let's talk about making those pipelines more stable, more ready for production use. We kind of have an opinionated kind of like a Git flow of certain branching logic and stuff like that, and certain requirements for each one of our projects that we're working on in the current center. And I'm going to distribute the slides earlier, and I'm not going to cover all these points. But to kind of summarize the workflow, this opinionated workflow that we've come up with, it's to use targets, for each project to use targets, to use renv and to use Git for all projects.

renv helps us solve this problem of each project gets its own set of isolated packages that you get to manage on the project level. So you're confident that every time you rerun this pipeline, you know the set of packages that go into it. And the problem that this solves is having a package updated kind of under your nose. And maybe what everyone else realizes that this is kind of not like a smart way to do this, but we used to share packages across multiple projects when we were doing kind of research-only workflows. And so this was a real thing that happened. dplyr would get updated under your nose, and you'd kind of like be struggling to figure out exactly what changed in your code, and it wouldn't have been you.

And then finally, we'd like to, each project should be in a code repository, and you should commit all the necessary files to capture the state of your project at any time. And Git usually comes with like code repositories like ADO or GitHub. So you can use those tools to manage code changes and deployments of those code changes in production. And kind of just what Git is sort of helping us with is saving us from saving multiple versions of a final model.

Very nice consequence of using all these tools together is extreme portability. So I might not be running this project where I'm currently running it today, but I know I can be confident that if I have to change where the compute happens, if I need more compute power, et cetera, that I can very easily move these projects and maintain my reproducibility, where essentially all I need to do is clone a Git repository, run a renv restore to get all the packages I need, and then finally run Tarmac to restore all the objects that I created.

Enabling predictions with Posit Connect and plumber

And so finally, I want to talk about project enablement. So we've covered developing pipelines, making them more stable and prod-ready. But back to that point that I made earlier is that, well, it's useless, right, if it's just sitting on my computer running predictions. And so how we enable our work is we rely a lot on the Posit Connect server. And what the Posit Connect server does for us is it allows us to make web applications to display our predictions in real time for clinicians. Shiny helps us accomplish that very easily. We're able to prototype sort of dashboards really quickly. And as well, we can write plumber APIs to allow others to work with other software to work with our work.

And kind of how I like to illustrate how Connect is actually very powerful for us is to enable or to show what this workflow might have been like five years ago when I didn't have Connect. And that probably meant that I was emailing these predictions to a clinician who was working in the hospital. Super smart, right? And I would probably have to do that as much as they would stomach it, and probably eventually they would ignore it. I also have to assume that nothing is going to change in the time that they actually view my file and when that patient actually just charges. So a world with Connect, we're able to display all that information that we're trying to convey to a clinician real time in a web application. We're also able to store their decisions that they're making so that we can evaluate a model's performance over time. And this is great. It allows us to really find out if there's efficacy to the models that we're putting into production. Are we actually affecting patient outcomes?

So there might be a scenario where your model is so good that the clinicians want it to become a part of a patient's medical history. They want your model result to be in the medical record, essentially. So the problem that we need to solve, I guess, here is that we need to put R in the medical record. And while five years ago me might have been naive enough to go to the people at Epic, go to the people who run the EHR, and be like, well, R is open source. I'll find you the link to download it. You can put it on your servers, no problem. Right? That's just not how things work. And the way, a very elegant way that we get to solve this problem is the use of plumber API. So kind of going back to that notion of reusable functions, we get to define functions in R that other groups that are trying to utilize our models, they get to interact with all that machinery, all those pipelines that we created for a project. They get to interact with it very easily. And so that's all I have. Thank you, everybody.

Q&A

Thank you so much, Brendan. Great job. So we do have some questions. And thank you, everyone, for actually submitting questions in advance. This is great. So one question here is, how do you avoid feedback loops where your predictions of who will be readmitted contributes to that occurring and biases future model retrainings?

Sorry, can you repeat that? Yeah. So how do you avoid feedback loops where your predictions of who will be readmitted, right, to the ICU contributes to that occurring and biases future model retraining? So the same data bias in the future retraining? Sure. So I think with any modeling framework, or with any model, there are some assumptions that you need to live with. So I would say to avoid feedback, it's a valid concern. And it is something that we looked into. But really, we're not really adjusting or modifying what we're doing for those readmits.

Makes sense. Another question here is, how might you change this workflow, if you were to pull live data, for example, like vitals? Like if this was actual live data? Oh, if it was actual live data? So I think the best way to do that is just start by, and I understand it like is hard, but it's to kind of go to the source that you're going to be pulling in a live scenario. If you can, do that upfront in your function calls so that when you go to productionize it, you're utilizing those same resources. Kind of how we've combated that is, so we have a SQL server that we kind of combined together from the start of when we're modeling it. So pulling historic data, and then combining that with real-time, we use, it's called FHIR APIs to get real-time data. We kind of try to do that from the start of a project so that we'd know. So it is more seamless to go from modeling to production. Well, thank you again, Brandon. Appreciate it.