Trust, but Verify: Lessons from Deploying LLMs in a Large Health System (Timothy Keyes)

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thank you so much for being here. My name is Timothy Keyes and I'm a data scientist at Stanford HealthCare. Because I work at a hospital, I actually want to get started by asking you all to remember the last time you went to visit your doctor. During your appointment, your doctor probably listened to your heart as a part of their comprehensive medical exam. They probably also prescribed you a medication or gave you a vaccine. At some point during your visit, they may have ordered a laboratory test, like a blood test or an x-ray. And probably at some point, someone asked you to update your insurance or billing information for financial purposes. What you may not know is that at the end of your visit, your physician actually wrote up a text document summarizing everything that you did together that day and uploaded it into a data system called your Electronic Health Record, or EHR.

Your personal EHR is really important. It's basically a story about your health, starting on the day you were born and leading up to today, that is shared among all of the clinicians who are involved in your care. This is useful because it means that the next time you go to see one of your doctors, they can reference any of the changes that have happened to your medical concerns so that they can offer you the highest quality, most efficient care possible when they do see you for what brings you in that day.

The problem with the Electronic Health Record, though, is that over the course of your entire life, it tends to look less like this and more like this. There are thousands of pieces of information about you in your EHR. Every time you've ever had your blood pressure taken, every time you've ever filled a prescription, every time anyone has interacted with your insurance company, it's so much stuff. So it's really hard for your clinician to find any information about you, even when they're looking for it really hard. This is made even more difficult because, and I was kind of shocked to learn this when I was in medical school, is that most of this information is stored as unstructured text, making it all the more difficult to search through and parse.

So what we've learned is that monitoring the so-called performance of our LLM deployments is really about capturing this human-in-the-loop adjudication process, so that we can track over time how much our expert clinicians are agreeing with the things that the LLM says.

I think this is well illustrated by an example, so I'll walk you through one of our LLM deployments to give you a flavor for how this looks in practice. The task here is we're asking the LLM to identify patients who might benefit from an appointment with a specialist, just a certain kind of physician, basically. So our starting point is a prompt, and that prompt contains the criteria for the specialist appointment. It usually contains some information about the types of illnesses that the specialist can treat. We send this to the LLM along with clinical notes describing the health of the patient and ask the LLM to output two things. The first is a targeted recommendation for whether the patient should get the specialist appointment, and the second is a justification with cited evidence from the patient's chart of why it made that recommendation. Both of these are then sent to one of our expert clinicians who are always in the loop, and that clinician then decides if they agree or disagree with the model and if they want to place the referral for the specialist appointment.

Our monitoring approach then really hinges on capturing this clinician's decision within our electronic health record, and then we track it over time so that if we see basically that the disagreement rate of our clinicians is increasing for a deployment after some period of time, this basically suggests that something is going wrong, maybe documentation practices have changed, maybe we need to adjust the prompt, and that our data science team then needs to remediate that.

Okay, so moving on to our second key difference between LLMs and other statistical models in terms of monitoring performance. Whereas the first one was more conceptual, this one is more technical. It has to do with like some engineering things, and it's based on the observation that frontier LLMs developed by large companies like ChachiPT, Claude, and Gemini are externally developed and versioned by third parties. So this means that they actually update and deprecate over time without your team necessarily being in control. This is in contrast to something like an XGBoost model that you may have trained, hosted, and served yourself, which will only ever be updated or retrained or deprecated, turned off, when you choose.

You may not think about this as being related to monitoring a model post-deployment in the traditional sense, but my point here is that when you're working with proprietary LLMs, you don't just have to worry about the data distributions of your deployed population shifting over time, you don't just have to worry about the user behavior changing over time, you actually also have to worry about the model itself changing over time as it's continually updated by the third party that has developed it. So what we do about this is basically host a suite of internally developed benchmarks that help us to monitor task performance for important tasks at Stanford Healthcare whenever a new model is released. So for those of you who might be familiar, GPT-5 was released a couple of weeks ago, and we were curious like, should we switch to this model? We're using a couple of legacy models that won't be supported in a couple of months, and so this is our approach for dealing with that scenario.

For those of you keeping me honest, I have used some jargon here. So benchmark is a term that is used commonly in the LLM world, and the LLM world like really likes jargon, but to keep it short and sweet, a benchmark is essentially a human-labeled gold-standard test data set for a specific task. So if you remember our specialty referral example from before, the way that you would generate a benchmark for that is basically taking your prompts and your clinical notes about a patient, showing them directly to a clinician, and then asking, would you place the referral to the specialist for this patient, yes or no? If you do that for a hundred patients, you've now generated a data set that the model never saw during training because it lives within your walled garden of your databasing of your organization, and then you can then use that benchmark to evaluate many different LLMs on that task, whether they already exist or if they are released in the future.

So what we've done in our organization is to basically do this for many different tasks that we have identified as high-value, and then bundled them all together in a custom Python evaluation harness called Medhelm, or medical holistic evaluation of language models, that is actually quite similar under the hood to some of the other evaluation frameworks that have been discussed in this session. And in fact, we may end up porting it over to something like Inspect in the future.

All right, finally, this brings us to our third key difference between monitoring LLM performance and monitoring the performance of other kinds of statistical models, and this one is the hardest. At the end of the day, no matter how big or expansive your suite of benchmarks is, you just can't evaluate an LLM on every single task that you could potentially apply it to within your organization. It's not possible, and even if it were, it'd be prohibitively expensive to do. This is because LLM outputs are just so much more flexible and expressive than other kinds of models that are trained for one or a small number of specific purposes. They can take pretty much anything as input, and they can produce pretty much anything as output.

So what do you do? Our approach is to identify important and known failure modes across many of your different LLM deployments and try to focus on those. So you basically identify things that you don't want the LLM to do, and you monitor those specifically. This is helpful not just for monitoring purposes, but it also is kind of a step into a process called guardrailing your model, which basically means trying to intercept those failure modes before they occur.

Monitoring hallucinations with fact decomposition

So for example, at our organization, something that we're really worried about is this issue of LLM hallucinations. Probably many of you who work with LLMs are concerned about these as well. In our context, an LLM hallucination just means that the model makes a claim about a patient that is not substantiated or supported by any of the information present in that patient's chart. So what we do to monitor this is basically we run automated Quarto reports that basically conduct a Python analysis using an algorithm called fact decomposition, and this estimates the hallucination rate across our chatter deployments for us.

So how does fact decomposition work? Bear with me, it's like an algorithmic talk. You start with the output text that an LLM generates, and then you use either natural language processing or another LLM to break it into the individual claims that it contains. So for example, if the claim or if the output from the LLM says this patient came to the neurology clinic to have their headache evaluated, the individual claims would be that the patient has a headache and that they went to the neurology clinic, and maybe also a third one that's like that their evaluation of the neurology clinic was about their headache. Once you have these individual claims, you can then use a powerful reasoning model or a smaller fine-tuned language model to cross-reference all of the individual claims with all of the clinical notes within the relevant time period of that patient's chart. This basically allows you then to flag individual claims that are not supported by any of the evidence present in the patient's record, and using simple division you can then compute, like estimate a rate at which like how many claims or what proportion of claims are supported versus not supported.

So as an example, this is a plot that I pulled or that I made in plot 9 that I pulled from our most recent partial report that ran for the month of August, and basically what we're doing is we're estimating the hallucination rates across our different deployments among three of the different models that we route to within the chatter infrastructure. What you can see is that one of the models that is becoming a bit dated at this point, GPT-40 mini, was hallucinating at a much much higher rate than the other models that we were using, so this resulted in the recommendation basically to deprecate that model and switch to one that was more reliable.

Lesson three: you already have the tools

Okay, so I would now like to conclude with our third and final lesson, the kind of boots-on-the-ground view. When I talk to other data scientists about the fact that I mostly work with LLMs now, a common, like a common worry that I hear is like, oh I'm really intimidated by LLMs. I haven't worked with LLMs before and I'm not sure where to get started, but I feel like the craze has kind of swept over everybody else, but it forgot me. So if there's one thing I would like for you to take away from this talk, it's just that as a data scientist, regardless of if you've worked with LLMs before in the past, you already have all the tools you need to work with LLMs if you would if you would like.

just that as a data scientist, regardless of if you've worked with LLMs before in the past, you already have all the tools you need to work with LLMs if you would if you would like.

For example, during this talk we've met several components of monitoring a deployed LLM system that may have seemed kind of new, but that really connect quite deeply to things that you already understand as someone that works with statistical systems. For example, we talked about benchmarking, which just basically is another word for creating a test set for a model that you didn't train. You know how to create a test set that's free from bias, that is representative of your deployment population, that the model has not seen before during training, like none of that is actually new. Secondly, we talked about identifying failure modes for a deployment so that you can either monitor them specifically or set up guardrails around them. And how do you identify these failure modes? Well, it's simple. You spend some time with the logs for the tool that you've deployed, do some exploratory data analysis on them, and then you work with your governance partners to identify like what are the problems we really want to avoid here. EDA and governance are foundational in every single data science workflow, so that's not new either.