
Trust, but Verify: Lessons from Deploying LLMs in a Large Health System (Timothy Keyes)
Trust, but Verify: Lessons from Deploying LLMs in a Large Health System Speaker(s): Timothy Keyes Abstract: Large language models (LLMs) are transforming how data practitioners work with unstructured text data. However, in high-stakes domains like medicine, we need to ensure that “hallucinated” clinical details don’t mislead clinicians. This talk will present a framework for evaluating and monitoring LLM systems, drawing from a real-world deployment at Stanford Health Care. We will describe how we built and assessed an LLM-powered system for real-time, automated chart abstraction within patients’ electronic health records, focusing on methods for measuring accuracy, consistency, and safety. Additionally, we will discuss how open-source tools like Chatlas and Quarto powered the work across our team’s combined Python- and R-based workflows. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone, thank you so much for being here. My name is Timothy Keyes and I'm a data scientist at Stanford HealthCare. Because I work at a hospital, I actually want to get started by asking you all to remember the last time you went to visit your doctor. During your appointment, your doctor probably listened to your heart as a part of their comprehensive medical exam. They probably also prescribed you a medication or gave you a vaccine. At some point during your visit, they may have ordered a laboratory test, like a blood test or an x-ray. And probably at some point, someone asked you to update your insurance or billing information for financial purposes. What you may not know is that at the end of your visit, your physician actually wrote up a text document summarizing everything that you did together that day and uploaded it into a data system called your Electronic Health Record, or EHR.
Your personal EHR is really important. It's basically a story about your health, starting on the day you were born and leading up to today, that is shared among all of the clinicians who are involved in your care. This is useful because it means that the next time you go to see one of your doctors, they can reference any of the changes that have happened to your medical concerns so that they can offer you the highest quality, most efficient care possible when they do see you for what brings you in that day.
The problem with the Electronic Health Record, though, is that over the course of your entire life, it tends to look less like this and more like this. There are thousands of pieces of information about you in your EHR. Every time you've ever had your blood pressure taken, every time you've ever filled a prescription, every time anyone has interacted with your insurance company, it's so much stuff. So it's really hard for your clinician to find any information about you, even when they're looking for it really hard. This is made even more difficult because, and I was kind of shocked to learn this when I was in medical school, is that most of this information is stored as unstructured text, making it all the more difficult to search through and parse.
Introducing Chat EHR (Chatter)
This is why our team at Stanford Healthcare created an application called Chat EHR or Chatter. Chatter provides a chatbot interface for clinicians to ask natural language queries against a patient's chart and to receive a text answer from an LLM that is hooked into the EHR. Chatter's UI is served by an infrastructure layer that handles all the technical things that you need when you're working with LLMs, and this infrastructure layer is also exposed on the back end as a suite of REST APIs so that data scientists like me can build various task-specific automations to guide our patients efficiently through the healthcare system. Some of the things that we tend to do with these task-specific automations are screen patients for eligibility for certain kinds of appointments, schedule those appointments, and then write draft documentation for our clinicians once those appointments are completed.
However, this talk is actually not about Chatter at all. It is not about how it was built, and it is not about how it works. Instead, it is about a single question that faced our team once we put this system into production and over 4,000 of our clinicians started using it out in the real world for real patient care, and that question is this. How do we know that an LLM-powered system is behaving as expected over time, or more specifically, how do we know that the text that the LLM is generating is reliable, useful, and safe for our patients and the clinicians who serve them?
I can't tell you that our team or really anybody else has definitively answered this question. It's a really, really hard question, but what I can do is share with you three lessons that our organization has learned in our journey trying to do so.
Lesson one: the view from space
Sound good? Awesome. This brings us then to lesson number one, the view from space. I will claim that LLMs really aren't so different from other statistical models in terms of what you really care about organizationally once they have been deployed. What I mean by this is that our organization has an MLOps framework that we use for all of our statistical model deployments, and we were really worried when we started working with LLMs that they were just too different of a technology, and they would basically smash our framework into smithereens and we would have to start over from scratch. Luckily, this didn't happen because we just found in practice that regardless of the type of model that you're serving, whether it's logistic regression, XGBoost, an LLM, or an agent, there's always certain things that you have to do to support a model when it's in production that are kind of regardless of the underlying technology.
Let me illustrate this clearly by walking you through Stanford HealthCare's MLOps framework. Your organization probably has something pretty similar, but if you don't, Julia Silke actually gave a really beautiful talk about MLOps in her PositConf talk a couple of years ago, and you should go watch that. But at Stanford HealthCare, we break down MLOps into three pillars.
The first we call system integrity. This is about monitoring a model's ability to find its input data and to post its output to the intended destination with low latency and high availability. This has to do with the IT infrastructure supporting your model, so regardless of what that model is, it's always on the table and it's something you have to keep track of. The second pillar we call performance. This is about monitoring the quality and accuracy of the output of the model. So how often is the model correct? How often is it incorrect about the things that it does or says? Again, regardless of the model, this is something you have to care about if you want to be a good data scientist. And then lastly, our third pillar is impact. Because we're a hospital, it's really not enough for our models to be technically excellent. They also have to have a traceable way of serving our number one goal as an organization, which is providing the best patient care possible. So we typically also monitor process measures and KPIs that are associated with any of our deployments to make sure that they are serving their intended purpose towards this mission.
So we found that these three principles, kind of luckily for us, are sufficient for monitoring any model after it's been deployed, including LLMs. But that all being said, we did find it most difficult to tackle this second pillar. How do you define and then monitor the accuracy of an LLM? It's a really sophisticated technology, and it can do so many things.
Lesson two: key differences in monitoring LLM performance
This brings us to our second lesson. So we are now no longer in space. We have re-entered the atmosphere. And as we get closer to the Earth, we can appreciate that despite their apparent differences at a high level, there are some key differences between LLMs and other kinds of statistical models that affect how we define and monitor their performance, just like accuracy element. There are a bunch of differences that we could pull apart here between LLMs and other types of models, but I will talk about three key differences that we are most opinionated about at Stanford.
The first is this. Humans simply interact with text differently than they do with the outputs from other kinds of models, like probabilities or discrete labels. In a clinical environment, machine learning models are typically used to make recommendations to clinicians. So they might recommend prescribing a particular medication or ordering a certain laboratory test. But in practice, what we found is that unless these models are highly, highly accurate, which most of them simply are not, because medicine is very hard, clinicians learn not to trust them, and so they ignore them. So they end up being not so useful. With LLMs, on the other hand, we find that this kind of like magical human-computer interaction dynamic emerges where clinicians are actually really eager to read the free text recommendation of an LLM and to understand its justification for making that recommendation. They kind of like see what the LLM is suggesting to them, run it through their own clinical reasoning process, and then decide if they agree or disagree with the model.
So what we've learned is that monitoring the so-called performance of our LLM deployments is really about capturing this human-in-the-loop adjudication process, so that we can track over time how much our expert clinicians are agreeing with the things that the LLM says.
So what we've learned is that monitoring the so-called performance of our LLM deployments is really about capturing this human-in-the-loop adjudication process, so that we can track over time how much our expert clinicians are agreeing with the things that the LLM says.
I think this is well illustrated by an example, so I'll walk you through one of our LLM deployments to give you a flavor for how this looks in practice. The task here is we're asking the LLM to identify patients who might benefit from an appointment with a specialist, just a certain kind of physician, basically. So our starting point is a prompt, and that prompt contains the criteria for the specialist appointment. It usually contains some information about the types of illnesses that the specialist can treat. We send this to the LLM along with clinical notes describing the health of the patient and ask the LLM to output two things. The first is a targeted recommendation for whether the patient should get the specialist appointment, and the second is a justification with cited evidence from the patient's chart of why it made that recommendation. Both of these are then sent to one of our expert clinicians who are always in the loop, and that clinician then decides if they agree or disagree with the model and if they want to place the referral for the specialist appointment.
Our monitoring approach then really hinges on capturing this clinician's decision within our electronic health record, and then we track it over time so that if we see basically that the disagreement rate of our clinicians is increasing for a deployment after some period of time, this basically suggests that something is going wrong, maybe documentation practices have changed, maybe we need to adjust the prompt, and that our data science team then needs to remediate that.
Okay, so moving on to our second key difference between LLMs and other statistical models in terms of monitoring performance. Whereas the first one was more conceptual, this one is more technical. It has to do with like some engineering things, and it's based on the observation that frontier LLMs developed by large companies like ChachiPT, Claude, and Gemini are externally developed and versioned by third parties. So this means that they actually update and deprecate over time without your team necessarily being in control. This is in contrast to something like an XGBoost model that you may have trained, hosted, and served yourself, which will only ever be updated or retrained or deprecated, turned off, when you choose.
You may not think about this as being related to monitoring a model post-deployment in the traditional sense, but my point here is that when you're working with proprietary LLMs, you don't just have to worry about the data distributions of your deployed population shifting over time, you don't just have to worry about the user behavior changing over time, you actually also have to worry about the model itself changing over time as it's continually updated by the third party that has developed it. So what we do about this is basically host a suite of internally developed benchmarks that help us to monitor task performance for important tasks at Stanford Healthcare whenever a new model is released. So for those of you who might be familiar, GPT-5 was released a couple of weeks ago, and we were curious like, should we switch to this model? We're using a couple of legacy models that won't be supported in a couple of months, and so this is our approach for dealing with that scenario.
For those of you keeping me honest, I have used some jargon here. So benchmark is a term that is used commonly in the LLM world, and the LLM world like really likes jargon, but to keep it short and sweet, a benchmark is essentially a human-labeled gold-standard test data set for a specific task. So if you remember our specialty referral example from before, the way that you would generate a benchmark for that is basically taking your prompts and your clinical notes about a patient, showing them directly to a clinician, and then asking, would you place the referral to the specialist for this patient, yes or no? If you do that for a hundred patients, you've now generated a data set that the model never saw during training because it lives within your walled garden of your databasing of your organization, and then you can then use that benchmark to evaluate many different LLMs on that task, whether they already exist or if they are released in the future.
So what we've done in our organization is to basically do this for many different tasks that we have identified as high-value, and then bundled them all together in a custom Python evaluation harness called Medhelm, or medical holistic evaluation of language models, that is actually quite similar under the hood to some of the other evaluation frameworks that have been discussed in this session. And in fact, we may end up porting it over to something like Inspect in the future.
All right, finally, this brings us to our third key difference between monitoring LLM performance and monitoring the performance of other kinds of statistical models, and this one is the hardest. At the end of the day, no matter how big or expansive your suite of benchmarks is, you just can't evaluate an LLM on every single task that you could potentially apply it to within your organization. It's not possible, and even if it were, it'd be prohibitively expensive to do. This is because LLM outputs are just so much more flexible and expressive than other kinds of models that are trained for one or a small number of specific purposes. They can take pretty much anything as input, and they can produce pretty much anything as output.
So what do you do? Our approach is to identify important and known failure modes across many of your different LLM deployments and try to focus on those. So you basically identify things that you don't want the LLM to do, and you monitor those specifically. This is helpful not just for monitoring purposes, but it also is kind of a step into a process called guardrailing your model, which basically means trying to intercept those failure modes before they occur.
Monitoring hallucinations with fact decomposition
So for example, at our organization, something that we're really worried about is this issue of LLM hallucinations. Probably many of you who work with LLMs are concerned about these as well. In our context, an LLM hallucination just means that the model makes a claim about a patient that is not substantiated or supported by any of the information present in that patient's chart. So what we do to monitor this is basically we run automated Quarto reports that basically conduct a Python analysis using an algorithm called fact decomposition, and this estimates the hallucination rate across our chatter deployments for us.
So how does fact decomposition work? Bear with me, it's like an algorithmic talk. You start with the output text that an LLM generates, and then you use either natural language processing or another LLM to break it into the individual claims that it contains. So for example, if the claim or if the output from the LLM says this patient came to the neurology clinic to have their headache evaluated, the individual claims would be that the patient has a headache and that they went to the neurology clinic, and maybe also a third one that's like that their evaluation of the neurology clinic was about their headache. Once you have these individual claims, you can then use a powerful reasoning model or a smaller fine-tuned language model to cross-reference all of the individual claims with all of the clinical notes within the relevant time period of that patient's chart. This basically allows you then to flag individual claims that are not supported by any of the evidence present in the patient's record, and using simple division you can then compute, like estimate a rate at which like how many claims or what proportion of claims are supported versus not supported.
So as an example, this is a plot that I pulled or that I made in plot 9 that I pulled from our most recent partial report that ran for the month of August, and basically what we're doing is we're estimating the hallucination rates across our different deployments among three of the different models that we route to within the chatter infrastructure. What you can see is that one of the models that is becoming a bit dated at this point, GPT-40 mini, was hallucinating at a much much higher rate than the other models that we were using, so this resulted in the recommendation basically to deprecate that model and switch to one that was more reliable.
Lesson three: you already have the tools
Okay, so I would now like to conclude with our third and final lesson, the kind of boots-on-the-ground view. When I talk to other data scientists about the fact that I mostly work with LLMs now, a common, like a common worry that I hear is like, oh I'm really intimidated by LLMs. I haven't worked with LLMs before and I'm not sure where to get started, but I feel like the craze has kind of swept over everybody else, but it forgot me. So if there's one thing I would like for you to take away from this talk, it's just that as a data scientist, regardless of if you've worked with LLMs before in the past, you already have all the tools you need to work with LLMs if you would if you would like.
just that as a data scientist, regardless of if you've worked with LLMs before in the past, you already have all the tools you need to work with LLMs if you would if you would like.
For example, during this talk we've met several components of monitoring a deployed LLM system that may have seemed kind of new, but that really connect quite deeply to things that you already understand as someone that works with statistical systems. For example, we talked about benchmarking, which just basically is another word for creating a test set for a model that you didn't train. You know how to create a test set that's free from bias, that is representative of your deployment population, that the model has not seen before during training, like none of that is actually new. Secondly, we talked about identifying failure modes for a deployment so that you can either monitor them specifically or set up guardrails around them. And how do you identify these failure modes? Well, it's simple. You spend some time with the logs for the tool that you've deployed, do some exploratory data analysis on them, and then you work with your governance partners to identify like what are the problems we really want to avoid here. EDA and governance are foundational in every single data science workflow, so that's not new either.
And then finally there's this question kind of underlying this entire conversation, which is what are the tools that I should use to build with LLMs? Unfortunately I can't give you a single answer for that. Here are a bunch of the frameworks that I have used either in the past or currently. There are many of them, but if you allow me to give a single recommendation for someone who is completely new, I would suggest getting started with the packages that Posit has developed for building with LLMs. If you work in Python, this is a package called chatlas. If you work in R, this is a package called Elmer. I think what sets these packages apart is just that their docs are so excellent. They will teach you everything that you need to know along the way, assuming basically no prior knowledge, which is really helpful when you're a beginner. And if even that is intimidating, which it very well might be, it certainly was for me when I got started, I'll also just leave you with the reminder that there was a time when the frameworks that we all use every day were new. Dplyr was once new. ggplot was once new. Pandas and pullers were once new to us, but if you believe in code-first data science, these are skills that you can pick up to help you solve problems. So there's no reason to be intimidated. With that, I will thank you all for being here and invite any questions.
Q&A
Thank you, Timothy. We have time for one question and then lunch. And then lunch, yeah. Human-in-the-loop accuracy measurement is expensive. How do you do this at scale?
Yeah, human-in-the-loop is expensive. I mean, you're kind of preaching to the choir here because our experts are also clinicians. Like, can you imagine the hourly rate for hiring a clinician as a consultant to do labeling? Like, whatever number you're thinking of, it's too low. But I think the cool thing for us as a health system is we're kind of uniquely positioned because all of our users are clinicians, and so if we're giving them the tools, we can kind of require them to give us feedback in return. How to do this at scale in the general sense, like how to do scale AI for doctors or something, I'm not sure. But at least in the medical context, I know that, like, we always imagine our deployments as having a human-in-the-loop. We're not doing things fully automated, so it's kind of baked into some of the workflows of the tools we're developing. Sorry for the non-answer, but it's a really hard question. Thank you. Let's thank Timothy and all our speakers again.

