
Reliable Maintenance of Machine Learning Models - posit::conf(2023)
Presented by Julia Silge Maintaining machine learning models in production can be quite different from maintaining general software projects, because of the unique statistical characteristics of ML models. In this talk, learn about model drift, the different ways the word "performance" is used with models, what you can monitor about a model, how feedback loops impact models, and how you can use vetiver to set yourself up for success with model maintenance. This talk will help practitioners who are already deploying models, but this is also useful knowledge for practitioners earlier in their MLOps journey; decisions made along the way can make the difference between resilient models that are easier to maintain and disappointing or misleading models. Materials: https://github.com/juliasilge/ml-maintenance-2023 Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Tidy up your models. Session Code: TALK-1083
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right, I am here to tell you about the Reliable Maintenance of Machine Learning Models. My name is Julia Silgi, and I work at Posit on open source tooling for MLOps tasks, things like versioning, deploying, and monitoring your model. And I work on the vetiver framework, which is for R and Python ways to approach these kinds of tasks.
Since this talk is about maintenance, we're definitely kind of at the end, we're really talking about you've already deployed a model there, what do you do at that point?
But before we really get started, I think there's something you should know about me. I love finishing projects. I love checking things off checklists. I love, I'm very motivated, like if a project is kind of almost done, like to really do whatever it takes, like get that. Toward the end, I love shipping stuff, like this is kind of how I'm wired up. And this, of course, plays out in my work life, but it also plays out in my non-work, my real life.
And just as an example, I, if I'm coming home from a trip, and I arrive at my house with my suitcase, I unpack that basically right away. Like certainly before I go to bed that night, probably within an hour of getting home, honestly it is sometimes the first thing that I do when I get home, because otherwise that suitcase is just sitting there, needing to be dealt with, it's hanging over my head, I can't relax until that suitcase is unpacked, because I know I need to take care of it.
I'm sad to tell you that maintaining machine learning models is nothing like the process of unpacking a suitcase. It doesn't have like a beginning and an end, and instead, the work, the job of maintaining machine learning models is like the process of doing laundry. We do laundry all the time, we're never done doing laundry. You can catch up with laundry on one day, but guess what, there's going to be more laundry to do the next day. And maintaining machine learning models is like this, in that it is this never-ending process that has to be done forever.
And maintaining machine learning models is like this, in that it is this never-ending process that has to be done forever.
Much like we have to decide when is it appropriate to start a new load of laundry, we have to decide when, how, in what circumstances are we going to retrain a model.
Software vs. statistical performance
Now it's true that all software products have to be maintained, I don't think I, at least I, have never been able to write a piece of software that does not require some kind of maintenance, but machine learning models are unique in that they have both software characteristics or properties, and also statistical characteristics or properties. So let's think about something that does not have any statistical properties, it's just software, it's just code, like an example of this would be a fairly straightforward shiny app.
Unlike that, something that's only software, when it comes to a model, you could entirely retrain the model with new data and just totally change its statistical properties, and you could not change its software properties at all. So we have different kinds, like machine learning models are complex products in that they have, they are both software products and statistical products.
Whether we're talking about the software or the statistical properties, what we're really talking about is performance. And one of the just most basic prerequisites for model maintenance is monitoring that performance. So when we monitor, when we measure model performance, this lets us say, my model is performing well. But it turns out, if you hear someone say that, or you're having a conversation about that, people can mean different things when they say this.
So this person, what she means is, when she says this, is my model returns predictions quickly, it doesn't use too much memory or processing power, and it doesn't have outages. These are statements about the model's software characteristics. And the kind of things you want to measure, the metrics, the way you would measure this kind of performance, is with metrics like latency or uptime.
Another person, when they say my model is performing well, they may mean something different. So this person, when she is saying this, she means that the model returns predictions that are close to the true values for whatever it is that we're trying to predict here. So I am, I consider myself by profession largely a data practitioner. And I bet in this audience, that is true of most of you as well. And as data practitioners, we focus on monitoring this kind of statistical performance.
I'm not saying you never have to learn about or work on the software characteristics of your models. You almost certainly will. But we are uniquely responsible for these statistical characteristics of our machine learning models, because if we don't do it, nobody will. There's no one else who cares about this, and we'll make sure that this is in good shape.
And it's important to think about that, because, oops, I didn't pop it up, these kind of metrics here. So it's important to think about that, because failures in statistical performance kind of has this very unique tendency towards silence. Let's say your model has a problem with latency. It's returning predictions slowly. Something is happening. Typically that will be noticed. Typically that will cause a problem in your system such that you know you can then go in and change it.
If your model has great latency, it's returning predictions really fast, but actually the predictions are nonsense, it is very possible for you not to know unless you are actively looking for that. So statistical performance, it is easier to have problems that you don't know about.
Data drift and concept drift
There's some vocabulary around these kind of problems. You may have heard this term model drift. I love this term. It's very evocative to me. Model is drifting off, away from where I need it to be. So that's a generic or general term. And there are also more specific terms, and we're going to talk about two. The first one is data drift, and the second one is concept drift.
So let's say that I run a laundry service in my city, and I'm going to build a model to predict whether someone's going to be my customer. I use inputs, features, predictors to the model. Maybe I use things like how many loads of laundry a household needs to do a month. Maybe income. Maybe whether someone lives in a house or an apartment. I use these inputs to train a model and get a predicted probability that someone will be my customer.
Data drift is about shift, drift, change in those inputs. And when I use the model, compared to when I developed the model. So how do you measure this? How do you know you have it? Using tools you are probably very familiar with, things like summarization, visualization. Let's look at a possible example.
Let's say on the X axis here is the number of loads of laundry that a household needs to do every month. So I built my model, I'm using it in my business, I'm going along, and I'm going to keep track of the distribution that I see as I'm using it. To know if I'm experiencing model drift, I need to compare it to what I had when I trained the model. So if it looks like this, I might think I'm a-okay. Those that look like they're drawn from the same distribution, I'm okay. I'm talking about comparing this visually, but you can also use any of the statistical tests that are appropriate to the kind of data you have to do this kind of measurement. If by contrast I see something like this, then I know something has changed. The distribution of this thing I'm using as a feature has changed. If it's important in the model, if it changes, that means the outputs from my model will change as well. I'm going to get a different distribution of predictions than I would have thought. So data drift, to understand if you have data drift, you need to monitor your inputs.
Concept drift is about the relationship between the input and the output, and concept drift means that is changing over time. So let's say in the city where I have my laundry service, a bunch of new apartments are being built and all the new apartments have laundry units inside of the apartments. If house versus apartment is a feature that I use in my model, the fact that the apartments, the nature of them is changing means I will have concept drift. The relationship between this feature and my output is changing over time. So with concept drift, it's not enough only to monitor your inputs, but we also need to monitor your outputs.
This is a place that vetiver comes in because we're now talking about specific machine learning statistical kind of performance, and either in R or Python, we have functions that say you have some laundry service monitoring data. You can use the functions to compute the metrics, measuring the statistical performance of your model at some given time aggregation with the metrics that are important to you in your laundry service business that you're clearly running. So vetiver also has functions to help you get started with these kind of basic plots, and because everything you have access to is code, you can extend these, you can add on to these here.
Feedback loops
Now one thing that it's important to know as we start talking about model monitoring is that the fact that you put a model into a system actually can cause those things we were just talking about, data drift and or concept drift. We talk about this, we call this feedback loops.
So what's happening here is the user, if a user takes some action as a result of a prediction, or if you have users like correcting predictions, or some systems produce kind of feedback automatically. To be a little more concrete about this, let's walk through a couple of examples.
Many of us, I expect, are customers of a streaming service like Netflix or Hulu. So in those apps, you will see recommended pieces of content that's driven by machine learning models. And if you click, if you choose one of those recommendations, that's a signal to the system that that prediction was good, that data is collected and then used to improve the model. This is an example of a good feedback loop.
Sorry to report that not all feedback loops are good. There's a really interesting case study from Stripe where they had a data set of credit card transactions and they trained a model to predict which ones were fraudulent. They then took that model, which looked great, and they put it into their system in a way such that they blocked the credit card transactions. They actually never became credit card transactions. Some of you may be seeing where this is going. So they, you know, you were using the model, as happens with models in production, it gradually started doing a little bit worse. When they went to try to retrain their model, they did not, they had not captured information to be able to treat them like credit card transactions anymore. So they actually could not retrain their model. So they had to go back, do new data engineering so that they captured information in a different way and put these data sets together to be able to retrain their model.
This is an example of a feedback loop that's bad for a, like a single company, but we actually have evidence of feedback loops that can be bad for us as society as a whole. If you take data from, say, arrests and try to train a model to predict, say, what areas of a town or city are going to have more crime in them and then take some intervention based on that prediction, you can end up with a feedback loop, depending on the nature of the intervention, that makes certain areas actually worse off. You can make certain neighborhoods have worse experience than if you didn't use a model at all.
Stages of model monitoring maturity
Okay, so I've got one more kind of piece of, like, one more kind of topic that I want to talk about as I'm coming here to the end of my talk, and that is about talking about the stages of model monitoring maturity. So the first stage is just that you do it at all, and honestly, it's a huge step. It's a huge step, but often we start out doing things manually. So picture yourself, you are, you deployed a model last week, and then a product manager sends you a Slack message, and she says, how is that model doing? You run a query, you make a plot, you do a statistical test, and then you share that in Slack with the product manager. So that is, like, a first step of model monitoring maturity.
The next step is to make that reproducible. So you take that code that you wrote, you arrange it in a script, you check it into version control, and then next week, when a different person asks you the same question, you can give them a reproducible answer.
The next stage of model monitoring maturity is to move to automated model monitoring. So here you take that code that you checked into version control, you rearrange it so it is a report, or it is a dashboard, and you publish it somewhere so that it updates automatically. Some example, you can use a Cron job for this. Posit Connect, of course, has, like, features for doing this kind of, like, automated updating. So now, any time anyone wants to know how is the model doing, you have one place to look that is automated on the appropriate schedule for your particular use case.
Vetiver, again, here, comes in with tooling, wherever you are in these stages, to help you move to the next stage. As one example, the R package has in it a flex dashboard template for getting started with model monitoring to make a nice dashboard for you to be able to share with your coworkers. It has code that allows you to see, okay, here is how I would make a plot measuring statistical performance. Here is how I would make a plot monitoring inputs. Here is how my model works. Here I can share this visual documentation so people can understand what the inputs and outputs for my model are.
Okay. So what we've talked about is understanding what performance means when it comes to models and how it's important to understand both the software and the statistical characteristics of a model. We talked about measuring both inputs and outputs to really get a handle on these different kinds of drift.
I want to say this is the knowledge to have, and also this is what we need tools, really nice tools to be able to work on in order to have a model that's easier to maintain, a model that's resilient, and a model that is successful not just at the moment when we deploy it, but a model that is successful in the long term.
a model that's resilient, and a model that is successful not just at the moment when we deploy it, but a model that is successful in the long term.
You may have noticed there's a URL at the bottom of these slides, and these slides are posted there. You can go and visit all the links that I included, and I especially invite you if you're interested in getting started with model monitoring or even some of those earlier stages like deploying your model to check out vetiver, see if it's a good fit for you, whether you work in R or in Python, and with that, I will say thank you so much and see if there's any time for questions.
Q&A
We do indeed have a few minutes for questions, so if you haven't already loaded up the Slido page for this room, please do do so and submit any questions you have. I do have one on the screen, so the question is, does model monitoring tie in with model fairness? Oh, that's a great question.
So much like during model development when you're making a model, you might want to measure model fairness for, say, different versions of a model, or you might want to even tune a model with its hyperparameter so it has the fairness characteristics that you have. In the same way, you can use fairness metrics when you're monitoring. So over time, you can measure and keep a track on how these kinds of fairness can work. So when people say fairness, they're usually talking about metrics. So picture these as things you measure about a model, usually disaggregated by some protected characteristics. And the other sort of related thing would be explainability. So sometimes people say fairness and are a little vague, but fairness metrics absolutely fits perfectly in this. Explainability, you would probably not monitor explainability over time, but rather make sure you can return explainability with any given prediction.
Thank you. One more question. One more. Okay. Just enough time. Is there a way to put confidence intervals on the monitoring metrics? Ah, yes. So depending on the aggregation, the time aggregation that you use for model monitoring, you have a certain number of observations in then. And then you can use the, when you have that sort of feedback loop where you have the real answer of whatever the prediction should be, depending, let's say in a week, you have a thousand examples. You can use those thousand examples to estimate exactly a confidence interval on what the, or an interval on the metric that you're measuring.

