Building an MLOps strategy from the ground up - Isabel Zimmerman, RStudio PBC

Transcript#

This transcript was generated automatically and may contain errors.

And now, our first speaker of the day, Isabel Zimmerman is a software engineer on the open source team at RStudio , where she works on building open source MLOps frameworks. The title of her talk is Building an MLOps Strategy from the Ground Up. Let her hear you, everybody. Please welcome Isabel Zimmerman.

So, fun fact, my slides are running code behind the scenes, so I'm going to quickly render them. So, I'm running Python scripts right now, and then we get to start.

Hello, everyone. Today, I will be talking to you all about building an MLOps strategy from the ground up. So, if you want to follow along, my slides are at this link. You can, you know, put it in a tab somewhere and forget about it for a few weeks and rediscover it. Also, these slides are clickable and they're running code, so you can kind of interact with them as we go through them. All you have to do is open up this link. They will auto-advance as I advance these slides. It is the magic of this software that I'm using called Quarto .

But, of course, we're not here to talk about slides. We're talking about MLOps today. So, from the ground up, we're going to start with what is MLOps anyway, and how can I start? We also want to keep in mind what should we be considering when we are looking at using tools for MLOps.

But I'll introduce myself first. So, hello, everyone. I am Isabel. I started my career as a software engineer slash data scientist, and I was deploying models using Kubernetes. And if you've ever used Kubernetes, it's not always, like, the most fun thing to do, so I needed a little bit of a de-stressor, and this was coming from my dog, Toast. This is a real-life image here of him.

And I started training him tricks, and my first trick that I trained him was to sit. And if you've ever had a dog, when you start training your dog to sit, you kind of have them come up, and you hold a treat above their head, and you kind of push their bum down, and you say, sit, and you give them a treat. And all is well. And so I was going to show this off to my neighbor, and I was super excited. So, you know, I have Toast in hand, and we're walking. He's right here. And I say, hi, neighbor. And, Toast, sit. And he doesn't sit.

And, of course, the data science in me is like, I overfit my model. Something is wrong here. And I realized that when I had taught him to sit, I had only ever taught him from the front, but when we're walking, you know, he's on my side. So, Toast taught me an important lesson here.

When I was training him in my cozy living room, I only had the information that I knew where I wanted him to sit. But when we were out in the real world, the circumstances changed. He was on my side, and he didn't realize that this was kind of the same thing. He had to sit in either direction.

But I learned a hard lesson. You know, even if you're training for the right outcome, the real world brings a new set of challenges, and things behave differently.

But this is not a dog training conference. So I'll tell you this instead. If you develop models, you can operationalize them. In fact, I'll go so far as to say if you develop models, you should operationalize them. Well, most of them. Some of them.

Every now and then, there's this statistic that goes around that 85% of models don't make it to production. And people use this as a statistic to say we need more MLOps tooling. And I would say, I mean, from the data science perspective, you train a lot of models, and not all of them are production-ready. So I think it's actually a good thing that 85% of models don't make it into production for both quality and cost-saving reasons.

So I gave my dog information. And at home, there was a great response and a reproducible response. But out in the wilderness of my suburban neighborhood, when I gave my dog instructions, it didn't look the same. Things behave differently in different environments. Models work the same way as Little Toasty. When you're training and tuning a model, data scientists, they often live in their Jupyter Notebook. You have your installations of packages that you've had for years, maybe. Hopefully not. But it happens. And when you productionalize your model, sometimes things look different. Maybe the data that's coming in isn't exactly what your model is used to.

This is really important to think about, because the business value of models often comes from when they're in production. And how that model works in the wild. But we're getting a little ahead of ourselves right now. We'll start with, what is MLOps?

But models fail silently. They can still run forever and ever with no error, even if your accuracy is 0%. And we keep giving you predictions even if they're horrible. So if you're not monitoring your model in some way, you are oblivious to the decay that may be occurring.

So we'll set the scene. Our YouTube-like predictor has been in production for years now. We are really happy with its performance, but we want to keep monitoring and making sure that this is working the way we expect it to. So we're going to start with a few things. We have new data, and this is a data frame that has been collected, however, maybe through data engineering pipelines. And it has a few columns that we have to give to Vetiver, the Vetiver compute metrics function.

And the first column name that we have to say is the name of the date. So this is a very specific flavor of date. A lot of times, models will use time as an input to the model. However, this is actually not the input data that you are predicting on. This is the date that the prediction was made. We also need to know the true value of the like count as well as the predicted value. So we can compare these two to see if our model is still performing well. There are a few other things, like the time delta. So this is how we are aggregating this time. So this is looking at every one week, how is this data performing? And we're going to give it a metric set. Here we have mean absolute error and mean squared error from the scikit-learn metrics. These could just as well be custom metrics as long as they have a y true and y predicted value.