MLOps with vetiver in Python and R | Led by Julia Silge & Isabel Zimmerman

Transcript#

This transcript was generated automatically and may contain errors.

Hi friends, so nice to see you back for today's meetup. If we haven't had a chance to meet yet and this is your first meetup, I'm Rachel. I'm calling in from Boston today, actually at the RStudio office in Seaport. It's so nice to meet you and to see so many familiar names joining as well. Feel free to introduce yourselves through the chat window and say hello as well. I love getting to see where people are calling in from all over the world and also to be able to see people sharing helpful resources with each other over there in the chat as well.

I host these meetups every Tuesday at noon eastern time and they are all recorded and shared up to the RStudio YouTube as well if you want to go check out past sessions too. I'll say this one will be, the recording will be up immediately at the same exact YouTube live link, but this is a friendly meetup environment for teams to share different use cases with each other and teach lessons learned.

Together we're all dedicated to making this an inclusive and open environment for everyone, no matter your experience, industry, or background. You can add the whole calendar or individual events to your own calendar with a link that I will show on the screen here right now.

For a heads up about next week, we'll be back here, same place, for a talk on beautiful reports and presentations with Quarto , and the following week a talk on model monitors and alerting at scale with RStudio Connect. Today we are so lucky to be joined by both Julia Silge and Isabel Zimmerman presenting on MLOps with Vetiver in Python and R. During the event you can ask questions on YouTube live or LinkedIn, wherever you're watching from, and are also able to ask anonymous questions as well through the short link that I will put up here on the screen.

And so we will try to answer as many questions as possible from there, but with that I would love to pull Julia and Isabel up here on our virtual stage with me.

Awesome. Well, hello everyone. We are so excited to see you all here for the MLOps with Vetiver in Python and R meetup today, and talking about beautiful Quarto presentations is next week. This is a Quarto presentation. I love plugging that. It's my favorite thing.

And so who are we? I am Isabel Zimmerman. I am an open source MLOps software engineer here at RStudio, and I'm joined here today with Julia Silge , who is also an open source software engineer writing great MLOps tools in R, and I write them in Python. But another important question is who are you? So who's joining here today? We have a Slido poll question for you all. What language does your team use for machine learning?

So Julia and I have worked really hard to make sure that Vetiver feels great for bilingual data science teams. So that's data science teams that are writing in both R and Python, but it also feels like a great native experience if you are just an R user or just a Python user. So if you're developing a model, you can operationalize that model, and in fact we believe that you likely should operationalize that model, which gives us a really big looming question of what is operationalization? What is MLOps?

And if you're here, you're probably on some sort of MLOps journey, whether you're just curious about what it is, or maybe digging into tools. And the MLOps landscape looks a little bit like this. There's a lot of pieces going around. Is it infrastructure? Is it analytics? Is it open source? Are we looking at APIs? And it makes us feel sometimes a little bit like this.

But there is kind of a one-liner. Everyone finds MLOps a little differently, but the way that we see MLOps is as a set of practices to deploy and maintain machine learning models in production reliably and efficiently. All of those tools are helping you do this in some way. And MLOps is not something you can pass off to your IT department. We believe that data scientists should be owning part of this process.

Everyone finds MLOps a little differently, but the way that we see MLOps is as a set of practices to deploy and maintain machine learning models in production reliably and efficiently.

So there are great tools out there to help you through that data science life cycle. We have another quick Slido poll for what packages are you guys using the most for machine learning right now. We know there's great ways to collect data, to understand and clean data. On the R side, that's tools like the tidyverse or data table. On the Python side, it's maybe Pandas or NumPy. Once your data is cleaned and understood, you are ready to train and evaluate a model. Once again, on the R side, you're looking at tools like Carrot for tidy models, or in Python, things like PyTorch or scikit-learn.

But then you enter kind of the wild west of MLOps. And it's sometimes hard to find, you know, MLOps open source tools that fit your needs. And Vetiver is really well scoped into helping data scientists version, deploy, and monitor their models. And we'll take a little bit of a closer look in what we mean when we use those three verbs.

So, in this analogy, your models are these like more volatile, really valuable fragrances, and what Vetiver does is stabilize it so that you can deploy with confidence.

So, let's look at the same data on Scooby-Doo episodes. So, this Scooby-Doo dataset is originally from Tidy Tuesday, but we transformed it a little bit and saved it as an Arrow file so that both on the Python and the R side, we can read it in the same way. So, it has this thing we're going to predict whether the monster is fake or real in a certain episode, the year that it aired, what the IMDB rating was, and then we've got the title in there as well so we can keep a track of that.

Now, let's move on to training and evaluating a model. So, here, I'm going to use a support vector machine model. So, let's set up that support vector machine specification. And now, I'm going to make a feature engineering recipe here. So, support vector machines, it turns out, you have to do some data preprocessing to them. So, and then I'm going to combine the feature preprocessing together with the model into a workflow and fit it to the data that I have.

So, what I want you to notice here is that I'm going to be able to – what I'm doing is I'm using good statistical practices, treating my preprocessing together with my modeling, and I can take that whole thing and move on to my MLOps tasks. So, again, from before everything I've shown you so far, we want you to keep using the tools that you know and love, that you think are a good fit for your particular use case.

Now, what I'm going to do is I'm going to start using Vetiver. So, just like on the Python side, you know, we read data in, maybe did some EDA, trained a model, and now it's time for us to start going into the MLOps tasks. So, the first thing that we'll do here is that we'll use Vetiver to create a deployable model object.

I want you to notice what's printed out here because the idea of, like, why? Why do I need to make a deployable model object? It's because at the time of training your model, you have a lot of information about it. You know, for example, whether it was a classification model or a regression model. You know, like, what kind of computational engine you used to train it with. You know how many features there were and what the names and data types were of those features that went into your model.

And the reason why you want to use Vetiver to create this kind of deployable model object is that it is designed so that it collects and stores all the information you need to make predictions in a new computational environment at training time. Because, you know, you have all that information, let's store it in a nicely organized package bundle so that you can then take it somewhere else. So, just like Isabel was saying that when, like, the good way to think about have I deployed a model is have you taken a model that was in one computational environment and put it somewhere else so that it can integrate with your IT infrastructure.

All right. So, we have our ready-to-go model object, deployable model object. And now I'm going to version slash publish slash share my model. So, I'm going to use pins to do this. And I, right now, am going to, you can see from my screen here that I've connected to our demo connect server. And I am going to write my Vetiver model there.

So, this is pins have support for many different kinds of infrastructures upon which to version, share, publish models. So, just like on the Python side and the R side, you can write to connect. You can write to S3 buckets, AWS S3 buckets. You can write to, like, an Azure blob storage. You can write to a network drive if that's the way that your particular organization works. The important thing is that the pinning works the same.

And it allows you this nice little bit of, you know, usability in terms of I need to get a version. I need to store this in a versioned way on my, to keep my model somewhere that, like, is the appropriate place for my particular infrastructure.

So, I want to just highlight here, again, that I got a little prompt that I get once per session about that reminding me to create a model card. So, I am going to show you in R how you might go about doing that. So, if I go to say I want to make a new R markdown, I've got a couple of things here that are Vetiver templates. One of them is this model card. And what this does is, just like Isabel said, this can get you, like, 80% of the way there to documenting your model with a model card. So, if you haven't heard or read about model cards, this paper, I encourage you to look at it.

And what this does is it goes through, for example, things that can be automated, like, hey, we know what kind of model this is. We know what version we're using. We can automate those parts. And then it provides you an outline to the parts that take human domain knowledge to be able to answer. So, like, what are the uses of this model? Why did you choose the metric that you did? So, it allows you to walk through and do a good job of documenting your model in an appropriate and specific way.

Great. Okay. So, we've versioned our model. We have thought about how we're going to document our model. And now it's time for me to deploy the model as a REST API. So, here I am going to run this locally where I am going to see, oh, right, look, I can create a REST API that's going to run locally on my machine right here. So, this is running in this R session. Notice that my R session is kind of busy right now because it's serving this API to me. And I can understand, I can locally debug. It takes me this many lines, three lines of code to locally debug the model to be able to see, like, what is it doing?