Resources

R-Ladies Rome(English) - Extending the data science workflow: {vetiver} and {pins}- Isabel Zimmerman

In this video, Isabel Zimmerman goes through the fundamental aspects of machine learning operations (MLOps) tasks, bridging the gap between data analysis and model deployment. While data practitioners excel in data analysis and model development, there's often a significant gap in understanding tasks beyond the conventional data science workflow. You'll explore crucial MLOps concepts, such as deploying models as API endpoints and monitoring model decay, while leveraging the powerful capabilities of the vetiver and pins packages. Material: - presentation: https://www.isabelizimm.me/talk-extending-ds-workflow-rladies/ - RStudioConf2022 talk: https://www.isabelizimm.me/talks/rstudioconf2022/ - Vetiver website: https://vetiver.rstudio.com/ 0:00 Welcome & R-Ladies Rome Chapter Introduction 0:04:45 Slido Pools 0:10:15 Talk Intro 0:10:56 Isalbel's Talk 0:47:53 Hands-on session 1:02:20 Q&A Have a look at our WebSite for more insights about our events: https://rladiesrome.quarto.pub/website/talks/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome everybody to this R-Ladies Rome event. We are thrilled to have Isabel Zimmerman tonight with us to talk us about extending the data science workflow with vetiver and pins.

A bit of a disclaimer first, this talk is recorded and will be posted on our YouTube channel. So feel free to turn off your cameras if you do not want to be recorded. And I mean, a nice thing to keep in mind is that we prioritize creating a safe and inclusive space free from any form of harassment, fostering a respectful environment for everyone to learn and connect. So we would like for you to keep in mind that all attendees are expected to adhere to our code of conduct. You can find it in our website, rladies.org.

So what can you expect from this talk tonight? Basically, whether you're a beginner or an experienced data practitioner, tonight you can expect to learn about actionable strategies and tips for enhancing your data science capabilities within MLOps.

And yes, basically, we would like to welcome everyone. This event is hosted by R-Ladies Rome. I am Silvana Costa, one of the chapter's organizers. I'm from Uruguay. I work as a data scientist in tech. And I have a background as a PhD in econometrics. And together with Federica Gazzeloni and Rafael Arribeiro-Lucas, we are thrilled to have you all here tonight.

Hello, everyone. My name is Federica Gazzeloni. I am one of the chapter organizers. I'm Italian. I'm from Rome. Background is statistics and archaeological science. So excited to be part of this team.

Hello, everyone. My name is Rafael. I'm from Brazil. I'm a researcher in cardiovascular disease. And I'm developing as a data scientist.

Well, and of course, I mean, we are thrilled to have Isabel Silverman talking to us today.

And well, I mean, maybe a little bit about what is R-Ladies. R-Ladies is a global organization with the mission of promoting the R language and for empowering women at all user levels by building a collaborative global network. It is a gender diversity friendly community founded in 2012 by Gabriela de Queiroz in San Francisco. And R-Ladies is now a worldwide organization with more than 200 chapters in more than 60 countries and 4,000 events and more than 90,000 members globally.

And R-Ladies Rome is a local chapter of R-Ladies Global. And our monthly meetings provide a platform to discuss current trends and hot topics in R. And we encourage active participation, of course, and engagement from all of our attendees. So we would like to hear about your suggestions and comments.

Some of our past events have been in January, an introduction to Quarto with Torin White. In February, we had two events, Building a Chatbot with Shiny and R with James Wade. And the last one we had was Debugging in R with Shannon Pileggi. Well, so today in March, we have Isabel with us. And our next upcoming event is the 16th of April. And this will be for Geospatial Data Science and Public Health Surveillance with Paula Moraga. And then in May, we will have an evening with Hadley Wickham. And for June, we are expecting a topic in our open site with Giannina Bellini.

Introduction and welcome

Thank you, Federica and Silvana and Raffaella for having me here.

So thanks, everyone, for joining me on this Friday afternoon or evening or morning, depending on where you're at in the world. I hope you're as excited to learn about vetiver and pins as I am to teach it today.

And vetiver and pins and this whole idea of extending the data science workflow is something that's super near and dear to my heart. But while you guys are listening, there is a small code along. At the end, it's only going to be like 30 lines of R code. A lot of those are empty lines as well. But if you can install a few packages, if you're interested in joining this code along at the end, these are the packages that you should have installed.

I do have a link to the slides. So if you want to open up your RStudio IDE or wherever you are writing R code and have these packages installed, should be ready to go.

About Isabel and the motivation for MLOps

All right. So who am I? This is me. And I actually recently graduated with my master's degree. I did the whole full-time work and full-time school thing while I was doing my master's degree. As I was working at Posit, I was writing this vetiver framework and finishing my degree. And part of what made me so excited about MLOps was because of what I had learned in my degree.

And when I was learning about data science, my degree was in data science, half computer science, half data science, one of those weird in-between things. But I learned about a data science workflow that looked like this. You start with some data. You collect it, whether that's from an API or from CSVs or people are just handing you data. You get this data. And then you get to use some tools to understand the data. And that's things like the tidyverse or data table.

And so you have this data. You've cleaned it. You've understood it. And then in this data science workflow, you train and evaluate some sort of model. That's using things like Carrot or tidymodels and R or Keras or scikit-learn or PyTorch and Python.

And as I was learning all of these different data science things, I was learning about best practices in data science. And I will give you guys a warning. There is Python code on my slides. This is not to scare anyone off. I personally am a Python person primarily. The end will be in R. But I just want to show you guys that these concepts are the same in both languages. And these frameworks actually exist in both languages.

So some best practices in data science are things like setting your seed for reproducibility. If you've worked in different data science worlds, this is the first thing you learn to do to make sure when you load data, it's always the same. Just good things for this reproducible data science workflow. And you also learn about best practices like splitting your data into test and training data sets. So you're not giving your model the answers while you're training it.

But my first job as a data scientist before I was building software, the problems I encountered as a data scientist were different. I knew all of these best practices. But I realized the issues I was running into is how do you share data with your teammates? Like, are you emailing everybody CSVs or Excel files? How are you sharing models with each other? Is it on Git? Is it, again, emailing things around? And how do you get people who don't code as much to be able to engage with the things that you have created?

Oftentimes, at the end, you have a Shiny application that your model is running in. And how do you put that model into a Shiny application? There's a lot of different problems when you're putting this whole cycle of data science together that are just kind of beyond the best practices that I had learned in school.

What is MLOps?

So a lot of these practices maybe weren't enough. And I also learned that if you develop models, you can operationalize them. And so this is making sure that your models are in some sort of larger ecosystem. Maybe that's on Git. Maybe that's shared with your teammates in some way, making sure that the model isn't just on your computer.

So I learned that you want to bring your model outside your computer. And you want to learn kind of the best practices of how to do that as well. And so those best practices for machine learning operations to operationalize your model are called MLOps.

And so there's a lot of hype around this word that I find online. It's like, I don't even know what MLOps is. You can read all of these different startups. And you read their documentation. And you don't really know what they're doing at the end of it. It's kind of stressful to get started. But here is a kind of one-liner. MLOps is a set of practices to deploy and maintain machine learning models in production reliably and efficiently.

MLOps is a set of practices to deploy and maintain machine learning models in production reliably and efficiently.

So where we had those best practices like setting a seed or splitting your training and testing data to build a model, MLOps are the same different set of practices. But it's a set of practices to make sure that the model, when it leaves your computer, can still be reproducible and shared in a safe way with others.

In my first job, when I was starting to learn about MLOps and learn how to deploy models, I was struggling because there was a set of tools that were really based on Kubernetes. And it felt like I almost had to be like a cloud expert and then also like a DevOps expert and maybe kind of a systems architect and also a data scientist. And that felt like way too much.

These practices are hard. It gets models off your computer into some sort of larger system. And sometimes tools don't necessarily feel ergonomic for data scientists. It felt like these tools were not really made for people with the skill set that I had coming from a purely data science background.

Introducing vetiver

And that is kind of how vetiver was created. I ended up changing career paths from doing data science to building data science tools and making these MLOps tools to feel like an extension of what people are already doing.

So I do work for Posit, the company that was formerly known as RStudio. Vetiver is funded by Posit because I'm an affiliate for them. So it's no surprise that while I'm showing you Python code and while I primarily write in Python, there is also an R version of vetiver as well. The code actually looks almost identical. And that was on purpose, especially realizing that people sometimes might have to work in different languages or collaborate with people who are using different languages. We wanted to make tools that feel super easy to go back and forth so it doesn't feel like you have to start from zero if you're moving between languages.

So part of the idea of vetiver is that there's this life cycle where you collect and understand and evaluate your model. But it wants to fill in the gap of how to get this model out into the real world, making predictions from real world data in a really safe, reproducible way.

So some of the tasks that vetiver helps with is to version your models, to deploy your models, and to monitor your models. And those are the three main tasks vetiver focuses on. There are many, many other best practices that envelops in cases. But we really wanted to narrow down what, when I say we, it's me and the vetiver team, wanted to narrow down what we thought was important for people who are just starting to deploy models, who are just trying to figure out what this world is and what it means for them.

Versioning models with pins

So the first step of this is versioning. And when people think of versioning, it's normally in the context of Git, where you have some sort of central repository of files that people can read and write to, like push and pull to. But people actually version a lot of things. And they normally version things pretty badly, actually.

I think of how I am building models in my past. And I would version. I would start with a model. And I save it probably as a RDS file or as a Joblib file. So I save my model. I name it model, or maybe something a little better. And then maybe you update some of your features when you get new data. Then you re-save it, and it's model final. And then maybe you realize that it needs a little bit more tweaking. And so you have your model vinyl final. And you guys see where this goes. You end up with all of these versions of one object that maybe you need information from multiple files of these. But it's not really scalable.

Versioning really is the foundation for success in machine learning deployments. And also maybe not just for machine learning deployments, but a lot of reproducibility problems. A well-versioned system, a well-versioned file is normally going to be easier to share. It's going to be easier for you to share with yourself six months from now as well.

So when we think about our ideal versioning system, it would be nice if all of these files kind of lived in a central location so you don't have to search between different directories, maybe if they were discoverable by a team, so something that's not just on your computer, but something that can be hosted in a cloud environment between people. And it's also really nice if things can be loaded right into memory so you don't have to, like, the problem I have with Git sometimes with data is you'll have to go download the data if you can't in an easy way, load it onto your local computer, open up RStudio, load the data in that.

So something that helps vetiver out is actually another package called pins. And we're going to do, like, a very small pins side quest. You don't actually have to install pins to use vetiver, but they play together really nicely and they're kind of built alongside each other.

So pins is a package that publishes data, models, other R objects, and it makes it easy to share them between projects and with your colleagues. Some things that might be considered good uses for pins are, like, an ETL pipeline that will store a model or summarize data sets maybe once a day.

Pins, though, is not meant for multiple writing things back and forth. One thing that we see people try to do, and then I just kind of want to get ahead of, is don't try to make like a Google form with a pin. These are just files. So if you're trying to have multiple people like update the same file at once, it might get corrupt. But if one person or if, say, some sort of system is reading and writing this pin one at a time or you want to share something for multiple people to ingest, pins is good. Please don't make a Google format of it. I can tell you right now, I've had some horror stories from that.

But what works really well is people have like GitHub Actions running where it will update a notebook or something like that, and pins is kind of a perfect solution for them. And pins is built to be like super easy to use.

So it's good for pipelines where things is read and or updated. Pins is not really meant to have multiple people writing data to it at once. You can have as many people as you want reading from it, but you can't have multiple writes. That is kind of a downside of pins.

So this is the temporary board. If you want to change it to be a S3 board, it would just be board underscore S3 or board underscore Azure, whatever. So it's super easy to like switch between boards or something like that.

And this is where vetiver comes in. So let's say we are in our data science workflow. We have created our model. It's called RFPipe. This is like a random forest pipeline that I created. And we're ready to deploy it. So we want to create this deployable model object called a vetiver model. And our inputs into this vetiver model are going to be the model itself that we've created. And then we're going to give it a name as well.

And the things that are useful about using this vetiver model object is that it holds onto a lot of information for you. Inside this vetiver model, you're going to have things like an input data prototype. So when you created your model, you, of course, fed it some input data. And vetiver actually can like slurp that up on its own. And it knows the data that it should expect as an input. So if you have five columns and you accidentally put four when you're making a prediction, vetiver will tell you, like, no, that's wrong. Your data should be five columns, or if it has like a wrong date format or something like that. This becomes super useful when your model is deployed somewhere that's not on your computer.

It also will say things like what packages are needed to recreate this model and a little bit of metadata about the model itself. And then with your model board and then your vetiver model created, you can do a function that's just vetiver pin right, your model board, your vetiver model. And it will actually version your model automatically.

If you wanted to see the metadata that it saves, you can see it has the title, it has a description, when the model was created, a hash, a little bit more robust versioning system than a bunch of underscore finals, as well as the required packages this used vetiver and Scikit-learn.

Model cards and documentation

MLOps is not just about creating models that are in different locations. It's a lot about documenting and making sure you have good models that are well-documented, that are reproducible, and can really show what you've learned when you've created this model. We believe that when you've created the model, this is the time where you've spent all of your research. And there's no better person to write information about this model than the person who created it.

So there are a few templates that you can pull up in RStudio. One of them is a model card. Actually, when you write the model itself, you'll get a little feedback that the model card is a framework for transparent, responsible reporting, and to use this Quarto framework as a place to start. And when you run this code in Python or use the template in RStudio, it gives you something that you can give your pin, and it will automatically document what it is able to automatically document. And then it has fill-in-the-blank type responses for things that a machine can't do. And that is knowledge that only the model developer would have.

At the bottom, there's also some places for you to put ethical considerations and caveats and recommendations, which we believe, you know, even if you don't think that your model has any ethical considerations, that you should at least maybe put none that you know of, and to not just delete the section. Just if you don't have complete information, it's better to leave something blank to show that you've thought about it than to delete something altogether.

Deploying models as API endpoints

So deploying your model, you're going to be moving your model here into someplace that is not your computer. It's super useful because others can use this model without having to load it. The way that we're going to create this is by using a REST API endpoint.

An API is kind of a, I think it's application something process interface. Essentially, it is a place where people can ingest your model's information. The really nice thing about this is that it works with JSON, not necessarily R or Python, to send requests to and from your model. So it is pretty language agnostic, as long as the input data information is the same.

So in Python and in R, it is pretty much one-ish lines of code to create a local API endpoint. In Python, it's putting this vetiver model v into a vetiver API and then calling run on it. In R, it is also creating a vetiver API that you're putting a vetiver model into and running it using Plumber, if you're familiar with Plumber, to make API endpoints.

So this is going to give you a local API running. But of course, what we had wanted was a model that was not running on our computer. We want it to be shared somewhere else. So we have a one-liner for Posit Connect, which is one of Posit's pro products, where if you give it the board and the pin name and the version, it'll automatically deploy it as an API endpoint. But there's also ways to do this for Dockerfiles. So if you're working for a company that maybe is using a different cloud platform, most of them have a bring your own container ideology.

So as long as you have some sort of Dockerfile, it will kind of automatically slip it up. It knows how to handle these Dockerfiles in these Docker containers. So this prepare Docker call for Python or R will create really three files for you. And these three files are what you'll need to deploy your Docker image locally or on another cloud. And that's the application file itself, so kind of a plumber.r file, the Dockerfile, to give the system information on how to build the Docker image and a requirements.txt in Python or an rnlock file. So the computer that you're running this new API on will know all the information it needs to download the right packages and make sure it's all of the correct versions as well to ensure reproducibility.

And with this image, this Docker image, this API endpoint up and running, you're able to just call predict. So predict with the data that you are running, that your model expects. So that can actually make predictions from this model locally, even if the model itself is not running on your laptop.

So that's what deploying a model looks like. I think kind of maybe the most famous deployed model right now would be something like chat-gpt, where you'll interact with it locally on your laptop. But the model, of course, is running somewhere else with all this information. But you don't have to be someone who knows how to run chat-gpt on your computer. All you have to do is interact with it. Which I think is really showcasing the beauty of MLOps and what a deployed model should look like and kind of how that should feel.

Monitoring models

So when you've created a model, you have your fresh data that looks great. It's performing well. But that is at the exact point in time when the model is created. Oftentimes, data will change over time. I think of maybe my Spotify or whatever music listening platform I use. My music taste is different now than it was five years ago or 10 years ago. So models have to continually update, train if I got the same suggestion, I would probably not be on that platform anymore. So the same concept applies here. Models will often degrade over time.

So what the vetiver framework looks like for MLOps monitoring is the statistical output of the model. So it's tracking things like maybe accuracy or R squared of a model, where a systems admin or a DevOps person, when they think that they're monitoring a model, they might be thinking about the CPU usage of a model or the runtime. So I think if you're collaborating with other teams, these are things to keep in mind that you're clear in your expectations of what these kind of loaded terms might mean, like monitoring.

In the vetiver framework, there's really three main functions to kind of help people on their monitoring journey. It's to compute the metrics, where you give it the data, the column that has the date, the time that you want to aggregate over. So if you want to look at how your model has performed over every one week or one year, the metrics that you're trying to compute. So things like mean absolute error, R squared, and then the true value of the model, so the model output and the true value.

You can also do things like pin your metrics if you want to continue this information over time, if you want to write a small script to check this as you have different information come in, and then to plot the metrics themselves. There's also another, just like the model card, there's another template in RStudio to build a monitoring dashboard if maybe your company or your team is someone who needs to monitor these models over time and wants a dashboard deployed to help look at that.

So monitoring is extra important because models don't fail loudly. So if you're thinking about building a Shiny app, when things go wrong in Shiny, you kind of get that, what is it, the skull or the sad loading screen. Things are super broken. Things are not going well. And you get this loud error, and you know things are not going well. Models don't work like that. Models will fail silently, and they can still run even if there's no error. Even if the accuracy is 0%, your model will still give you an output. This is something we also see with chat GPT, where it'll hallucinate. You can ask it things. It'll tell you the wrong thing very confidently. And that's where monitoring comes in, because if you aren't tracking how your model is performing, you are oblivious if your model is decaying.

because if you aren't tracking how your model is performing, you are oblivious if your model is decaying.

So it's really important, whether it's in some sort of large system or something you're doing ad hoc on your own, to do some sort of model monitoring.

And that will complete our cycle. You have your model versioned as you're creating different experiments, as you're updating your model. You can deploy your model. You can monitor your model. And we can think about this kind of over a long span of time. Let's say last year, I created a model, and I deployed it. I have my version locally. And maybe over the last year, I've been monitoring it, and I realize now I need to update my model. And retrain it with the new data we've collected. Then I can create a new version of that model. I get to redeploy it and continue monitoring it so this cycle can go on and on.

So things that are interesting about vetiver that I think is something that we should be excited about is it's very composable. So internally with vetiver API and vetiver model, vetiver APIs are just plumber APIs. So if you're someone who is a plumber expert and wants to add in new endpoints, who wants to add in whatever other infrastructure you want around this plumber endpoint, you can add that into your vetiver API just as you'd expect. It's also something that is built to be kind of an extension of a data science workflow. So you don't have to change the types of models you're creating or anything like that. It is truly just something to add on, not to try to change what you're already doing.

Live code demo

So like I said, I primarily write Python. But today, we're going to be doing R together. So I have my RStudio pulled up. I'm not a super fast typer. So hopefully, we can all do this together at the same time. But if you're someone who would rather copy and paste and run code that way, I'll show you the file that I will be working off of.

So this is the file that we're going to be working off of. I'm going to be typing it slash copying and pasting it. The first part is we're all going to need to use this URL. So you will have to copy and paste this URL.

So we're going to put on our data scientist hat. Everyone here is going to do a full data science workflow, kind of end to end. And that always starts with, of course, the tidyverse. So we'll start by loading the tidyverse and loading in this data.

So this data is from the advertising outputs of the Super Bowl, which is a very American thing. It's American football. It's like their World Cup. This is looking at all of the advertisements run during the Super Bowl and seeing how many likes they got on YouTube as of, I think, whenever this data set was created. So 2021. And this is from an earlier Super Bowl. It has things like, was the ad supposed to be funny? Did it show the product? Does it have a celebrity in it? Does it have danger, element of danger, animals? So we're going to make a really small model based off that.

And we're going to be using the tidy models framework. Now, tidy models is a little bit newer. But what I have really enjoyed from this framework is that it makes it really easy to swap in and out of different types of models. So if we want to start with a random forest model, we use the exact same words. Random forest. And we can put it into a regression mode. So we're going to do just a little regression on whether or not these are good ads, I suppose, or predicting maybe for a new advertisement, how many likes that one would get on YouTube.

So now we've created our model, almost. We're actually going to put this into a little workflow. So this is going to fit our model from using this random forest and this formula. So we're going to run this. And we're going to also run this. And now you have a model that's fit. And this is kind of the beginning stages of that data science workflow.

This is the point where we think data scientists have gotten to. And there's such great tools and great support for loading your data, doing, I can't really call this a lot of feature engineering or exploration, but learning about your data, selecting the columns you want. There's really great tools for making models. But this is where we think maybe people at this point want a little bit of help. Or maybe there's less tools available.

So next, we're going to load vetiver. And so the first thing that we are going to do, just like we had seen in the slides, is use this vetiver model object. We're going to put in our trained model. So that's that trained random forest model. And we're going to give it a name.

And we can see here that this model has all of these blueprints of the predictors that the model expects. We can see like, oh, this model later on should have these features, show these predictors, all this information. So we've created a vetiver model.

And we're first going to pin it. So I'm going to do something that is just a local board for pins. But if you guys wanted to, you could look at the docs maybe after and move this to GitHub. I think there's even Kaggle, if people use Kaggle still. And OK, so I'll list my folder. So you can move this into a variety of locations.

And we're going to put versioned equals true. So there's a few different ways to have boards. If you actually only wanted one model in at any given point in time, you don't have to version it. But we do for this one. So I'm going to explicitly say that I want a versioned model. A versioned board.

Then we're going to write our pin. And for that, we're going to give it the board, which is B, and the vetiver model, which is V. And just like we had seen in the slides, we get this little pop-up of, oh, we should create a model card. So if we want to create a model card, you can go to our Markdown files. And from a template, you can see there's the model card. And here's also the model dashboard.

So if you want some templates to get started, you can do OK. I'm not going to run this, but it is a parameterized report. So you can see here, if you wanted to do your pins board folder with the pin we just wrote and the name of the pin, you could update this model card to be for the model we've just created.

If you want to read the board back in, the name, you'll use the board that it's saved on and then the name of the pin. So the name of the pin is our Super Bowl random forest. And we can see here, we have a random regression model workflow of six figures. If you save it as V2, we can see V1 and V2 are going to be the same exact thing. So we've just saved it somewhere else and loaded it back in.

So this is the more exciting pieces where we're going to have things popping up. So like I said, vetiver is built off of Plumber. So when we load this Plumber library and start a Plumber router, then for our last bit, we're going to create a vetiver API of holding V2. And we'll also run it.

You can see, if you're working off that file, there's a port. You don't need to specify all of those things. It's mostly if you want to get different debug information. And so you'll run it. And it might take a second. OK, there we go. This is an API that is running locally on your machine. This API endpoint comes with this automatically documented information.

And we get all of these different endpoints. So if you were to deploy this somewhere else, right now it's, of course, at HTTP 127.0.1. But you can imagine it's maybe on some other server. You can, without even downloading the model, you can see the metadata for the model. So this is the version. These are the required packages you need to run the model locally. If you wanted to get the input data prototype, so if you want to see what the input data is, if you want to see if, like, oh, this model needs things that are funny, all these different columns.

But kind of for the final piece, you can actually make predictions right from the model. So if we wanted to try this out, we could do if there was something that was not a funny ad that doesn't show the product, it's not patriotic, there's no celebrity, there's no danger, and there's no animals, that it'll maybe get 1,800 likes on YouTube. We can, of course, change these. So if this is true, you can see how that changes. Oh, if it's dangerous, now it has 9,000 likes on YouTube. So you can interact with this model here.

But if you are running these in other locations, you can just do predict. Of course, I've shut this down now. But you could run predict and then put your data in here. So if I did df, and it would spit the model back out or spit the predictions back out, just like I was running it locally,