How to train, evaluate, and deploy a machine learning workflow with tidymodels & Posit Team

Transcript#

This transcript was generated automatically and may contain errors.

Hey y'all, and welcome to the 7th Enterprise Community Meetup, where we talk about end-to-end data science workflows with Posit Team. My name is Simon Couch , and I work on free and open source packages for machine learning in R at Posit.

In today's session, I'm going to walk through an example of training and deploying a machine learning model. I'll be leaning heavily on the tidymodels , which is a framework for machine learning in R that I work on, to develop my model. Along the way, I'll show you how Posit Team eases some of the pain points of model development that I encounter.

During today's session, I'll lay out three high-level steps for building a machine learning model, and then spend some time with each of those steps individually before concluding with resources where you can learn more about tidymodels and about Posit Team. After I present, we'll be hanging out for a live Q&A, so please feel free to ask questions as we go, and I'm looking forward to spending time. But with all that out of the way, let's go ahead and get started.

To talk about those first and second steps, I'll make use of some slides. And then for step three, I'll drop into a live demo using Posit Connect and Posit Workbench. If you're interested in spending more time with these slides or checking out the source code behind this presentation, you can go to the link in the footer of these slides. That's github.com slash simonpcouch slash mutagen.

And with tidymodels, I just get that for free by changing the engine argument.

Another reason to use tidymodels is safety. tidymodels is designed to safeguard you from common modeling pitfalls. One of the things that tidymodels is designed to protect you from is data leakage. A 2023 review found data leakage to be a widespread failure mode in machine learning-based science. Data leakage is when information from the set you're going to use, or the portion of the data that you're going to use to evaluate your model, somehow leaks into the portion of the data that you're using to train your model. If a model is being evaluated on data that it's seen already, then it will perform better than on data that it hasn't seen before. And so it will seem like your model is performing really well, but it's actually not, and you'll only learn that the model isn't performing well once it's deployed.

So in our example, in the mutagen example, if our model seems to be able to classify drugs as either mutagens or non-mutagens really, really well, if we go ahead and deploy that, thinking that the model is performing perfectly, then scientists will mistakenly interpret the results from that model and take them at face value.

Another common failure mode is overfitting. Overfitting leads to analysts believing models are more performant than they actually are. So this is really similar to that first failure mode, where in the case of an overfit model, in our mutagen example, the model is going to perform really, really well on the portion of the data that was used to train the model. And if we don't have tools to then evaluate the model on data that it hasn't seen before, then that model might just learn every little intricacy of the training data and won't generalize well to data that it hasn't seen before.

A last component of safety that tidymodels make sure to protect its users from is irreproducibility. Implementations of the same machine learning model give differing results in many popular implementations of machine learning algorithms, resulting in irreproducibility of modeling results. So each of the citations listed in the bottom here in order are reports on data leakage and overfitting and irreproducibility. And tidymodels is designed to safeguard you from all of them. We developed this software so that it's really hard to do the things that are wrong and that an analyst would commonly want to do, and makes the things that are usually difficult but pay off in the long run, it makes those things really easy.

Another reason why you might want to use tidymodels is that tidymodels is designed to be communicable. All tidymodels objects, or at least many of them, have visualization functions called autoplot that are associated with them. So if I make a receiver operating curve with the tidymodels, which is a tool used to evaluate models, I can pass that object to autoplot and see what that curve looks like in 2D space. I can also pass, for example, the results of trying to tune a random forest algorithm. Again, I just pass that result to the autoplot function, and we have built-in visualization procedures for that model object.

The same story goes for iterative search results. Things like Bayesian optimization, simulated annealing, these are advanced machine learning techniques that are built into tidymodels. When I use tidymodels to generate these results, I can understand them by quickly plotting them.

Another example to use tidymodels is completeness. There's built-in support for 99 machine learning models once you load the tidymodels packages, and there are even more packages if you load extensions that are contributed by the community. tidymodels is also complete with reference to data preprocessing techniques. With the three main packages for preprocessing that are loaded when you load tidymodels, we load in 102 different ways to process data that comes in raw, and turn it into something that a model knows what to do with.

I've hinted at this already, but again, tidymodels is extensible. If you can't find the machine learning technique you need, which is either the machine learning model or the preprocessing technique, you can add it yourself. We have APIs that are designed to make it really easy to contribute your own extensions that are designed with your problem context in mind.

The last reason I'll list here is deployability, and this will bring us into that final step three that we'll drop into the live demo for. tidymodels is tightly integrated with Vetiver, which is a free and open-source package for model deployment. And Vetiver is configured to quickly and easily and securely ship a model to Posit Connect, which is part of the suite of products in Posit Team.

Live demo: deploying a model with Vetiver and Posit Connect

So with that said, let's go ahead and drop into a live demo. I'll spin up a workbench session, and we'll use that session to train a model, and then send it off to Posit Connect. Once I've logged in, this is what my Posit Workbench instance looks like. I'm going to start out by spinning up a new session, and this is what I was talking about where I can get access to resources beyond what my laptop is capable of in terms of computing power. So I have access to different types of clusters, and then for that cluster, different options in terms of the number of CPUs, and the amount of RAM. So I'm going to go ahead and start this session as an RStudio Pro project.

This will just take a couple of seconds to spin up, and once it's wrapped up, it will look a lot like an instance of the RStudio IDE if you've ever worked with one of those locally on your laptop.

So from here, I'm at the point where I have some code that I want to train. I have some code that I want to use to train a model, and this code is hosted on this GitHub repository that I have here, simonpcouch slash mutagen. This repository has all the source code for training the model, as well as the source code for the slides if you're interested in it. I'm going to go ahead and copy the link to this GitHub repository, and then I will start a new project, and the new project will be from version control. I'll start it up from Git, and paste in the URL there, and I'll call this project mutagen. This will also just take a couple seconds to spin up, and once it's done so, all of the data and the source files that were inside of that GitHub repository are going to be also available to me inside of Posit Workbench.

So for instance, there's a readme in this repository giving some high-level information on what's in there. Here's all the source code for the slides in this slide folder. What I'm going to open up is a source here, which has a Quarto document that has all of the source code that I used to pre-process my data and start understanding it in step one, and then moving into that step two, which is developing and training the model, and ultimately deciding on the final fit. So you can see up here in my environment, I went ahead and loaded in the output of final fit from file. So normally in my workflow, I would go ahead and train the model here on Workbench, and since I've done so already, I already have the output of that from here, and instead we're going to move right on to that last step, step three, which is getting that trained model into the hands of the people that are going to use it.

This is actually really simple. There's only a few lines of code you need once you have a fitted model with tidymodels to then go ahead and deploy that model to Posit Connect. The first step is to pass that trained model to vetiver model, which is a function from Vetiver, which will put some wrappers around it and make sure it's ready to go to deploy. I'll also call board connect here and assign that output to board. If you're unfamiliar with this function, this is from pins, and if you want to learn more about the pins package, I think it was the last Posit Team demo where they covered the functionality there.

We're running into an error here, and what this is telling me is that we don't have a Posit Connect server that we're connected to already, so we can actually go ahead and do that, and it's pretty simple. If I go to Tools, Global Options, Publishing, these are the accounts from which I can publish from. I'm going to go ahead and just click Connect here to Posit Connect, and this is pre-filled for me because I'm working on my Workbench instance inside of a Posit Team account. That's pre-filled in there. I'll click Next. It's going to wait for authentication, and it will authenticate to my Posit account, and now that it's authenticated, I can click Connect Account and Apply, and so that's all I needed to do to link my Workbench session up to Posit Connect.

So I can try running that board Connect code again, and it was successful. Nice. So I'm on the Connect instance that I have access to through Posit. I will then pass that final fit object, the vetiver made to me. It's a vetiver pin write to write it to the board I've just created, and now that final fit vetiver object is pinned to the board on Posit Connect. So once that pin is on Connect, I'll go ahead and deploy the model that I have assigned to this Simon Couch slash Mutagen, and you'll see here down in the console that vetiver is doing all sorts of work to bundle up everything that that model needs to create predictions from new data into an object that can be passed off to Connect.

So that happened pretty quickly. It just brought us over to a Connect window where vetiver is showing the deployment. So on the left side of the screen is what a user would see if they pulled up this model API, and so there's all sorts of metadata about the model. There's examples from which analysts can ping, put together code to ping the model for new predictions. And then on the right side, and this is a really nice benefit for a lot of people, there's all sorts of access to security features and metadata. So by default, it's only available to specific users or groups, and you can add those in, but you can also make it available to all users within your organization. And then there's all sorts of other metadata that you have access to as well.

So I hadn't deployed a model using vetiver until I was working on the blog post that I mentioned earlier, and I was really surprised at how straightforwardly it came together. And so I would encourage you if you are working in an instance of Workbench in your organization to go ahead and try to do that same thing, deploy the model that is publicly available on my GitHub and see what happens.

So I hadn't deployed a model using vetiver until I was working on the blog post that I mentioned earlier, and I was really surprised at how straightforwardly it came together.

Wrapping up and resources

So with the third step of deploying our model wrapped up, we've really touched on all portions of this diagram at this point. At first, we took a look at our data, we situated ourselves with the modeling problem, we visualized some of the variables and got a sense of the challenge that we were up for. We then moved on to training and evaluating a model using tidymodels. In this demo, we really just talked about why using tidymodels would be advantageous to your modeling workflow. But if you're more interested in how we actually went about doing that, I would encourage you to pull down that script I had up during the demo, run some of the code and see how it feels for you. And I'll show you some resources after this where you can learn more about how to use tidymodels. After that, we went on to deploy the model to Posit Connect using Vetiver. And as you saw, that really only took a couple lines of code. So we very quickly went from a final trained model to a model that's in the hands of users at our organization.

If you're interested in learning more, to learn more about tidyverse, how to understand and clean and visualize data, the R for Data Science book at r4ds.hadley.nz is definitely a great place to start. And if you feel comfortable with those foundations and want to move on to the modeling portion, the Tidy Modeling with R book at tmwr .org is a great resource and can bring you from a beginner to a strong modeler very quickly. If you want to learn more about Posit Team, I would head to posit.co slash team. There's all sorts of information on the products that are included as part of Posit Team and links for places to learn more. And lastly, if you want to see these slides again or poke around with the source code that's underneath them or poke around with the source code to train a model using tidymodels, again, that code is at github.com slash simonpcouch slash mutagen. I really appreciate the welcome to drop into the Posit Team demo. Thank you for having me and looking forward to questions.

Thanks so much, Simon. Really appreciate you joining us here today for the demo and for sticking around for the Q&A. Just wanted to let everybody know that this YouTube will push you over to the live Q&A in just a second here. But if you want to ask questions, you can post in the YouTube chat over there, but you can also use the Slido link that we have up on the screen so you can ask questions anonymously. If you go to pos.it slash demo dash questions.

But one thing that I wanted to add is that we have these monthly end-to-end workflow demos on the last Wednesday of every month. So if you joined us here today for your first one, we'd love to have you join us again. And so you can add those to your calendar with just the short link pos.it slash team dash demo. And that add to calendar link will point you to all the recordings as well. But we'll see you over there in the Q&A in just a second.

How to train, evaluate, and deploy a machine learning workflow with tidymodels & Posit Team

Transcript#

Three steps for developing a machine learning model

What is a machine learning model?

Understanding and cleaning data

Why use tidymodels?

Live demo: deploying a model with Vetiver and Posit Connect

Wrapping up and resources

Featured software#

tidymodels

tidyverse

TMwR