Resources

How to train, evaluate, and deploy a machine learning workflow with tidymodels & Posit Team

Helpful resources: Github: https://github.com/simonpcouch/mutagen Follow-up Q&A Session: https://youtube.com/live/vwBVOBQfc_U If you want to book a call with our team to chat more about Posit products: pos.it/chat-with-us Don't want to meet, but curious who else on your team is using Posit? pos.it/connect-us Blog post on tidymodels + Posit Connect: https://posit.co/blog/pharmaceutical-machine-learning-with-tidymodels-and-posit-connect/ Tidy Modeling with R book: https://www.tmwr.org/ Timestamps: 1:44 - Three steps for developing a machine learning model 3:35 - What is a machine learning model? 7:02 - Overview of machine learning with Posit Team 7:36: Step 1: Understand and clean data 11:05 - Step 2: Train and evaluate models (why you might be interested using tidymodels) 23:02 - Step 3: Deploying a machine learning model from Posit Workbench to Posit Connect 30:14 - Summary 31:21 - Helpful resources __________________ Machine learning models are all around us, from Netflix movie recommendations to Zillow property value estimates to email spam filters. As these models play an increasingly large role in our personal and professional lives, understanding and embracing them has never been more important; machine learning helps us make better, data-driven decisions. The tidymodels framework is a powerful set of tools for building—and getting value out of—machine learning models with R. Data scientists use tidymodels to: 1. Gain access to a wide variety of machine learning methods 2. Guard against common mistakes 3. Easily deploy models through tidymodels’ integration with vetiver Join Simon Couch from the tidyverse team on Wednesday, October 25th at 11am ET as he walks through an end-to-end machine learning workflow with Posit Team. No registration is required to attend - simply add it to your calendar using this link: pos.it/team-demo

Oct 25, 2023
33 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey y'all, and welcome to the 7th Enterprise Community Meetup, where we talk about end-to-end data science workflows with Posit Team. My name is Simon Couch, and I work on free and open source packages for machine learning in R at Posit.

In today's session, I'm going to walk through an example of training and deploying a machine learning model. I'll be leaning heavily on the tidymodels, which is a framework for machine learning in R that I work on, to develop my model. Along the way, I'll show you how Posit Team eases some of the pain points of model development that I encounter.

During today's session, I'll lay out three high-level steps for building a machine learning model, and then spend some time with each of those steps individually before concluding with resources where you can learn more about tidymodels and about Posit Team. After I present, we'll be hanging out for a live Q&A, so please feel free to ask questions as we go, and I'm looking forward to spending time. But with all that out of the way, let's go ahead and get started.

To talk about those first and second steps, I'll make use of some slides. And then for step three, I'll drop into a live demo using Posit Connect and Posit Workbench. If you're interested in spending more time with these slides or checking out the source code behind this presentation, you can go to the link in the footer of these slides. That's github.com slash simonpcouch slash mutagen.

Three steps for developing a machine learning model

I've put together a diagram of sort of the high-level process for developing a machine learning model. And the three steps that I'm talking about are kind of called out with these three hex stickers in the corners. In the top right corner, that first step is to collect data and to spend time with it. Cleaning data, visualizing it, understanding it are all important first steps to building a machine learning model. When I do each of those steps, I end up spending a lot of time with the tidyverse packages, which some of you may have heard of.

Once I move on to training and evaluating models, I use the tidymodels packages, which I work on day-to-day. And a common issue that people run into when they're building models with tidymodels or any model development software is running out of computational resources. If you're working in an environment with lots of data or you're trying to evaluate really complex models, you can run out of computational resources really quickly. And for me, this is a place where Posit Workbench is really key. Recently, I've started using it when I run into this situation. And it allows me to drop into a computational environment where I have more resources and can complete that model training more quickly.

Once a model is trained, it's time to get it into the hands of people that will use it. And that's where the Vetiver package comes in. Vetiver is free and open source, and it provides tools to version models, deploy models, and monitor them to ensure that they continue to perform as you expect them to once they're out into the world. Vetiver is tightly integrated with Posit Connect, which offers all sorts of tooling for getting the outputs of analysis into the right hands.

What is a machine learning model?

So first, let's talk about what do I even mean when I say machine learning or a machine learning model. One example is a Netflix movie recommendation. If you've ever gone to somebody else's house and seen the movies that are recommended to them, you might know that they're a Netflix movie. And if you look at the movies that are recommended to them, you might notice that they're quite different than what you might see at your own house. That's because the movies that are recommended to you and the thumbnails that are part of those recommendations are outputs of machine learning models. And based on your viewing history and the things that you've said you're interested in, Netflix will recommend different movies to you.

Another example, on Zillow, if you're looking at houses or apartments, there are estimates of the value of the property or the rent that they would expect you to pay if you were to live there. So that's another example of a machine learning model. Or email spam filters. So all of our emails, whenever we get an email, at first there's a filter that looks at the subject line and if you've received an email from that person before and if there are any typos inside of the email. And based on information like that, a model will try to predict whether an email is spam or not. And if it does predict that the email is spam, it will go to a folder.

So the thing that all of these share is that they are guessing the value of an outcome using predictors. For example, in the email spam filters, the outcome is whether that email is spam or not. And the predictors are those things I mentioned, like the subject line and the number of typos inside and other things like that. For a Zillow property value estimate, that might be something like the square footage of the house or the apartment, how recently it was renovated, what area it's in. And for Netflix movie recommendations, that's things like your stated interest in movies that you're interested in to how long you spent watching movies in different subcategories.

Another example of a machine learning model is one that this presentation is based on. Day to day, my responsibility is just to develop tools to build models. And so I often don't spend a lot of time building models myself, but doing so helps us better understand the perspectives of our users. And so we try to make time to really delve in on a modeling problem as often as we can. So recently, my colleague Max Kuhn and I spent some time to develop a model and work through the process to deploy it using Posit Workbench and Posit Connect and wrote about that on the Posit blog.

In this blog post, we talked through a process that's really similar to the one that I'll talk through with you today, where we explore data, we situate with our modeling problem. We develop a model to best understand that data, and then we deploy it using Connect.

Understanding and cleaning data

So the process of machine learning with Posit Team, again, I'll just underscore those three different sections before we dive in on them. First, we'll spend some time to understand and clean data with the tidyverse. And then we'll train and evaluate models with tidymodels and Workbench. For both of these sections, I'll just work through some slides. And if you're interested in the details, you can look at the source code of the analysis and the source code for these slides at github.com slash simonpcouch slash mutagen.

Once we've done so, we need to get that model into the hands of users. And that's where Vetiver and Connect come in. So let's start off with understanding and cleaning data.

In pharmaceuticals, mutagenicity refers to a drug's tendency to increase the rate of mutations due to damage caused to DNA. Mutagenicity is a key indicator that a drug may be a carcinogen or a cancer-causing. Mutagenicity can be evaluated with a lab test. So a person can test whether a drug is a mutagen using a standard wet lab procedure. But those tests are costly, and those tests are time-intensive.

So a group of scientists in our example problem converges to ask the question, What if a machine learning model could predict mutagenicity using known drug information? That is, if scientists are spending time with a new drug, using information that they already have, they can predict whether a drug is a mutagen or not, which will allow them to iterate more quickly when they're developing new drugs.

The source data that we'll be working with is called mutagen table data. It's a table with over 4,000 rows and notably over 1,500 predictors. So in this first column, we have whether the drug is a mutagen or whether it's a non-mutagen. And for each of those compounds as well, we have a whole lot of predictors giving us other information we can use to predict whether a drug is a mutagen or not.

So you're probably thinking to yourself, if I have 1,500 little pieces of information about this drug, it surely must be pretty straightforward to predict that. It turns out it's not quite that simple. On this plot, I've plotted two of the predictors, and each of those points is also colored according to whether the drug is actually a mutagen or not. The points is also colored according to whether the drug is actually a mutagen or whether it's a non-mutagen. So colors in blue are mutagens, colors in yellow are non-mutagens. On the x-axis, we have the molecular weight, and y-axis is the partition coefficient. It doesn't matter whether you understand either of these two predictors. I really don't understand them. But the point here is that it's really hard just using the information we have here to separate between the blue points and the yellow points, only knowing the value of these two predictors.

I can choose another two of them. Right now I'm showing the number of heavy atoms and the log of the average valence connectivity. Whatever those two predictors mean, again we're seeing that it's really hard to draw a line here that would easily separate the blue points and the yellow points. And this is the goal that our machine learning model will try to solve. So if you're like me, you're feeling a little bit intimidated about tackling this. This is where we move on to the step of training and evaluating models.

Why use tidymodels?

So for this portion of the presentation, instead of walking through the notebook that does the training and the evaluating, I'll leave it to folks to look through the notebooks that are on that GitHub page after the presentation. And instead, I'll try to answer the question of why you might be interested in using tidymodels over other modeling frameworks that are out there.

The first reason why you should be interested in using tidymodels is that tidymodels is consistent in a way that other modeling frameworks and modeling packages are not. An example that I'll use to call this out here, we can start off by showing the code to fit a linear model using the lm function with tidymodels and a linear model using only the lm function. On the left hand side, there's just one line of code. We're using the example mtcars dataset here if you're familiar with it. We're saying we want to use all of the predictors in the mtcars data to predict the outcome, which is the miles per gallon or mpg. And to do so, we'll use the mtcars data.

On the right hand side, the code does the same thing, except the tidymodels wrapper has been dropped on top of that function. The code reads, we'll use a linear regression using the lm function as our computational engine and we'll fit that model using the same dataset and the same preprocessing formula, mpg, as a function of everything. When you see this, you might think it seems a whole lot easier to just go ahead and use lm, it's only one line of code, and I hear you. Where we really see the value of tidymodels, though, is when we start to introduce complexity into the modeling pipeline.

So let's say you're using the lm function and it's not doing everything you need it to do, and you want to add, for example, a regularization term to your linear regression. If you're not familiar with that, it doesn't matter to understand what I'm about to demonstrate here. I want you to look at that line 3 in the tidymodels portion of the code where I've set the engine to lm. If I want to transition that code to instead use glmnet, all that I need to do is change the portion of the code that said lm to instead say glmnet. And you can see on the left hand side, it was much more than just one of the lines of code that needed to change. Instead, here, glmnet, for its first argument, is going to take all of the predictors and they can only be a matrix. So thankfully, here, in mtcars, all of the predictors are numbers anyway, and so they're one homogenous type and we can easily convert it with as.matrix. And then instead of the name of the outcome variable like it was in lm, the outcome here is specified as the actual vector of values in the mtcars dataset. So that's mtcars dollar sign mpg.

To transition that code, I needed to know not only that glmnet existed and was able to perform regularization, but I needed to look at the documentation to see how that function wanted the predictors to be formatted, how it wanted the outcome to be formatted, whether it needed any additional arguments, which of the functions in the package I needed to use to do so. And again, in tidymodels, all I needed to do was change the value of that engine argument. And the story goes the same way with any other engine you can imagine.

Let's say instead of glmnet, I wanted to use the h2o package, which gives me access to all sorts of machine learning techniques implemented in Java so it's super fast. With tidymodels, all I need to do is change the value of engine to h2o. But with h2o, I need to initialize an h2o server. I need to convert that mtcars dataset to a dataset that's compatible with h2o using as.h2o. And I assign a label to that mtcars dataset, the string mtcars. And then I pass that string as the last argument to h2o's linear model function, h2o.glm. Also, the h2o.glm function doesn't take the values of the predictors, instead it takes the names. So I take the names of the mtcars dataset, the portion of it that has the predictors in it. And the same goes for the outcome. It takes the name of the outcome in the mtcars dataset as a string.

So all of this change in the code, again, I would need to spend time to read through the documentation, see how that interface changes. And only once I've done so can I make use of the functionality within h2o. And with tidymodels, I just get that for free by changing the engine argument.

And with tidymodels, I just get that for free by changing the engine argument.

Another reason to use tidymodels is safety. tidymodels is designed to safeguard you from common modeling pitfalls. One of the things that tidymodels is designed to protect you from is data leakage. A 2023 review found data leakage to be a widespread failure mode in machine learning-based science. Data leakage is when information from the set you're going to use, or the portion of the data that you're going to use to evaluate your model, somehow leaks into the portion of the data that you're using to train your model. If a model is being evaluated on data that it's seen already, then it will perform better than on data that it hasn't seen before. And so it will seem like your model is performing really well, but it's actually not, and you'll only learn that the model isn't performing well once it's deployed.

So in our example, in the mutagen example, if our model seems to be able to classify drugs as either mutagens or non-mutagens really, really well, if we go ahead and deploy that, thinking that the model is performing perfectly, then scientists will mistakenly interpret the results from that model and take them at face value.

Another common failure mode is overfitting. Overfitting leads to analysts believing models are more performant than they actually are. So this is really similar to that first failure mode, where in the case of an overfit model, in our mutagen example, the model is going to perform really, really well on the portion of the data that was used to train the model. And if we don't have tools to then evaluate the model on data that it hasn't seen before, then that model might just learn every little intricacy of the training data and won't generalize well to data that it hasn't seen before.

A last component of safety that tidymodels make sure to protect its users from is irreproducibility. Implementations of the same machine learning model give differing results in many popular implementations of machine learning algorithms, resulting in irreproducibility of modeling results. So each of the citations listed in the bottom here in order are reports on data leakage and overfitting and irreproducibility. And tidymodels is designed to safeguard you from all of them. We developed this software so that it's really hard to do the things that are wrong and that an analyst would commonly want to do, and makes the things that are usually difficult but pay off in the long run, it makes those things really easy.

Another reason why you might want to use tidymodels is that tidymodels is designed to be communicable. All tidymodels objects, or at least many of them, have visualization functions called autoplot that are associated with them. So if I make a receiver operating curve with the tidymodels, which is a tool used to evaluate models, I can pass that object to autoplot and see what that curve looks like in 2D space. I can also pass, for example, the results of trying to tune a random forest algorithm. Again, I just pass that result to the autoplot function, and we have built-in visualization procedures for that model object.

The same story goes for iterative search results. Things like Bayesian optimization, simulated annealing, these are advanced machine learning techniques that are built into tidymodels. When I use tidymodels to generate these results, I can understand them by quickly plotting them.

Another example to use tidymodels is completeness. There's built-in support for 99 machine learning models once you load the tidymodels packages, and there are even more packages if you load extensions that are contributed by the community. tidymodels is also complete with reference to data preprocessing techniques. With the three main packages for preprocessing that are loaded when you load tidymodels, we load in 102 different ways to process data that comes in raw, and turn it into something that a model knows what to do with.

I've hinted at this already, but again, tidymodels is extensible. If you can't find the machine learning technique you need, which is either the machine learning model or the preprocessing technique, you can add it yourself. We have APIs that are designed to make it really easy to contribute your own extensions that are designed with your problem context in mind.

The last reason I'll list here is deployability, and this will bring us into that final step three that we'll drop into the live demo for. tidymodels is tightly integrated with Vetiver, which is a free and open-source package for model deployment. And Vetiver is configured to quickly and easily and securely ship a model to Posit Connect, which is part of the suite of products in Posit Team.

Live demo: deploying a model with Vetiver and Posit Connect

So with that said, let's go ahead and drop into a live demo. I'll spin up a workbench session, and we'll use that session to train a model, and then send it off to Posit Connect. Once I've logged in, this is what my Posit Workbench instance looks like. I'm going to start out by spinning up a new session, and this is what I was talking about where I can get access to resources beyond what my laptop is capable of in terms of computing power. So I have access to different types of clusters, and then for that cluster, different options in terms of the number of CPUs, and the amount of RAM. So I'm going to go ahead and start this session as an RStudio Pro project.

This will just take a couple of seconds to spin up, and once it's wrapped up, it will look a lot like an instance of the RStudio IDE if you've ever worked with one of those locally on your laptop.

So from here, I'm at the point where I have some code that I want to train. I have some code that I want to use to train a model, and this code is hosted on this GitHub repository that I have here, simonpcouch slash mutagen. This repository has all the source code for training the model, as well as the source code for the slides if you're interested in it. I'm going to go ahead and copy the link to this GitHub repository, and then I will start a new project, and the new project will be from version control. I'll start it up from Git, and paste in the URL there, and I'll call this project mutagen. This will also just take a couple seconds to spin up, and once it's done so, all of the data and the source files that were inside of that GitHub repository are going to be also available to me inside of Posit Workbench.

So for instance, there's a readme in this repository giving some high-level information on what's in there. Here's all the source code for the slides in this slide folder. What I'm going to open up is a source here, which has a Quarto document that has all of the source code that I used to pre-process my data and start understanding it in step one, and then moving into that step two, which is developing and training the model, and ultimately deciding on the final fit. So you can see up here in my environment, I went ahead and loaded in the output of final fit from file. So normally in my workflow, I would go ahead and train the model here on Workbench, and since I've done so already, I already have the output of that from here, and instead we're going to move right on to that last step, step three, which is getting that trained model into the hands of the people that are going to use it.

This is actually really simple. There's only a few lines of code you need once you have a fitted model with tidymodels to then go ahead and deploy that model to Posit Connect. The first step is to pass that trained model to vetiver model, which is a function from Vetiver, which will put some wrappers around it and make sure it's ready to go to deploy. I'll also call board connect here and assign that output to board. If you're unfamiliar with this function, this is from pins, and if you want to learn more about the pins package, I think it was the last Posit Team demo where they covered the functionality there.

We're running into an error here, and what this is telling me is that we don't have a Posit Connect server that we're connected to already, so we can actually go ahead and do that, and it's pretty simple. If I go to Tools, Global Options, Publishing, these are the accounts from which I can publish from. I'm going to go ahead and just click Connect here to Posit Connect, and this is pre-filled for me because I'm working on my Workbench instance inside of a Posit Team account. That's pre-filled in there. I'll click Next. It's going to wait for authentication, and it will authenticate to my Posit account, and now that it's authenticated, I can click Connect Account and Apply, and so that's all I needed to do to link my Workbench session up to Posit Connect.

So I can try running that board Connect code again, and it was successful. Nice. So I'm on the Connect instance that I have access to through Posit. I will then pass that final fit object, the vetiver made to me. It's a vetiver pin write to write it to the board I've just created, and now that final fit vetiver object is pinned to the board on Posit Connect. So once that pin is on Connect, I'll go ahead and deploy the model that I have assigned to this Simon Couch slash Mutagen, and you'll see here down in the console that vetiver is doing all sorts of work to bundle up everything that that model needs to create predictions from new data into an object that can be passed off to Connect.

So that happened pretty quickly. It just brought us over to a Connect window where vetiver is showing the deployment. So on the left side of the screen is what a user would see if they pulled up this model API, and so there's all sorts of metadata about the model. There's examples from which analysts can ping, put together code to ping the model for new predictions. And then on the right side, and this is a really nice benefit for a lot of people, there's all sorts of access to security features and metadata. So by default, it's only available to specific users or groups, and you can add those in, but you can also make it available to all users within your organization. And then there's all sorts of other metadata that you have access to as well.

So I hadn't deployed a model using vetiver until I was working on the blog post that I mentioned earlier, and I was really surprised at how straightforwardly it came together. And so I would encourage you if you are working in an instance of Workbench in your organization to go ahead and try to do that same thing, deploy the model that is publicly available on my GitHub and see what happens.

So I hadn't deployed a model using vetiver until I was working on the blog post that I mentioned earlier, and I was really surprised at how straightforwardly it came together.

Wrapping up and resources

So with the third step of deploying our model wrapped up, we've really touched on all portions of this diagram at this point. At first, we took a look at our data, we situated ourselves with the modeling problem, we visualized some of the variables and got a sense of the challenge that we were up for. We then moved on to training and evaluating a model using tidymodels. In this demo, we really just talked about why using tidymodels would be advantageous to your modeling workflow. But if you're more interested in how we actually went about doing that, I would encourage you to pull down that script I had up during the demo, run some of the code and see how it feels for you. And I'll show you some resources after this where you can learn more about how to use tidymodels. After that, we went on to deploy the model to Posit Connect using Vetiver. And as you saw, that really only took a couple lines of code. So we very quickly went from a final trained model to a model that's in the hands of users at our organization.

If you're interested in learning more, to learn more about tidyverse, how to understand and clean and visualize data, the R for Data Science book at r4ds.hadley.nz is definitely a great place to start. And if you feel comfortable with those foundations and want to move on to the modeling portion, the Tidy Modeling with R book at tmwr.org is a great resource and can bring you from a beginner to a strong modeler very quickly. If you want to learn more about Posit Team, I would head to posit.co slash team. There's all sorts of information on the products that are included as part of Posit Team and links for places to learn more. And lastly, if you want to see these slides again or poke around with the source code that's underneath them or poke around with the source code to train a model using tidymodels, again, that code is at github.com slash simonpcouch slash mutagen. I really appreciate the welcome to drop into the Posit Team demo. Thank you for having me and looking forward to questions.

Thanks so much, Simon. Really appreciate you joining us here today for the demo and for sticking around for the Q&A. Just wanted to let everybody know that this YouTube will push you over to the live Q&A in just a second here. But if you want to ask questions, you can post in the YouTube chat over there, but you can also use the Slido link that we have up on the screen so you can ask questions anonymously. If you go to pos.it slash demo dash questions.

But one thing that I wanted to add is that we have these monthly end-to-end workflow demos on the last Wednesday of every month. So if you joined us here today for your first one, we'd love to have you join us again. And so you can add those to your calendar with just the short link pos.it slash team dash demo. And that add to calendar link will point you to all the recordings as well. But we'll see you over there in the Q&A in just a second.