Julia Silge & Max Kuhn | Good Practices for Applied Machine Learning

Transcript#

This transcript was generated automatically and may contain errors.

Thank you so much for that introduction, Hadley. It is such a pleasure to be here. Max and I are really excited to be talking to you today about machine learning and software for machine learning. This talk, the topic is not so much going to be about the math of statistical modeling or new methods for machine learning. What we want to talk to you here today about is applied machine learning, how you as a modeling practitioner get on that road, that road from model development through to model deployment.

As we're here together and we talk about being this road that we are on, we think this trip down this road is going to be better if we all go together. So we would encourage you to have Slido up on your phone with the event code R keynote. And we're going to be using some polls during this talk for us in this room and people who are watching from afar to be able to share their own experiences, opinions, and perspectives.

So the last three lines on the yellow block, we do something like we create what's called a workflow object. And that's a situation where you want to take a complicated preprocessor like a recipe and bind it together to whatever type of model that you want to use. So in that workflow call, I'm binding together my data processing recipe, and I'm saying we're going to use this with Cubist. And then there's a simple fit function we use after that, and that does all the preprocessing, all the estimation that happens during preprocessing for this recipe, hands the data off to Cubist, it runs all its model, and it's all sitting in one object that you can save or deploy, spoiler alert, and do things with.

Model deployment with vetiver

In my experience and talking with folks about their modeling, one of the parts of the whole modeling analysis that has been the least ergonomic, that has involved the most pain, is the process of getting your model off your machine. You know, you would spend time training, tuning, developing a beautiful, accurate, appropriate model, and then when it's time to deploy that model, to put that model into production, there's been a real lack of tools for people who are excellent model developers to be able to take that model and get it off where it needs to go.

Maybe some of you are sitting here and you're like, what does that even mean, the word production? That's a great question, because there isn't some industry standard definition for what production means. I like to use the definition or the idea that putting a model into production is taking your trained model, getting it out of the computational environment where you trained it, and putting it somewhere where it is useful to a wider group of people. So by that definition, taking a model and putting it in a shiny app and letting people interact to get predictions, we can say that's production.

The industry standard these days for putting models into production for model deployment is to create restful APIs so that you can scale your API, you can have it as a microservice, and we have this convenient way to take a model that you have trained and put it in a computational environment where it's accessible to anyone in your organization who needs it.

I've been working on a new framework for model deployment and other MLOps tasks called vetiver. So if you are into perfumery or fancy candles like I am, you may have seen this word, this word vetiver. So vetiver is a stabilizing ingredient in perfume. It takes the more volatile fragrances and it stabilizes them. So in this metaphor, your model is this more volatile thing. You have used different hyperparameters, you have used different data, and vetiver helps you stabilize it so you can version it, deploy it, and monitor it.

Let's walk through that a little briefly. So any modeling process starts with you collecting data, having access to some kind of data. The first thing you need to do is to understand, clean, you need to engage in the process of exploratory data analysis. There are great tools in Python and R for you to approach those tasks. Next, you want to train your model. You want to train, tune, evaluate, and again, there are amazing open source tools like what we are talking about to be able to approach those kinds of tasks. So far, at this point, that becomes less true as we move around this cycle.

Once your model is trained, you need to have a reliable scheme to version that model so that you can know which version of my model was used over which time with which data. Once your model is versioned, you need to deploy it. And after it's deployed, your job is not done. You need to also monitor your model. You need to measure regularly the statistical characteristics of how your model is performing so that you can make adjustments, decide to retrain, head down that path. So, the tasks on the right side of this diagram are ones that there are great open source tools for. The tasks on the left side are ones where that has not been true, and this is where vetiver sits. So, vetiver is a framework for versioning, deploying, and monitoring your model.

Now, you can probably guess from this slide that vetiver is not just for tidymodels. It supports many kinds of models in R, and vetiver is not just for R. vetiver was designed from the beginning to have feature parity between R and Python. You can use the tools that you want to use to understand and clean your data, to train your model, and then vetiver can support these ML ops tasks for both Python and R.

While you're here, though, I'll just show you what this looks like in R a little bit. So, let's say I trained that model that Max talked about, and now it's time to deploy it. The first step is to create a deployable model object. This collects all the information that we need to be able to move this model from where I trained it over to a new computational environment where I can make predictions. This information includes the types and number and names of the predictors, the original predictors. This includes also the packages and package versions, the specific software that is needed to be installed over in that other place to be able to make predictions.

Once we have that deployable model object, the process of spinning up an API is quite ergonomic. This was our goal here. So, on the R side, we used the plumber package, and once you have a plumber router initialized, it is literally just one line of code to spin up your model-aware API that's ready to serve predictions at an endpoint. So, this is what we mean by ergonomic. We want to let you be expressive, to let you be fluent, to let you get your job done in this way.

Being effective: tuning and advanced methods

Now, let's move on to our second point, which is how tidymodels makes you more effective. Max, when we think about this, like being effective, what are some of the things that come to mind for you? So, you know, back when I used to do modeling for a living, I feel like I was most effective when somebody brought some data to me, like DeNovo, like a new project, and I was able to give them a predictive model back that really, like, met or exceeded what they were interested in or what they needed for the problem and not take, let's say, like months to be able to do that.

A big part of building a machine learning model is model tuning, which we mentioned earlier, where you have these hyperparameters that you can't directly estimate from the data and you need to find a way to tune them. So, another poll here, have you ever tuned a model? So, for example, if you have a neural network and you want to figure out how many hidden units or hidden layers you should have, you need to, you know, do some sort of methodology to figure out what good values are for those tuning parameters.

So, just as an example of some techniques to talk about effectiveness, you know, there's a lot of things that aren't, like, bleeding edge, but they're relatively cutting edge and might be coming from different parts of literature. Some things we include in tidymodels are from the deep learning literature. There's actually some interesting things we're doing from the computational chemistry literature that we're using for other types of data or any type of data. Some examples of that would be feature embedding methods, and these are just, like, really fancy machine learning methods for dimensionality reduction. So, UMAP and ISOMAP, like multi-dimensional scaling, are pretty powerful and interesting methods, and we'd like you to be able to use them for any data and use them with any model.

Bayesian optimization is a tool that's, you know, pretty common in the deep learning or neural network side, but you can use that for anything. There are a few packages in R that do that, but they're not really well integrated overall with all the types of models. We provide you tuning methods for Bayesian optimization that you could use for your boosted tree or for a support vector machine or whatever you're going to do. And then, racing is an example we'll look at in a little bit more detail. It's a way to do, like, really, really efficient grid search.

So, in grid search, you might come up with, for your tuning parameters, you might come up with a number of different candidate, like, values of those tuning parameters. So, you might say, let's try seven hidden units and 12 hidden units and two and so on. But you predefine them. And the problem with grid search sometimes is you don't know if some of those choices you made about the candidate parameters are any good until you're done with all the computations. And so, what racing does, it's a dynamic way of doing that, is as you start to do the model tuning, it's starting to look at the results as they happen and look at some tuning parameter combinations and say, oh, those are never going to be the best.

So, in this little animation we have here, the y-axis is 50 different model configurations for a machine learning model. And the x-axis is a measurement of performance. And so, the blue dot, if you can see it there near the top, is the current winner during each resample that we're doing for our grid search. And as you can see, a lot of these candidate values got grayed out and were eliminated on the next step. So, every time it does a resample, it's doing some analysis to figure out what's good and what's bad. And it gets rid of the, you know, the bad stuff. And so, really quickly, after like eight or ten resamples, you're down to just a handful of models out of 50. And in this particular case, of all the models you could have fit, you actually only ended up fitting about 7% of them. And if you're working with parallel processing, that's like almost like a three-fold increase in efficiency. So, it took you a third of the time it would have done with regular grid search.

Extensibility: the tidymodels ecosystem

This racing example is such a great one because it really highlights how tidymodels is extensible. The package that provides that infrastructure is not a core tidymodels package, but an extension package. So, the metaphor that we can think about here together is one of Legos. So, if you have one beautiful Lego block here, we can admire it, look how well it's designed. But the reason why Legos are fun to play with is that you can put them together. And you can put them together in the way that meets your needs or fulfills your vision or is what it is that you want to, what is that you want to build. And that is how tidymodels works.

But if we ask the question, what else is tidymodels? There's another whole group of packages that are for more specialized tasks. I'm just going to run through the ones on the top there briefly. TextRecipes is a feature engineering extension package for text preprocessing, text feature engineering. This is one of my favorite things to talk about. And tidymodels gives you great tools to make you effective in building text models. The next one there is the Censored package, which is for survival analysis. Our coworker Hannah is going to be giving a talk on that in the next session, if you particularly work with survival analysis. The next one is interesting, Stacks, because it builds on top of the whole tidymodels framework to actually take multiple different kinds of models and stack them together or ensemble them so that you can get a, like, squeeze that last bit of performance out of your data that you have.

The last one up there is Model Time. And I want to highlight this because this one is a package that's not built by us, who work on the tidymodels team. It's actually built by a member of our community. So this is built by Matt Dancho, who is here somewhere today. And it highlights how tidymodels is extensible, not just by us, but also by you. This is true whether you're interested in building some open source software for a new kind of model or a new kind of resampling. It's also true, you know, just within your own organization. You can build, for example, a custom metric based on your company's KPIs and use that to optimize the models that you are training.

Now, when we start talking about all these packages, one of the things that we start to hear from people is a bit of discomfort with the fact that there are so many packages. How do I know which one does what? How do I find them? And we want to acknowledge that, yeah, there is this little bit of a learning process. And it's important to have tools in your tool belt to be able to identify what functions come from what packages. However, we really want to emphasize that this modularity makes all of our lives better. It makes our lives better as maintainers of the packages, and it makes your lives better as users of the package. Smaller, more modular packages can be released more frequently with smaller changes. We can more quickly fix bugs.

And this is, I would say, most highlighted when we are talking about model deployment. Let's say we trained a model using that racing method that Max shared with us. We don't need any of that racing infrastructure when we go to deploy our model. In fact, we don't need the tuning infrastructure at all. We need a subset of the modeling software that is required to make a prediction. Our packages being modular allows you to make smaller Docker images, have faster installation times, being able to have more scalable models in production.

Practicing safe machine learning

So, so far, we've talked about how we can have a more ergonomic modeling process. We have talked about how we can make you more effective in the models that you build. And last, we want to talk about being safe, practicing safe machine learning. So, Max, how does tidymodels keep us safe? So, the whole idea of safety in modeling, you might be like, well, what do you mean by safety or being safe? It turns out with, like, you know, complex machine learning models and especially complex data, it's possible for you to do something horribly wrong and not really know it until a really inopportune time.

And I've made this confession several times at different conferences and things like that. But, you know, in my first job, I had a project that was, like, three quarters of the total R&D budget. You know, we were doing a bunch of sophisticated machine learning with, like, a large number of predictors. And we thought we'd gotten it to a good point. And my boss came by and he's like, how does it look? And I'm like, accuracy's, like, you know, like, 90%. He's like, great. And then we got more samples in, and we missed them all. And we did a lot of soul searching in the days after that and figured out some methodology mistakes that we'd made that others make, just to be honest with you, quite often. And once we figured that out and fixed it, we had more genuine estimates performance.

So, there are times in modeling where, especially if you're doing something complex, that you might accidentally fool yourself into believing you have something that's really good, but you have something that's actually not as good, and you won't figure that out until we get new data. And here's one of the reasons is, going back to this part of our previous slides, is, you know, we have all the modeling parts, and those are pretty well understood how to validate them and how to assess them. But very often, before we get to the model, we might do something either simple or something really complex to the data that involves something that's not, like, a deterministic, like, data cleaning operation. We might be doing some statistical estimation, right?

And so, it's really, really important, in some cases, that we handle that in the right way to make sure that our statistics that tell us how well our model is doing take into account the good and bad parts that might happen in that part of the system. So, let's think about, like, our Spotify script. We have all these genres. Maybe we got rid of the redundancy, but maybe we want to refine that feature set a little bit more before we give it to our model. And let's say we want to do some, like, fancy supervised feature selection routine. So, we might want to pick, like, the top 10 variables to give to the model and pick those 10 because they're the most influential for the outcome, like, for popularity, in this case.

So, let's say, in our Spotify data, we're going to do a tenfold cross-validation to validate our model. And we're going to use feature selection. So, actually, how many times do you think we're going to run that feature selection? Are we going to run it once? Are we going to run it 10 times? We could run it 11 times. Or we could run it some fractional amount of time because it was taking so long that you just, like, hash, you know, you slam the escape key until it stopped after three iterations.

So, you know, here's the situation we're in, is we have some data. We have a list of predictors. We want to enact some sort of feature selection routine to filter out some of those predictors before we give it to our model. Let's just say, for kicks, we're doing ordinary least squares with LM. And so, that produces some sort of fitted model. And that red box there is, basically, the box that says, like, what am I estimating? What should I validate here?

And the way we have the diagram here treats the feature selection bit as if it were sort of outside the model. So, if I would ask most people, like, where are we modeling in this diagram, they would probably do what we show and just circle the actual technical model part of the process. But in actuality, what you have to do is you have to do this. And in this previous slide, this is what I did when I made that horrible mistake in my job, is I just selected the features once, put them into the model, and there's a big sort of, like, circular argument methodology-wise in doing that. And it gave me really bad results.

What we want to do is this. So, the answer to this is probably C, where for every one of those tenfold cross-validations, you do the feature selection over and over again. And that may seem, like, excessive or it may take a long time. But at the same time, you can't measure the effect of something that's not within your sort of validation system. If it's outside of it, it might give you bad answers. So, the answer is probably B, which is 10. But let's say this model was really, really good, and you liked it, and compared to all the other things that you did, this is the one you want to take to production. You want to put it in vetiver and deploy it. Well, actually, what you would end up doing is doing it an 11th time. Because to build the final model, you have to do the feature selection on the entire dataset, hand that data over to LM, and run LM on the entire dataset.

So, really, the answer there is either B or C. But it's highlighting that there are some pitfalls and gotchas that it's really easy for somebody like me, who studied this in graduate school and did it for a living for a while, that you could actually make these mistakes. And unfortunately, it happens quite a bit. So, you know, there's quite a few papers that talk about this and measure this, but there's one that just came out. You know, we're bringing it up because their findings were maybe, like, not great. So, leakage in this context is data leakage. And it's really, where you're using the wrong data at maybe the wrong time to do some sort of calculation, right?

And so, in this paper, what they do is they look at a lot of different publications. And you can tell, based on their method sections, what they were doing. There's a lot of situations where they're doing, like, the bad methodology. And in tidymodels, what we want to do is, whether people know it or not, is we sort of want to, like, silently sort of give you guardrails. So, if you follow the process of tidymodels and use a tidymodels syntax, it's very, very, very difficult to do the wrong thing.

So, just to give you an example of that, here's another recipe. We're going to use a recipe extension package written by somebody in the community called Recipe Selectors. And that has a recipe in it that will do, like, let's say, like, a random forest variable importance score across your predictors. And then you can select the top 10 or 15 or 30, whatever you want, you think you should do of the most predictors and give that to, let's say, your linear regression model. But we don't really know how many we should pick. So, if you look at that line there, it has an argument called top p. So, it's like, how many predictors should I retain? We give that a value of tune. And that value of tune marks it in tidymodels as being something that we want to optimize.

And so, then we can take our new recipe that built on our previous Spotify recipe, put that in a workflow with a modeling function that says we're going to use just plain old linear regression. And we could use the tune grid function to do, like, a grid search to find a good value for how many of those predictors that we should use. And so, this is how you would do the feature selection in tidymodels. Now, the thing is, the way we've set up resampling and data splitting, the way we're combining our preprocessor with our model, it's nearly impossible if you did accidentally do the wrong thing here and inappropriately validate or estimate things from your data. Okay. So, we're really, without really saying that, we're sort of forcing that methodology for users.

That idea of making it hard to do the wrong thing leads us to wanting to talk about one more metaphor here when we're here talking about, like, being on the road, being on this journey as we drive down. This may not be 100% resonant to those of you who are visiting here from outside the United States, maybe where you live somewhere where you have excellent transportation. But I have always lived somewhere where I rely on a car. And the idea of locking my keys in my car is something that is kind of terrifying because it's, you're like, oh, no, what do I do now? I'm stuck somewhere.

So, here together, let's answer another poll. So, when was the last time that you locked your keys in your car? Was it pretty recently? Was it a long time ago? Have you never done this? Max, when was the last time you locked your keys in your car? I would pick D, about 10 years ago. Yeah. I would pick D, too. It was, like, 10 or 15 years ago was the last time that I locked my keys in my car. And I wouldn't be surprised if a bunch of people here choose that they have never locked their keys in their car. And I don't think it's because they're that much more responsible or good at handling keys than we are, but rather that you're quite a bit younger than Max and I are. Because the thing is, the car that I have now, it's really hard to lock my keys in my car now. The car itself is built in such a way that it protects against common failure modes.

So, this is how tidymodels is built. It is built to protect you against these common pitfalls, potholes, that are on your road to developing and deploying your model. So, the idea of T's is like an immediately painful problem. And some problems that you run into in machine learning are like that. One I might suggest here would be the model deployment. You know, you're like, oh, I can't get it. It fails every time I try to, like, push it to wherever I'm going. There are other kinds of problems, though, that are not obvious until later. So, the metaphor we might use here would be not filling your car up with gas and just driving along. And because you didn't make that choice at the last exit that had gas, you end up running out of gas later. tidymodels also protects you against problems like that, that you make some decision during your modeling process, and it may not come to bite you until you're predicting on new data.

So, this is how tidymodels is built. It is built to protect you against these common pitfalls, potholes, that are on your road to developing and deploying your model.

So, as we come up and start to wrap up here, we want to share a thought from my friend and co-author, Dave Robinson, who started using tidymodels about a year ago and wrote a blog post reflecting on his experience. So, he said that he, like us, sometimes hears some resistance to the idea of how tidymodels works, that it makes it too easy. You know, people, you're not really thinking about what it is that you're doing. But, like Dave, we think this is entirely backwards. We want to protect you as a modeling practitioner from making silent but bad choices so that you don't have to worry and stress about those things, and you can focus on the scientific and statistical questions that are the reason why you're training a model in the first place.

But, like Dave, we think this is entirely backwards. We want to protect you as a modeling practitioner from making silent but bad choices so that you don't have to worry and stress about those things, and you can focus on the scientific and statistical questions that are the reason why you're training a model in the first place.