
Julia Silge & Max Kuhn | Good Practices for Applied Machine Learning | Posit (2022)
The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. Whether you are just starting out today or have years of experience with ML, tidymodels offers a consistent, flexible framework for your work. In this talk, learn how tidymodels has been designed to promote ergonomic, effective, and safe modeling practice. We will discuss how to think about the steps of building a model from beginning to end, how to fluently use different modeling and feature engineering approaches, how to avoid common pitfalls of modeling like overfitting and data leakage, and how to version and deploy reliable models trained in R. https://www.tidymodels.org/ Session: Keynote
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you so much for that introduction, Hadley. It is such a pleasure to be here. Max and I are really excited to be talking to you today about machine learning and software for machine learning. This talk, the topic is not so much going to be about the math of statistical modeling or new methods for machine learning. What we want to talk to you here today about is applied machine learning, how you as a modeling practitioner get on that road, that road from model development through to model deployment.
As we're here together and we talk about being this road that we are on, we think this trip down this road is going to be better if we all go together. So we would encourage you to have Slido up on your phone with the event code R keynote. And we're going to be using some polls during this talk for us in this room and people who are watching from afar to be able to share their own experiences, opinions, and perspectives.
Introducing tidymodels
So what we're going to be talking about today is the software that we build for machine learning, and that is tidymodels. So you can think of tidymodels as a meta package that helps you install and manage other packages that are used for modeling and machine learning in R. These packages work together, but each have a specific focus. It works much the same way that the tidyverse does. So if you have ever typed library tidyverse and then used ggplot2 for data visualization, dplyr for data manipulation, you can think about tidymodels as working in the same way.
You type library tidymodels, and then you have access to these different packages that have a focus purpose. So yardstick, for example, is a package that is used to measure model performance. Tune, perhaps unsurprisingly, is a package for hyperparameter tuning. So this is a great way to think about tidymodels, that it's an R package that gives you access to all these kinds of functions that work together.
It's also right to be thinking about tidymodels as a unified framework for modeling in R. The ecosystem for modeling in R is one of our, like, big strengths. It is a reason why people choose to use R. At the same time, it really presents a challenge, because the interface to all these different kinds of models are very heterogeneous. If you have set up a model analysis using one kind of model, the moving over to a different type of model often involves starting over from scratch with your whole approach. tidymodels provides a unified interface to a vast array of modeling approaches, feature engineering approaches, and encourages good statistical practice via its design.
Now, some of you are sitting out there, and you're thinking, wait a minute. R already has several of these. In fact, that person standing up there next to Julia talking right now is one of the people who has built one already. So, Max, what's going on? Why yet another modeling framework for R?
Why would Max do this, right? So I wrote a package a while back called Carrot, and it's another similar modeling framework. And it was in 2005, which, many of you know, if you developed packages back then, were a different time. There were no namespaces in R, for example, and it was lacking a lot of the things we have now. I was also kind of doing it in my spare time, and it wasn't really engineered in a way that we could sort of port it over to new and better interfaces and extend it really well. So, you know, learning more as I've gone on about software engineering, I think if I'd known those things back then, I would have done some things different under the hood.
So it's not really something that is really extendable at this point. But maybe more importantly, it's really not something that has a very, it's not a bad interface, but it's a very 2005 interface where you have like one function that does just about everything and has like 30 options to it. So we were thinking about doing more and more with Carrot. It would be really difficult because, you know, if we add more types of resampling or more preprocessors or this or that, then you have a function that has like maybe 100 or so options, which is not a good design. And so since 2005, I think we've not only learned a lot more about how to write a good R package, but we've learned a lot more about how people interact with our software a lot more.
And then, you know, as time goes on, you see things that are happening in ggplot, even just plier and then dplyr and things like that. So I think we've learned, like as a community, a lot more about what are good ways of writing interfaces to packages. And then as I started to think about that, I started to think about this in the context of modeling, and it seemed really like it would make sense to have more of like a tidy, modern interface to modeling. And so that's really the reason to build like a whole new sort of stack from the beginning or from the bottom up, because really what was old wasn't engineered well enough to extend and also just its whole view of the world was not really very modern.
The Spotify dataset and the reality of modeling scripts
So a lot of times we want to introduce software and talk about, let's say, our philosophy and why we do things. A lot of times we want to start off with a data set that we can use to illustrate things. So here's a data set some of you have seen. It's Spotify data. So each row in this data frame is a track or a song. We might want to do some modeling on this. This was in a Slice competition. The outcome is the popularity column right there. That's a numeric. So we want to try to predict that with the other columns that are in this data set.
Other interesting features, there's an artist feature. And as you might imagine, there are quite a large number of artists in the Spotify catalog. So we might think about ways we could deal with that when we go to do modeling. There's a date field. And we might not want to put date directly into a model. We might be better off by deriving features from it, like an indicator for month or year or any seasonality or any sort of like date-based features instead of using the original date value. And then we sort of have like this monster of a column over here, this character. And it's really like a multiple choice type field sort of in a delimited format. And what we'd like to do is we think that maybe genres are probably important to predict popularity. So we'd have to deal with this data somehow.
And so there's this sort of like idea that we have of like what our modeling script is going to look like. And, you know, when we read books and we see websites and blogs and things like that, I think we're always led to believe that there's like this perfect data set that we're handed by our associates. And there's like this single call to a modeling function. It's really clean and nice. And then you're going to make predictions. And it's just like a single line. And you're fine. It's no big deal. But the reality is more like this. It's like a Franken script.
And a lot of this is about the data pre-processing. So, you know, at the top there, like we might use something like a technique called effect encoding to handle the large number of artists. That requires some statistical estimation of the effect of each artist. You might do things like convert your, you know, all of us have like LuberDate commands, right? That use like, you know, W-Day and all that to make date features that we then put in our model. There's code there to parse the multiple choice field. And then at some point we have way too many features. We might think about doing some feature selection. And so there's this like big script that we have to sort of do before we get to that actual one line of modeling.
So this is the reality of things. And it's not great. I mean, it's not something if you do this for a living, you wouldn't have to deal with every time you get a new data set. Now, we'll come back to that slide and we'll also come back to this slide a couple times. And the point we want to make with this slide is that we're doing, even before we get to the model, we're doing a lot of estimation and a lot of sort of like quasi-modeling work. So there are things here that we're estimating from a statistical standpoint. And also to say that when we go to predict new data, we have to apply all those same transformations to data we might get six months from now to make a prediction.
Ergonomic modeling with recipes
I really love this visualization of what you think your script is going to look like and how it actually ends up turning out. It aligns so much with my practical experience. And that's why we want to talk to you here about why tidymodels has designed the way it is and how it can support more ergonomic approaches to modeling. By ergonomic, what we mean here, this is really about syntax, about how expressive you can be in the code that you write, how fluent you are, and how the syntax that you write helps you get your work done.
We also want to talk about how tidymodels will make you more effective. What we mean by effective here is that it allows you, as a practical modeling practitioner, to have access to a wide swath of different kinds of approaches, that it enables you to do many different kinds of things. And last, we want to talk to you about how tidymodels keeps you safe. The process of machine learning has many pitfalls or potholes in the road on the way, if you will, and what tidymodels is designed to help you avoid those pitfalls, those potholes.
So a lot of times I think about physical ergonomics. When I work in diagnostics, we have people come in with clinicians, and they would interact with the diagnostic machines we had, and we'd watch them and see, just as they did their work, what were things they did over and over again or they might get frustrated with. And there's a lot of that in the syntax, the way we design syntax for our functions. But in particular in machine learning, there's an extra layer of things that we have to think about, because there's a high amount of cognitive load for machine learning.
So when you go to do a model for the first time, or if you're, let's say, taking a workshop to learn about machine learning, before you can even get to your first model fit, there's all this stuff that we have to throw at you. We have to talk about overfitting, and data splitting, and training and test sets. And there's a lot of jargon and a lot of things. It's like, you do this, and then this, and then this, and then a day or two later, maybe you're fitting your first model.
And so what we want to do is we want to simplify this process, maybe in ways you don't even see in terms of the syntax. So we want to take this complexity and reduce it down a little bit. We don't want to stop you from doing really sophisticated things and having access to the functions and the nitty gritty bits. But we do want to give you really good APIs that, for like 80% of the work, the syntax should be pretty high level and consumable by just about anybody.
So one aspect of machine learning and really complicated or sophisticated statistical models is we have to think about what our data are and how we're going to use them. So we're going to do our first Slido poll. So go ahead to slido.com if you haven't already. And the event code is our keynote. And so what we want to talk about is the types of roles you would have when you're building models. And so this is multiple choice. So like A and B are the things that we probably think about normally, predictors and outcomes. But you know, there are more sophisticated methods you might use, like with case weights, like frequency weights or importance weights or survey weights.
So let's look at recipes for a minute. So we'll use that as an example. So what recipes are is they're kind of a combination of like the R formula method that you see in modeling functions. Sort of combined with like a dplyr sort of approach where you're going to be basically piping in operations that you want to do to your data before you get it to the model. So in the first line there, we load the tidymodels package, which again loads the core tidymodels packages plus some tidyverse packages. And when you start modeling, a lot of things that we typically do is we do an initial split of our data into testing and training set.
So in that first line, what we're doing is we're starting a recipe. And there's a formula method there. It says popularity to the dot. The left-hand side says popularity is an outcome. That's its role in the analysis. And the dot means everything else that's in that data frame besides popularity should be by default considered to be a predictor. Now, you can have a lot of different roles in recipes. You can make up roles for whatever you want. But right now at this point, all the recipe does is it sort of catalogs the data, what's in the data frame, is it numeric, is it date, is it categorical, that sort of thing.
So what we can do then is as we want to do operations on our data to prepare it for modeling, we can sort of pipe in new steps, new functions that represent data processing steps in our recipe. So let's say we're going to use a model like a neural network or a nearest neighbor model where we have to make sure that our predictors are all in the same units. And one common way of doing that is centering and scaling your data. So a step function we have for that is called StepNormalize. And after StepNormalize, you can use any sort of like dplyr selectors like you would in the dplyr select function to say which variable should be affected by this.
But we know from the modeling standpoint, when we make choices about these things, we're thinking about what type of role the variable is and what its type is. You wouldn't go and try to center and scale like categorical data. So what we've done is we've given recipes a lot of extra dplyr selectors to let you choose the model-related context in which you want to use variables. So you see that all numeric predictors and it's doing what you think. It's going to select anything at that point that exists in the recipe that is numeric and has a role of predictor.
So when we go back to our, what I keep calling the Franken-script, you know, we have all these things that we wanted to do to our data before we give it to the model. And it's kind of like I said, a big mess of scripts. You know, parts of these scripts have probably never been like unit tested and things like that. So we can convert that into a recipe. So in that yellow box, you see a series of steps and they do all the things that we talked about doing. You might use the step date to take the date field that's in there, convert that to new features for like date and year and things like that.
So all these things you can put into a single object, you can save that object, you can carry it around. It's not in a bunch of scripts. It's been unit tested and has a lot of features in it. So we want to simplify that part of the process. So the last three lines on the yellow block, we do something like we create what's called a workflow object. And that's a situation where you want to take a complicated preprocessor like a recipe and bind it together to whatever type of model that you want to use. And in this case, we're going to use a machine learning model called Cubist. It's like a very sophisticated rule-based model. So in that workflow call, I'm binding together my data processing recipe, and I'm saying we're going to use this with Cubist. And then there's a simple fit function we use after that, and that does all the preprocessing, all the estimation that happens during preprocessing for this recipe, hands the data off to Cubist, it runs all its model, and it's all sitting in one object that you can save or deploy, spoiler alert, and do things with.
So the last three lines on the yellow block, we do something like we create what's called a workflow object. And that's a situation where you want to take a complicated preprocessor like a recipe and bind it together to whatever type of model that you want to use. So in that workflow call, I'm binding together my data processing recipe, and I'm saying we're going to use this with Cubist. And then there's a simple fit function we use after that, and that does all the preprocessing, all the estimation that happens during preprocessing for this recipe, hands the data off to Cubist, it runs all its model, and it's all sitting in one object that you can save or deploy, spoiler alert, and do things with.
Model deployment with vetiver
In my experience and talking with folks about their modeling, one of the parts of the whole modeling analysis that has been the least ergonomic, that has involved the most pain, is the process of getting your model off your machine. You know, you would spend time training, tuning, developing a beautiful, accurate, appropriate model, and then when it's time to deploy that model, to put that model into production, there's been a real lack of tools for people who are excellent model developers to be able to take that model and get it off where it needs to go.
Maybe some of you are sitting here and you're like, what does that even mean, the word production? That's a great question, because there isn't some industry standard definition for what production means. I like to use the definition or the idea that putting a model into production is taking your trained model, getting it out of the computational environment where you trained it, and putting it somewhere where it is useful to a wider group of people. So by that definition, taking a model and putting it in a shiny app and letting people interact to get predictions, we can say that's production.
I've been working on a new framework for model deployment and other MLOps tasks called vetiver. So if you are into perfumery or fancy candles like I am, you may have seen this word, this word vetiver. So vetiver is a stabilizing ingredient in perfume. It takes the more volatile fragrances and it stabilizes them. So in this metaphor, your model is this more volatile thing. You have used different hyperparameters, you have used different data, and vetiver helps you stabilize it so you can version it, deploy it, and monitor it.
Let's walk through that a little briefly. So any modeling process starts with you collecting data, having access to some kind of data. The first thing you need to do is to understand, clean, you need to engage in the process of exploratory data analysis. There are great tools in Python and R for you to approach those tasks. Next, you want to train your model. You want to train, tune, evaluate, and again, there are amazing open source tools like what we are talking about to be able to approach those kinds of tasks. So far, at this point, that becomes less true as we move around this cycle.
Once your model is trained, you need to have a reliable scheme to version that model so that you can know which version of my model was used over which time with which data. Once your model is versioned, you need to deploy it. And after it's deployed, your job is not done. You need to also monitor your model. You need to measure regularly the statistical characteristics of how your model is performing so that you can make adjustments, decide to retrain, head down that path. So, the tasks on the right side of this diagram are ones that there are great open source tools for. The tasks on the left side are ones where that has not been true, and this is where vetiver sits. So, vetiver is a framework for versioning, deploying, and monitoring your model.
Now, you can probably guess from this slide that vetiver is not just for tidymodels. It supports many kinds of models in R, and vetiver is not just for R. vetiver was designed from the beginning to have feature parity between R and Python. You can use the tools that you want to use to understand and clean your data, to train your model, and then vetiver can support these ML ops tasks for both Python and R.
While you're here, though, I'll just show you what this looks like in R a little bit. So, let's say I trained that model that Max talked about, and now it's time to deploy it. The first step is to create a deployable model object. This collects all the information that we need to be able to move this model from where I trained it over to a new computational environment where I can make predictions. This information includes the types and number and names of the predictors, the original predictors. This includes also the packages and package versions, the specific software that is needed to be installed over in that other place to be able to make predictions.
Once we have that deployable model object, the process of spinning up an API is quite ergonomic. This was our goal here. So, on the R side, we used the plumber package, and once you have a plumber router initialized, it is literally just one line of code to spin up your model-aware API that's ready to serve predictions at an endpoint. So, this is what we mean by ergonomic. We want to let you be expressive, to let you be fluent, to let you get your job done in this way.
Being effective: tuning and advanced methods
Now, let's move on to our second point, which is how tidymodels makes you more effective. Max, when we think about this, like being effective, what are some of the things that come to mind for you? So, you know, back when I used to do modeling for a living, I feel like I was most effective when somebody brought some data to me, like DeNovo, like a new project, and I was able to give them a predictive model back that really, like, met or exceeded what they were interested in or what they needed for the problem and not take, let's say, like months to be able to do that.
A big part of building a machine learning model is model tuning, which we mentioned earlier, where you have these hyperparameters that you can't directly estimate from the data and you need to find a way to tune them. So, another poll here, have you ever tuned a model? So, for example, if you have a neural network and you want to figure out how many hidden units or hidden layers you should have, you need to, you know, do some sort of methodology to figure out what good values are for those tuning parameters.
So, just as an example of some techniques to talk about effectiveness, you know, there's a lot of things that aren't, like, bleeding edge, but they're relatively cutting edge and might be coming from different parts of literature. Some things we include in tidymodels are from the deep learning literature. There's actually some interesting things we're doing from the computational chemistry literature that we're using for other types of data or any type of data. Some examples of that would be feature embedding methods, and these are just, like, really fancy machine learning methods for dimensionality reduction. So, UMAP and ISOMAP, like multi-dimensional scaling, are pretty powerful and interesting methods, and we'd like you to be able to use them for any data and use them with any model.
Bayesian optimization is a tool that's, you know, pretty common in the deep learning or neural network side, but you can use that for anything. There are a few packages in R that do that, but they're not really well integrated overall with all the types of models. We provide you tuning methods for Bayesian optimization that you could use for your boosted tree or for a support vector machine or whatever you're going to do. And then, racing is an example we'll look at in a little bit more detail. It's a way to do, like, really, really efficient grid search.
So, in grid search, you might come up with, for your tuning parameters, you might come up with a number of different candidate, like, values of those tuning parameters. So, you might say, let's try seven hidden units and 12 hidden units and two and so on. But you predefine them. And the problem with grid search sometimes is you don't know if some of those choices you made about the candidate parameters are any good until you're done with all the computations. And so, what racing does, it's a dynamic way of doing that, is as you start to do the model tuning, it's starting to look at the results as they happen and look at some tuning parameter combinations and say, oh, those are never going to be the best.
So, in this little animation we have here, the y-axis is 50 different model configurations for a machine learning model. And the x-axis is a measurement of performance. And so, the blue dot, if you can see it there near the top, is the current winner during each resample that we're doing for our grid search. And as you can see, a lot of these candidate values got grayed out and were eliminated on the next step. So, every time it does a resample, it's doing some analysis to figure out what's good and what's bad. And it gets rid of the, you know, the bad stuff. And so, really quickly, after like eight or ten resamples, you're down to just a handful of models out of 50. And in this particular case, of all the models you could have fit, you actually only ended up fitting about 7% of them. And if you're working with parallel processing, that's like almost like a three-fold increase in efficiency. So, it took you a third of the time it would have done with regular grid search.
Extensibility: the tidymodels ecosystem
This racing example is such a great one because it really highlights how tidymodels is extensible. The package that provides that infrastructure is not a core tidymodels package, but an extension package. So, the metaphor that we can think about here together is one of Legos. So, if you have one beautiful Lego block here, we can admire it, look how well it's designed. But the reason why Legos are fun to play with is that you can put them together. And you can put them together in the way that meets your needs or fulfills your vision or is what it is that you want to, what is that you want to build. And that is how tidymodels works.
But if we ask the question, what else is tidymodels? There's another whole group of packages that are for more specialized tasks. I'm just going to run through the ones on the top there briefly. TextRecipes is a feature engineering extension package for text preprocessing, text feature engineering. This is one of my favorite things to talk about. And tidymodels gives you great tools to make you effective in building text models. The next one there is the Censored package, which is for survival analysis. Our coworker Hannah is going to be giving a talk on that in the next session, if you particularly work with survival analysis. The next one is interesting, Stacks, because it builds on top of the whole tidymodels framework to actually take multiple different kinds of models and stack them together or ensemble them so that you can get a, like, squeeze that last bit of performance out of your data that you have.
The last one up there is Model Time. And I want to highlight this because this one is a package that's not built by us, who work on the tidymodels team. It's actually built by a member of our community. So this is built by Matt Dancho, who is here somewhere today. And it highlights how tidymodels is extensible, not just by us, but also by you. This is true whether you're interested in building some open source software for a new kind of model or a new kind of resampling. It's also true, you know, just within your own organization. You can build, for example, a custom metric based on your company's KPIs and use that to optimize the models that you are training.
Now, when we start talking about all these packages, one of the things that we start to hear from people is a bit of discomfort with the fact that there are so many packages. How do I know which one does what? How do I find them? And we want to acknowledge that, yeah, there is this little bit of a learning process. And it's important to have tools in your tool belt to be able to identify what functions come from what packages. However, we really want to emphasize that this modularity makes all of our lives better. It makes our lives better as maintainers of the packages, and it makes your lives better as users of the package. Smaller, more modular packages can be released more frequently with smaller changes. We can more quickly fix bugs.
And this is, I would say, most highlighted when we are talking about model deployment. Let's say we trained a model using that racing method that Max shared with us. We don't need any of that racing infrastructure when we go to deploy our model. In fact, we don't need the tuning infrastructure at all. We need a subset of the modeling software that is required to make a prediction. Our packages being modular allows you to make smaller Docker images, have faster installation times, being able to have more scalable models in production.
Practicing safe machine learning
So, so far, we've talked about how we can have a more ergonomic modeling process. We have talked about how we can make you more effective in the models that you build. And last, we want to talk about being safe, practicing safe machine learning. So, Max, how does tidymodels keep us safe? So, the whole idea of safety in modeling, you might be like, well, what do you mean by safety or being safe? It turns out with, like, you know, complex machine learning models and especially complex data, it's possible for you to do something horribly wrong and not really know it until a really inopportune time.
And I've made this confession several times at different conferences and things like that. But, you know, in my first job, I had a project that was, like, three quarters of the total R&D budget. You know, we were doing a bunch of sophisticated machine learning with, like, a large number of predictors. And we thought we'd gotten it to a good point. And my boss came by and he's like, how does it look? And I'm like, accuracy's, like, you know, like, 90%. He's like, great. And then we got more samples in, and we missed them all. And we did a lot of soul searching in the days after that and figured out some methodology mistakes that we'd made that others make, just to be honest with you, quite often. And once we figured that out and fixed it, we had more genuine estimates performance.
So, there are times in modeling where, especially if you're doing something complex, that you might accidentally fool yourself into believing you have something that's really good, but you have something that's actually not as good, and you won't figure that out until we get new data. And here's one of the reasons is, going back to this part of our previous slides, is, you know, we have all the modeling parts, and those are pretty well understood how to validate them and how to assess them. But very often, before we get to the model, we might do something either simple or something really complex to the data that involves something that's not, like, a deterministic, like, data cleaning operation. We might be doing some statistical estimation, right?
And so, it's really, really important, in some cases, that we handle that in the right way to make sure that our statistics that tell us how well our model is doing take into account the good and bad parts that might happen in that part of the system. So, let's think about, like, our Spotify script. We have all these genres. Maybe we got rid of the redundancy, but maybe we want to refine that feature set a little bit more before we give it to our model. And let's say we want to do some, like, fancy supervised feature selection routine. So, we might want to pick, like, the top 10 variables to give to the model and pick those 10 because they're the most influential for the outcome, like, for popularity, in this case.
So, let's say, in our Spotify data, we're going to do a tenfold cross-validation to validate our model. And we're going to use feature selection. So, actually, how many times do you think we're going to run that feature selection? Are we going to run it once? Are we going to run it 10 times? We could run it 11 times. Or we could run it some fractional amount of time because it was taking so long that you just, like, hash, you know, you slam the escape key until it stopped after three iterations.
So, you know, here's the situation we're in, is we have some data. We have a list of predictors. We want to enact some sort of feature selection routine to filter out some of those predictors before we give it to our model. Let's just say, for kicks, we're doing ordinary least squares with LM. And so, that produces some sort of fitted model. And that red box there is, basically, the box that says, like, what am I estimating? What should I validate here?
And the way we have the diagram here treats the feature selection bit as if it were sort of outside the model. So, if I would ask most people, like, where are we modeling in this diagram, they would probably do what we show and just circle the actual technical model part of the process. But in actuality, what you have to do is you have to do this. And in this previous slide, this is what I did when I made that horrible mistake in my job, is I just selected the features once, put them into the model, and there's a big sort of, like, circular argument methodology-wise in doing that. And it gave me really bad results.
What we want to do is this. So, the answer to this is probably C, where for every one of those tenfold cross-validations, you do the feature selection over and over again. And that may seem, like, excessive or it may take a long time. But at the same time, you can't measure the effect of something that's not within your sort of validation system. If it's outside of it, it might give you bad answers. So, the answer is probably B, which is 10. But let's say this model was really, really good, and you liked it, and compared to all the other things that you did, this is the one you want to take to production. You want to put it in vetiver and deploy it. Well, actually, what you would end up doing is doing it an 11th time. Because to build the final model, you have to do the feature selection on the entire dataset, hand that data over to LM, and run LM on the entire dataset.
So, really, the answer there is either B or C. But it's highlighting that there are some pitfalls and gotchas that it's really easy for somebody like me, who studied this in graduate school and did it for a living for a while, that you could actually make these mistakes. And unfortunately, it happens quite a bit. So, you know, there's quite a few papers that talk about this and measure this, but there's one that just came out. You know, we're bringing it up because their findings were maybe, like, not great. So, leakage in this context is data leakage. And it's really, where you're using the wrong data at maybe the wrong time to do some sort of calculation, right?
And so, in this paper, what they do is they look at a lot of different publications. And you can tell, based on their method sections, what they were doing. There's a lot of situations where they're doing, like, the bad methodology. And in tidymodels, what we want to do is, whether people know it or not, is we sort of want to, like, silently sort of give you guardrails. So, if you follow the process of tidymodels and use a tidymodels syntax, it's very, very, very difficult to do the wrong thing.
So, just to give you an example of that, here's another recipe. We're going to use a recipe extension package written by somebody in the community called Recipe Selectors. And that has a recipe in it that will do, like, let's say, like, a random forest variable importance score across your predictors. And then you can select the top 10 or 15 or 30, whatever you want, you think you should do of the most predictors and give that to, let's say, your linear regression model. But we don't really know how many we should pick. So, if you look at that line there, it has an argument called top p. So, it's like, how many predictors should I retain? We give that a value of tune. And that value of tune marks it in tidymodels as being something that we want to optimize.
And so, then we can take our new recipe that built on our previous Spotify recipe, put that in a workflow with a modeling function that says we're going to use just plain old linear regression. And we could use the tune grid function to do, like, a grid search to find a good value for how many of those predictors that we should use. And so, this is how you would do the feature selection in tidymodels. Now, the thing is, the way we've set up resampling and data splitting, the way we're combining our preprocessor with our model, it's nearly impossible if you did accidentally do the wrong thing here and inappropriately validate or estimate things from your data. Okay. So, we're really, without really saying that, we're sort of forcing that methodology for users.
That idea of making it hard to do the wrong thing leads us to wanting to talk about one more metaphor here when we're here talking about, like, being on the road, being on this journey as we drive down. This may not be 100% resonant to those of you who are visiting here from outside the United States, maybe where you live somewhere where you have excellent transportation. But I have always lived somewhere where I rely on a car. And the idea of locking my keys in my car is something that is kind of terrifying because it's, you're like, oh, no, what do I do now? I'm stuck somewhere.
So, here together, let's answer another poll. So, when was the last time that you locked your keys in your car? Was it pretty recently? Was it a long time ago? Have you never done this? Max, when was the last time you locked your keys in your car? I would pick D, about 10 years ago. Yeah. I would pick D, too. It was, like, 10 or 15 years ago was the last time that I locked my keys in my car. And I wouldn't be surprised if a bunch of people here choose that they have never locked their keys in their car. And I don't think it's because they're that much more responsible or good at handling keys than we are, but rather that you're quite a bit younger than Max and I are. Because the thing is, the car that I have now, it's really hard to lock my keys in my car now. The car itself is built in such a way that it protects against common failure modes.
So, this is how tidymodels is built. It is built to protect you against these common pitfalls, potholes, that are on your road to developing and deploying your model. So, the idea of T's is like an immediately painful problem. And some problems that you run into in machine learning are like that. One I might suggest here would be the model deployment. You know, you're like, oh, I can't get it. It fails every time I try to, like, push it to wherever I'm going. There are other kinds of problems, though, that are not obvious until later. So, the metaphor we might use here would be not filling your car up with gas and just driving along. And because you didn't make that choice at the last exit that had gas, you end up running out of gas later. tidymodels also protects you against problems like that, that you make some decision during your modeling process, and it may not come to bite you until you're predicting on new data.
So, this is how tidymodels is built. It is built to protect you against these common pitfalls, potholes, that are on your road to developing and deploying your model.
So, as we come up and start to wrap up here, we want to share a thought from my friend and co-author, Dave Robinson, who started using tidymodels about a year ago and wrote a blog post reflecting on his experience. So, he said that he, like us, sometimes hears some resistance to the idea of how tidymodels works, that it makes it too easy. You know, people, you're not really thinking about what it is that you're doing. But, like Dave, we think this is entirely backwards. We want to protect you as a modeling practitioner from making silent but bad choices so that you don't have to worry and stress about those things, and you can focus on the scientific and statistical questions that are the reason why you're training a model in the first place.
But, like Dave, we think this is entirely backwards. We want to protect you as a modeling practitioner from making silent but bad choices so that you don't have to worry and stress about those things, and you can focus on the scientific and statistical questions that are the reason why you're training a model in the first place.
tidymodels 1.0 and learning resources
Well, I think it means that we're at the point with tidymodels where it's mature enough. The syntax is pretty much solid. No, like, super major changes we're going to be doing in a minute now that would affect users. You may have noticed a flurry of CRAN submissions where we've revved everything to 1.0. So, I think the message we want to get across to you is, you know, it's ready for your day-to-day work. Like, if I were working in my previous job, I would be using this, you know, to finish projects and do things on time. You know, it's at that point where we really feel confident that you can use it and depend on it in your day-to-day work.
So, you know, at the very beginning of the pandemic, what we did was we spent a lot of time, like, really writing a lot of good documentation. So, tidymodels.org is a website that has some really good content on it. You can see there's a section there for getting started. There's a great sequence of articles that will introduce you to tidymodels, show you how to use it. There's a lot of articles about, like, if you want to have an example of, like, Bayesian optimization or bootstrapping to compute confidence intervals and things like that. That's all on tidymodels.org. There's quite a lot of content there. In addition to that, you may have seen the other night that Julie and I finished writing a book on tidymodels from O'Reilly. They should be printing them now. Hopefully, we'll get copies in the next week or two. The website's down there if you want to look at the bookend version. So, check that out. It's a really good way for, like, longer form documentation to learn about tidymodels.
So, we think it's not only ready for you, but we think we have lots and lots of good materials publicly available for you to learn it and interact with it. So, I think the last things we want to say are thank you. Thank you for being here. Thank you if you're in the room. Thank you if you're remote. We appreciate you coming. We also want to thank our teams. We want to thank the rest of the tidymodels team, Simon, Hannah, Emil, and Davis. The vetiver team's excellent. Michael and Isabel. It's great having people from a Python background. It's really, like, enriching our, like, perspectives on, like, how we do things in R and modeling and deployment in general.
Q&A
Thanks, Max. Thanks, Julia. We've now got time for some questions. And remember, go to Slido if you want to ask your questions. I do not see any questions there yet. So, I will make a chitchat with Julia and Max until some... Okay, we're getting some questions rolling and I'll still make a little chitchat. One thing I forgot to mention in my intro is that Max and Julia have a book, Tidy Modeling with R, coming out very soon with O'Reilly. Unfortunately, not in time to make it for this conference, but it should be out in the very near future. Yes. And, of course, you can read it online for free.
Okay. So, first question. Are you planning to integrate tidymodels with other data structures like time series or arrays for images, text, or video? I'll answer first and then you can go. So, one of the things that is especially applicable in text data is a characteristic of it is that it is really sparse. And so, one thing that we have did, I've already integrated, is the ability to internally, within that loop that we talked about, pass data as a sparse matrix to models that can handle it and thus train, tune more effectively. So, this is something that we definitely think about and are interested in doing more of.
Yeah. So, for time series, we've mentioned Matt Dantro's work. He's here and has, like, a whole bunch of packages that handle that. A lot of times what we work on is mostly governed by, like, what we know about. And, honestly, I don't think any of us are, like, expert at time series. So, Matt's stuff covers all that, I think, really, really well. We're doing more with survival data. We'll be more widely integrated with, like, Tune and things like that, hopefully by the end of the year. We're always on the lookout, especially for recipes, in terms of, like, specialized things that we can do.
Julia, could you comment on how vetiver compares and contrasts with MLflow? Sure. I would say that MLflow more highly prioritizes model development in having opinionated requirements on how you develop your model. It then does some similar things in that it provides a way for you to deploy your model. One thing that is different about vetiver is that it is more flexible in how you train your model, how you tune your model. If you come from a more statistical background, you're going to be more comfortable with using your solid statistical approaches to training your model, and then what vetiver does is it comes in at that point, once your model is trained. So, I'd highlight that as a main difference between the approach of MLflow and the approach of vetiver.
Max, how would a newcomer kind of get started with tidymodels, or do you think this is something that's only suitable for seasoned data scientists? No, it's definitely, I think, relevant for anybody. You know, it's a little bit more verbose than things like Carrot. Like, you might end up writing a few more lines of code, but its simplicity, in a way, is a lot more, it's a lot lower than, let's say, Carrot, because, you know, you don't have, you know, for the one thing you want to do, you don't have, like, 75 options to sort through. So, I think it's really good for newcomers, especially if you don't have, you know, we've shown, like, UMAP and racing and things like that. We don't want to make it seem like you can't just, like, fit your LM model to whatever equivalent, like, MT cards or something you have that's, like, not terribly complex.
For both of you, any plans to integrate tidymodels with causal inference? With what? Causal inference. Causal inference. Yeah, yeah. Just casual. Just casual. This is actually something, you know, of all the workshops that we just had the two days, that's the one I would have liked to have gone to the most if I wasn't teaching one. So, I don't know. Maybe we'll learn about it, and then start. I don't know. What do you think? So, I'd like to. I was saying the same thing at the book signing. I'd, you know, oh, what workshop do you go to? And they say, oh, ca


