
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
And now I'm going to switch to English because our next speaker speaks English so I'm going to switch. The next keynote will be Max Kuhn. He works as a software engineer at Post-it. He leads the tidymodels team.
And on 2020, when we were supposedly organizing the event here we decided to invite him but the pandemic and stuff so we have to postpone the invitation and at that moment when we decided to invite Max, we looked at our list of cool people that do awesome things and he was there so let's invite Max.
And on 2023 when we were absolutely sure that we were going to hold the event here we invited him again so let's come to Montevideo and he didn't answer our email so we moved him from the cool people list to the black list for a couple of weeks until we discovered that he actually answered our email but it went to the spam folder and we solved the issue and now Max is here to talk about the tidymodels universe so thank you Max for coming.
You can hear me okay? Excellent. Thanks for sticking around to the end of the conference, I appreciate it. Now I've been using, you may have noticed this, we've been using Google Translate during the sessions and I've realized if you're going to do that and translate it to Spanish or Portuguese, I will speak, try to speak slowly and clearly.
I'm saying this because, let's see, my favorites so far have been so I wrote a package called Carrot and has anybody in the US if somebody calls you a Karen has anybody ever heard that before? It's kind of like a really mean thing to say and it translated Max Kuhn's Karen package.
So that was that and then ggplot got translated to was it like jpplop2 yeah so that's what we're going to call it from now on, we're going to send it to Cray and break all your dependencies, it's a new name.
So I'm here to talk about tidymodels. We did a tutorial on this, hopefully it won't be too bad if you were there and hopefully it will be good for people who weren't. So if you want the slides it's at the bottom of the slides, there's a little bigger font here if you want to see it, my github handle is topepo so it's 2023 Latin R.
Why tidymodels exists
So at first I'm going to tell you why maybe tidymodels and actually Carrot too exist. R is very very very good at many types of modeling, it's always been stellar at things like mixed models and hierarchical models, survival analysis, and there are just so many things that it does in terms of modeling.
For a long time it's had quite a long history of having really good predictive modeling and machine learning models, so there's a lot of really good things if you're doing modeling in R. You know the formula interface if you ever use that, like y tilde x, that's been around since I think it was like 1982, believe it or not in the S language, and so all the things they did back then in the S language have really carried through R and so we've always benefited from having really expressive ways to do modeling.
And it's also pretty much had cutting edge models, so things like random forest the first real implementation of that that people could use that wasn't Fortran was built in R and so a lot of times people who do modeling they might use R as their first sort of like prototyping language. So outside of like deep learning and some other things especially as tree based models and regularized models I think you see things in R before you see them anywhere else.
And if they're not there, the last really good thing that's pointed out here is R and the S language originally was built as an interface to Fortran wrappers if you can believe that, so for quite a while in the DNA of R and S it's always been really easy to take things that R doesn't do well and do them much faster or better in C or Java or Python or whatever it is and so there's quite a lot there to love and use and it can be very effective with it.
But there is a downside and it's mostly about consistency and so you know some examples that I'll go into a little bit more detail come down to things where I think package managers and I'll include myself in here early on, I think when we write packages we kind of write them for ourselves usually and so a lot of times the scope of the package or the things it can do or the things that it doesn't do a lot of times it's largely informed by the things that are important to us at the moment right.
So just some examples of the consistency problem. Let's say you're fitting a classification model and let's say you have like two classes like yes or no and you want to get the class probabilities back, you might start off by fitting like an old school like linear discriminant model and the standard way of doing that is to take that model fit and put it into the predict function and if you use that particular model and you want to get the probabilities back you don't really have to do anything, you can just use the predict function.
And give it your object but for a lot of other models you have to say something like type equal and this is just, this slide is usually a lot longer with all the options you have but you know there's type equal response for logistic regression, there's sometimes it's prob which is the one I like, sometimes it's posterior, sometimes it's probability so you know when I was doing this for a living like before 2000, I'm old, you know this is very very frustrating because you have to remember all the minutiae of every particular package that you would like to try and that was kind of a pain.
The other thing is R generally has two interfaces for the models, there's the formula method that I discussed and then there's also what we just really call like the non formula method and that's where you have predictors in some matrix or data frame x in your outcome in some vector or matrix or whatever y and then you'd say the function and give it x and y.
And sometimes you want one or the other or sometimes you might want both depending on how your data come in and things like that and so here are five fairly popular machine learning or modeling packages. Let's pick glmnet, so between the formula method and the non formula method raise your hand if you think it has the formula method.
Okay thank you, raise your hand if you think it has the non formula method. Okay wow, okay one versus three, thank you. Hannah, well here's how this all shakes out so if you're using glmnet it actually only takes matrices which is kind of a pain because if you have any categorical predictors it's your job to turn that into indicator variables, if you want to transform a variable that's your problem, the formula method kind of does that automatically for you and then you have to when you make new predictions give it a matrix and it's very limiting.
Ranger's a really good random forest package, it's pretty fast and it does a lot of things and it has both the formula method and the non formula method which actually is not correct, you can use the formula or use x and y, the problem is it's all in one function so they have a function that have arguments formula and data and x and y which is kind of weird but we can get there.
Our part is a decision tree package, very old, it has just the formula method. I long ago put in an issue to get the non formula package or non formula interface and the maintainer sent me the sources and says you should maintain it, then I am not maintaining it. And then survival is for survival package, it is also formula only, and then XGBoost which a lot of people love, not only does it not have a formula method you have to convert the data to a special sparse matrix format, you append your outcome to that and if it is a classification model you have to convert your outcome to zero based integers. So this just gives you a sense that these are all used quite a bit, all these packages, so sometimes you are doing a lot of work for these packages.
So this isn't really something that we like people to deal with.
So this isn't really something that we like people to deal with. So I used to work with engineers a lot and some of the sharpest and most functional engineers I worked with were called systems engineers. So like you think about your phone and there is a whole engineering crew at Apple or the people that make Android phones that are battery engineers right, and they have engineers that work on the screen to optimize that and the antenna and they have all these groups of engineers and what a systems engineer does is make sure all that stuff works together, that one thing is not taking too much power so the other thing can't work and so on.
They just make it all happen not at the end but along the way. And I feel like I realized maybe like a week or two ago that my professional life mostly has been spent as what I would think of as a systems statistician because we have a lot of packages that do really interesting things and we want to use them but they are not necessarily made to be consumed to be used somewhere else or in a different context than the original author intended. So that is a lot of what we do, that is a lot of what I've been doing, and that is absolutely the goal of tidymodels.
So our job is to really make R less frustrating but really it's really not that bad, it's very functional and with a few extra tools you can get very very far and build very effective models. And I stay here like we usually talk about predictive models and machine learning, tidymodels can do pretty much everything, we have packages for hierarchical models now, survival models and so on, so it's not just like a machine learning library.
Design goals
So you've probably heard of the tidyverse, I know that's kind of a ridiculous thing to say, and the tidyverse is built to have packages that work really well together, to have a common sort of design goal, the naming is very similar and it's very very thought out in how you interface with the software and anything else I've ever seen.
And when Hadley first started working on that he had these goals that are listed here, don't make a new data structure that somebody has to learn, try to stick with data frames if you can. Probably the most important one here is design for humans. Can anybody, raise your hand if you know the function name if you have a matrix and you want to invert it, does anybody know what the function name is? Solve. There's no inverse function, that's not designing for humans.
It's those kind of things, we want to make things obvious for you, we don't want you to get a result back and be surprised, wait, why is it in a list and not in a data frame, that kind of thing. So when we think about designing for humans in terms of modeling, it's a little bit different from what they do in the tidyverse but at the end of the day we want you to enjoy programming and not be constantly frustrated by it.
So in some of the things that we do is we try to name things and optimize them for auto complete, so if you're doing model tuning, all of our function names that do that start with tune underbar and then if you're in RStudio or VSCode you can just tab and then you get the whole list of things, whether they're grids or tune objects or things you can extract, like we try to make function names as consistent as possible.
And when I don't they fix it. So the other idea is this notion of tidy data, every variable should be in a column, every row or every observation forms a row, that's a little bit weird for us sometimes because sometimes we have records that have multiple sub-records. So if you're a patient in a clinical trial or you're being followed longitudinally the patient is really the observation, they're like the independent experimental unit and they might have multiple rows in the data set.
So some of this we have to deal with a little bit but our idea is we want to create modeling packages that work well together that do a lot of things and that you know when you start using them, even if you haven't used the other parts of it, you'll already know what to expect when you get there.
So I've been thinking about modeling for a long time and if you look at the tutorial I mentioned that in the process of having a career doing modeling before coming to our deposit there were mistakes were made, you learn a lot along the way and so a lot of that informs how we design our software.
So the first bullet point there is type equal prop, that's just remember that one, right? So when you use our code we'll make the translation for you for whether it's probability or posterior or whatever. If you want to use the form method or not the form method we'll handle that for you so we don't want you to have to worry about that.
Another thing that's important to me over the year is resampling and empirical validation, that when you have a model and you want to know if you want to get something out of it you should be able to quantify its performance hopefully using something like cross-validation of the bootstrap. I've sat in a lot of presentations where people present results with logistic regression and how important it is that this p-value is for the interaction is less than .05 and then you ask what the accuracy of the model is and it's like 60% and you're like, okay.
So there's a lot more to qualifying and validating your model beyond just doing predictive modeling. Another thing that we do in tidymodels is it's designed with some hidden guardrails, at least in machine learning a lot of the problems that occur that you don't necessarily know happen until you get new data, those are the result of people using maybe the wrong data at the wrong time for the wrong thing or using it for multiple things and so we sort of, you sort of have to go out of your way in tidymodels to sort of use the data inappropriately by accident.
And so we want people to also be able to use the infrastructure to make your own packages pretty easily and we want to encourage people to do things that they necessarily couldn't because I've heard a lot of people say yeah, that is probably a better way to do this but I don't know how to use that software or I don't have a license for that or it's not been written and I'm probably not going to write it.
So we want to also sort of like the next slide has the phrase leveling up. Does everyone know what that means? It means like how can I improve me or my tool set to get more out of what I'm doing. And just some examples of that is a lot of times we have variables like predictors in our data and we use a model that can't handle them as character data, you know, because they're categories and so we have to convert them to indicator variables.
There are a lot of different ways you can do that and a lot of times there haven't been, there have been methodologies out there that people just didn't have available to them or the package wasn't easy to use and so on. So just as examples, there are things called effect encoding methods that can be very effective in certain situations, feature hashing is an excellent tool if you have like categories that have maybe like thousands or millions of levels and things like that.
Also multiple choice predictors. I've had codes laying around in scripts for years that take, you know, if I had a model and a predictor was like what languages do you speak, you know, some person might have one, somebody might have four and so having like a way to deal with multiple categories is really helpful. So we're trying to institutionalize as much of that as we can in tidymodels so you can just use it and not have to carry around your little bits that you've been doing.
I had these slides already written before I saw Di's presentation but also some modern dimension reduction methods like UMAP which he showed and these fancy name things like manifold based multi-dimensional scaling and things like that. So, you know there are things we see in papers and all that but we don't necessarily have the tools to fit them and so we want you to try to use them. Maybe they're not effective, maybe they are but having the ability to even try that is important to us.
You know there's other things like a lot of different imputation tools are out there so we just want you to be able to use this stuff. So, you know, if there's something like yeah I use this all the time but I can't find it anywhere just come to our repos and put an issue, we'd like to hear from you.
Resources and getting started
Alright, so if you want to learn more besides this there's three main resources. There's tidymodels.org. We spent a lot of time on that during the very beginning of the pandemic just sort of optimizing that, putting a lot of articles in there, searchable list of functions, a lot of really nice sequential set of getting started articles.
Julius Hilge and I wrote a book on tidymodels. You can get it from O'Reilly or you can just go to this link and take a look at it. That's very helpful if you want longer form documentation. And also some of you may know after the other day we keep all of our training materials in the same place. We keep all the old versions so if we change data sets you can always get to the version that you used, but you can go to workshops.tidymodels.org and there's a lot of training materials there. And all this is licensed in a way that you can use it in your classes or for whatever you need.
Code walkthrough
Okay, so let's look at some code maybe. Like the tidyverse we have like a meta package. So if you load tidymodels you see this big amount of output you get back. We load things like dplyr and ggplot which are not part of tidymodels but you're probably going to want to use those. And also like tidyverse we list these conflicts. You know things like dplyr has a filter function but the stats package has a filter function. If you just type filter the one that you get is based on which package was loaded more recently.
And so one thing that we added here is this thing at the very bottom called tidymodels prefer and what that does is it has uses the package that Hadley wrote called conflicted that manages these conflicts. So after running this function assuming you want to use dplyr and tidymodels it will sort of default your filter no matter how it gets loaded to use the dplyr one. So you don't have to use this but if you're sort of tired of dealing with these and have the wrong function from the wrong package this can be a big help for you to resolve that.
Alright so I have some example data. It's a little bit more than 10,000 data points and it's something about restaurant deliveries like food deliveries. So basically we have a time period from when the order was put in to when it was delivered. There's a few variables. There's the order time is like the hour. You know 0 to like 24. There's a day of the week. So it's a factor. It's Monday, Tuesday, Wednesday that kind of thing. There's distance and also there's 27 pre-made sort of indicators and this is like the result of like a multiple choice thing. These are counts of the things that were ordered so like item 3 might be somebody ordered a nice tea or something like that.
So we want to take this information and try to predict how long it would take you or me to get our food at home or wherever it's being delivered to. So it's a fairly decent sized data. It's 10,000 data points. There's not that many variables.
The first thing we'll do is we'll look at the outcome. I should say if anybody here does survival analysis you're probably wondering if any of this data is censored. This is all complete data. Censored means if I put an order in and it's five minutes after I don't know when it's going to get here so I don't know the actual delivery time but I know it's at least five minutes. And so these are all complete values where they've actually had the delivery and they know what the time was. And you can see it's right skewed which is pretty common with time data.
So one of the things we might think about is do we do anything special about this? One thing you might think about doing is logging the outcome and then doing that. I'm just going to keep the outcome the way it is. And the only other thing I'm going to do with it special is I'm going to split the data up. I'm going to split it into a training set, a test set, and a validation set. And the validation set is sort of like a test set but you can use it along the way to figure out how well your model is using or how well it's working. The test set is really at the very, very end when you want to just validate what your results were.
So when I do that split, what I don't want to happen is I don't want to have, you know, all of these sort of like outlying extreme data points end up accidentally in the test or the training set. So when I use this function in tidymodels called initial validation split, it'll put 60% in the training set, 20% in validation, and the remaining 20% in testing. And what it'll do is it'll do that three-way split by chopping this distribution up into quartiles and doing the split in each quartile of the data and then recombining them together. So the distribution of this outcome in all three of those data sets should be very, very similar to one another.
So in the code here, you know, we get to split about 6,000 data points for model fitting, about 2,000 for validation, and about 2,000 for testing. We try to make our function names as obvious as possible. So if you have the object that has the information about the splitting, you want to get the training set out, we're going to name that training for you. And validation is the function that brings back the validation set.
So, you know, I think one of my college professors said that the only way to be comfortable with your data is to never look at it. And so, you know, when you look at this data a little bit, you look at the time of the order, and clearly the delivery time increases with that and it's maybe not linear. So this is something, if we're going to fit like a linear regression here, we need to think about that there's a non-linear effect here.
The only way to be comfortable with your data is to never look at it.
I use geomSmooth, and what geomSmooth does under the hood is that you use something called a spline function. And what a spline does is it takes the original time variable and sort of expands it to these kind of weird extra columns, and those extra columns are the ones that are used in the linear regression. So this smooth line here that you see here is the prediction from like a linear model that uses maybe five or ten of these extra non-linear terms based on the order time.
But if you look at it a little more closely, it's actually a lot more complicated because it turns out if you look at that same trend and estimate it differently for the day of the week the order was made, it's quite different, that Friday and Saturday have a very, very steep non-linear curve, whereas something like Monday is a lot more blunted. So if we just did this and put it in the model, we're going to underfit because it actually depends. And that's what a statistical interaction is, is when you have two predictors and their relationship with the outcome, they depend on one another.
Recipes and feature engineering
Alright, so we have this amazing package called recipes, and what it does is it prepares your data for modeling. And you can start off by just giving it any data set, but we'll use the training set here. And we'll use a formula, and all it really does is, what the formula does is it says, things on the left side of the tilde are outcomes and everything else represented by the dot is a predictor. So at this stage all the recipe does is says, okay, what's the role? Is it a predictor, is it an outcome? And it looks at the actual columns, and it measures like, are they category, are they factor, or strings, or ordered, or numeric, or integer.
Now the way recipes works is it's a little bit like the formula method or model.matrix in R, but with a flavor of like dplyr. In dplyr, you would keep piping in things that you want to do to the data until you have it the way you want it. And the recipe works the same way. We're going to pipe in declarations of like, hey, I want you to log the data now, and I want you to do this with the data.
So the first thing we'll do is we'll create indicator variables for the day of the week by using a function called stepdummy. So those indicator variables are sometimes called dummy variables. So we're not mad at this variable, it's just like stepdummy is the name of it. And so all these things we're going to do and pipe in, they're called step functions. They all start with step under bar. And then the first argument in all of them is what variable should be affected by this. And you can use any dplyr selector here like you would with select.
What you could have done is recipes have a bunch, like a lot of extra selectors that you can just use here. And you can choose variables not only by their type, like are they character or factor or things like that, you can also choose them by their role. So if you want to do something to all the predictors, whatever might be in there, you can just say all predictors instead of listing them all out.
Alright, so what's next? We have a lot of indicator variables, I think I call them item, and there are counts. But there might be one of those items in the training set where it just so happened they almost never get ordered, and the training set might have all zeros in it. That's not a big deal for some models, but for other models it doesn't like having a column of all zeros. So we call that a zero variance predictor. So what this would do is when you use this recipe, it would look at your training set, it would find any of the predictors, I used all predictors, so it would search those and look for anything that has any column that has one unique value, and it marks that as you should get rid of that.
Alright. So then what we're going to do is we're going to do splines. We're going to do exactly what geomSmooth does. We're going to take our original column of the hour that the order came in, and we're going to generate as many columns as we want to make it non-linear. And if we add 100 columns it's going to be a really wiggly line. If we add very few columns, like 2 or 3, it's going to be a relatively smooth line. And so, you know, we don't know how many to add. The typical parameter for that is called the degrees of freedom of the spline. And I'm just going to add 5 columns to represent that non-linearly. Why 5? I just picked it out of the air.
But we have a variety of other basis expansions. There's a lot of different ways you can do splines. There's polynomial expansions. So there's quite a lot of ways you can represent a column in a non-linear way to the model you're going to use. But you might not, you might be saying, like, I don't know about 5, though. I'm kind of stuck on that. You don't have to actually pick it. You'll see later that you can actually tune these options. So I can give it a non-value of this function called tune. And all that does is it just marks it to say, like, hey, later on, I don't, you're not going to be able to do this recipe until you know what this is. But we'll do it in a way that we'll search over different values of that parameter to figure out what the best performance is.
So you can mark pretty much anything in here as being tunable and then just optimize it however you want to. So, here we are. We have, remember we have dummy variables that are, like, order day in the factor level. When we make spline terms, the way they're going to be named is they're going to be, like, order time under bar 1, under bar 2, under bar, all the way up to 5. So if you want to add interactions, and I looked to see if you can do this in BaseR, and you kind of can. You can make interactions between a factor and a spline expansion, but you can only do it for one variable because it gives the splines the same names if you did it with a second variable.
So we can do it for one variable. We can do it for pretty much everything, and the way we would do that is something called step interact, and then you can just give it whatever interactions you want to, and those interactions could use starts with or any dplyr selector. So we know all our indicators start with order day. We know our spline terms start off with order time, and that will let our spline terms, those relationships

