Charla Plenaria: Max Kuhn

Transcript#

This transcript was generated automatically and may contain errors.

And now I'm going to switch to English because our next speaker speaks English so I'm going to switch. The next keynote will be Max Kuhn . He works as a software engineer at Post-it. He leads the tidymodels team.

And on 2020, when we were supposedly organizing the event here we decided to invite him but the pandemic and stuff so we have to postpone the invitation and at that moment when we decided to invite Max, we looked at our list of cool people that do awesome things and he was there so let's invite Max.

And on 2023 when we were absolutely sure that we were going to hold the event here we invited him again so let's come to Montevideo and he didn't answer our email so we moved him from the cool people list to the black list for a couple of weeks until we discovered that he actually answered our email but it went to the spam folder and we solved the issue and now Max is here to talk about the tidymodels universe so thank you Max for coming.

You can hear me okay? Excellent. Thanks for sticking around to the end of the conference, I appreciate it. Now I've been using, you may have noticed this, we've been using Google Translate during the sessions and I've realized if you're going to do that and translate it to Spanish or Portuguese, I will speak, try to speak slowly and clearly.

I'm saying this because, let's see, my favorites so far have been so I wrote a package called Carrot and has anybody in the US if somebody calls you a Karen has anybody ever heard that before? It's kind of like a really mean thing to say and it translated Max Kuhn's Karen package.

So that was that and then ggplot got translated to was it like jpplop2 yeah so that's what we're going to call it from now on, we're going to send it to Cray and break all your dependencies, it's a new name.

So I'm here to talk about tidymodels. We did a tutorial on this, hopefully it won't be too bad if you were there and hopefully it will be good for people who weren't. So if you want the slides it's at the bottom of the slides, there's a little bigger font here if you want to see it, my github handle is topepo so it's 2023 Latin R.

Why tidymodels exists

So at first I'm going to tell you why maybe tidymodels and actually Carrot too exist. R is very very very good at many types of modeling, it's always been stellar at things like mixed models and hierarchical models, survival analysis, and there are just so many things that it does in terms of modeling.

For a long time it's had quite a long history of having really good predictive modeling and machine learning models, so there's a lot of really good things if you're doing modeling in R. You know the formula interface if you ever use that, like y tilde x, that's been around since I think it was like 1982, believe it or not in the S language, and so all the things they did back then in the S language have really carried through R and so we've always benefited from having really expressive ways to do modeling.

And it's also pretty much had cutting edge models, so things like random forest the first real implementation of that that people could use that wasn't Fortran was built in R and so a lot of times people who do modeling they might use R as their first sort of like prototyping language. So outside of like deep learning and some other things especially as tree based models and regularized models I think you see things in R before you see them anywhere else.

And if they're not there, the last really good thing that's pointed out here is R and the S language originally was built as an interface to Fortran wrappers if you can believe that, so for quite a while in the DNA of R and S it's always been really easy to take things that R doesn't do well and do them much faster or better in C or Java or Python or whatever it is and so there's quite a lot there to love and use and it can be very effective with it.

But there is a downside and it's mostly about consistency and so you know some examples that I'll go into a little bit more detail come down to things where I think package managers and I'll include myself in here early on, I think when we write packages we kind of write them for ourselves usually and so a lot of times the scope of the package or the things it can do or the things that it doesn't do a lot of times it's largely informed by the things that are important to us at the moment right.

So just some examples of the consistency problem. Let's say you're fitting a classification model and let's say you have like two classes like yes or no and you want to get the class probabilities back, you might start off by fitting like an old school like linear discriminant model and the standard way of doing that is to take that model fit and put it into the predict function and if you use that particular model and you want to get the probabilities back you don't really have to do anything, you can just use the predict function.

And give it your object but for a lot of other models you have to say something like type equal and this is just, this slide is usually a lot longer with all the options you have but you know there's type equal response for logistic regression, there's sometimes it's prob which is the one I like, sometimes it's posterior, sometimes it's probability so you know when I was doing this for a living like before 2000, I'm old, you know this is very very frustrating because you have to remember all the minutiae of every particular package that you would like to try and that was kind of a pain.

The other thing is R generally has two interfaces for the models, there's the formula method that I discussed and then there's also what we just really call like the non formula method and that's where you have predictors in some matrix or data frame x in your outcome in some vector or matrix or whatever y and then you'd say the function and give it x and y.

And sometimes you want one or the other or sometimes you might want both depending on how your data come in and things like that and so here are five fairly popular machine learning or modeling packages. Let's pick glmnet, so between the formula method and the non formula method raise your hand if you think it has the formula method.

Okay thank you, raise your hand if you think it has the non formula method. Okay wow, okay one versus three, thank you. Hannah, well here's how this all shakes out so if you're using glmnet it actually only takes matrices which is kind of a pain because if you have any categorical predictors it's your job to turn that into indicator variables, if you want to transform a variable that's your problem, the formula method kind of does that automatically for you and then you have to when you make new predictions give it a matrix and it's very limiting.

Ranger's a really good random forest package, it's pretty fast and it does a lot of things and it has both the formula method and the non formula method which actually is not correct, you can use the formula or use x and y, the problem is it's all in one function so they have a function that have arguments formula and data and x and y which is kind of weird but we can get there.

Our part is a decision tree package, very old, it has just the formula method. I long ago put in an issue to get the non formula package or non formula interface and the maintainer sent me the sources and says you should maintain it, then I am not maintaining it. And then survival is for survival package, it is also formula only, and then XGBoost which a lot of people love, not only does it not have a formula method you have to convert the data to a special sparse matrix format, you append your outcome to that and if it is a classification model you have to convert your outcome to zero based integers. So this just gives you a sense that these are all used quite a bit, all these packages, so sometimes you are doing a lot of work for these packages.

So this isn't really something that we like people to deal with.

So this isn't really something that we like people to deal with. So I used to work with engineers a lot and some of the sharpest and most functional engineers I worked with were called systems engineers. So like you think about your phone and there is a whole engineering crew at Apple or the people that make Android phones that are battery engineers right, and they have engineers that work on the screen to optimize that and the antenna and they have all these groups of engineers and what a systems engineer does is make sure all that stuff works together, that one thing is not taking too much power so the other thing can't work and so on.

They just make it all happen not at the end but along the way. And I feel like I realized maybe like a week or two ago that my professional life mostly has been spent as what I would think of as a systems statistician because we have a lot of packages that do really interesting things and we want to use them but they are not necessarily made to be consumed to be used somewhere else or in a different context than the original author intended. So that is a lot of what we do, that is a lot of what I've been doing, and that is absolutely the goal of tidymodels.

So our job is to really make R less frustrating but really it's really not that bad, it's very functional and with a few extra tools you can get very very far and build very effective models. And I stay here like we usually talk about predictive models and machine learning, tidymodels can do pretty much everything, we have packages for hierarchical models now, survival models and so on, so it's not just like a machine learning library.

The only way to be comfortable with your data is to never look at it.

I use geomSmooth, and what geomSmooth does under the hood is that you use something called a spline function. And what a spline does is it takes the original time variable and sort of expands it to these kind of weird extra columns, and those extra columns are the ones that are used in the linear regression. So this smooth line here that you see here is the prediction from like a linear model that uses maybe five or ten of these extra non-linear terms based on the order time.

But if you look at it a little more closely, it's actually a lot more complicated because it turns out if you look at that same trend and estimate it differently for the day of the week the order was made, it's quite different, that Friday and Saturday have a very, very steep non-linear curve, whereas something like Monday is a lot more blunted. So if we just did this and put it in the model, we're going to underfit because it actually depends. And that's what a statistical interaction is, is when you have two predictors and their relationship with the outcome, they depend on one another.

Recipes and feature engineering

Alright, so we have this amazing package called recipes, and what it does is it prepares your data for modeling. And you can start off by just giving it any data set, but we'll use the training set here. And we'll use a formula, and all it really does is, what the formula does is it says, things on the left side of the tilde are outcomes and everything else represented by the dot is a predictor. So at this stage all the recipe does is says, okay, what's the role? Is it a predictor, is it an outcome? And it looks at the actual columns, and it measures like, are they category, are they factor, or strings, or ordered, or numeric, or integer.

Now the way recipes works is it's a little bit like the formula method or model.matrix in R, but with a flavor of like dplyr. In dplyr, you would keep piping in things that you want to do to the data until you have it the way you want it. And the recipe works the same way. We're going to pipe in declarations of like, hey, I want you to log the data now, and I want you to do this with the data.

So the first thing we'll do is we'll create indicator variables for the day of the week by using a function called stepdummy. So those indicator variables are sometimes called dummy variables. So we're not mad at this variable, it's just like stepdummy is the name of it. And so all these things we're going to do and pipe in, they're called step functions. They all start with step under bar. And then the first argument in all of them is what variable should be affected by this. And you can use any dplyr selector here like you would with select.

What you could have done is recipes have a bunch, like a lot of extra selectors that you can just use here. And you can choose variables not only by their type, like are they character or factor or things like that, you can also choose them by their role. So if you want to do something to all the predictors, whatever might be in there, you can just say all predictors instead of listing them all out.

Alright, so what's next? We have a lot of indicator variables, I think I call them item, and there are counts. But there might be one of those items in the training set where it just so happened they almost never get ordered, and the training set might have all zeros in it. That's not a big deal for some models, but for other models it doesn't like having a column of all zeros. So we call that a zero variance predictor. So what this would do is when you use this recipe, it would look at your training set, it would find any of the predictors, I used all predictors, so it would search those and look for anything that has any column that has one unique value, and it marks that as you should get rid of that.

Alright. So then what we're going to do is we're going to do splines. We're going to do exactly what geomSmooth does. We're going to take our original column of the hour that the order came in, and we're going to generate as many columns as we want to make it non-linear. And if we add 100 columns it's going to be a really wiggly line. If we add very few columns, like 2 or 3, it's going to be a relatively smooth line. And so, you know, we don't know how many to add. The typical parameter for that is called the degrees of freedom of the spline. And I'm just going to add 5 columns to represent that non-linearly. Why 5? I just picked it out of the air .

But we have a variety of other basis expansions. There's a lot of different ways you can do splines. There's polynomial expansions. So there's quite a lot of ways you can represent a column in a non-linear way to the model you're going to use. But you might not, you might be saying, like, I don't know about 5, though. I'm kind of stuck on that. You don't have to actually pick it. You'll see later that you can actually tune these options. So I can give it a non-value of this function called tune. And all that does is it just marks it to say, like, hey, later on, I don't, you're not going to be able to do this recipe until you know what this is. But we'll do it in a way that we'll search over different values of that parameter to figure out what the best performance is.

So you can mark pretty much anything in here as being tunable and then just optimize it however you want to. So, here we are. We have, remember we have dummy variables that are, like, order day in the factor level. When we make spline terms, the way they're going to be named is they're going to be, like, order time under bar 1, under bar 2, under bar, all the way up to 5. So if you want to add interactions, and I looked to see if you can do this in BaseR, and you kind of can. You can make interactions between a factor and a spline expansion, but you can only do it for one variable because it gives the splines the same names if you did it with a second variable.

So we can do it for one variable. We can do it for pretty much everything, and the way we would do that is something called step interact, and then you can just give it whatever interactions you want to, and those interactions could use starts with or any dplyr selector. So we know all our indicators start with order day. We know our spline terms start off with order time, and that will let our spline terms, those relationships

Charla Plenaria: Max Kuhn

Transcript#

Why tidymodels exists

Design goals

Resources and getting started

Code walkthrough

Recipes and feature engineering

Featured software#

tidymodels