
Max Kuhn | parsnip A tidy model interface | RStudio (2019)
parsnip is a new tidymodels package that generalizes model interfaces across packages. The idea is to have a single function interface for types of specific models (e.g. logistic regression) that lets the user choose the computational engine for training. For example, logistic regression could be fit with several R packages, Spark, Stan, and Tensorflow. parsnip also standardizes the return objects and sets up some new features for some upcoming packages. VIEW MATERIALS https://github.com/rstudio/rstudio-conf/tree/master/2019/Parsnip--Max_Kuhn About the Author Max Kuhn Dr. Max Kuhn is a Software Engineer at RStudio. He is the author or maintainer of several R packages for predictive modeling including caret, Cubist, C50 and others. He routinely teaches classes in predictive modeling at rstudio::conf, Predictive Analytics World, and UseR! and his publications include work on neuroscience biomarkers, drug discovery, molecular diagnostics and response surface methodology. He and Kjell Johnson wrote the award-winning book Applied Predictive Modeling in 2013
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So yeah, I'm here to talk to you about a package called parsnip that's been quite a long time in the making. Before I get started, why is this called parsnip? So I have this other package called caret, C-A-R-E-T. And I was joking with some people in my group that the first RStudio package I should make when I joined the company should be caret, like the vegetable. So then people would be like, I have a problem with caret, and I'd be like, well, is it caret? Or is it caret? And then somebody, it might have been Hallie, was like, oh, parsnips are white caret. So we kind of felt like that's a good code name for the project, and then we just kept calling it parsnip.
So Alex stole one of my slides. That's cool. But I've been kind of talking about this forever. What for me, at least, is frustrating with R packages for modeling, and it's almost always it's never like the numerics, usually. It's about the user interface, right? And the problem that Alex talked about is that struggle is real, in that you start using an R package, or you see an R package, you're like, oh, I really want to try that out, and then when you start using it, you're like, oh, no.
And so sometimes people inadvertently, maybe because they don't know anything about R, make the package very difficult to use. So for example, I was looking at one the other day, which I tweeted about and got really angry about, expected your predictor data not to come in as a data frame, but as a matrix, and that by itself is not a deal breaker, but instead of having factors or dummy variables for your qualitative predictors, they wanted to convert it to zero-based integers, right? So that's the most un-R way of doing things. And so I was like, no, I'm not going to do that.
But there's a lot of we do have some sort of loose conventions in R about what your modeling package should look like, but you don't have to follow them, and they're not, to be honest with you, all that specific. So as another example, we usually have the formula method for models, and then we have the non-formula method, where you use X and Y as your arguments. And so you can never really know when you start using a new package whether you have either of those or both, right? So that can be kind of frustrating.
One really, really frustrating thing is sometimes you get a prediction back, and I love the ranger package, but the predictions that you get out of the predict method aren't actually data frames. They're like a specialized ranger object, which then you have to extract the piece out that you actually want. And then I'll talk about Glimnet a little bit more, but that's another good example where if you're fitting classification models, you could get, for a prediction, you can get a vector, a matrix, or a multidimensional array back, depending on the data. And that's very frustrating to program with.
So the point is that if you're going from model to model and trying different things, you could get really frustrated. And then here's the same thing that Alex showed, where just the variation in type equal is substantial. So we want to solve that. We want to, you know, I don't want to have to worry about remembering all this stuff for all the special cases when I go to do modeling.
And so I tried solving that, or I did kind of solve that with caret previously. So caret is like a unified interface of models, and that was written in like 2005, I think. But that's like 2005 code. And that works pretty well, but there's some, it's definitely not tidy, right? It's like the most untidy package. And so what I wanted to do is sort of reimplement this sort of like model interface, knowing all the things that I know now after implementing it for like 250 like models. So parsnip is sort of that part of caret where we're looking at like a unified interface that's really consistent with a tidyverse and does some things differently that I've learned to appreciate over time.
How parsnip organizes models
So one thing we do in parsnip is, you know, it's this unified interface, but we decided to organize the models a little bit differently than before. So what we do is we say, you know, what kind of model, generally speaking, are you trying to fit? So are you trying to fit like a King or Snapper model or random forest or logistic regression or let's say linear regression, right? So let's define what the type of model is as opposed to using LM or Glenet or what have you. So once we have a specification for the model, what we can do is we can then generalize how to fit them, right?
So if I say, and this is on the next slide, I think, if I say I'm going to fit a linear regression, that usually means like slopes and intercepts, right? And there are a variety of different ways you can estimate that. So in parsnip, what we do is we sort of organize all these models and their interfaces by what you're trying to do as opposed to the way you would try to do it. It has a tidy interface, so it's really consistent with the pipe and all the other tiny model packages that we have.
It also, and this is a really big deal, very much like broom, we spend a lot of time defining what we think a predictable interface would be. Predictable in the sense that like, okay, if I do predictions on an object, do I know what I'm going to get before I get it? In many packages, you don't know that. So we spend a lot of time trying to come up with like a convention or a guideline publicly. So we publish all this stuff and want feedback as to like what those return values should look like if you're going to get them.
It also, and this is a really big deal, very much like broom, we spend a lot of time defining what we think a predictable interface would be. Predictable in the sense that like, okay, if I do predictions on an object, do I know what I'm going to get before I get it? In many packages, you don't know that.
So if you want to, you can follow, I'm not going to click on it, but you can follow this modeling package guideline, which, you know, we started right, and some of it's specific to tidyverse stuff and other parts of it are specific to just general modeling ideas. And you can see what we, our decisions were there, but, you know, it's not like completely written in stone. So if you do have opinions, we'd love to hear them so you can file a get issue and we can discuss them.
There's one post about parsnip on the tidyverse blog. We have another one sort of queued up after the conference that's more about the inner workings of parsnip and how it does what it does the way it does it.
Deferred evaluation and model specification
So you know, one thing about ggplot2 and recipes is they kind of defer evaluation of things. So for example, like if you create your ggplot code and then if you don't assign that ggplot to an object and you hit enter, what you're doing is you're invoking whether you know it or not, like the print function on that object. So ggplot actually doesn't do anything really until you explicitly like print it, whether you use the print command, like you save it to like an object and do print, that's when all the stuff happens with ggplot, so it's drawing stuff and doing stuff, right? And the same thing with recipes is you define a recipe, but you don't really do anything until you need to prepare or use the recipe, right? So once we start deferring the evaluation of what we want to do, it opens up a lot of doors to make maybe the workflow a little more sensible.
So the way, so let's say you have some data, let's say you have, I don't know, some data on cars and maybe they're miles per gallon and maybe there's like, I don't know, let's just say 32 data points in the data set. So you know, as an example, hypothetical example, so we have like a meta package called tiny models and you can load that and you'll get dplyr and ggplot and parsnip and some other stuff. So you know, what we could do is let's say we want to fit a penalized regression model, like ridge regression, you know, if you're used to neural nets, this is like a weight decay model, but we use like a L2 penalty for the statisticians in the audience, you know, and what we'll say is we want to fit a linear regression and then we have a little bit of details about the specifics of that, and let's say we kind of know what the penalty should be, like really, you know, fairly low value of like .01.
So what we can do is we can find a specification for that model using this linear underbar reg and say the penalty should be this, and if we print it out, you know, it doesn't really have much detail because we haven't specified much detail, and a lot of times the detail comes in terms of like how are we going to fit this regression. So at least in parsnip, I mean, as we speak, we could fit that using LM or Glimnet or Spark or Stan or Keras, right, and that's just the ones we've implemented. So what we did is we kind of decoupled, you know, the estimation procedure and the package that you use to accomplish that from the actual specification of the model, all right?
So you'll see, for example, if you want to use Glimnet, we can pipe in what we call the computational engine, and again, the computational engine is sort of a mashing together of the type of estimator, like is it least squares or is it using Bayes method or is it using like, you know, Keras, with the model package that we would probably use, okay? So, you know, the engine might be like LM or Glimnet or Keras or Stan and that kind of thing. And also one other thing about the engines is they don't have to be in R, right? I mean, with Reticulate and all this other stuff, you know, we know how to like form and R's always been really good at this, form out the computations to a different language or platform or something like that.
So again, if we're not doing like immediate execution of these things, it allows us to set up all the things we need to do to get more general results back. So let's say we start with that regression model specification and we say we want to fit Glimnet, you don't really ever need to use this function, but I wrote this function called translate and what that does is say, well, okay, you said you wanted to fit this kind of model and now that you're saying you want to use it with Glimnet, like how would that actually work? And what translate does is it prints out like a template or a shell of what that code would look like.
So if you look down here, you can see, okay, we're using the Glimnet function, Glimnet package, we don't know what X and Y are and I should also say that Glimnet only has X and Y. So if your data has dummy variables or, you know, you start off with a data frame, then you've got to do the work of, you know, creating your indicator variables, converting it to a matrix and all that stuff to get Glimnet to work. So the underlying code would use that X, Y interface for Glimnet, lambda is the penalty function for that particular function and then, you know, since we know we're doing linear regression, it automatically sets the family to regression, right? So this is like the template of what it's going to use for code when it translates the model specification to the underlying engine code that's going to be used.
And, you know, we don't, you know, especially for Glimnet, we don't usually need the data to make that specification, right? So up to now, I haven't used any data, right? This could be on empty cars, it could be on, let's see, there's only two data sets, there's empty cars and Boston housing. Right. So, right, so, you know, at this point, I haven't said anything about the data and I'll show you a counterexample of that in a minute, but once we have our data, we can actually fit a model to this specification. And so you can use the fit function to, you know, you give it a formula and the data set and it goes out and fits your Glimnet model, or if you do want to use the formula method because that's more convenient for you, even though Glimnet doesn't have that, you know, parsnip will do the same thing that caret and other people do is they say, okay, we'll do that work to generate all the dummy variables and track all that stuff that we need to and keep those preprocessing objects and then it will fit the actual Glimnet model and have everything it needs encapsulated into that parsnip object to make predictions and stuff in the future. Right. So you can use fit for the formula method and then fit X, Y if you just want to give it X and Y.
Consistent prediction outputs
So when we do prediction, for example, I shouldn't dog Glimnet too much, but, I mean, it is very frustrating that you get very different data types under different situations. So if you're just working with like one data set, you might not notice this, but if you do any programming with Glimnet, there's like a ton of like if then statements, like, you know, if the number of levels my outcome is three, then I have to do this versus one, I do that.
And so what we have is this, you know, this idea that you ought to get a very formally defined output back when you make predictions. And so for regression, at least our first approach to that is to kind of follow what broom did, is you're going to get a table back and that table is always going to have the same number of rows as the input data set. And I'll show you an example of why that matters in a minute. But in that table, in that table for regression, the value or the column name you're always going to get is going to be called dot pred, right? No matter what model you use that, you know, how the model works, whatnot, you're always going to get a table for regression with one column of dot pred.
So in this case, I'm fitting the model to the first 29 rows and predicting the last three. And I'm going to get three rows back. So why would I make a big deal out of this? Well, with a lot of like common R prediction functions, what they do is they use NA dot omit either explicitly or silently. And so, you know, if I have 100 data points and three of those rows have missing values and I make a prediction, I get 97 rows back. And then I'm like, whoa, like now I got to figure out where the missing three rows are. If I want to merge that into a data set, you know, I got to do some extra work to get there. And so, you know, we became very, very frustrated with that and this idea that you return the number of rows that you started out with.
If I induce a missing value in the first data point and fit the same model and do the prediction, I always get an NA back. Okay. So you can always just like bind columns to your data frame and not have to worry about is it going to match up or not.
Now one other thing about Glimnet, which is really awesome, is Glimnet has this penalty parameter and I specified it here and that's kind of unusual thing to do with Glimnet. One really cool aspect of this particular model is it can fit using like one model fit, it can get the entire path or entire spectrum of lambda values. So all the lambda values for that model are kind of encapsulated into that Glimnet object. So when I get predictions for Glimnet, I can say, well, give me this lambda or that lambda or this other one and fit it over and over again. But the smarter thing would be to not specify lambda and basically get the Glimnet object that can predict any lambda at once.
Okay. And that's a really cool aspect of that model. The problem with doing that with this package, though, is it gives you like a bunch of labeled columns and you have to kind of trust that the lambda values, the penalty values that you're predicting at correspond to those. And so, you know, the first time I looked at this, it doesn't say anywhere that they're going to go in increasing order or decreasing order. So it's a little bit scary to use if you first start using it.
So since Glimnet can return for the same row multiple predictions at different lambdas, what do we do with this, like, you know, I start with an input of three rows and I come out with an input, you know, an output of three rows. How are we going to take care of that? So what we do is we basically produce a list column. So in this case, there's 80 possible values of lambda that it fit. And when I make the prediction, instead of using the predict method, because not all models have this feature, we use another function called multi-predict. And there's a silent lambda value or argument here for Glimnet, and it by default uses all the lambdas. And so what that does is in this particular instance, there's 80 possible lambdas. So what you get is you get a table back for each row, and that table has two columns and 80 rows.
And so if I look at the first row of that data set, that table, and remember the holdout object here had its first value had a missing data point in it, and so what it does is it basically gives you a table with 80 missing values for the dot pred and then all the corresponding lambdas that you would have gotten predictions for if you hadn't had missing data. Maybe more informative is to look at the second one, and then you can see if we look at that particular table, the first five rows of it, you know, we have our predicted values across all the lambdas. And so you might say, well, like, jeez, you know, I can't ggplot that. So the good news is, like, you know, unnest and tidy R will just, you know, simply just make that happen for you. So that's like a nice little feature.
So what we want to do is we want to have these, like, defined standards. So another good example is if you're doing, like, quantile regression, you might want to get predictions on, like, I don't know, like 10 or 15 or 100 different quantiles. So rather than you having to program your way around that, now you get back a table with those values in it, and you don't have to worry about doing special things for quantile regression versus something else.
Data descriptors
One last thing I'll talk about is this idea of data descriptors. So if you think about, like, random forest, which a lot of people have seen, random forest, the main tuning parameter is something called mtry. And when random forest goes to build a tree, mtry is the number of randomly selected predictors it will choose as candidates for that split. So if you say mtry equals 3, and you have 100 predictors, it will choose a random 3 out of that 100 as candidate variables to split on. And then it gets the next split and chooses another 3. And there's good reason, it sounds like a weird thing to do, but there's a very good reason that they do that, and it actually has a big effect upon performance.
So when I said, like, we usually don't need the data to specify a model, that's really not true for a random forest, because mtry is related to the number of columns in the data set. So we were starting to think about, like, geez, how would you write a specification that does involve the data, but you need something to use the characteristics of the data? So we have this thing called data descriptors. We have these, like, you can see them listed here, these little functions that can capture different aspects of the data, if not the data themselves. So if I want to know how many predictors I have, we have a little data descriptor function that will calculate that before and after any dummy variables are created.
So what you can do is you can use these little functions in the model specification. So let's say I want to fit a random forest model, and I want to use 75% of the predictors, whether I have 10 predictors or 1,000 predictors or whatnot. What you can do is you can start the specification for random forest, tell it, you know, let's say you want to tell it how many trees to use, and then you can say, well, give me the number of predictors before dummy variables, you know, find 75% of those, and make sure I get an integer by using floor here. Now when you run this, like when you run this model specification, of course there's no data here, right? So what it does is it just saves that expression, and whether you're fitting this model on the entire training set or you're resampling with, let's say, cross-validation would be about 90% of your training set, it does this calculation every time, all right?
So if we use translate here, what you see is, the first thing you see is random forest substituted instead of trees, the thing you would have had to remember for the random forest package is entry, and then you still see this expression here that has not been evaluated yet, okay? So when I go to evaluate that using my data, you get a value of 7, so it runs only at that point and substitutes the value in there for the data that you're actually using when you invoke the fit command, and you can see, you know, we have one column for the number of outcomes, and then you can see that we actually get the right value here.
What's next for parsnip
So there's a ton more to talk about with parsnip, but, you know, I'm going to get the stink eye from Davis if I keep talking, so, you know, just a quick look at the things we're thinking about. We want to think about how we're talking about models, right? So sometimes how we talk about models is related to the way the data is structured. So you know, I used to do a lot of work where I had repeated measures on a particular experimental unit. So like if you have a database with your customers in them, you might have multiple rows over time for your customers, or if you're in a clinical trial, you might follow patients over time.
So if you have like a really simple repeated measure experiment where you have a single sort of like clustering effect or random effect, like patient or customer, you might want to fit some sort of model, excuse me. You might fit that using a random intercept model, or you might use like a hierarchical Bayesian model, or you might fit that using like a correlated error model, like GE. So what we could do is we could have the different engines reflect something about the model and the experimental design for that model. And so you can try all these different things, and they kind of in some ways estimate models are either identical, functionally like identical, or extremely similar in spirit. And so try different things. So it doesn't have to be like random forest models. Sometimes we can, you know, confound that with the type of data that we're using, okay? So thanks a lot. Appreciate you coming.
Q&A
Think we have time for one question.
Thank you for this, you know, really, really exciting and very useful development. It's really great to see it. I'd really like to use it in production. Now, in November, we have 0.01 of parsnip on C-Run. Would you recommend that we actually go towards the customer with models built on there, or how long do we have to wait?
That's kind of a tough question to answer. Three, four years tops. Five. No more than five. No, I mean, so, okay, yeah, this is a good question. I mean, it's the first version. We want to see what people think and what things people encounter. It's been out for a few months. We've had almost no GitHub issues, which hopefully means people are using it. They're not finding any.
But the main thing I would say about that is parsnip in our sample and a lot of other things are pieces of a wider puzzle, and, you know, two or three of those things are not exist yet. So for example, like I have here, we want to integrate parsnip with recipes and things like that. So we're going to have, I don't know if it's going to be called a pipeline, but we're going to have like a pipeline object that then you can use fit on that. So I wouldn't say that, like, parsnip's not ready for prime time. I'd just say that, well, you can use it, but you'd still be lacking a lot of things that you would get otherwise. So I'm hoping by the end of this time next year, let's say, that if you were to ask that question, I'd say, yeah, we're good. So in my brain, that's what I anticipate the timeline would suggest, but that's what I think.

