Max Kuhn | parsnip A tidy model interface | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

So yeah, I'm here to talk to you about a package called parsnip that's been quite a long time in the making. Before I get started, why is this called parsnip? So I have this other package called caret, C-A-R-E-T. And I was joking with some people in my group that the first RStudio package I should make when I joined the company should be caret, like the vegetable. So then people would be like, I have a problem with caret, and I'd be like, well, is it caret? Or is it caret? And then somebody, it might have been Hallie, was like, oh, parsnips are white caret. So we kind of felt like that's a good code name for the project, and then we just kept calling it parsnip.

So Alex stole one of my slides. That's cool. But I've been kind of talking about this forever. What for me, at least, is frustrating with R packages for modeling, and it's almost always it's never like the numerics, usually. It's about the user interface, right? And the problem that Alex talked about is that struggle is real, in that you start using an R package, or you see an R package, you're like, oh, I really want to try that out, and then when you start using it, you're like, oh, no.

And so sometimes people inadvertently, maybe because they don't know anything about R, make the package very difficult to use. So for example, I was looking at one the other day, which I tweeted about and got really angry about, expected your predictor data not to come in as a data frame, but as a matrix, and that by itself is not a deal breaker, but instead of having factors or dummy variables for your qualitative predictors, they wanted to convert it to zero-based integers, right? So that's the most un-R way of doing things. And so I was like, no, I'm not going to do that.

But there's a lot of we do have some sort of loose conventions in R about what your modeling package should look like, but you don't have to follow them, and they're not, to be honest with you, all that specific. So as another example, we usually have the formula method for models, and then we have the non-formula method, where you use X and Y as your arguments. And so you can never really know when you start using a new package whether you have either of those or both, right? So that can be kind of frustrating.

One really, really frustrating thing is sometimes you get a prediction back, and I love the ranger package, but the predictions that you get out of the predict method aren't actually data frames. They're like a specialized ranger object, which then you have to extract the piece out that you actually want. And then I'll talk about Glimnet a little bit more, but that's another good example where if you're fitting classification models, you could get, for a prediction, you can get a vector, a matrix, or a multidimensional array back, depending on the data. And that's very frustrating to program with.

So the point is that if you're going from model to model and trying different things, you could get really frustrated. And then here's the same thing that Alex showed, where just the variation in type equal is substantial. So we want to solve that. We want to, you know, I don't want to have to worry about remembering all this stuff for all the special cases when I go to do modeling.

And so I tried solving that, or I did kind of solve that with caret previously. So caret is like a unified interface of models, and that was written in like 2005, I think. But that's like 2005 code. And that works pretty well, but there's some, it's definitely not tidy, right? It's like the most untidy package. And so what I wanted to do is sort of reimplement this sort of like model interface, knowing all the things that I know now after implementing it for like 250 like models. So parsnip is sort of that part of caret where we're looking at like a unified interface that's really consistent with a tidyverse and does some things differently that I've learned to appreciate over time.

How parsnip organizes models

So one thing we do in parsnip is, you know, it's this unified interface, but we decided to organize the models a little bit differently than before. So what we do is we say, you know, what kind of model, generally speaking, are you trying to fit? So are you trying to fit like a King or Snapper model or random forest or logistic regression or let's say linear regression, right? So let's define what the type of model is as opposed to using LM or Glenet or what have you. So once we have a specification for the model, what we can do is we can then generalize how to fit them, right?

So if I say, and this is on the next slide, I think, if I say I'm going to fit a linear regression, that usually means like slopes and intercepts, right? And there are a variety of different ways you can estimate that. So in parsnip, what we do is we sort of organize all these models and their interfaces by what you're trying to do as opposed to the way you would try to do it. It has a tidy interface, so it's really consistent with the pipe and all the other tiny model packages that we have.

It also, and this is a really big deal, very much like broom, we spend a lot of time defining what we think a predictable interface would be. Predictable in the sense that like, okay, if I do predictions on an object, do I know what I'm going to get before I get it? In many packages, you don't know that. So we spend a lot of time trying to come up with like a convention or a guideline publicly. So we publish all this stuff and want feedback as to like what those return values should look like if you're going to get them.

It also, and this is a really big deal, very much like broom, we spend a lot of time defining what we think a predictable interface would be. Predictable in the sense that like, okay, if I do predictions on an object, do I know what I'm going to get before I get it? In many packages, you don't know that.

So if you want to, you can follow, I'm not going to click on it, but you can follow this modeling package guideline, which, you know, we started right, and some of it's specific to tidyverse stuff and other parts of it are specific to just general modeling ideas. And you can see what we, our decisions were there, but, you know, it's not like completely written in stone. So if you do have opinions, we'd love to hear them so you can file a get issue and we can discuss them.

There's one post about parsnip on the tidyverse blog. We have another one sort of queued up after the conference that's more about the inner workings of parsnip and how it does what it does the way it does it.

Max Kuhn | parsnip A tidy model interface | RStudio (2019)

Transcript#

How parsnip organizes models

Deferred evaluation and model specification

Consistent prediction outputs

Data descriptors

What's next for parsnip

Q&A

Featured software#

parsnip

rstudio

rstudio-conf

tensorflow

tidymodels