Resources

Alex Hayes | Solving the model representation problem with broom | RStudio (2019)

The R objects used to represent model fits are notoriously inconsistent, making data analysis inconvenient and frustrating. The broom package resolves this issue by defining a consistent way to represent model fits. By summarizing essential information about fits in tidy tibbles, broom makes it easy to programmatically work with model objects. Combining broom with list-columns results in an especially powerful way to work with many model fits at once. This talk will feature several case studies demonstrating how broom resolves common problems in data analysis VIEW MATERIALS https://buff.ly/2FGKFkj About the Author Alex Hayes Alex is interested in how statistics can help people make better decisions. He's active in the R and data science communities, particularly interested in improving interfaces to modeling sofware. In his free time, he tries to get outside to climb and bike

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Who am I and why am I talking to you about broom? I interned at RStudio over the last summer and worked on the broom package and have since taken over maintaining it. So if there's any bugs, send them to me and I will try and fix them. And also the person who originally wrote broom is at RStudio conference. It's David Robinson. So give him a big shout out if you use broom at some point. Um, but I'm going to talk to you about what I'm calling the model representation problem and how broom kind of alleviates that to some degree. If you want to follow along with the slides, these are available online and you can download them.

The model representation problem

Okay. So anyway, what is the model representation problem? So if you have spent some time in a stat class or maybe even reading some tutorials online, uh, you will probably see things like this somewhere. Here I've said, okay, there's this YI and I'm using this little tilde, which means that this Y is a random variable and it has a normal distribution. And so you can write this down. And if you've seen this before, you probably know what it means. And similarly, you can kind of like pick one normal distribution. We have standard notation for that. And when we have like a way to kind of represent factors of parameters and you can write down estimates.

So in my mind, there's like these kind of three key objects. We have like models and then we have like individual fits for models and we have estimators, which I don't want to get into too much, but you can write these things down in ways that other people understand. So that's the, the kind of takeaway is that when you think of these things in terms of math, we have shared notation and community standards for how we write these things down.

Yeah. So, uh, but in, in code, that's not the case quite as much as one would hope. Um, and so it's just a quick example. So one thing that you might imagine they might want to do is if you have like some classification model, you might want to get a class probabilities for that model. Okay. But, uh, that's kind of hard to do sometimes. So most of the time you can get class probabilities, but the question is how do you get class probabilities? So what I have here is, uh, some example that I have stolen from Max, but you have these different, uh, objects or models that you'd like to predict from. And if you want to get class probabilities, there's all of these different kind of interfaces, right?

So it's not a question of whether or not you can do what you'd like to do. It's really a question of cognitive load. I don't want to have to remember all the different type arguments when I want to get class probabilities. Really. I just like to sit down and say, remember one thing and always get what I want. Okay. So what's happening is there's this extra friction that takes time away from being productive towards your end goal, which is really just like get the class probabilities and then get done with your job, right?

It's really a question of cognitive load. I don't want to have to remember all the different type arguments when I want to get class probabilities. Really. I just like to sit down and say, remember one thing and always get what I want.

Okay. So this is kind of what I'm calling the model representation problem is like we don't have a standard way to represent these things.

How broom helps

Okay. And Broom is here to help. So Broom is, uh, a package written by David Robinson and there's these kind of three key ideas. The idea is that, uh, we can take these fits by fit. What I mean is, um, you have some model and you like train a model and you give me a trained model back. Okay. And we're going to say that there's kind of like three key components in this like Broom philosophy and we're going to, all of these components are going to be tidy tibbles. And the reason I'm going to be tidy tibbles is because we know how to manipulate tibbles and so we can use tools that we already know and there isn't any kind of additional cognitive load in working with those things.

So the first type of information about a fit is going to be like the components or the parameters. So if you have a linear regression model, you could think about the regression coefficients. So we're going to have a table that's going to have all the information about the regression coefficients. Okay. And, uh, there's a S3 generic that will provide that to us. So that'll be tidy and I'll show some examples of how this works in just a moment.

Okay. The second generic is glance. So glance is going to always return a tibble with one row and the columns are going to be different goodness of fit measures. So for example, this could be things like R squared or log likelihood or BIC or whatever kind of fancy version of, you know, information criteria that a model returns. Um, but it's a one row summary of the entire model. And then finally there's going to be this third generic that's augment. And what augment does is it says, well, you have this training data set and each one of the rows in this training data set influenced the final model in some way. So any of that information about how it influenced the final model is going to show up in augment. Or if you have new data, you can also use it to get predictions on new data in a kind of nice way.

Working through examples

Okay. So we'll work through a kind of simple example here first. So, uh, suppose we have some data we can simulate from a normal distribution here and then we can get a fit where now this normal fit object here represents the maximum likelihood estimator of the parameters of this model. Okay. So the question is like, if we didn't do anything, like what did this thing look like? And if we look at this normal fit object, what we get is a list and I've cut off some of it, but we have these things in here. So there's this, um, we have our estimates, the maximum likelihood estimates themselves, and then we have their standard deviations and the covariance and the log likelihood of this model. Okay. So this is like actually a pretty nice object and you can work with it if you choose to, but the big issue is that if you go to a different package, it could look really different.

Right? So as an example of something that's going to look really different, actually not as an example, that we'll get there in just a second. Um, we're going to talk about linear models first. Oh wow. Okay. I'm out of order, but here's what happens when we want to tidy this thing up. Right? So we'd like to have a table that represents these estimates. Right? So there's this normal fit and what we can do is if we call tidy on it, broom is going to clean it up for us. Okay. So here we get is this table and there's three columns and these column names are pretty much always going to be the same, which is going to be really nice. So we're going to have this term column and that's going to tell us what parameter we're actually working with. We're going to have the estimate that's going to tell us the actual value that we've calculated that we think we should be using. And we'll get a standard error, which is going to address our uncertainty, uncertainty in that parameter.

Okay. So I told you that there were two other generics that we can apply to these objects. So another one is glance. So for kind of a simple normal model, we only get some simple diagnostics. Here we see how many data we fit to the total log likelihood and then these other kind of criteria. Uh, there isn't actually an augment defined in broom at the moment for these univariate distributions, but you could think about these if you wanted to, like you could probably define like a log likelihood contribution if you wanted to think about that.

Okay. So let's think about a slightly more involved model where this might start becoming more clear, like how it'll be useful in practice. So if you want to follow along on your computer, you can. This is, you only need these two lines. So we're going to create this, um, OLS fit object, which is going to be OLS estimators for the linear model. Okay. And, uh, I can't actually print this out on a slide because this object is really big and complicated, but it's interesting to look into yourself if you kind of want to, or if you just want to be like intimidated by how crazy the internals are. So I still have no clue how it works.

Okay. So, but then when you tidy it, you get something that's a lot easier to work with. You just get this table. Okay. So here we have the term again, the term is each one of our linear regression coefficients and we have this estimate, but now there's some additional information about linear models that we didn't have before. So we have these, um, uh, statistics from T tests and associated P values. If you wanted, you could also get confidence intervals. There is a argument to tidy that will give you those confidence intervals. Uh, when you glance at this, you also get a lot of information. So some of the metrics that you're going to get, you're going to get like R squared and adjusted R squared here. Sigma is the error variance, which is going to give you some measure of how uncertain predictions from this model are going to be and so on and so forth. And what exactly you get out is going to depend on the particular model that you put in. But the, uh, the column names are consistent. So if you want to like figure out how to get this information from one model or another, you don't have to remember what the internals of those two models look like. You can just tidy or glance them and then always just access that one column that you care about.

Okay. Finally, I told you that there was this third generic augment. So if you call augment and you don't have any additional data, you can pass data into augment. But if you don't pass any data, it's going to root around in the model object and find the training data for your model. So in this case, what it's done is it's found that I used three columns from the original MT cars data set, HP, MPG and cylinder and it's going to put those into a data frame and then it's going to add additional information about each one of these observations. So here each row is an observation. You can see that there's a dot fitted column, which is going to be our actual prediction from the model. There's a dot SE fit, which is the standard error of that prediction and then so on and so forth. So for linear regression, there's a bunch of additional measures that we can add in here. But there's, these are all kind of like training data level summaries.

Using broom in practice

Okay. So that's like the big picture of what Broom does and now I'd like to show you how I end up using Broom in practice. So the first thing that I do is I use it a lot when I want to do like a quick and dirty reporting tool. So if you are familiar with the knitter cable function, it's fantastic. It takes a data frame and then no matter kind of what you're outputting to, it just gives you like a nice table. And so if you can get your model into a table, which is exactly what tidy is going to do most of the time, you can report it really fast. So this is like my super quick and dirty. I don't want to just like barf our output into a homework document or something like that. So I'll use this kind of combination.

The next thing that you can do is you can use Broom to compare different models by using this glance generic on a list of models all at once. So what I've done here is I've said, well let's put three different models in a list. I'm going to call them fit one, fit two and fit three. And in this case I'm adding more predictors at each step. Okay. And then I'm going to use this tool from the per package called map DF. And what that's going to do is it's going to apply the glance generic to each one of these models and then it's going to put them all into a data frame together. And I'm going to arrange it by the AIC. So if I look at the result, now I have a summary for each one of my models. And in this case it looks like AIC is the best model in term, that fit three is the best model in terms of AIC. I think that you can also plot this immediately for a really nice visual model comparison.

Okay. By far my favorite use case for Broom is checking multiple linear regression. So I told you that if you use augment, you're going to get the residuals. And one of the things you might want to do when you're doing some sort of regression is you might want to take the residuals and compare them to each of your predictors and look at these partial residual plots. And if you're going to see any kind of systematic deviations there that might indicate some sort of model misfit. So if I want to do a visual check for model misfit, I have about ten lines of code here and it's going to produce the following plot. So that's it. And now I can look at these residuals and see if there's any patterns here. So if I kind of eyeball this, I wouldn't be too concerned in this case. But it's a nice way to do quick and easy diagnostics.

Working with multiple models at once

Okay. So finally I'm going to work through kind of a more extended example. I don't actually have the time so if someone could let me know how much I have left, that would be great. But I'm going to work through this kind of more extended example which is where I think the real power of Broom comes in which is when you want to work with multiple models at once. Okay. So suppose we have this data. This is the MT cars data again. I'm sorry if you've seen this a hundred times. But if we want to think about the miles per gallon and weight, we could say well let's use a model here but we don't want to do a linear model. We could say this really looks like there's some sort of inverse relationship. Let's do some sort of like nonlinearly squares. So we could say well let's just use this model. So let's say there's like some inverse relationship with weight. The particular model here is not particularly important. But we could say well let's use this model and maybe we'd like to fit this a bajillion times on bootstrap data sets to see what the uncertainty in these parameters K and B is.

Okay. So to do this the first thing that we're going to use is we're going to use the R sample package. And we're going to create bootstrap data sets. In this case we're going to create a hundred bootstrap data sets. And what we're going to get is we're going to get this tibble that has a bunch of nested objects. So when you start looking at kind of these R sample stuff there, you might wonder what's going on. So we have this split object. And so this split is going to have two kind of data sets contained in it. It's going to have the bootstrap data set and the held out data set. And then there's going to be this ID column. And so the idea is that we're going to use mutate together with per map to add new list columns of models and work with those new list columns of models.

And a general strategy that works really well in these kind of circumstances is you figure out how to do something on one resampled data set. So what I've done here is there's a function in R that will let me fit a nonlinear least squares and it's called NLS. And so NLS is going to need three arguments. I have to kind of define the formula that I want to use. So in this case that's exactly the formula from the slides a couple before. And I have to tell it the data I want to use. So in this case there's that split object from my bootstraps data frame. And I'd like to use the analysis function to get the training set out of there. And finally because it's a kind of different model than usual, I have to give it some sort of initial parameters which don't really matter here. So this is the helper function that's going to solve this problem for me on one data set.

So now I'd like to generalize this which I can just do with map. So now I take my bootstrap data sets and I'm going to mutate map. So now I'm going to fit a bootstrap data set on each one of these resampled data sets. And I'm going to do the second thing where now I have each one of these fit models and I'm going to tidy each one of these fit models here. So now what I've done is in this first row I have a data set. I have an ID that kind of corresponds to which data set it is. And then I have an S3 object. That's my actual model object which is living in this column. And then I'm going to have a nested table which has my tidied coefficients. So at first glance this kind of structure can seem really intimidating because you have all of these nested hierarchical structures. But this is kind of where you want to think about. You want to end up here and then there's a magical function from tidy R that's going to flatten this out into something that we can actually work with in an intuitive way. And that's the unnest function.

So in this case if I unnest on coefficient info, I'm going to go back one slide really fast. So coefficient info is this list of tables. If I unnest on coefficient info, now I have this flat data frame that I can work with where I have for each one of my bootstrap data sets I have all the information that I need about my coefficients. So now if I want to plot this to visualize uncertainty, this is just a one liner or two liner in GG plot. Okay, five liner. But then you throw it in GG plot too and you just can see this right away. So I haven't done a whole bunch of bootstraps here but hopefully the idea is that the workflow is very convenient.

Okay, there's kind of one last use case that you can use here. So suppose we want to do the same thing but we would like to visualize the uncertainty in our predictions. In that case what we're going to want to do is work with the augmented data sets. So we're going to apply the same strategy. We already have a table of fit models. We're going to map augment over those fit models and what we're going to get again is going to be this nested table with a bunch of predictions and then we can un-nest those nested predictions and GG plot them right away. And that looks like this. So now you have a bunch of non-linear fits that you're visualizing all at once.

Documentation and resources

Okay, so that's what I use Broom most for personally. There's a bunch of documentation on how to use Broom. We just released Broom 0.50 at kind of the end of the summer. So there's a bunch of information on new releases and new tidiers that were supported there. You can also learn how to implement these methods yourself. We have some standards that you can use so that your Broom functions or your tidiers is what we call them will behave like other Broom functions which is what you really want to do to decrease the cognitive load which is the goal in the end, right? So there's a website with all this documentation and I'm also happy to answer questions or help with pull requests or anything at any time. So I've got my contact information up here as well. Thank you.

Q&A

So thanks, Alex. We have time for a couple questions.

Maybe stand up and wave at me if you do have one. Yeah, okay, back here.

Can you hear me? Can you hear me, anybody? Okay, here we are. So since you're bootstrapping with a Tibble, is that making it extremely efficient time wise?

Okay, so there's two things you have to think about in terms of the efficiency for these things. The first is like the data. Are you going to create a bunch of copies of the data? And the answer is no. Max has implemented this in a way so you're not going to create excessive copies of the data. You're going to have to create a little bit of ‑‑ so I think the example is like if you bootstrap and you have 50 bootstrap data sets, you're going to have like maybe three times as much storage space as originally. What you're really going to get hit with in terms of inefficiency is going to be the models themselves. So most modeling functions in R, they actually save the data that they were fit on and that gets big really fast. So it really depends on how kind of clean the modeling package is. But that's normally not at your mercy. So if you're running into kind of space issues, one kind of standard thing you can do is you can define a helper function to fit the model, figure out what you actually need from that and then try to throw away as much junk as possible and kind of keep that like pared down object. I think we have time for one more question.

Hi. Great talk. Can you just in a few sentences say what it takes to implement new tidiers?

Yeah, normally it's about like five lines of tidyverse code in terms of like effort. So like five lines of tidyverse code and effort. The biggest thing is making sure that you're meeting the kind of conventions. So if you break with the convention, it doesn't actually do any good for anybody. So there's a lot of documentation. I need to update that to some extent. But this link that I have in the slides here should work you through everything you need to do. So I would say maybe 15 minutes to read through that and hopefully 30 minutes or less to create a custom one of your own. I think we can do one more quick question.

I have a love-hate relationship with the boot package. But one thing that I like about it is that it's got support for computing different types of confidence intervals that have better properties that you would like in them. So I was wondering if these tools have anything like that implemented or if it's just kind of percentile confidence intervals and that sort of thing.

So I didn't actually show a confidence interval example. What I visualized up on the board was the entire sampling distribution. So if you can, that's what my recommendation is, is to look at the whole sampling distribution. But when you're working with our sample, it's kind of low level infrastructure. So if you're going to do calculations across all of those coefficients at once, you'll need to implement that method yourself. Max has a comment as well.

Oh, yeah, there's fancy stuff. There's like 632 coming. So there's fancy stuff coming. That will be really cool.