IIIRqueR Plenaria Hannah Frick

Hannah Frick

Nov 11, 2024

1h 14min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Well, thank you very much for the introduction and thanks to the whole organizing team for having me. I'm delighted to be here.

Yes, I do work on the tidymodels team and this is what I'm going to be talking about. I am a statistician by training. Now my title is software engineer and I'm going to be walking down the line between math and machine learning and software engineering in this talk as well.

And once you leave academia you get the joy of presenting work that you haven't done. I've done some of the work that you saw, but this is work by the tidymodels team and these are the organisms. So a lot of the work has been done by these three guys alongside me.

This is Anil. This is Max, our available speaker. This is Simon. You will see their names scattered throughout the talk.

What is tidymodels

Before we jump into what is new with tidymodels, I'm going to give us at least one slide to say what is tidymodels. So it is the framework for statistical modeling and machine learning in R.

And it is a framework that uses tidyverse principles. So the same ideas that have been used in tidyverse are now being learned and used in design ideas of how functions should work, how you should be able to interact with them. All of these things that have made the tidyverse popular are the same design ideas that we try to apply to what I consider the heart of our modeling statistics.

It's still sort of like the crafting base. And essentially if you feel at home with the tidyverse, you should feel at home with tidymodels as well.

What is new and released

And when I got asked to talk about what is new, the usual interpretation of what is new is what is new and has been released. And I'm going to be talking about this, but I'm going to be talking about this more briefly than the second part here, which is what is new and in progress. So you're getting a little bit of work in progress.

And part of the reason why I've decided to keep new and released a little bit more contained here is that most of these big releases of the year happened in the first half of the year, and then we talked about them in various places. Amongst them, PositConf in August, and the talk videos from PositConf are now online since last week. So they're all on YouTube, including ours. We had a session of tidymodels talks with plenty of others, and I do believe that the quality of these talks at PositConf was really good. So if you're curious, I encourage you to go check out those videos.

The thing that probably took the longest out of these releases is adding support for timed event data to the entire framework. So timed event data, as the name suggests, has like two aspects of it. The timed event and the event itself. And that event may or may not have happened yet when you're like, but I don't want to analyze stuff. If it hasn't happened, that's a censored observation. It's not a missing observation. It's incomplete. But you still want to make use of that data.

For the stats people in the room, it's pretty obvious that the methodology to deal with both an outcome that has two aspects, and there's something involved. This methodology is survival analysis. For people who are not so familiar with this, the name survival analysis usually does something that's a circle like medical research and that's not applicable to me. And I've been trying to tell people this year in particular that it's timed event data and then methodology for that. And that event can be really a lot of things. Yes, it can be data in a medical context, but it can also be like time until your pet gets adopted in the shelter. So pretty broad applications.

My colleague Simon talked about a very different topic. He talked about fair machine learning. And tidymodels does include several metrics for fair machine learning. And there isn't like one silver bullet, one metric that answers all your questions because there just isn't one. So we also did not try to implement one. But rather Simon and other colleagues at Posit spent almost a year, I think, sort of catching up on literature on other implementations and then deciding on what to implement for tidymodels.

And they've chosen it to have metrics that support you when you're thinking about a problem rather than presenting you a too easy and not real solution.

And they've chosen it to have metrics that support you when you're thinking about a problem rather than presenting you a too easy and not real solution. And he does a really nice job in situating that research and questions in his talk on fair machine learning.

And last but not least is Emile's work on making predictions happen in databases. So if you're working in a context where you make a model and you have loads of data in your database, it's nice to not have to pull that data into R to make a prediction and then write it in R, but that you can do it in your database first.

And that is what the tidypredict package was doing, but it was doing it for a personal model. So in the easy case, if you have a linear model, you know your coefficients, you can write a SQL or other expression that multiplies your vector coefficients with your predictor values, and you get your prediction. And that can happen on a database. But that wasn't including any of the preprocessing. And Emile has gone through the work of trying to make this work for tidymodels workflows so the objects can include preprocessing and the fitted model. The package is called Orbital if you want to check it out.

Work in progress: post-processing

So that is what we've been working on earlier in the year. If any of this catches your fancy, go check it out. And I'm going to very smoothly segue into the new vision in progress. Some things get a lot more... less polished. Let's put it that way.

And the first thing I want to talk about in that moment where we're actively working on things is the topic of post-processing. And I've decided to slap the subtitle on. That's what you do after you've done fitting your model. And you do it to your model predictions, not to your data. The reason usually being you're trying to squeeze a little bit more performance out of it, trying to make the predictions better after you figure out how to make the predictions.

But let's look at an example. So I brought some delivery data. That's just how long it takes to get something delivered. Since I talked about censored data earlier, this is all uncensored. This is the actual time until the event. Just looking at the time now, we're keeping things easy in this example.

So time to delivery is our response. We've got a couple of predictors. We're doing the usual tidymodels thing where we split it and try to do testing. And we're also preparing something for 10-fold cross-validation. And then we're going to pick a model, and I'm going to deliberately make the model really bad to show you why you might want to test your predictions. So this is a boosted tree with a glorious number of three trees. So you're not going to get something very performant, but you get a model.

We're going to resample that model, put it on the 10-fold cross-validation. We're going to get some metrics out of that, and we are going to save those predictions.

We have metrics, so we might as well look at them. And we get an R-squared of like 0.8 something, which is not too terrible. But as always, you've got to remember what your metric is actually measuring. And R-squared measures the correlation. It doesn't tell you how close your prediction is to your observed value. That is a different topic. It's calibration.

Why not look at it as a plot? This is done with the probably package. You can see that for higher actual observed values, the model quite severely under-predicts them. This is not a great model. I mean, deliberately so, but if you're trying your hardest to get a good model and that's your result, things are challenging.

And one of the things that you can do with this is calibration, and that's what we're going to be doing. Generally, why would you want to do any of the post-processing? Usually the motivation is to get better predictions out of it with sort of adding a hit-stack, and calibration is another model on top of that. Or sometimes you simply want to better satisfy distributional limitations. You might have a range that your predictions can't be outside of, something similar.

Currently, you can do that, and things keep moving from it. It is not necessarily the immediate way. You can just do straight-up column manipulations on it, if you're just thresholding predictive probabilities into hard classes with something other than 0.5. Or you could use the probably package which has calibration components in it.

That is all okay, but what we are now doing is integrating it into the workflows objects. So a workflow in tidymodels is the thing that contains, currently, your pre-processing instructions and your model instructions. The idea being that putting them together makes it easy to cross-validate them together, and you're not creating data leakage from pre-processing. Now we are adding post-processing to go into that object, so that you can fit the whole thing.

Introducing the tailor package

We have a new package on the block for that kind of stuff. Meet tailor. So, the tailor package is the thing that connects stuff in probably and defines, so to say, with the workflow.

If you think of recipes for pre-processing, think of tailor for post-processing. That's the mental model that we're borrowing from. If you have one for recipes, you're a good bit of the way there for tailor.

Similarities and differences in that mental model. Recipes works on the training data because we're doing pre-processing. Tailor works on the model predictions because we're doing post-processing. But you both initialize an object for recipes and services, and then you add more specifications to it. Steps to do. Recipe steps. And in tailor, those steps are adjustments to do.

And then you have a part where you estimate something. And then the part where you execute some things, often called predict. And there are fit and predict for tailor.

And let's use it. Let's use it in the workflow. We need to call library tailor. Nothing happens with that. But then we can make a tailor object. We initialize it with a call to tailor like we would with recipe. And then we're adding the additional instructions of what to do with our model predictions. In this case here, I'm adding an adjustment that is a numeric calibration. So that's off the wall predictions. You're going to get calibrated.

And we have this tailor object that I knew would have a recipe object. And like we would add a recipe to a workflow with the add recipe function, there is now also a add tailor function. So hopefully all of your intuitions that you've built up with tidymodels already are working and you can be like, yeah, I'm mapping this knowledge to a new part.

So we can add the tailor here. And now we have a workflow that has instructions for pre-processing, for the model fit, and for post-processing. And then we can send that thing through fit resamples, like we did before. The only thing that's changed here is the object that goes into it. We're still saving the predictions so that I can do the original plot. And you can see, still not perfect, but we're getting closer to that diagonal line.

So success. I think this is work in progress, so let's talk about what it can do and what it can't do yet. So what is implemented in tailor as options for post-processing is what we expect to be the most useful and commonly requested ones.

So if you have something that you need that's not on this list, I will show the site where to get the feedback. Let us know. But we're hoping we're covering a lot of bases with this already. You can calibrate probabilities. You can either pick a different threshold or use equivocal zones if you're turning your probabilities into hard class predictions. And for numeric outcomes, you can calibrate them or tweak the range.

And we have implemented both tailor itself. There's support for it in workflows, so that you can say add tailor. And as you've seen, fit with samples and tune works with this already.

And then there's like this unassuming line of support for resamples for sundry samples. That concerns calibration. So calibration essentially fits another model on your model predictions as an input and your true outcome as an outcome. And if you fit yet another model, you want fresh data, unseen data. You don't want to risk overfitting by just adding a model on top of a model on top of a model and fine tuning too much on each range of data.

So what happens underneath the hood is that we hold back some of your training data for calibration if you have told us in your specification that that is what you want to be doing. And how we do that is implemented in our sample and we have that for cross-validation and bootstrap samples and there are some more to go including temporal datasets.

And the other thing that we still need to wire up in this is that you can tune with all of tune and not just say fit resamples but you could say tune on a workflow that has pre-processing specification and post-processing info. And if you care about this topic and want to give us feedback that is very welcome. The best place to do this is by issues on GitHub on the tailor repository.

Work in progress: sparse data support

The other topic that's sort of hooking in the tidymodels section is support for sparsity in a better way than currently. What I call sparse tables and sparse data in the wild would be something you encounter if you have categorical predictors with a lot of categories and you turn those into indicators or you tokenize text or you have a graph data set. You can easily end up with a lot of variables and then a lot of zeros in them.

So like the information in your data is pretty sparse in that context. And having a lot of data with a lot of zeros instead of like going from one column of a categorical variable to like multiple, multiple columns can sometimes be a challenge in two terms. One, memory allocation. Your data set is already big and then you can like turn one column into many, many. And the other thing is speed.

And the way that this is typically addressed is a different data representation. Different to what? Is a solid question. So the default is called a dense representation. We have a vector like this and we store all of the 25 values. That's a dense representation.

I don't need you to count the zeros, but you probably have spotted the non-zero values because they're only two. So if we don't want to store all 25 of them, because these two are the things that really interest us, we can try to switch to a sparse representation. And we only need five of these for that. So we need to remember how long that vector is. So that's one for the 25. We need to save the locations for the non-zero values. That's another two here. And we need to save those non-zero values themselves. So another two. But I'm five versus the 25. And whether that's an interesting trade-off depends on how sparse your data is. There isn't like, this is always better. I think we all know that's no free lunch, right?

There is a little bit of a trade-off, but if you are in a situation where you do encounter a lot of zeros, it's interesting. In this complete layer, in R, we're implementing stuff. This is called the Matrix package, with a capital M, that implements sparse matrices and sparse vectors in efficient matrix operations for this. So we're addressing both the memory concern and the speed.

But that's pretty late in the game. Usually when you have data, you have it in a data frame for a good reason, because you have often different types of data, right?

So there's a little bit of a hiccup in using the Matrix package directly in the tidymodels framework. So if you give us such a matrix to go fit the model, we can take it and pass it on to the modeling function. We're not crunching it, it's fine. That is what is there already.

But we have a slight problem trying to stick sparsity into tibbles. I feel like in these sort of things it's always good to remember, what are you trying to do, what are you trying not to do, and why? And the best place to start is why.

So let's pull back for a moment and talk about what we would really like to have here. Simple one, if you have something that isn't a sparse data format, we would like to preserve that throughout the framework. If you run into something that's concave or sparse data, they will materialize and fall back to the dense representation.

We want to preserve it and we want to preserve it across the framework. And the thing that carries data across tidymodels is a tibble . That's where my interest for putting a sparse vector into a tibble comes from.

The other thing we would like to have is to be able to make use of sparsity where it makes sense. Maybe if you're turning your categorical variable into a lot of indicator variables, you want to make use of them and you want to make sparse indicator variables. That's where we would like to be able to add sparsity.

And the last one is kind of casual, like we want to make things easy. Sounds great, but it really is like a fundamental idea of all of tidymodels. The idea of always being not easy for easy's sake, but rather if you're dealing with complex problems, we don't want you to spend all your energy making capability on the technical bits of handling sparsity. We want you to be able to rely on tidymodels for that and keep your focus on the actual modeling.

So that's why this matters. And we want you to not sit like, as a small example, we don't want you to be like, oh, but I actually don't need all of the rows and the fills of this. Will it break sparsity if I take out some rows because it's not going to change anything? You just want to be able to say dplyr mutate or dplyr filter. You do probably. We certainly do.

That is a long-winded way of saying why do you really want sparse vectors in a tibble? Not all of them have to be sparse, but we want to be able to have some vectors to be sparse.

And as you saw, unfortunately we can't just straight up stick sparse vectors from the Matrix, capital M, package into a tibble. For technical reasons, I'm not going to dive into, but you're going to be like, welcome sparse vectors.

So that is another moment of what are we doing, what are we not doing? So sparse vectors is not intended as a replacement for the Matrix package. It is our way to have sparse vectors that we can stick into tibbles and we're not aiming to implement any of the efficient matrix operations that the Matrix package has. So if that's what you're interested in, then go to the Matrix package for matrix operations.

That said, sparse vectors is its own package, and if that sounds interesting to you, you can go use it. We are using it in tidymodels, but it's its own thing and you can use it in other contexts if that fits to you. So it allows us to do that thing that I was after, put sparse vectors into tibbles, and because we want to be able to talk with the rest of the world, also it has fast conversion methods to go from a tibble with a sparse vector in it to formats of the capital M Matrix package so that you can then send that to a model fitting function, for example.

Okay, I said work in progress, I said some stuff's implemented, some stuff's not, so let's get to that part. Implemented already in recipes is that the server of everything in recipes, the recipe function itself, knows what to do with this and doesn't just smash it first and then materialize it. So that will pass through. That will pass through. It will also pass through prep and bake.

Parsnip 's fitXY function also does not destroy its sparsity, neither does predict. And in workflows, fit and predict also leave us alone. Which is really nice.

But obviously, what's not happening? So there's currently the area where the work is at right now are the recipe steps. So, whether a step touches a sparse predictor, it may or may not be sparse at least. So for some steps, that's just inherent to what you do. Like if you're sticking a vector into a PCA, you're not having something sparse there. But if you're doing sort of like the indicator variables, or other things where it might make sense to keep sparsity, the goal is to preserve that. And some of them work already and some of them don't.

And then, there is one thing that some people want to do, and some people are skeptical about, which is why I put a big caveat on this. So if you want to use a formula interface, which is lovely, which is a great thing about R, and in tidymodels, that is sort of called fit on a parsnip model, or if you invoke that by using formulas as the preprocessor in your workflow, then you're down on code paths that will eventually lead you to model.matrix from base R. And that one uses dense data.

And unless we write something that replaces model.matrix, that will be the case. So maybe we will have this in special cases. It will, I'm pretty sure it will not be a it will always work kind of situation because of the nature of things.

But I don't want to send you off to the side like, it's not going to happen. This is very much a work in progress. But we are pretty excited about this because of the possibilities that this opens up for sparsity in preprocessing. So far, sparsity is mainly relevant for model fitting, and now there's more happening on the preprocessing side. So we don't quite know yet where that will lead us, but we're pretty excited to find out.

Better error messages

This is the feedback side. If you have feedback on sparsity in general, and sparse vectors in particular, put the issue with the sparse vectors repository. That is not under tidymodels, that is under r-lib.

And now we have post-processing, sparsity, tables. And I brought you a third one because three is fun, right? And that third one is better errors. Not quite as big and flashy as the other topics, but more like the bread and butter of things.

Sometimes you notice. Usually when you run into an error, it's not the best moment in your life. And recipes used to be pretty tricky. So you write your beautiful recipe. You try to fit your workflow. You get an error. You're like, okay, fine. I know my debugging strategies. I'm going to bake that data set. First, I'm going to bake it. Remove the complexity of the workflow. And then you try to prep it and you still get an error.

Hopefully the error tells you something useful. So here it says a column already compounds new level. So new level is like, oh, yeah, I did something and you can look for that step. And I know that people have been devising a debugging strategy where they'll comment in and out of steps to narrow it down. That is a good strategy, but wouldn't it