Sparsity support in tidymodels, faster and less memory hungry models - Emil Hvitfeldt

Sparse data, data with a lot of 0s, appear quite often in modeling contexts. However, existing data structures such as data.frames or matrices doesn't have a good way of controlling them. You were forced to represent all data as sparse or dense (non-sparse). This means that many modelings workflows uses an non-optimal data structure, that at best slows down computation and at worst won't be computational feasible. This talk will cover how we overcame these issues in tidymodels. Starting with the creation of a sparse vector format that fit in tibbles followed by the wiring needed to make it happen in our packages. The best part is that most users doesn't need to change anything in their code to benefit from these speed improvements

Oct 30, 2025

17 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, I'll be talking about something that I've been working on, on and off, for five years. So it started before I joined Parser to work on this and then last year I was able to actually really put some thought down to solve this issue. And I'll be talking about how we now much more fully support sparse data in the tidymodels ecosystem.

Motivating example: flights data

So as a motivated example, what we have here is we're loading the flights data set. So this is all the flights coming out of New York in the year 2013, so it's around 300,000 flights, a little bit more than that. And we're building, it's not a great model. What we're trying to do is predict how much the delay will be and then putting it into an HDPoost model and seeing what happens. And we're also doing some pre-processing and most of it is just to handle some missing data and handling variables, labels that haven't been seen before.

But the important part here really is the step dummy. So we're creating dummy variables of all the teledorical predictors. And right before anything happens, there's 18 predictors. Once we add the dummies, there's a little over 4,000. Some of these, top of the teledorical variables is like the tail number. So we'll have a dummy variable for each type of, each plane that ever flew out of New York City in the year 2013.

So this means that this data set is quite sparse, meaning there's a lot of zeros. It has over 99% zeros, which are mainly concentrated in these dummy variables.

So if you ran this, if you try to fit this model a year ago, it would take around eight minutes to run. And this has nothing to do with HDPoost. This is only something that happened because of what we're doing in tidymodels. And now, we're actually seeing some improvement in the time it takes to fit. And you might think that if we got it down to 100 seconds, that would be really, really nice. But that's simply not true, because now it runs in four seconds.

But that's simply not true, because now it runs in four seconds.

And this has, this has nothing to do with HDPoost being slow before and fast now. This is simply us being able to fully leverage the sparse data that HDPoost really supports. To tease a little bit why this has happened, because we're now allocating a lot less memory. And the rest of the talk will showcase how we made this happen.

ALTREP: alternative representation in R

So first, we'll talk about alt-web, a little side thing, something that happens in base R. So alt-web is short for alternative representation. And it's something that happens in R, at the C level, that you shouldn't know happens. It always happens under the hood. If anything, it's really hard to actually see if something is alt-web or not. And it has a lot of different use cases I sadly don't have time to talk about. But I'll talk about the first one that really links to it, which is the idea of an integer sequence.

So we all have been making them, if you have a for loop, anything else, it's a sequence from one integer value to another integer value. And this uses alt-web under the hood. So if we try to look at how much memory a vector takes up in R, I'm using the lobster patch because it's easy, the numeric function just produces a bunch of zeros. And we can see that as we are making the vector 10 times longer, it also takes up 10 times more memory in, yeah.

But if we're now making a sequence of the same length as before, but we're increasing them, they always take up the same amount of memory and space, regardless of how long you're making it. So you could make one sequence that's so large that it actually wouldn't fit in memory on your computer, and it would still only take 680 bytes.

So what's happening is here we're just imagining it's a small sequence, but it shows everything. So how would we describe everything about this vector in as few values as possible? And to be able to do that, we need to know how long is it, how many values does it have, what's the first value, so here it starts at 1, and then we have this step. So here it's intermittent by 1, but you could imagine it intermittent by 2, 4, minus 1. So we can describe any length integer sequence using three integer values. And this is what happens in ALTREP for us.

And with that, whenever some operations happen, and so if you want to say, what's the length of an ALTREP vector? Instead of pulling it up and looking at how many values it has, our intercepts start saying, hey, you're trying to call length on this ALTREP integer sequence. Let's pull out the first value and return that. So it's transient in time, and likewise with if you wanted the sum, there's a little bit of arithmetic to do it, and you can now calculate the sum of any integer sequence at the same time with darkness of its length.

So there's a bunch of different methods we need to implement to be able to have ALTREP working. So here we saw that there's the sum, and we can also say, is something sorted? It's really easy to know if an integer sequence is sorted, because if the increment is positive, it's sorted. There's nothing else we need.

The last one I want to talk about is the serialized state. So the main thing to know about ALTREP is it's very, very fragile. If you're trying to do anything to an ALTREP vector that it doesn't support, it will make the normal vector and then use that. So if you try to add one to integer sequence, it's no longer an ALTREP vector. It's just a normal vector now, and it will do that.

Sparse vectors as ALTREP

Now we can get back to actually talking about what I did. So there's this idea of sparsity. So here we have a sparse vector that has 14 values or something, and two of them are non-zero. But the principle is the same if it had a million values. So then what's the minimal sufficient amount of information to be able to recreate this vector? Again, we need the length. And what we have instead is we want the positions, so the integer position of where the non-zero values are and what the values are. So this vector can now be represented as five integer values. And the technique doesn't have to be integer values. You can have a sparse double vector. So then you have three integers and two doubles.

So there's a little bit of trade-off from what we had before, where now how much memory a sparse vector takes up is fully dependent on the number of non-zero values. So if half of them was non-zero, it wouldn't be worth it anymore, because we're actually spending more time. And some of these operations we'll be doing later. It isn't always worth it. So there's a threshold, but it's not a set threshold. It's very fluid.

But this is the main idea behind what we're doing. And likewise, we can also do these pseudo-calculations that are faster than doing it before. So if we want to get the sum of a sparse vector, we just sum up all the non-zero values, which when we're using it should be a lot shorter than the real thing. And if we want to find the i-th element, we just see, is the i in the positions? If it is, we just find the matching value.

Of course, this all has to happen in C. So this is the C implementation of the same function from before. And because there's a lot of expectations for it to happen, and we need a separate one for the integer version, and the double version, and the character version, and everything else. So in this process, I created the sparse vectors package.

So as I said before, everything happens in C. So all the ORDREP plots into the C interface that R has. It needs to match. Everything has to match. So it's handling for infinite, not a number, any values. Everything has to match up completely with what happens before. That's the promise you're given when you are using ORDREP. And then they're very fragile. If anything happens, it materializes a normal thing, and it becomes non-sparse again. So we also want to avoid ever doing something to these sparse vectors that isn't supported, because then we're losing the benefit.

Additionally, because we need it, it has converters. So if you have a sparse matrix, we can turn it into a data frame with sparse columns in it, so a bunch of sparse vectors of columns, and back and forth. And this is something you don't need to worry about. So this is mainly for us, because we're using it across all the packages. But hopefully, you shouldn't really know that this happens. But this is where all the work really is.

Why sparse vectors in tibbles?

So why are we doing this? I spent a lot of time talking about different things. But the reason why we're doing this is sparse matrices and data frames with tuples don't co-mingle at all. Additionally, a sparse matrix, by being a matrix, is forced to all the values have to be the same type. So you can't mix factors and integers and dates. They all have to be the same thing. So in theory, you can't do classification with a sparse matrix, because you can't have the outcome in it. So you need to have it separate. And secondly, tidymodels is built on tuples. So if we wanted to do sparse data support with sparse matrices, we would have to duplicate all the code that have two paths, which is not worth it.

Making it work in tidymodels

So to make it all happen in tidymodels, we have to do three different things. One of them was making sure that our sparse data to pass through all the internal functions from end to end. So if you pass it into Fed, it can go all the way across without accidentally materializing, in a way. We have, as we saw in the beginning, we had this step dummy. It's one of the steps that can produce really sparse data. So we needed to upgrade all the steps that have produced sparse data to actually produce it. And lastly, we have this idea of automatic tuple, which I will get in details in four or five slides.

So having things go through is the least interesting part of this. It's just writing hundreds of tests where we try to put in sparse data, and then seeing if it stops being sparse at one point, and then figuring out this function did it, right around it, and continue until everything was able to flow back and forth.

Secondly, we have a couple dozen steps that are able to work with sparse data. You can also imagine that you can take a sparse vector and multiply it by two, and it stays sparse because it doesn't remove the zeros. So there's a couple of steps that can modify sparse data, or use sparse data, or produce sparse data. So all of that happens. And the ones that generate now have received a new argument called sparse. That can take three values, yes, no, or auto. And auto is the default.

Automatic sparsity detection

And the automatic tuple is, so we wanted this experience to be as seamless as possible for the users. So you only want to use sparse data when it's beneficial for you. And when we say beneficial, it's mainly computational speed, because it more or less goes hand-in-hand with the memory allocations that happen. So something can benefit from using sparse data if the model supports it, so not all the engines support it, so like HDBoot, GLM, like LemNet, LibLinear, like GBM will get full support by the end of the month. So some of them support it, but you also need enough sparsity in the data that it's worth it. So we don't want to encode everything as sparse vectors if there's no zeros, because then we're doubling the data and everything becomes really slow. So there is some threshold, and it's a little bit hard to do. But if these two questions are right, then we automatically turn on the sparse creation for you, and it should just happen.

The way we do the sparsity estimation is we need to know how much sparsity there is before the data goes into the model, but we need to know it before we fit the recipe. So it matters for the model, but for the recipe. And if we need to do the recipe and try both ways, we lost all the time anyways.

We have a very rough estimate of what the sparsity will be, and for some of them they're fairly easy. So if you have a dummy variable coming in, we know precisely how much sparsity that will add to the data. So we have a certain number of columns and a sparsity beforehand. We know it has 10 levels, so it produces 10 variables that has 90% sparsity in it. So we have a rough estimate of what happens, and then we just do that math, and then we have an estimate of how much sparsity is in the data.

I then set up a simulation study that takes around a couple of hundred hours to run, and it is fitting models that explicitly state that they're supporting sparsity, and we're doing different number of rows, number of columns, amount of sparsity, and with that we're fitting a model that says which one is faster. We're then encoding that as a simple model that's embedded into workflows. So when you're fitting a model, it will estimate is this model going to benefit from being sparse or not, and if it is, it will turn all the autos into yes, and if it doesn't, it will turn everything into no. It also means that if your model doesn't support it, it doesn't even do this, because it just says you don't support it, no right away.

What it means for you, the user, is that we technically also have sparse data support as input, so if you have a sparse matrix, you can fit and predict on it without going through something else. It just natively supports it by turning it into a sparse table right as it comes in, and because we're not perfect, the automatic title can be turned off or on manually by setting the sparse argument in different steps to yes or no.

So it basically means when we have this before, the only thing we need to change is setting this to yes, so our little prediction might not have affected it at all. But as far as we know, no one is seeing extreme slowdowns from this, but the potential in speed improvements is really high.

And then, that's what I have, this is the tutorial to the slides, there's a number of articles on the tidymodels.org website that show how we're doing, there's this blog post, and we have a link to all the currently supported model engines and recipe steps. Thank you.

Featured software#