duckplyr: Analyze large data with full dplyr compatibility (Kirill Müller, cynkra)

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon. My talk today is about the duckplyr R package. I've been working on this for the past three years together with Posit, DuckDB, and my colleagues at cynkra, so thank you for this.

This is a normal data frame for all practical purposes.

Now, this was on the pre-aggregated data with geom col. If we try to do the same on the raw data, bam, this is what happens. No, it's not going to work. The data is too large. This is not a bug, this is a protection mechanism I call prudence for some reason that avoids frying your computer.

So try to access a column, try to do a raw slice, no, it's not going to work. You can do this as soon as the data becomes small. So accessing one of the columns from the results works, the data is small, and this is how ggplot got access to the data.

Recommendations for large data

So for large data, my recommendation is you can use remote access to peek into the data. This feature will avoid getting data that is too large onto your local system. I forgot to mention, it would like to be everything is on the local system. You don't need a big server. You run everything on your workstation until it's fast.

If you decide to take a deeper look into the data, by all means, get it locally, because this still then will be faster by an order of magnitude or two if this data fits onto your disk. As soon as the code is ready for production, change one line of code, the way you access the data from local to remote, done. It's going to work the same way.

Speeding up existing pipelines

Of course, analyzing large data is slow. The same can be true for smaller data sets. This is a completely synthetic example, 1 million rows doing a simple summary over 40,000 loops, took five seconds on my machine. Imagine multiplying this by ten.

Now, if you have an existing pipeline, and this is one of the strengths and one of the original purposes that that was intended for, if you have a pipeline and you just want to get that a little bit faster, you have basically three strategies, you could go all in, try to do end-to-end with DuckDB. I think this amounts to a rewrite of the entire code. The best effort mode is the default mode. My recommendation would be to do a very targeted optimization of just the slow parts.

Now, going all in means you set an environment variable, and this will not even get to the analysis because already the generation of the data tells us, look, I don't know what to do with this R norm. There is no translation for this. So even if we don't do SQL, we still need to translate, and we need to provide a translation for each and every R function that we use in the verbs. So this hasn't been done for R norm yet. I'll show later how to work around it.

The best effort mode, the standard mode of operation when you attach the duckplyr package, in the beginning it will tell you, look, I am changing the way this R session works. Everything you do with dplyr will go through duckplyr first. So we try with duckplyr. If it works fine. If not, we offload to dplyr, works the same.

The generation works, and we see that the analysis, quote, unquote, now runs almost in order of magnitude faster. 600 milliseconds elapsed, four seconds of user time spent means all the cores of my computer have been busy computing these results.

The target mode is a little more, requires a little more work. But you get better control of what happens. So you don't let duckplyr take over. You undo what attaching the package does by methods restore. And this means that each operation will go through dplyr until you tell it otherwise. And the way to tell it to go through duckplyr is this sduckdbtbl verb.

And for a large data pipeline for one of our customers, takes 30 seconds and is used semi-interactively. So this hurts. Adding this in three crucial places in the code cuts this down to ten. And looking forward to do three or maybe more. The slow running parts, add two lines of code, you're winning.

Adding this in three crucial places in the code cuts this down to ten. The slow running parts, add two lines of code, you're winning.

The best effort mode, well, let me put it like this. There is a non-zero chance that the implementation will have different results with duckplyr than with original dplyr. And if your pipeline implicitly relies on this. So I'm talking about window functions that depend on the ordering. This is a weak spot. Could bring trouble. So best effort mode means you please check if the results are still the same. Of course, also the targeted mode, but with the targeted mode, it is much easier to isolate.

The dd:: syntax and DuckDB extensions

Now, we didn't have our norm, but this new DD colon syntax gives us access to all the functions in DuckDB and the extensions that are available and loaded. So just by saying DD colon colon random, we tell the system, look, there is no R function like this. Look for a DuckDB function of this name. And run it for me. So this will be passed through directly to DuckDB and just work. We see the result as an almost uniform histogram.

Now, the original code had our norm. This is not available in DuckDB, but there is an extension called stochastic. I found out by searching through a list of community extensions, and when I found stochastic, it was clear that this is probably where such a function would live. Bingo. This normal sample is available as soon as I do this load stochastic call, and I can run it to get this normally distributed samples. And again, piping directly to ggplot gives the plot without any extra effort.

This DD colon colon is just an artificial syntax. So if you run this code in dplyr proper, this will not work today. But it is a package on GitHub that at least gives you the help pages for the function that knows. So there is a help for random, and two years ago at PositConf, I was very passionate about nested data. And I still am. So these three functions are the way how you can nest and unnest data with duckplyr.

And if you're curious, this is the link to the material that gives you all the codes and much more to experiment on your own. Install from CRAN, run library duckplyr to attach the package, and maybe consider methods restore, at least for new code, to get operations for large data, fast, with a lot of goodies built in. Thank you.

Q&A

First question. I think somebody wants a TLDR. Why build duckplyr instead of using dbplyr?

So again, I think this is the interface that's more native to R. Dbplyr doesn't return data frames. Dbplyr returns a proxy object that you just cannot pass into ggplot unless ggplot takes measures to do so, which I'm not sure it does today. And if ggplot does, many other packages will not know how to handle a lazy table from dbplyr, but they will know how to handle data frame.

How does duckplyr interact with arrow? I have no idea. We'll save an answer for later.

Can I control when prudence kicks in? Yes. So in all the functions that generate a lazy duckplyr data frame, there is a prudence argument where you can go either, well, I don't care how big the data is, to all the way down to always give me an error if it cannot be run with DuckDB. And the middle setting, that's the default when you read from a file. This will just break if the data hits a million cells by default.

How does duckplyr avoid the user needing to call collect? Once the data is small enough, does it call collect for you? So no, the object is a proper data frame. It has a class. The class is data frame proper with a few subclasses. This is essentially a list of R vectors of the same length, pretty much the same as a regular data frame in memory, but each of those R vectors is an R vector. This is a lazy vector technique built into our core that allows us to trigger computation just when it's accessed. So that's why we don't need to compute or collect.

duckplyr: Analyze large data with full dplyr compatibility (Kirill Müller, cynkra)

Transcript#

Why DuckDB?

Accessing DuckDB from R

Reading Parquet files

Demo: remote data from Hugging Face

Recommendations for large data

Speeding up existing pipelines

The dd:: syntax and DuckDB extensions

Q&A

Featured software#

dplyr

duckplyr