Resources

duckplyr: Analyze large data with full dplyr compatibility (Kirill Müller, cynkra)

duckplyr: Analyze large data with full dplyr compatibility Speaker(s): Kirill Müller Abstract: The duckplyr package is now stable, version 1.0.0 has been published on CRAN. Learn how to use this package to speed up your existing dplyr codes with little to no changes, and how to work with larger-than-memory data using a syntax that not only feels like dplyr for data frames, but behaves exactly like that. Materials - https://github.com/cynkra/posit-conf-2025 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon. My talk today is about the duckplyr R package. I've been working on this for the past three years together with Posit, DuckDB, and my colleagues at cynkra, so thank you for this.

Why DuckDB?

DuckDB is an embedded analytical database, SQLite for analytics. Because of that, there is a variety of ways that you can package DuckDB as a command-line interface, as a Python, Rust, WebAssembly, and also as an R package.

I see three main reasons why you may wish to use DuckDB. If your data is large, DuckDB can read straight from Parquet and from other formats. If your data analysis or data pipeline is just a little too slow, DuckDB will make it faster. And then the third reason, DuckDB comes with a lot of exciting functions and data types, and there are extensions that add to that.

All these are valid reasons to use DuckDB as a command-line, to use the DuckDB R package. Now, duckplyr, to me, this is the interface that's more native to R, and I've tried to make this more accessible so that it blends in with all the other tools in the R ecosystem.

Accessing DuckDB from R

Now, looking at ways to access DuckDB from R, you could do the command-line at the bottom right. The more traditional way is through DBI. DBI is a generic package to connect to a variety of databases, including DuckDB. You would write SQL to interact through DBI or through the command-line. If you enjoy dplyr-like interfaces, as I do, you would use dbplyr that connects with DBI.

So dbplyr translates your dplyr pipelines into SQL, runs it on the database, collects the results, but only if you tell it to. So dbplyr works with lazy tables, and I forgot to mention, all of this will be available after the talk. You can run the code, and the slides are on GitHub.

So dbplyr translates to SQL, gets the data back as a lazy table. You can collect, but you also need to collect to get this as a data frame. Duckplyr is similar but different in a few respects. So first, there is no SQL involved. We use a lower-level interface so that we don't have to translate to text and back from binary formats. And second, you don't need to collect.

So the data comes back as a true and proper data frame, and it will be materialized on first access.

Reading Parquet files

This is what it looks like to access a Parquet file in SQL through the command-line. You give, in the from clause, the path to the Parquet file, just as you would give a table name to your box standard database, DuckDB will read the data from the Parquet file and do the operations. In this case, this is a simple row count, but this can grow much more complex.

The more traditional ways of accessing DuckDB through R look like this. Typically, you would establish a DBI connection first, and then you could either for dbplyr at the top right, call this function that points to the Parquet file, returns the lazy table, and this then prints when you access it. DBI, you write SQL the same way you would with the command-line. With dbplyr, this is, I believe, a little simpler, and you get back a proper data frame.

Demo: remote data from Hugging Face

Now, this is another example. I use a data set from Hugging Face. Instead of a file, I point it to a URL with this HF protocol specifier that points to the location where the data set is. The same way as accessing a local Parquet file, I can run my operations, and here the explain verb will show me, before I run the operation, what is going to happen when this operation is run. Starting with reading the Parquet file, you see we only need the two columns, so we only read the two columns. That's what the Parquet file format allows you to do.

I have prepared a demo. So just to show you how fast this is. This is over the internet, accessing this data directly from Hugging Face.

What it does is this lazy data frame, and this took no time. It prints, and you can use Glimpse to get an idea what's in the data. This is 100,000 synthetic persons with demographics and textual information on their preferences.

So the explain for this example is a bit different. If it's the entire Parquet file, it knows the metadata, but, again, as you run queries on this, you don't need to access all the columns. You don't need to load the data up front, which is important, I think, if you restart your R session very often.

This was the aggregation. This is the computation, again, all over the internet, very fast, and what I like best is you can offload this to ggplot, you can pipe this directly into ggplot, ggplot will know how to handle this data, because there's nothing to know. This is a normal data frame for all practical purposes.

This is a normal data frame for all practical purposes.

Now, this was on the pre-aggregated data with geom col. If we try to do the same on the raw data, bam, this is what happens. No, it's not going to work. The data is too large. This is not a bug, this is a protection mechanism I call prudence for some reason that avoids frying your computer.

So try to access a column, try to do a raw slice, no, it's not going to work. You can do this as soon as the data becomes small. So accessing one of the columns from the results works, the data is small, and this is how ggplot got access to the data.

Recommendations for large data

So for large data, my recommendation is you can use remote access to peek into the data. This feature will avoid getting data that is too large onto your local system. I forgot to mention, it would like to be everything is on the local system. You don't need a big server. You run everything on your workstation until it's fast.

If you decide to take a deeper look into the data, by all means, get it locally, because this still then will be faster by an order of magnitude or two if this data fits onto your disk. As soon as the code is ready for production, change one line of code, the way you access the data from local to remote, done. It's going to work the same way.

Speeding up existing pipelines

Of course, analyzing large data is slow. The same can be true for smaller data sets. This is a completely synthetic example, 1 million rows doing a simple summary over 40,000 loops, took five seconds on my machine. Imagine multiplying this by ten.

Now, if you have an existing pipeline, and this is one of the strengths and one of the original purposes that that was intended for, if you have a pipeline and you just want to get that a little bit faster, you have basically three strategies, you could go all in, try to do end-to-end with DuckDB. I think this amounts to a rewrite of the entire code. The best effort mode is the default mode. My recommendation would be to do a very targeted optimization of just the slow parts.

Now, going all in means you set an environment variable, and this will not even get to the analysis because already the generation of the data tells us, look, I don't know what to do with this R norm. There is no translation for this. So even if we don't do SQL, we still need to translate, and we need to provide a translation for each and every R function that we use in the verbs. So this hasn't been done for R norm yet. I'll show later how to work around it.

The best effort mode, the standard mode of operation when you attach the duckplyr package, in the beginning it will tell you, look, I am changing the way this R session works. Everything you do with dplyr will go through duckplyr first. So we try with duckplyr. If it works fine. If not, we offload to dplyr, works the same.

The generation works, and we see that the analysis, quote, unquote, now runs almost in order of magnitude faster. 600 milliseconds elapsed, four seconds of user time spent means all the cores of my computer have been busy computing these results.

The target mode is a little more, requires a little more work. But you get better control of what happens. So you don't let duckplyr take over. You undo what attaching the package does by methods restore. And this means that each operation will go through dplyr until you tell it otherwise. And the way to tell it to go through duckplyr is this sduckdbtbl verb.

And for a large data pipeline for one of our customers, takes 30 seconds and is used semi-interactively. So this hurts. Adding this in three crucial places in the code cuts this down to ten. And looking forward to do three or maybe more. The slow running parts, add two lines of code, you're winning.

Adding this in three crucial places in the code cuts this down to ten. The slow running parts, add two lines of code, you're winning.

The best effort mode, well, let me put it like this. There is a non-zero chance that the implementation will have different results with duckplyr than with original dplyr. And if your pipeline implicitly relies on this. So I'm talking about window functions that depend on the ordering. This is a weak spot. Could bring trouble. So best effort mode means you please check if the results are still the same. Of course, also the targeted mode, but with the targeted mode, it is much easier to isolate.

The dd:: syntax and DuckDB extensions

Now, we didn't have our norm, but this new DD colon syntax gives us access to all the functions in DuckDB and the extensions that are available and loaded. So just by saying DD colon colon random, we tell the system, look, there is no R function like this. Look for a DuckDB function of this name. And run it for me. So this will be passed through directly to DuckDB and just work. We see the result as an almost uniform histogram.

Now, the original code had our norm. This is not available in DuckDB, but there is an extension called stochastic. I found out by searching through a list of community extensions, and when I found stochastic, it was clear that this is probably where such a function would live. Bingo. This normal sample is available as soon as I do this load stochastic call, and I can run it to get this normally distributed samples. And again, piping directly to ggplot gives the plot without any extra effort.

This DD colon colon is just an artificial syntax. So if you run this code in dplyr proper, this will not work today. But it is a package on GitHub that at least gives you the help pages for the function that knows. So there is a help for random, and two years ago at PositConf, I was very passionate about nested data. And I still am. So these three functions are the way how you can nest and unnest data with duckplyr.

And if you're curious, this is the link to the material that gives you all the codes and much more to experiment on your own. Install from CRAN, run library duckplyr to attach the package, and maybe consider methods restore, at least for new code, to get operations for large data, fast, with a lot of goodies built in. Thank you.

Q&A

First question. I think somebody wants a TLDR. Why build duckplyr instead of using dbplyr?

So again, I think this is the interface that's more native to R. Dbplyr doesn't return data frames. Dbplyr returns a proxy object that you just cannot pass into ggplot unless ggplot takes measures to do so, which I'm not sure it does today. And if ggplot does, many other packages will not know how to handle a lazy table from dbplyr, but they will know how to handle data frame.

How does duckplyr interact with arrow? I have no idea. We'll save an answer for later.

Can I control when prudence kicks in? Yes. So in all the functions that generate a lazy duckplyr data frame, there is a prudence argument where you can go either, well, I don't care how big the data is, to all the way down to always give me an error if it cannot be run with DuckDB. And the middle setting, that's the default when you read from a file. This will just break if the data hits a million cells by default.

How does duckplyr avoid the user needing to call collect? Once the data is small enough, does it call collect for you? So no, the object is a proper data frame. It has a class. The class is data frame proper with a few subclasses. This is essentially a list of R vectors of the same length, pretty much the same as a regular data frame in memory, but each of those R vectors is an R vector. This is a lazy vector technique built into our core that allows us to trigger computation just when it's accessed. So that's why we don't need to compute or collect.