duckplyr: Tight Integration of duckdb with R and the tidyverse - posit::conf(2023)

Presented by Kirill Müller The duckplyr R package combines the convenience of dplyr with the performance of DuckDB. Better than dbplyr: Data frame in, data frame out, fully compatible with dplyr. duckdb is the new high-performance analytical database system that works great with R, Python, and other host systems. dplyr is the grammar of data manipulation in the tidyverse, tightly integrated with R, but it works best for small or medium-sized data. The former has been designed with large or big data in mind, but currently, you need to formulate your queries in SQL. The new duckplyr package offers the best of both worlds. It transforms a dplyr pipe into a query object that duckdb can execute, using an optimized query plan. It is better than dbplyr because the interface is "data frames in, data frames out", and no intermediate SQL code is generated. The talk first presents our results, a bit of the mechanics, and an outlook for this ambitious project. Materials: https://github.com/duckdblabs/duckplyr/ Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Databases for data science with duckdb and dbt. Session Code: TALK-1100

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Kirill. Today I present you first results from my collaboration with duckdb labs and posit. I wouldn't be here today if not for Hannes and Tom and Davis, Hadley, Andrew and Kevin. And thank you for joining me in this presentation.

So duckdb, we have heard in case you missed the previous talk, is the new high performance database that works in R, in Python, in Java, JavaScript, Rust, Swift, Scala, I don't know, everywhere. And it integrates with Arrow.

dplyr , this is example code using the taxi data that Hannes just shown. It computes the median tip percentage by passenger count in the state of cab rides in New York City in 2019, 1.7 gigabyte compressed, more if you uncompress this, and you can load this into memory, you can use dplyr to compute this, but have you actually tried to do things with such a large dataset in dplyr? I feel uneasy. I'm not very patient. This takes eight seconds on my machine. And spoiler alert, with duckplyr it is going to be much faster.

Why dplyr over SQL

But before we get to the how, let me start with why. So I get a lot of joy of writing and reading dplyr code. Even if you're not familiar with the syntax, you can read from top to bottom, you can see what each individual step is doing, and you can execute parts of the pipeline to inspect intermediate results.

Now this is a SQL version of the same query. The order of operations is mixed up. And to run a simplified version of this thing for, I don't know, debugging, you would have to carefully edit it. So I find for incremental building and debugging, dplyr is easier, is better.

Now a feature matrix could look like this. We have dplyr, which is nice, but slower. We have duckdb, which uses SQL, but it's super fast. Hannes spent like 20 minutes on explaining why. Multi-threading native code and the queries, it sees the queries and it optimizes them as a whole, and to end, chooses the best way to compute the desired result. Maybe the most important thing, Hannes and his team, talented database specialists with decades of experience, and this is their day job, this is what they do day in, day out, making DuckDB faster and faster and faster and ever faster.

Introducing duckplyr

So can we have the icing and the cake? Can we have fast dplyr? Yes, this has worked before. With dbplyr , we can take dplyr code, convert it into SQL, run it on the database, get it back as a data frame, continue operating in R. And there are other such systems, such as datatable and dtplyr . There is arrow, there is collapse, there may be more that are implementing the dplyr syntax.

What is wrong here? Well, we inherit too many features from the underlying database. DuckDB can read parquet, this is great, dbplyr can also read parquet, duckdb works with datasets larger than memory, perfect, dbplyr works with datasets larger than memory, but when we get the data from dbplyr, it's not a data frame, we have to collect it, transform it into a data frame. We don't support all the R functions, we don't support all the R data types, and for some functions even the implementation or the semantics are a little different, bottom line chances are your dplyr code, if you take it and try to run this with dbplyr, will not work out of the box.

Now unlike duckplyr, duckplyr is designed as a full drop-in replacement for dplyr, data frame in, data frame out. Support for all data types, all R functions, I mean, yes, we try to do things in duckdb, if we can't we just offload to dplyr and handle it there without breaking your code.

Now unlike duckplyr, duckplyr is designed as a full drop-in replacement for dplyr, data frame in, data frame out.

My goal is to make all of your dplyr code work with duckplyr and make more and more of your code work faster so that you can focus on what matters to you and achieve better results in less time, development time, and computation time.

And now this is me working on duckplyr, so tests are green, everything passes, we need a new feature, maybe we go to Hannes and ask them to build something, sometimes it's the other way around, duckdb changes behavior, it challenges my expectations, but it's a lot of fun iterating between correctness, performance, and completeness.

So working on duckplyr is test-first development, except the tests already exist, we don't need to write them, we have 3,500 test cases in the dplyr package, and we can adapt them to run against duckplyr. And we have 4,000 packages on CRAN that use dplyr directly or indirectly. We can create a modified version of dplyr that uses duckplyr and see what happens with those packages. So the combination here helps us find the occasional bug and also guide our development efforts.

Design decisions and limitations

Now if you've been in Davis's talk, dotpy is the new kid on the block for group operations. Groupby we do not support, this is laziness, this is because we want you to use the new syntax and frankly fewer things can go wrong this way.

Now what happens if you compare an integer with a string? What happens when you compare an integer with a factor? On what day of week does the week start? And R and duckdb have different opinions here and on many, many other questions. This is why we support only a small subset of the functions right now. This is more opt-in than opt-out. We know how to handle those. All this now takes us time to implement basically the rest of that.

Now one more difference is row order. So if you think about a data frame, you sort a data frame by a specific column in dplyr and if there are ties, if there are multiple rows with the same value, the original order of rows will be preserved with dplyr and duckdb just doesn't care. Because duckdb cares about performance, they might just shuffle around your rows. This is why we have two modes of operation, a slower mode that will maintain the original row order and fast mode that will do whatever duckdb is doing here. And we'll see which one will be most useful in the end.

Performance demo

Now speaking about performance, I did not dare to make a live demo, but I'm showing you how this looks like in the RStudio IDE. Now this starts in a clean R session. I'm attaching the dplyr, I'm attaching the duckplyr package and I'm using conflicted here to reduce the noise and to be very specific about the functions I use. If you're not familiar with that, I highly recommend this for your scripts.

This is the same query I had in the beginning, the taxi data ingestion preparation summary and I'm running this in three statements and these take no time. Push, push, push. They're done immediately and this is because nothing has been actually computed here. You see in the bottom the results, just don't know the number of rows just yet. These are data frames, lazy data frames, but these are proxies and they need to be computed in order to be available as full data frames. Yet they have a class, the class is DataFrame and the subclass, they have column names, they have column types, just not the data yet.

I can use DuckDB internals to see what will happen when the query is actually executed. So this is a lot of output and it reads from bottom to top. So we start at the bottom with ingesting the parquet file and we read only the columns that we need and only the rows that we need. Further along the way there is this transformation computation of percentages, there is the grouping operation and there is the ordering of the output and all of this from start to finish runs as a single DuckDB operation.

Data mentions fast, we'll see. So running this, touching a column, querying, getting the value of a column actually triggers the computation. Again, this output is huge. In the middle you see the vector, 987 and so forth. That's the regular R vector that comes back from this operation. At the top we see the recipe, that's what DuckDB is using to compute the result.

Because I used system time in the code I see it took 1.8 seconds to run and in the user column 5 seconds of computation time, meaning more than one core, more than one processor was involved multi-threading. Now we're doing this a second time or we are accessing the full data frame, this is instantaneous because once we have computed it we do not need to do this again, of course.

If we compare this with a more traditional dplyr workflow like loading the data and then doing the computation, the taxi data variable, that's the first. I get the number of rows, this also triggers the computation. And this is now all the columns and it takes four seconds and doing the whole analysis took two. So, well, I think that's already an improvement.

The arrow, the refresh thing in the environment pane now actually is useful, clicking on it shows for the data sets that we have touched how many rows there are and we see taxi data prep that was the preparation has not even been materialized yet. We don't need this because the result has been computed from end-to-end with FTP.

Now this has been recorded for the very recent and as in nightly built version of their RStudio IDE. I'm not sure if this is in the release yet. Maybe you need manual refresh only in your version of the IDE or of the Positive Workbench.

So we run this with a few other queries. The query I showed was fastest in dplyr, all the variants with more groups or that included join with dimension table take about 20 seconds or more. In dplyr, all of this runs in less than two seconds. In duckplyr, in order of magnitude, just like that, I really like this particular result.

In duckplyr, in order of magnitude, just like that, I really like this particular result.

Completeness and custom functions

So completeness. One thing we do not support yet is custom R functions. If we had a code like this, and this might take a second to digest. So instead of computing inline, we would implement a compute PCT function that can be the percentage. This will not be handled by DuckDB, still it will work. So what's going to happen here, mutate from duckplyr function, we'll see a compute PCT, I don't know what that is, I send it to dplyr and dplyr will just do it. It will call compute PCT, query the value of the column, and by querying the column, it will actually start computation of the filter, which still will be in DuckDB duckplyr. So this will work, a bit slower, but still you don't have to change your code at all, I hope.

Now there is work to do, we would like to support more data frames, especially maybe spatial data frames would be useful, Hannes mentioned nested and packed columns, actually they do work in DuckDB, they just don't work in duckplyr yet. More R functions, more data types, ideally all R functions, all data types, we'll see if we can get to there.

Getting started

We have a good idea why we're doing this, we have a good idea how to do this, now we only need to know what, and for this we will need your help. You can install this today, it is on CRAN, the binaries have been built, all this will do is get you an updated version of dplyr and DuckDB in case you need it, it won't otherwise change your system.

And there are two basic ways to use this first, as duckplyr df, you write this as the first step in your pipes and this will run the dplyr steps in duckplyr, but you have to do this with every pipe you have. If you're courageous, methods override will send everything to duckplyr, and this will reset when you restart the R session, and there's also a function to turn it off.

This is the GitHub, I would love to hear if this works for your code, file an issue, drop me a line, we have a presence in the lobby and I'd be happy to take questions here. Thank you.

Q&A

We have five minutes until the next talk so we have plenty of time for questions, so please do submit them on Slido if you have them. I have a question, is the goal to one day support 100% compatibility with dplyr, including things like groupby?

Including things like groupby. I wouldn't place bets on groupby, but for the other parts of the question, yes. If you find that something that works in dplyr does not work in duckplyr, this is a bug, because we really strive to near full compatibility. I know this is elusive, because the slope gets very, very steep at the end, but I don't want to put a figure on it. We will look at everything that you send our way.

Do list columns in data frames work with duckplyr? Come back next month.

Is there an advantage using duckplyr alone versus starting with a tbl, converting that tbl to an arrow table and piping it to arrow to duckdb? You can do that. One thing I forgot to mention, indeed, that you do not have to start with a parquet file. Hannes said this too, you can start with a data frame, I just didn't have the data in memory just yet in my demo, which is why I got this straight from the parquet, and this also gave a nice speed up. You can start with data frames, you can start with arrow, you can start with csv, with parquet, with anything that duckdb can ingest, and we have a function that allows you to read those files.

What is on the roadmap for duckplyr that you are most excited about? Definitely nested and packed columns, and this is the low-hanging fruit, I think.

I think that's all the questions I have, so let's give a hand again to Kirill for his talk.

Featured software#