In-Process Analytical Data Management with DuckDB - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

Hello, good afternoon, I think. Thanks so much for coming. It's exciting for me as I've never been to posit.conf before, so it's a really new experience for me.

My name is indeed Hannes Mühleisen. Once my clicker works, then I will be happier. There we go. And I'm here to talk about In-Process Analytical Data Management with DuckDB.

Just a quick poll. Who here has heard of DuckDB? Oh, wow. And who has not heard of DuckDB? Okay. So I'm really happy about that because, you know, it's good to see that people have heard of us.

Why DuckDB exists

I want to start with a bit of motivation why DuckDB exists, and it actually is interesting how that is very deeply connected to the R community. So in the tradition of putting Hadley on a slide, I was working in a database architectures research group back in Amsterdam a couple of years ago, and we realized that data scientists and specifically our users really hated data management systems, and it hurt our feelings a little bit.

So here's a quote from Hadley, if your data fits in memory, there is no advantage of putting into a database, it will only be slower and more frustrating.

So here's a quote from Hadley, if your data fits in memory, there is no advantage of putting into a database, it will only be slower and more frustrating.

And we thought very much about that, and we realized that he was right. Databases can be very frustrating. So here we have sort of the good and the bad bandits here, and they can be very bad. Setting up databases is very difficult. Even for somebody who has a PhD in data management systems, setting up Postgres is a daunting task sometimes.

And I have seriously, I've set up so many systems over the years because, you know, you need to run experiments and so on and so forth, and my favorite one, I think, like a little anecdote, you cannot literally install IBM DB2 and Oracle on the same machine. It's just not possible. But of course, you know, it took me a week to realize this. Installation. Maintenance. Yeah. So then once you have installed this thing, you need to somehow make sure it runs, you know, we have to deal with user accounts, all that stuff, you have to update, all not very pretty.

And something that really came out when we worked with data people is the data transfer. In fact, we wrote a paper about how slow transfer back and forth from databases is. You would actually not believe that if you're not starting to run experiments in it, even for something like Spark, you know, that uses just ancient protocols, it's not a great user experience. So that's bad.

There is also good things in databases. For example, you know, the, I don't know, 40 or so years, no, actually at this point it's 50 years since relational databases have existed. We have spent a lot of time on optimizing queries so that the user doesn't have to actually become a database engine in their head and start, you know, reshuffling things in order to make them faster. It should be automatic.

Persistence, right? The original motivation of making data management systems was to get rid of these file zoos that people write like, you know, custom programs to operate on to make some changes to set files and instead have sort of a defined transactional persistence model with updates and consistency. That's quite useful.

And of course, most notably, especially in my community of analytical data management systems, we have spent a lot of time working on efficiency and parallelism. But the problem was that the frustrating bits were kind of hiding the good stuff. And the good way of looking at this was that people have been sort of ignoring the sandwich principle as I like to call it.

People have spent a lot of time optimizing like the patty, whether it's vegetarian or not is up to you. Optimizing that till the end of the world, like there is hundreds of papers on how to optimize a join, but we've literally wrote the first paper on optimizing client protocols, right? Nobody has ever looked at the user end-to-end user experience. And I think that's why people perceive these things as to be frustrating and would actually rather invent their own than to touch that stuff.

And similarly, the query results in DuckDB can directly become R data frames in the same process without having to go through IPC or anything like that, right? So we can basically say, you run a query, it has a million rows, results, which normally would be the end of the world in sort of traditional setup with client server. But with DuckDB, it's just like, whoop, okay, there you go.