DuckDB and the future of databases | Hannes Mühleisen | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the data science hangout everybody and welcome to the last data science hangout of 2024. If we haven't had a chance to meet before, I'm Rachel, I lead customer marketing at Posit and I'm based in the Boston area so if you ever want to grab coffee or get together I'm in Boston.

Posit builds enterprise solutions and open source tools for people who do data science with both R and Python. We are also the company formally called RStudio just in case anybody needs to hear that but I'm joined by my lovely co-host today Libby. Libby do you want to introduce yourself?

Hello everybody I'm Libby, I'm a community manager who works with Posit to help our beautiful hangout community thrive and I'm also a Posit Academy mentor so speaking of R and Python I teach both R and Python, I mentor both R and Python for Posit Academy to help working professionals learn R and Python and use them in their data work. We're so happy to have you joining us today.

If this is your first hangout let me just tell you a little bit about it. So the hangout is our open space to hear what's going on in the world of data across all different industries, to get to chat with others about data science leadership and connect with other data community members who are facing similar things as you and we get together here every Thursday at the same time, same place, of course not if it's a holiday week so we will be on a little bit of a holiday break and coming back January 9th but if you are watching this as a recording on YouTube and you want to join us live in the future there will be details to add it to your own calendar below.

Thank you so much to those who have helped make this the friendly and welcoming space that it is and we're all dedicated to keeping that it that way so if you ever have feedback for me about your experience anything you want to share with me anonymously or maybe suggestions for topics I'll share a Google form in the chat with you right now but feel free to use that you can always reach out to me directly on LinkedIn too.

And here at the hangout we love hearing from you it doesn't matter what your years of experience are, what your title is, what industry you're in, or what languages you use or do not use, you are welcome here in the spaces for you. I encourage you to connect with each other in the chat so introduce yourself, say hi, say where you are and what you do. Feel free to share your LinkedIn profile and also a link to your website so that people can actually find you once this Zoom chat goes away.

Okay so this is a community-driven discussion that we're going to be having with Hannes and that means that you ask the questions. There are three ways to jump in, you can raise your hand on Zoom, we'll just call you to ask your question. You can put a question in the Zoom chat and if you can't ask it, maybe your mic's not working or you're somewhere loud, put an asterisk somewhere in there at the beginning or the end and we will ask it for you. And then we also have a Slido link where you can ask questions anonymously.

And for that, really, there's so much information, there's so many resources on, you can just download the Reddit dump or whatever it's called. And then just start, you know, analyzing it and you will see, you know, what are the things people struggle with?

Why is it called DuckDB? Meet Wilbur

So why duck? Why duck? Yeah. So there used to be a duck called Wilbur and he was my pet duck. And he lived with me on the boat and it was very cute. Very cuddly duck. Who could have imagined that ducks can be cuddly? But yeah, so he eventually flew away and, you know, we like to think started a ducky family somewhere, but he, in his memory, DuckDB is called DuckDB.

So here's, this is little Wilbur. You can see him there. This is me on the other side, obviously. And he's just standing there hanging out, you know, being happy, being very loud. Ducks are loud, man. Ducks are loud.

Corporate use cases and DuckDB in the wild

My question is, I've used DuckDB for some of my personal projects. Like, it's great at, like, CSV to Parquet if their files are big. I got the vector data support, which was really helpful, combined with some other simple small language model stuff. But work, my data's all up in the cloud. My data's all, like, in Snowflake. Like, how do you see, do you see DuckDB as mainly a tool for, like, kind of researchers working on the laptops, kind of in, like, academic environments? Or do you see a corporate kind of, you know, use case where our data's mostly in the cloud?

Yeah, that's a great question. So we do see actually a ton of corporate use. This is the fastest growing, let's say, area of DuckDB use. So there is a lot of the, sort of, I'm here on my laptop and I just have to deal with this data sort of use. And that's also where DuckDB came from, as I've explained. But there is a ton of use in corporate. And it's less, let's say, as a sort of complete sort of system, but people use it a lot as a building block.

So say I have, you know, some, for example, if you look at one of the modern BI tools out there, almost all of them use DuckDB under the hood. If you look at, you know, for example, there's big sort of data transformation companies like Fivetran that are also using DuckDB under the hood. So you might not actually realize it that you're, it's not like, I don't know, Snowflake, where, you know, you're talking to Snowflake because there's a like little blue logo up in the corner there.

But what we see a lot with DuckDB is that people put it in somewhere and, you know, just, you can kind of figure it out if you really want to, that it's running DuckDB under the hood sometimes because, you know, the SQL dialect will have some things that will give this away. But very often it's also just, you know, pre-baked queries, pre-baked schemas, just run this thing and people are happy. And that's really fine with us.

So we're not, we don't, you know, it's open source software, people can do whatever we want, but that's, I think the growth that we see a lot in, in these sort of use cases. And for people that do want an end-to-end sort of solution, there is MotherDuck, right? There is a company that is building a DuckDB SaaS, which is, you know, has look and feel, I would say, similar to Snowflake or BigQuery or, you know, whatever else there is.

And some use cases that are kind of wild. You know, people put it in cars and then I'm like, are you sure? But yeah, so I think there is a lot of use in, in big, in big tech. It's just sometimes not visible.

DuckDB with Shiny and R integration

I believe I read somewhere that DuckDB can be used with Shiny. Generally, how does it integrate with R, Python and data visualization tools?

So the integration with R is really interesting and it's not like anything you've kind of ever seen, I think, in terms of data management systems, maybe except with SQLite, because it's this in-process architecture where the whole thing runs inside whichever process invokes it. So with R, you just have an R package that's, you know, installed on packages, DuckDB. And that actually is not a client or a driver, like maybe for other systems, it's actually contains the whole thing. And then when you start it, it actually runs inside this R process that you're sitting in front of.

And yeah, that has some, some very nice benefits. For example, you can read stuff from memory. Like if you have a data frame in memory, we can read the data frame. When you have a query result, you can directly turn that into data frame or error structures. And the same thing also happens in Python, where you can, you can just load the, you know, pip install DuckDB and install the package. And then you can just load it up inside this Python environment and look at, you know, Pandas data frames or Polar structures or IR structures again, because it's just right there in this process.

And it doesn't have all this like client service stuff. You don't have to set a password because there's no point of having a password. It's right there in your thing. It's yours, right? You can, you can do whatever you want with it.

Rethinking what a database is

So, so I've been reading more about DuckDB and using it in a, in some of the creative architecture bucket of use cases and doing serverless apps and things. And I realized that there's a lot of things I think I know that I don't really know, like what is a database? And I think in the past I've thought of as a database as something that has to be in the cloud because lots of things are happening to it in different places. And therefore I have to query that instead of just, you know, having my data files locally, but files can also be in the cloud and DuckDB can query files. So what is like, I need to update my mental models of a lot of things. Can you weigh in on how data scientists should think about what a database is and is DuckDB actually like changing the concept of database?

No, I think it's definitely changing what people, what sort of this conception of, of what that, that what people have of what a database is and what you alluded to where it lives. Okay. So, I mean, what is a database? There's there, especially in stats, there is a lot of confusion with the word. So, I mean, because there is the database, which is just a collection of rows. And then there's the database management system, which is of course something like DuckDB, which is the system that you used is confusing terminology, but in principle, it's just a collection of tables, right?

And then you have the database management system, which, you know, provides management of these tables, like, you know, store them, change them, query them most prominently. And now it's not, not like, and again, in the past, it used to be the case that these things were sort of, you know, cloud, you know, some sort of data warehouses also termed there. It's like a Teradata somewhere, or maybe a green, or I don't know, some Oracle server somewhere. And that was your database.

And it's true that in the past, these things were kind of locked into these specific pieces of hardware also often because they would just don't run on anything else. But what we are seeing, and I think that's kind of the point of DuckDB is that you can just pull a lot of this stuff, you know, into places that you haven't sort of considered yet. And like your laptop is something that, I find it always funny when people sit in front of a $3,000 sort of MacBook that runs a web browser and nothing else. Right. And then they use that to spend more money on Snowflake in order to query like a bunch of rows, right. When they have an actually pretty powerful machine sitting right there.

And like your laptop is something that, I find it always funny when people sit in front of a $3,000 sort of MacBook that runs a web browser and nothing else. Right. And then they use that to spend more money on Snowflake in order to query like a bunch of rows, right. When they have an actually pretty powerful machine sitting right there.

And so what you said, I really liked it's also the creative architecture. It does really change people's perception, I think, and for people's perceptions also maybe should change in terms of what does database actually mean. And part of our mission is to, is to maybe reduce the negative connotation of it with the word, because I think data scientists in particular have been, how should I say, a bit fearful or loathsome of database tech in general, because it's, you know, it worked differently from the other stuff. So it's why we're trying really hard to, you know, to make it easy to use friendly, these kinds of things.