
DuckDB and the future of databases | Hannes Mühleisen | Data Science Hangout
To join future data science hangouts, add it to your calendar here: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Hannes Mühleisen, Co-founder and CEO at DuckDB Labs, to chat about the evolution of database systems, the unique architecture of DuckDB, and how to find opportunities in areas others might consider "boring." He also showed us a picture of the real life duck, Wilbur, that inspired the name DuckDB! 🦆 In this Hangout, we explore how DuckDB is changing the traditional concept of a database by making it accessible to a wide range of users and use cases. We also discuss its origins and how it was inspired by the challenges of data analysis and the importance of focusing on real user problems when building tools. What is a database and where does it live? DuckDB is challenging database assumptions by saying... maybe your expensive laptop can hold a database instead of just pinging one? Maybe it can do more than just run web browsers! Hannes wants you to know that database doesn't have to be a bad word, and that you can actually have a great experience with one. If the data community's affinity and enthusiasm for DuckDB is any indication, we think he's right. We hope you listen in to hear Hannes talk about: a simple explanation of what DuckDB is, DuckDB's in-process architecture and why it's so cool, different DuckDB use cases (should you be using DuckDB?), DuckDB's compatibility with cloud data systems, and more. Resources mentioned in the video and zoom chat: DuckDB Website → https://duckdb.org/ DuckDB Labs Website → https://duckdblabs.com/ Posit Conf talk by Hannes → https://www.youtube.com/watch?v=GELhdezYmP0 Sarah Alman's webinar on using DuckDB and Shiny → https://www.youtube.com/watch?v=6AGroJb4zPM Forbes Article on Imposter Syndrome → https://www.forbes.com/sites/marycrossan/2024/12/18/how-to-combat-imposter-syndrome-by-developing-character/ MotherDuck → https://motherduck.com/ MotherDuck Blog Post "Big Data is Dead" → https://motherduck.com/blog/big-data-is-dead/ Posit Blog post on Shiny with Databases → https://posit.co/blog/shiny-with-databases/ DSH episode with Marco Gorelli → https://www.youtube.com/watch?v=lhAc51QtTHk Emil Hvitfeldt’s talk at posit::conf about orbital → https://youtu.be/Qnm1y0KPxVM Marcos Huerta's Blog post on DuckDB and Bert → https://marcoshuerta.com/posts/duckdb-and-bert/ If you didn’t join live, one great discussion you missed from the zoom chat was about whether industry teams who use and benefit from open source tools could devote 5-10% of their time to giving back to the open source projects with code contributions or other support. Let us know below if you’d like to hear more about this topic! ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us!
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome back to the data science hangout everybody and welcome to the last data science hangout of 2024. If we haven't had a chance to meet before, I'm Rachel, I lead customer marketing at Posit and I'm based in the Boston area so if you ever want to grab coffee or get together I'm in Boston.
Posit builds enterprise solutions and open source tools for people who do data science with both R and Python. We are also the company formally called RStudio just in case anybody needs to hear that but I'm joined by my lovely co-host today Libby. Libby do you want to introduce yourself?
Hello everybody I'm Libby, I'm a community manager who works with Posit to help our beautiful hangout community thrive and I'm also a Posit Academy mentor so speaking of R and Python I teach both R and Python, I mentor both R and Python for Posit Academy to help working professionals learn R and Python and use them in their data work. We're so happy to have you joining us today.
If this is your first hangout let me just tell you a little bit about it. So the hangout is our open space to hear what's going on in the world of data across all different industries, to get to chat with others about data science leadership and connect with other data community members who are facing similar things as you and we get together here every Thursday at the same time, same place, of course not if it's a holiday week so we will be on a little bit of a holiday break and coming back January 9th but if you are watching this as a recording on YouTube and you want to join us live in the future there will be details to add it to your own calendar below.
Thank you so much to those who have helped make this the friendly and welcoming space that it is and we're all dedicated to keeping that it that way so if you ever have feedback for me about your experience anything you want to share with me anonymously or maybe suggestions for topics I'll share a Google form in the chat with you right now but feel free to use that you can always reach out to me directly on LinkedIn too.
And here at the hangout we love hearing from you it doesn't matter what your years of experience are, what your title is, what industry you're in, or what languages you use or do not use, you are welcome here in the spaces for you. I encourage you to connect with each other in the chat so introduce yourself, say hi, say where you are and what you do. Feel free to share your LinkedIn profile and also a link to your website so that people can actually find you once this Zoom chat goes away.
Okay so this is a community-driven discussion that we're going to be having with Hannes and that means that you ask the questions. There are three ways to jump in, you can raise your hand on Zoom, we'll just call you to ask your question. You can put a question in the Zoom chat and if you can't ask it, maybe your mic's not working or you're somewhere loud, put an asterisk somewhere in there at the beginning or the end and we will ask it for you. And then we also have a Slido link where you can ask questions anonymously.
Introducing Hannes and DuckDB
Thank you so much for being here. Libby and I are so excited to be joined by our other co-host today, Hannes Mühleisen, co-founder and CEO of DuckDB Labs. Hannes, I would love to have you introduce yourself first and share a little bit of what you do at work but also something you like to do for fun outside of work.
All right, well thanks for having me. I'm really, you know, really happy to see so many people. On the Thursday before Christmas I was a bit concerned but my name is Hannes. I mean, yes, I work at DuckDB Labs but I'm also academic so I'm also a professor at university. I teach data engineering and I'm also active in the open source space that we make a database system called DuckDB and it's an open source project and, you know, it takes most of my time so at some point we spun out of the research world into a company, DuckDB Labs, where I'm now sitting.
I'm still at the office. It's here in Amsterdam where it's the longest stretch. We haven't seen the sun in 30 years so yeah, I'm leading the company and I'm still hacking around at DuckDB so today was a glorious day because I got to just sit here and hack around because it's, you know, it's kind of quieting down for the holidays and it's really amazing.
But anyways, what do I do outside work? I like boats so I live on a houseboat with my family in Amsterdam. I actually can drive around so sometimes I do that and when we're not driving around it's usually some maintenance issues that have to be addressed which I also really love. I'm a practical person. I like fixing things and so I also like making software. It has somehow, it's connected. Making software and fixing things, I don't know why.
What is DuckDB?
My second question is going to be can you explain DuckDB to me like I'm five? For anybody here who hasn't used DuckDB, who doesn't know what it's about or why anyone would ever use it?
Right, so it's kind of funny because DuckDB is a data management system, okay. It speaks SQL, it's, you know, can do transactions, it can store data tables, these kind of things. Quite traditional in that sense but it's quite new in the sense that it's how it's deployed. It can run anywhere. It can run in your, you know, Python shell, it can run in your R shell, it can run in the browser, it can run on your phone, it kind of puts a technology that used to be sort of locked up in one place and can kind of, because we used to have like data warehouse servers and they were really like one place.
And DuckDB really sort of opens that up in sort of deployment sense and you can put it anywhere. Now, it's interesting because it was really built with Anthony D'Amico, maybe some people know him. He did like asd3.com, it's like just to analyze survey data with R. And at some point, we started talking about, you know, the American community survey which was creating issues because there were too many Americans or something. And this is, you know, there were too many rows, what can I say.
And so he started like exploring database technology to make that work. And from these discussions over many years, it's been almost 10 years now that we had the initial discussions, we started designing a database system from scratch that can do that. And so yeah, DuckDB is open source. It's the fastest growing database system I think ever. We are sort of, there's some of these metrics that we're tracking, kind of interesting to see that. It's free, anybody can use it. It's used all over the place.
The company is huge and small. And we have a soft spot for data science since because of kind of that was the original sort of mental discussions or the initial discussions that eventually led to it came out of interactions with, you know, the R community specifically. So that's why we've been working with people. I mean, I've been working, we have been working with Posit to make that player, but also it's just generally the R community. We've been spending quite a lot of time because it's really great as inspiration to see what people's data problems are, right?
Like we are database systems people. We're like, okay, what are the actual problems? And you find interesting problems like, I need to import CSV files or something like that. And these are things that are very unwieldy. It can be very unwieldy and we've spent a lot of time on it because it's such a relevant thing. So yeah, DuckDB, it's a, it's a piece of software.
People generally say, right, Excel stops working at, you know, a hundred thousand rows. R maybe stops working at, I don't know, 10 million or something like that. It's like base R, dplyr, that kind of stuff. And then, so DuckDB is kind of for this mid range where you go, you know, let's say up to a couple of billion of rows or something like that. That's really where people used to go for like these big data solutions in the past.
And now you have like an engine where that can get you, get you sort of much further on your, on your local laptop, right? Because you have your laptop already paid for it. It's there sitting there. It can do, it can do surprisingly a lot of computation, but indeed if you're, you know, if your biggest file is 15 megabytes, then there is probably no need for it.
Finding opportunities in "boring" problems
So Hannes, one of the takeaways for me from Hangouts this year, and I was thinking about our conversation a few weeks back is around like getting, like finding the things that people might not think are cool and focusing on them. And I was thinking back to when Marco Gorelli was also talking about like getting excited about things that people might think are boring. And I was wondering if you could talk to us a little bit about that. But working on things that are boring. Yeah, it's very, I don't know. I don't think there's boring things.
So I think in general, there's really no shortage on sort of ideas on what to work on. And I think personally there is a bunch of, I really like looking at Stack Overflow and Reddit and see what people kind of, what kind of problems people have. Right. And then you can see, oh, you know, like, and really you have to squint, right. Because like one specific person's issue is not really interesting because there's so many of them. But if you see the same exact sort of class of problems sort of coming up, you know, 10,000 times, then maybe, you know, there is something there.
So one of the things that we, that we did was we were teaching these courses with Spark. Maybe some people now have worked with Spark. So I was teaching a course with Spark. It wasn't a great success because people generally didn't manage to work with Spark. So we were like, okay, here's how you Spark. We tried to teach them how, here's how you operate this thing. Obviously we had some concept on how to do it. But then nobody managed. And these were not, you know, these were not toddlers. These were like people who had been around data for ages and they just couldn't manage.
And after a couple of years of giving this course, we thought, we started thinking like, yeah, maybe, maybe there's something there. Maybe that doesn't have to be so difficult to, I don't know, write a parquet file. Right. So, and I really encourage also all my colleagues in research and in the open source space, it's like, it's so easy to, I mean, people tend to want to set their own agenda and then just basically execute that. But I feel like for me, I really want to, you know, address an issue.
And for that, really, there's so much information, there's so many resources on, you can just download the Reddit dump or whatever it's called. And then just start, you know, analyzing it and you will see, you know, what are the things people struggle with? You know, I can't get this to work, you know, just search for that. And then you have some idea. And that's, I think also part of the, you know, that's also part of the secret of maybe the tidyverse that this really addresses a need.
And for that, really, there's so much information, there's so many resources on, you can just download the Reddit dump or whatever it's called. And then just start, you know, analyzing it and you will see, you know, what are the things people struggle with?
Why is it called DuckDB? Meet Wilbur
So why duck? Why duck? Yeah. So there used to be a duck called Wilbur and he was my pet duck. And he lived with me on the boat and it was very cute. Very cuddly duck. Who could have imagined that ducks can be cuddly? But yeah, so he eventually flew away and, you know, we like to think started a ducky family somewhere, but he, in his memory, DuckDB is called DuckDB.
So here's, this is little Wilbur. You can see him there. This is me on the other side, obviously. And he's just standing there hanging out, you know, being happy, being very loud. Ducks are loud, man. Ducks are loud.
Corporate use cases and DuckDB in the wild
My question is, I've used DuckDB for some of my personal projects. Like, it's great at, like, CSV to Parquet if their files are big. I got the vector data support, which was really helpful, combined with some other simple small language model stuff. But work, my data's all up in the cloud. My data's all, like, in Snowflake. Like, how do you see, do you see DuckDB as mainly a tool for, like, kind of researchers working on the laptops, kind of in, like, academic environments? Or do you see a corporate kind of, you know, use case where our data's mostly in the cloud?
Yeah, that's a great question. So we do see actually a ton of corporate use. This is the fastest growing, let's say, area of DuckDB use. So there is a lot of the, sort of, I'm here on my laptop and I just have to deal with this data sort of use. And that's also where DuckDB came from, as I've explained. But there is a ton of use in corporate. And it's less, let's say, as a sort of complete sort of system, but people use it a lot as a building block.
So say I have, you know, some, for example, if you look at one of the modern BI tools out there, almost all of them use DuckDB under the hood. If you look at, you know, for example, there's big sort of data transformation companies like Fivetran that are also using DuckDB under the hood. So you might not actually realize it that you're, it's not like, I don't know, Snowflake, where, you know, you're talking to Snowflake because there's a like little blue logo up in the corner there.
But what we see a lot with DuckDB is that people put it in somewhere and, you know, just, you can kind of figure it out if you really want to, that it's running DuckDB under the hood sometimes because, you know, the SQL dialect will have some things that will give this away. But very often it's also just, you know, pre-baked queries, pre-baked schemas, just run this thing and people are happy. And that's really fine with us.
So we're not, we don't, you know, it's open source software, people can do whatever we want, but that's, I think the growth that we see a lot in, in these sort of use cases. And for people that do want an end-to-end sort of solution, there is MotherDuck, right? There is a company that is building a DuckDB SaaS, which is, you know, has look and feel, I would say, similar to Snowflake or BigQuery or, you know, whatever else there is.
And some use cases that are kind of wild. You know, people put it in cars and then I'm like, are you sure? But yeah, so I think there is a lot of use in, in big, in big tech. It's just sometimes not visible.
DuckDB with Shiny and R integration
I believe I read somewhere that DuckDB can be used with Shiny. Generally, how does it integrate with R, Python and data visualization tools?
So the integration with R is really interesting and it's not like anything you've kind of ever seen, I think, in terms of data management systems, maybe except with SQLite, because it's this in-process architecture where the whole thing runs inside whichever process invokes it. So with R, you just have an R package that's, you know, installed on packages, DuckDB. And that actually is not a client or a driver, like maybe for other systems, it's actually contains the whole thing. And then when you start it, it actually runs inside this R process that you're sitting in front of.
And yeah, that has some, some very nice benefits. For example, you can read stuff from memory. Like if you have a data frame in memory, we can read the data frame. When you have a query result, you can directly turn that into data frame or error structures. And the same thing also happens in Python, where you can, you can just load the, you know, pip install DuckDB and install the package. And then you can just load it up inside this Python environment and look at, you know, Pandas data frames or Polar structures or IR structures again, because it's just right there in this process.
And it doesn't have all this like client service stuff. You don't have to set a password because there's no point of having a password. It's right there in your thing. It's yours, right? You can, you can do whatever you want with it.
Rethinking what a database is
So, so I've been reading more about DuckDB and using it in a, in some of the creative architecture bucket of use cases and doing serverless apps and things. And I realized that there's a lot of things I think I know that I don't really know, like what is a database? And I think in the past I've thought of as a database as something that has to be in the cloud because lots of things are happening to it in different places. And therefore I have to query that instead of just, you know, having my data files locally, but files can also be in the cloud and DuckDB can query files. So what is like, I need to update my mental models of a lot of things. Can you weigh in on how data scientists should think about what a database is and is DuckDB actually like changing the concept of database?
No, I think it's definitely changing what people, what sort of this conception of, of what that, that what people have of what a database is and what you alluded to where it lives. Okay. So, I mean, what is a database? There's there, especially in stats, there is a lot of confusion with the word. So, I mean, because there is the database, which is just a collection of rows. And then there's the database management system, which is of course something like DuckDB, which is the system that you used is confusing terminology, but in principle, it's just a collection of tables, right?
And then you have the database management system, which, you know, provides management of these tables, like, you know, store them, change them, query them most prominently. And now it's not, not like, and again, in the past, it used to be the case that these things were sort of, you know, cloud, you know, some sort of data warehouses also termed there. It's like a Teradata somewhere, or maybe a green, or I don't know, some Oracle server somewhere. And that was your database.
And it's true that in the past, these things were kind of locked into these specific pieces of hardware also often because they would just don't run on anything else. But what we are seeing, and I think that's kind of the point of DuckDB is that you can just pull a lot of this stuff, you know, into places that you haven't sort of considered yet. And like your laptop is something that, I find it always funny when people sit in front of a $3,000 sort of MacBook that runs a web browser and nothing else. Right. And then they use that to spend more money on Snowflake in order to query like a bunch of rows, right. When they have an actually pretty powerful machine sitting right there.
And like your laptop is something that, I find it always funny when people sit in front of a $3,000 sort of MacBook that runs a web browser and nothing else. Right. And then they use that to spend more money on Snowflake in order to query like a bunch of rows, right. When they have an actually pretty powerful machine sitting right there.
And so what you said, I really liked it's also the creative architecture. It does really change people's perception, I think, and for people's perceptions also maybe should change in terms of what does database actually mean. And part of our mission is to, is to maybe reduce the negative connotation of it with the word, because I think data scientists in particular have been, how should I say, a bit fearful or loathsome of database tech in general, because it's, you know, it worked differently from the other stuff. So it's why we're trying really hard to, you know, to make it easy to use friendly, these kinds of things.
Small data and relational integrity
So you mentioned that maybe 50 megabyte or less files, maybe DuckDB isn't the right solution, but are there wrong solution of the saying then you're maybe existing stack also works fine. I mean, there's no, there's no lower limit if you, if that's what you want to know. Yeah. Cause I was kind of curious. I, so for example, I work in clinical trials and most of our data is actually small, but we do often have a lot of relational tables is, is I assume this could be a good use case for DuckDB.
Yeah, for sure. So I think, I think what's, what's really interesting is what people sometimes forget is that there's like more to database systems already traditionally than like just store of tables. And I think what's interesting for things like small, maybe smallish data is all this integrity sort of dance, right? Where you say, okay, I'll actually define some, some relationships here, like, you know, foreign keys, primary keys, constraints, these kinds of things. And that will actually allow me to keep my data consistent and with a really good transactional system like DuckDB actually will also make sure that any changes you make to the data will not break any of these constraints.
And that's actually something that I think data frame libraries users in general have just no concept of because it's never been something they had. But it's pretty standard in, in sort of relational technology. So I think what might be interesting, if you say, I don't have so huge tables, but they are kind of connected in a complex way. And I need to do all these funky recombinations. Then there's definitely a great use case because you are able to basically go from one consistent stage, another consistent state.
And your, let's say your value per byte is extremely high, right? Like it's not like that you collected some log files somewhere and it's like probably garbage, but maybe something interesting is there. It's more like, no, no, no, this took, you know, every row, you know, was, was a whole patient trajectory and that costs, you know, a hundred thousand dollars. So it's very high value data. And maybe we want to treat high value data better than sticking it in a CSV file. Maybe we do want to have a schema. Maybe we do want to make sure that we have the right types. Maybe you want to have the constraints, maybe you want to have the integrity relationships on the table.
So that's also something that was actually in my talk at Posit Conf this year that I tried to sort of make this point where I was trying to say, hey, it's, there's more to table technology than just running queries. It's also just all these other guarantees that you kind of get for free that could be interesting for data practitioners.
DuckDB culture and open source governance
So the culture here at the company is, um, a bit like we took what we knew from a research group and try to improve it where we thought it was going wrong. So, you know, it's a very sort of, I would say, progress sort of oriented team here that we, you know, we all, yeah, all very good at database engineering and, you know, we just enjoy each other's sort of company. And generally we have a rule that we only hire friendly people that has worked really well so far.
And in the project itself, so DuckDB is a bit special because we don't have open governance in DuckDB. So that's, it is a open source MIT license, you can, you know, do whatever you want with it. But there is no sort of democracy — that sounds a bit terrible, I have to admit — there's no democracy in terms of direction of the project. And that's intentional because we've seen, specifically Apache projects to go down the drain, because they couldn't agree on, on, on a vision. And in the end, just, they just did whatever, you know, everything everyone wanted. And in the end, they're unmaintainable messes.
And because we have seen this so many times we've just decided that we're gonna keep control over what goes in. So does Mark and me, the two creators, original creators of DuckDB, just to, you know, keep the vision. We do have very strict rules on behavior, like not strict. I would say we do, and we are very conscious of keeping productive and friendly tone on things like GitHub issues, comments. And we are, you know, we're trying to really, really be friendly to anyone. I think we have over the years, we've, I think, banned two people of our GitHub repo, which is for really unacceptable behavior, but we are really trying to have a positive interaction also on our discord and all the other places we interact with the community.
And I think it's worked out really well. The tone is incredibly friendly. People want to help each other. So yeah, I think it's also just, you know, leading by example, there's other open source projects, you know, like Linux where the head guy in charge is famously grumpy all the time, like that will trickle down through the community. And we're not doing that. We try to be, you know, consistently friendly. And it does actually, and I think which shows in the community, but it's also something that I think Hadley is doing extremely well. It's just this incredible civility that he brought to the previously grumpy R community, I think.
Machine learning and DuckDB's focus
Now thinking about, you know, things like orbital and how that integrates nicely with tidy models and then post queries, which you could pass into SQL for machine learning. Is any plan for some support for machine learning as we go into the future with DuckDB?
Um, yeah, I think we're quite happy where we're at in terms of, you know, API or general focusing on relational transformation of tables. I mean, we are sometimes expanding. We recently added like pivot statement, which is a pivot statement, which doesn't exist in SQL in general. So we are expanding there, but we try to keep it focused on, let's say table-y operations. There are some operations to do simple, you know, regressions and stuff like that in DuckDB, but it's really not our focus.
There are other tool kits that are great at machine learning tasks and we try to integrate. So that's one of the strengths that I think DuckDB has by being in process. We can, for example, export to, I don't know, scikit-learn fairly efficiently. And then they can do their thing and we can pull back the results and do something else. It's not like traditional databases, like, for example, Oracle that have to sort of push everything into the database system itself for them to be efficient. We can just say, yeah, you want to pull this out and run something in it. Great. Just do it. It's not going to take forever.
So yeah, we try, we try to focus and we're a small team with 20 people. We need to focus. We need to, you know, only do things we can, we can sort of do. And table SQL is already a, let's say challenging target.
Roadmap: connectors and interoperability
Is there anything DuckDB can do today with ADBC, like the ODBC, ADBC competitor? Yeah. So we do have ADBC support in DuckDB. So for those who don't know, this is like something came out of Voltron, or the Arrow project, where the idea was to update the standard database protocols a bit. And we do support ADBC. We also generally support Arrow ingestion. So if your database supports Arrow, then you can ingest that into DuckDB quite efficiently.
We do have specialized connectors for things like SQLite, PostgreSQL, MySQL. And what we do have on the roadmap is a generic ODBC connector that you basically can also just pull stuff out of your SQL server or Oracle or DB tool, or, you know, these ancient things, into DuckDB. But we also seeing a lot of interesting developments with things like Iceberg or Delta Lake where, you know, the big players are starting to make things accessible through these interfaces. And yeah, we are working on that, that already works okay with DuckDB, but we're working also on that to improve that further.
So we do see a lot of use cases where basically, you know, an analyst will just pull some stuff from the production database into the local DuckDB and then just run with it and do analysis on it. That we see a lot and we want to support that more.
Security considerations
I think this is an excellent question. And I mean, it's not, we do get some sort of improvement in the general situation, but not having client server because there's no server to run. There's no sort of port that can be accidentally left open or something like that, right? Like there is no, like this whole traditional security problem of, oh, you're, you know, you can attack this protocol over the, you know, with some crafted numbers or something that we don't have.
But there are things that we do have a website on our documentation about securing DuckDB because there are aspects that can be surprising. Like if you're, for example, you're running it somewhere and you're just opening up the SQL prompt to the world, expecting nothing bad will happen. Well, bad things can happen because the SQL prompt is extremely powerful. It would be like opening up an R shell to the world. You can do it, but you have to be a bit careful, right?
The biggest security concerns we have is around distribution of binaries, right? So we make, we build the binaries in our CI and yeah, we distribute them and CRAN builds binaries and so what we do, we have to be very careful that nobody manages to implant, you know, malicious code into the stuff that we distribute. Right. So that is, I think the thing I am mostly concerned about and that comes down to like, you know, nailing down your CI, having, you know, signatures on things, having sort of hashes being checked regularly to make sure that there's a lot of sort of stuff in the background that we have to do that doesn't really end up that it's not visible in the product, but we are definitely aware of it.
Getting acceptance and overcoming skepticism
When you first came up with DuckDB, did you struggle to get acceptance from the data industry? And if you did, how did you get over it?
So we were initially ridiculed for not having a distributed system and it was our strong belief and, you know, hill we were willing to die on that. You don't actually need distributed systems in order to query sort of significant amounts of data. And I did, I was like, you know, I mean, I didn't really worry me so much because I thought, great, this is, you know, I'll just, you know, if that takes 10 years for the world to acknowledge distance, fine. It actually didn't take all that long.
I was ready to sort of be ridiculed for much longer than we actually were. And I think the crazy growth we see with DuckDB and, you know, the blog posts like big data is dead and stuff like that have really changed the data industry in my opinion. And so we are no longer ridiculed. I'm happy to report even we were working with some of the big, big, big data companies.
And of course we are from Europe. The data industry tends to ignore things that are not made in, in, you know, in the Bay area. That was also something I was a bit concerned about in the beginning was like, yeah, we can be as good as we want, but is it gonna actually arrive, you know, in the US. And that has also worked out. So that was also something I was like, maybe a little bit concerned.
Data heroes and closing thoughts
Earlier this week, we had a customer meeting with Wes and Hadley and somebody had asked them both, who are their data heroes? And Wes McKinney pointed to you. And so I wanted to ask you that question too. Who are your data heroes?
Well, okay. I think that Hadley is definitely one of my heroes. So I, one of the reasons I was extremely nervous about talking at the Posit Conf this year was because I, it's just, I have talked in front of a hundred thousand people, but somehow in front of Hadley was making me extremely nervous. So that was, that was interesting to see also about myself. And I think it has to do with, yeah, just, just, you know, people that have, you know, managed, managed single-handedly to turn the industry around.
Who are my heroes? Yeah. I think I have to name Hadley, and of course Wes, but I cannot say him. This would be circular. No, I think I really, in my field in data management systems, it's really, if you can show that people use your thing, then you have kind of won the argument. And I think both Hadley and Wes have shown this in a great sort of way. And so for me also people like, you know, Stonebraker, the Mike Stonebraker, the Turing Award winner, and one of the people behind Postgres is somebody who I look up to, because just of the immense amount of impact they've had.
Um, yeah, I think, first of all, I mean, I'm sorry, I couldn't answer all the questions. Feel free to email me. It's Hannes at duckdb.org. It's very easy. If you have any sort of direct questions to me, I'm not kind of, can't promise I will immediately answer, but career advice, I think going for impact, I think that's just, that's what I mentioned earlier with looking at Reddit. If you look at what will have the biggest sort of impact for the most people, that's something that I think is worthwhile.
If you look at what will have the biggest sort of impact for the most people, that's something that I think is worthwhile.
Thank you so much. Well, Hannes, I know we are the last call holding you off from your holiday vacation right now, but thank you so much for joining us to close out a year of amazing data science Hangouts and really appreciate you taking the time and thank you all who are here and make this space what it is and join us week after week. We will be on a bit of a holiday break until January 9th, where we'll be joined by Michael Chow and Rich Iannone at Posit who work on the GT and Great Tables packages. But I wish you all a wonderful holiday season with your friends and family and a happy new year. Thank you all. Thanks everyone. Bye everybody. Bye.

