Resources

Siuba and duckdb: Analyzing Everything Everywhere All at Once - posit::conf(2023)

Presented by Michael Chow Every data analysis in Python starts with a big fork in the road: which DataFrame library should I use? The DataFrame Decision locks you into different methods, with subtly different behavior:: - different table methods (e.g. polars `.with_columns()` vs pandas `.assign()`) - different column methods (e.g. polars `.map_dict()` vs pandas `.map()`) In this talk, I'll discuss how siuba (a dplyr port to python) combines with duckdb (a crazy powerful sql engine) to provide a unified, dplyr-like interface for analyzing a wide range of data sources‚ whether pandas and polars DataFrames, parquet files in a cloud bucket, or pins on Posit Connect. Finally, I'll discuss recent experiments to more tightly integrate siuba and duckdb. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Databases for data science with duckdb and dbt. Session Code: TALK-1101

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Howdy, thanks for having me. I'm super excited to talk on Suba and duckdb, analyzing everything everywhere all at once. But I also have to say I'm so honored to be in just a power session with so many duckdb enthusiasts. It's cool to see all the variations.

Just a little background. My name is Michael Chow. I did a PhD in cognitive psychology. I do open source data tools at posit, focused on Python tools. And I have two beautiful cats, Bandit and Moondog. So those aren't related at all to the topic.

The DataFrame decision problem in Python

What I want you to think about is when you start a data analysis in Python, if you're like me, do you think about the question, what data frame do I use? I think it's in R you usually kind of have it figured out, but in Python you have quite a few options. So you could import pandas or you could import pollers or you could also bring in duckdb and work that into your analysis.

And I find the options really exciting. I really like to have a lot of different options for analysis. But I also find parts of it really overwhelming. And I think the reason is that depending on the data frame you use, everything changes. So you have to go to a new documentation site. All the operations are different, sometimes subtly different.

So you might maybe start with pandas and go there and learn its methods. And then you might switch to pollers. I've really been enjoying pollers a lot. And you'll notice that it looks eerily similar to the panda site, except everything's slightly, ever so slightly different.

And let's just say you keep going and you decide to go with pandas. And you just want to concatenate a couple columns. So that means you want to take the strings in X and you just want to concatenate them with Y. You could use df.x.sdr.cat. Don't ask me to say that again. And pass it dfy. And that will concatenate it. So you'll get two rows, A, C, B, D. I had to check that, but that's exactly what you'll get.

But then let's say you decide to use pollers. And you decide to do something a little bit similar. And so you notice there's a concat method. So there, let's say you try the concat method. As it turns out, you'll get an error. And the reason is concat in pollers actually aggregates. So it sort of pastes all the letters in your series. And actually, the argument you pass makes no sense to pollers, and it's mad at you.

I find this really, I actually love all the options they give me, but I find this switching, this sort of split-brain nature of working with data, really challenging in Python. And leaning into the movie, everything everywhere all at once, I sort of, I want to try coining a pattern for this. I think this is the everything bagel pattern.

It's the pattern where you have a data tool, but you also want to kind of stuff in everything people might want to do with it. So you take every possible method, and you add it on. And I actually, again, I love options. I love that I can do a lot with it. But I find it really kind of overwhelming when I'm switching off between things, that I'm switching off kind of the whole world.

And leaning into the movie, everything everywhere all at once, I sort of, I want to try coining a pattern for this. I think this is the everything bagel pattern.

Using siuba and DuckDB together

Oh, so just some, yeah, you can't really use pandas methods on pollers data. You can, but there's this interchange, and that's really vexing to me. So let's keep going, though. I think one really neat thing, I've been using a lot of DuckDB, and I think the really neat thing is you can run DuckDB on either pandas or pollers. You can just use DuckDB functions. DuckDB has its own docs, so it's not totally innocent from this, like, docs thing. But I really like using DuckDB.

I find the one challenge is that I see people write a lot of SQL in Python strings. And so this is kind of a weird tradeoff if you're doing it in Python.

So one really neat thing, I think, about Suba and DuckDB is you can use dplyr syntax to generate SQL for DuckDB, and you could still harness the power of the Duck. So in this case, you would do a little bit of setup. Right now, we're using SQLAlchemy to sort of create a connection to DuckDB. And then dplyr style, you would create a table. So this is similar to how dplyr connects to database tables. And then you can feed it into something like mutate. And it might actually give you back that exact same sort of SQL, that concat statement.

So I really like this approach because I get the syntax and the grammar of dplyr, but the power of DuckDB. And I think especially in Python, that's really helpful to have that escape hatch.

Introducing DuckOps

You might notice, though, there's one weird thing that's going to be a big topic here, which is this is a quirk. Suba actually is translating pandas methods. I think going back to that Duck situation, I don't want to add new methods at all. Like you already have half a dozen Duck sites to worry about.

In my mind, Suba was originally built on pandas to manipulate pandas data. And so it just used the pandas methods. I didn't want to create a whole new Duck site for people. But it's a little bit weird when you're working with DuckDB. Now if you think about it, if you're a person working with DuckDB, it's weird to be in the DuckDB Docs and then have to actually switch to pandas methods that get translated back to DuckDB. I feel like it's the same sort of split-brained problem as the sort of pandas, polars, everything bagel methods situation.

So what I want to talk about today is an approach I've been working on called DuckOps, which is the idea of what if you could just import the DuckDB functions into Python and just use those? Now it's sort of all DuckDB, and you don't have to worry about this weird translational Docs thing. You could hopefully stick mostly with the DuckDB Docs and still get dplyr syntax in Python.

And I've been really enjoying this approach. So just to recap, I think you've heard a lot about DuckDB. DuckDB is really neat. It's fast. It can read all kinds of things. It's got really cool cutting-edge SQL syntax that it's added. The idea with DuckOps is that you should be able to just read the DuckDB functions in and use them in Python. And there are a lot of really neat advantages to that. And then Suba is the dplyr part of being able to mutate and filter and select.

Functions vs. methods

But I want to really focus on this thing, this move from methods to functions. I think as our user, if there are a lot of our users in this room, you probably already appreciate functions because it's most of what you use when you're analyzing data. But what if I told you the thing on the left, methods, are probably the dominant way of working with data in Python. So I want to focus on why this is a really big deal. And it goes back to this, the everything bagel.

So I'm going to talk about how functions are sort of alternative realities. And then I want to focus a little bit more on methods and some of their limits. And then some more advanced features that you can do with DuckDB.

So first, functions are alternative realities. In the movie Everything Everywhere All at Once, we follow a character, Evelyn, as she experiences many different universes. In one, she has hot dogs for fingers. In another, she's a cat person. And in yet another, she's a rock on a mountain. And a big challenge is Evelyn finds these alternative realities overwhelming. When you move through these different realities, you might feel a bit fractured and overwhelmed. Because there's a whole universe of alternatives out there.

I think about it like this. I think that there are so many things you can do with data. And it's worth diving into every single one. Whether it's like strings or dates or nested things, which is a crazy topic. I think that there are a lot of talks at PositConf on these specific areas. We've heard talks about working with dates that are really in-depth. And there's a library, StringR, just dedicated to string stuff. So there's all these functions. There's all these sort of lives you can live in data analysis.

And I'm sure you know a datetime person who's gone weirdly deep into datetimes. And they're disheveled when they come out. They're ranting about, I don't know, epochs and different things. There are whole worlds out there.

In DuckOps, each of these is just a submodule you can import from. And I think this is really helpful. It's dumb, but it's really helpful to explore and find the different things you might want to use.

So I think a big advantage of all these functions is they're really easy to tinker with. You can just import concat, and you can see what happens. So concat ABC is ABC. That's great. I love that. Concat with a polar series is a polar series. That's pretty handy to figure out what's going to happen. One interesting thing is with a SUBA underscore, it returns a lazy expression. And you can use that inside these sort of, like, lazy queries in interesting ways.

Another nice thing about functions is you can just look up docs often. So from Python, I can just figure out what concat does. When concat's tucked inside a string, it's really hard to figure out. You kind of have to go find other ways. Functions are really discoverable.

The other thing is functions work great programmatically. So if you want to, say, change how concat works, so this is actually concat underscore WS, which is how db says with separator. You can modify the behavior of a function. So in Python, this is a partial. This is saying, like, the separator's always a dash. And now we have a dash concat, and that's all it does. It's dumb, but it works.

We'll get into methods where it doesn't work in a little bit. So the thing is you can swap in alternatives. So here, if we don't want concat, we can just import a new concat from a new package, and suddenly we have another version.

So just to recap, you can really easily tinker across things. It's easy to discover functions, to program them, and to swap in alternatives.

In contrast, methods, the everything bagel pattern, methods are a bit tricky to work with. So we've put everything on the data, and it's nice. You can look it up via a dot, but there are some downsides. One is that you end up with a lot of docs. People tend to reproduce methods. The other is that if you wanted to put a concat, say, on a panda series, you would actually have to do something really weird and kind of violent to it. You would have to mutate it. So you would just take the series, and you would actually just kind of shove concat in there.

I don't think the owner of a class can really put methods on it in a normal way. The other thing is you can't change the first argument. A method is like a function that loves its first argument. So a panda series, isn't it weird? You can concat the string ABC to something, but on a panda series, you actually can't do that, because the series is always the first thing. So it's surprisingly hard.

The last is you can't really program. So we saw you could partial the concat function. This isn't really easy or fun or desirable with methods. You just sort of can't do it.

A lot of people will say, method apologists will say, what about method chains? They're super convenient. You can take this Dagwood sandwich on the left, this deep nested set of function calls, and you can turn it into this nice chain. I agree. I think that's really important. Suba tries to work around this with function change, just this pipe syntax. So you can pipe functions into each other. So it's tough, because method chaining is a part of Python, so you get it in the language. But this is an attempt to kind of get the same thing with functions.

Just to recap, methods are a little bit hard to discover. Everybody has their own docs for methods. A function just has one doc string. Alternatives, you need to actually mutate classes to add new methods onto them. And the last is method chains are a really handy thing. Hopefully pipe chains, this concept can get you pretty close.

Advanced DuckDB features and DuckOps

So just to have a deeper, I'm going to gloss, go through these very fast. I think DuckDB, just to summarize, has some really cool syntax and some really advanced SQL it can do. So here, this is actually that dash arrow is a lambda. There's all these interesting constructs they have. I think in DuckOps, a big challenge is going to be reproducing that. So here, we can run this list filter function, and we do have syntax for lambdas. And this is also used in the dplyr across. So a lot of the same stuff is reused.

Secondly, would you rather have string R functions? That's actually pretty doable. Like you could create, hopefully it's not called string R pi, but with just functions, you could create alternative functions. You could create string R. Or we could do it as a team.

So just to recap, DuckDB is incredible. It's fast. It can read all kinds of files. It has cutting-edge SQL syntax. DuckOps is this experiment in trying to pull the DuckDB functions out. And I really want to make the point, hopefully, that these functions are really useful. And as functions, there's a lot you can do with them. And then last, Suba aims to be a dplyr port to Python, and hopefully can let you build up your DuckDB query and use it programmatically from Python.

So I want to thank everybody that helped, Krista Adams, who designed the hex sticker, which will make a proper hex and not a squiggly hex, I swear it, Chris Gardillo, Anthony Baker, and tons of folks at Posit who have helped and supported this kind of work. So thanks, and go functions. Functions forever.

Q&A

So the Zen of Python says there should only be one way to do it. Would you say that the Python data ecosystem fails to live up to that philosophy? Yeah. I think that that's an interesting mantra. I mean, that's the funny thing about methods is that your first attempt at a method lives forever because it lives on your data. So I would almost flip it and say with methods, you only get one attempt.

But actually, if you look at the history of the tidyverse, packages like stringr and lubridate and all the failed versions of those things, it's really clear that R innovated through alternatives and decoupling of the functions that can act on data from the data itself. But I agree. It would be nice to have one way to do things, but sometimes you've got to like search around, you know?

I mean, that's the funny thing about methods is that your first attempt at a method lives forever because it lives on your data. So I would almost flip it and say with methods, you only get one attempt.

Does Suba support polars and pandas? I mean, we're just, we're basically barnacling on DuckDB at this point. I think I would love to support polar's expressions, actually. I think similar to extracting out DuckDB functions, it would be cool to just work directly on polar's expressions, but I think it would involve a little bit of rust scuba diving.

Is there a data world that's not currently supported by DuckOps that you'd like to see in there in the future? Oh, I mean, there's all the like geo functions and I mean, it seems like there's a ton of stuff in DuckDB. So this was really a first pass to see what it would look like and how it would feel to run. I almost feel like I'm really sold on the feel. I like the functions. And so now I'm really interested in exploring how do we really flesh that out and see what it looks like to just try to cover as much as possible if it's useful.

Thank you very much. Let's thank Michael and all of our other speakers again.