Siuba and duckdb: Analyzing Everything Everywhere All at Once - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

Howdy, thanks for having me. I'm super excited to talk on Suba and duckdb, analyzing everything everywhere all at once. But I also have to say I'm so honored to be in just a power session with so many duckdb enthusiasts. It's cool to see all the variations.

Just a little background. My name is Michael Chow . I did a PhD in cognitive psychology. I do open source data tools at posit, focused on Python tools. And I have two beautiful cats, Bandit and Moondog. So those aren't related at all to the topic.

The DataFrame decision problem in Python

What I want you to think about is when you start a data analysis in Python, if you're like me, do you think about the question, what data frame do I use? I think it's in R you usually kind of have it figured out, but in Python you have quite a few options. So you could import pandas or you could import pollers or you could also bring in duckdb and work that into your analysis.

And I find the options really exciting. I really like to have a lot of different options for analysis. But I also find parts of it really overwhelming. And I think the reason is that depending on the data frame you use, everything changes. So you have to go to a new documentation site. All the operations are different, sometimes subtly different.

So you might maybe start with pandas and go there and learn its methods. And then you might switch to pollers. I've really been enjoying pollers a lot. And you'll notice that it looks eerily similar to the panda site, except everything's slightly, ever so slightly different.

And let's just say you keep going and you decide to go with pandas. And you just want to concatenate a couple columns. So that means you want to take the strings in X and you just want to concatenate them with Y. You could use df.x.sdr.cat. Don't ask me to say that again. And pass it dfy. And that will concatenate it. So you'll get two rows, A, C, B, D. I had to check that, but that's exactly what you'll get.

But then let's say you decide to use pollers. And you decide to do something a little bit similar. And so you notice there's a concat method. So there, let's say you try the concat method. As it turns out, you'll get an error. And the reason is concat in pollers actually aggregates. So it sort of pastes all the letters in your series. And actually, the argument you pass makes no sense to pollers, and it's mad at you.

I find this really, I actually love all the options they give me, but I find this switching, this sort of split-brain nature of working with data, really challenging in Python. And leaning into the movie, everything everywhere all at once, I sort of, I want to try coining a pattern for this. I think this is the everything bagel pattern.

It's the pattern where you have a data tool, but you also want to kind of stuff in everything people might want to do with it. So you take every possible method, and you add it on. And I actually, again, I love options. I love that I can do a lot with it. But I find it really kind of overwhelming when I'm switching off between things, that I'm switching off kind of the whole world.

And leaning into the movie, everything everywhere all at once, I sort of, I want to try coining a pattern for this. I think this is the everything bagel pattern.

I mean, that's the funny thing about methods is that your first attempt at a method lives forever because it lives on your data. So I would almost flip it and say with methods, you only get one attempt.

Does Suba support polars and pandas? I mean, we're just, we're basically barnacling on DuckDB at this point. I think I would love to support polar's expressions, actually. I think similar to extracting out DuckDB functions, it would be cool to just work directly on polar's expressions, but I think it would involve a little bit of rust scuba diving.

Is there a data world that's not currently supported by DuckOps that you'd like to see in there in the future? Oh, I mean, there's all the like geo functions and I mean, it seems like there's a ton of stuff in DuckDB. So this was really a first pass to see what it would look like and how it would feel to run. I almost feel like I'm really sold on the feel. I like the functions. And so now I'm really interested in exploring how do we really flesh that out and see what it looks like to just try to cover as much as possible if it's useful.

Thank you very much. Let's thank Michael and all of our other speakers again.

Siuba and duckdb: Analyzing Everything Everywhere All at Once - posit::conf(2023)

Transcript#

The DataFrame decision problem in Python

Using siuba and DuckDB together

Introducing DuckOps

Functions vs. methods

Advanced DuckDB features and DuckOps

Q&A

Featured software#

dplyr