Michael Chow | Bringing the Tidyverse to Python with Siuba

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Michael Chow , and thanks for watching this RStudioConf talk. I'll be going through Siuba, which is a port of dplyr to Python. Since you're watching RStudioConf, I'm guessing that you're on board with R and dplyr for data analysis. And I've got to say I love R, but I also love Python.

And depending on the project or task, I've often found myself having to switch between R and Python, or sort of juggle both at the same time. And I'm guessing this isn't a super unusual experience, since Python's just incredibly popular. If you look at stack overflow posts, the sort of predominant tool in Python for data analysis, pandas, counts for about 3% of posts a month. That's just an incredible amount.

And so a lot of my interactions have been with people who use pandas for data analysis. But as time has gone on, I've found myself reaching more and more for dplyr, and I've really thought a lot about that and tried to figure out what's going on there.

The pandas vs dplyr analogy

And to kind of set the stage, I thought I'd just use a dumb analogy, that pandas is a lot maybe like a double-decker bus. So this is in Hong Kong, and this thing's really built for carrying capacity. If you have to ship like 80 humans somewhere, this is your tool. You stack them on top of each other, you're golden. The challenge is that it's kind of cumbersome. It can't go everywhere. It's probably scary to back up.

And in contrast, in Hong Kong, there's another vehicle, the minibus, or suba, that's the opposite approach. So tiny capacity, holds like 16 people, and it's just a terror on wheels. So the BBC describes it as Hong Kong's wildest ride. And these things are just fast, they can go anywhere, and they can sort of get you. They're super flexible.

So I kind of think that that really reminds me of dplyr, that dplyr is incredible for exploratory analysis. It can just get you to where you want to go quickly. Pandas is great at computation, but can be a little bit cumbersome, sort of on the fly.

So suba aims to be small but mighty by leveraging dplyr-like syntax in Python, but doing computation behind the scenes in pandas or SQL. And I've tried to live code with suba just to sort of battle test it and make sure it's ready for the big time.

What makes dplyr powerful

I think a big question is what dplyr is really doing that's so useful. And to really understand, we have to go back to Hadley's 2014 talk where he introduced dplyr. And he mentions that analysts sort of have two bottlenecks, a cognitive bottleneck and a computational bottleneck. The computational one's the one we think of most often, which is as the data becomes bigger, it takes longer to run the code.

But intriguingly, Hadley mentions he thinks a lot about the cognitive bottleneck, which is how should we think about the data and describe what we want to do to the data or code. And dplyr aims at this cognitive bottleneck to help people sort of focus their thoughts and to give people strategies for data analysis. It does this by kind of slimming down, taking this big space and slimming down the options that people have. So there are five simple verbs. And all of these critically can combine with an operator group by. All of these take a data frame and return a data frame. And he mentions that overall, it's a very constrained design, especially compared to his previous tool, Plyar.

But intriguingly, Hadley mentions he thinks a lot about the cognitive bottleneck, which is how should we think about the data and describe what we want to do to the data or code.

So Siuba aims to sort of let people capture those dplyr-like thoughts and keep them while writing dplyr syntax in Python and then doing the computation in Pandas. So the gist is Siuba lets you transfer your thoughts from R to Python.

that dplyr as a cognitive tool has, if you've used it over the long term, probably really helped you build skills to ask important questions of your data. And those skills aren't even maybe that related to programming. So why not just bring those skills with you into Python?

The next point is Siuba uses dplyr's architecture, and this lets it very flexibly add new backends. So whether you run against SQL or pandas, Siuba can support it. And I'm hoping to extend support to Python-specific tools in the future, like Dask, and tools like Spark, as well as fleshing out MySQL support.

The next thing is that Siuba runs just an enormous glut of continuous integration tests. So it's incredibly thoroughly tested. I would say it's like paranoid about ensuring that you get the same result back, whether you're running on SQL or pandas. And every time I push code, it runs thousands of tests.

The last thing is developer docs. So I've tried to leave a nice trail of breadcrumbs. So if you're curious about the internal workings of Siuba or looking to patch or extend it, there are just enormous resources to do that. I'd suggest the programming guide in the Siuba docs, which goes through all of Siuba's parts. Or I have something called architectural decision records in GitHub that document key decisions I made, why they were made, and contain sketches of those decisions.