Resources

Michael Chow | Bringing the Tidyverse to Python with Siuba | RStudio

Last January I left my job to spend a year developing siuba, a python port of dplyr. At its core, this decision was driven by a decade of watching python and R users produce similar analyses, but in very different ways. In this talk, I'll discuss 3 ways siuba enables R users to transfer their hard-earned programming knowledge to python: (1) leveraging the power of dplyr syntax, (2) options to generate SQL code, and (3) working with the plotnine plotting library. Looking back, I'll consider two critical pieces that have helped me develop siuba: using it to livecode TidyTuesday analyses, and building an interactive tutorial for absolute beginners. About Michael: Michael Chow is a data scientist and learning researcher. He serves as a co-director at Code for Philly. In past lives, he worked on adaptive assessment tools in ed tech, and received a PhD in cognitive psychology from Princeton University

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Michael Chow, and thanks for watching this RStudioConf talk. I'll be going through Siuba, which is a port of dplyr to Python. Since you're watching RStudioConf, I'm guessing that you're on board with R and dplyr for data analysis. And I've got to say I love R, but I also love Python.

And depending on the project or task, I've often found myself having to switch between R and Python, or sort of juggle both at the same time. And I'm guessing this isn't a super unusual experience, since Python's just incredibly popular. If you look at stack overflow posts, the sort of predominant tool in Python for data analysis, pandas, counts for about 3% of posts a month. That's just an incredible amount.

And so a lot of my interactions have been with people who use pandas for data analysis. But as time has gone on, I've found myself reaching more and more for dplyr, and I've really thought a lot about that and tried to figure out what's going on there.

The pandas vs dplyr analogy

And to kind of set the stage, I thought I'd just use a dumb analogy, that pandas is a lot maybe like a double-decker bus. So this is in Hong Kong, and this thing's really built for carrying capacity. If you have to ship like 80 humans somewhere, this is your tool. You stack them on top of each other, you're golden. The challenge is that it's kind of cumbersome. It can't go everywhere. It's probably scary to back up.

And in contrast, in Hong Kong, there's another vehicle, the minibus, or suba, that's the opposite approach. So tiny capacity, holds like 16 people, and it's just a terror on wheels. So the BBC describes it as Hong Kong's wildest ride. And these things are just fast, they can go anywhere, and they can sort of get you. They're super flexible.

So I kind of think that that really reminds me of dplyr, that dplyr is incredible for exploratory analysis. It can just get you to where you want to go quickly. Pandas is great at computation, but can be a little bit cumbersome, sort of on the fly.

So suba aims to be small but mighty by leveraging dplyr-like syntax in Python, but doing computation behind the scenes in pandas or SQL. And I've tried to live code with suba just to sort of battle test it and make sure it's ready for the big time.

What makes dplyr powerful

I think a big question is what dplyr is really doing that's so useful. And to really understand, we have to go back to Hadley's 2014 talk where he introduced dplyr. And he mentions that analysts sort of have two bottlenecks, a cognitive bottleneck and a computational bottleneck. The computational one's the one we think of most often, which is as the data becomes bigger, it takes longer to run the code.

But intriguingly, Hadley mentions he thinks a lot about the cognitive bottleneck, which is how should we think about the data and describe what we want to do to the data or code. And dplyr aims at this cognitive bottleneck to help people sort of focus their thoughts and to give people strategies for data analysis. It does this by kind of slimming down, taking this big space and slimming down the options that people have. So there are five simple verbs. And all of these critically can combine with an operator group by. All of these take a data frame and return a data frame. And he mentions that overall, it's a very constrained design, especially compared to his previous tool, Plyar.

But intriguingly, Hadley mentions he thinks a lot about the cognitive bottleneck, which is how should we think about the data and describe what we want to do to the data or code.

So Siuba aims to sort of let people capture those dplyr-like thoughts and keep them while writing dplyr syntax in Python and then doing the computation in Pandas. So the gist is Siuba lets you transfer your thoughts from R to Python.

Three roadblocks when moving from dplyr to pandas

All right, so I'm going to try to show three motivating cases where it can be tricky to transfer dplyr thoughts to Pandas. Three roadblocks that people hit. And I know it's tempting to see dplyr in Pandas. It's pretty similar. So in this example, they're doing roughly the same thing. And they're using basically the same code. So it's a group summarize. We're calculating the average of a column. And we're renaming the result average HP. And we get roughly the same output.

But it's also worth noting, if you take this somewhere like Twitter or look around, you'll see some pretty scathing reviews. So dplyr has forever ruined Pandas for me. Or Chelsea notes, every picture shows Pandas in chaos. And I think it's worth noting these people, I think they like Pandas. So this isn't a straight critique of Pandas. But it can be really frustrating when you try to go from one tool and the way it structures thinking to another.

And so I would say these three cases try to bring that to the surface, what's going on. The first roadblock is select. So select is really a very basic action. You can choose columns. You can drop columns. You might match certain columns or rename a column. And the trick is that Pandas uses different names for all these actions. So there are four different methods, one for each action.

The other thing is that sometimes they have kind of funky arguments you need to pass. So you might have to say axis equals one to say, oh, I want to do this to the columns. Or you might have to say columns equals. Those are kind of different ways of saying the same thing. The last is that sometimes you have to use sort of intense programming constructs, like a lambda function, if you want to do things like match on a string. In contrast, dplyr uses the same verb for all these actions, select. And it's really easy to change between what you're doing, to go from keeping a column to dropping a column, or even to combine these actions together.

And so this is a case, I think, of where sort of changing what you do requires changing gears, right, and shifting between these different functions and methods in Pandas. The second example is, I would say, groupby. And this is really, I would say grouping is really a critical activity in data analysis in dplyr.

And the Pandas docs describe some translations from R to Pandas. And it looks pretty straightforward, at least for these cases. But actually, I would say groupby in dplyr and Pandas use radically different underlying grammars. So in this example, on the left, I'm showing a filter in Pandas or a mutate. And note there are two ways of doing it, or multiple ways, I should say. So you can use query, that's the top left, or eval on the top right, or an alternative way to filter on the bottom left is these blocks, or to mutate a sign on the bottom right.

Critically, we can ask, what happens when we try to groupby before these operations? And the answer is that none of them work. And the trick is a groupdata in Pandas, a dataframe groupby, actually doesn't have even any of these methods on it to perform these operations, except in one case where the operation just sort of fails. And this is probably really surprising, I think, for dplyr users. The trick ends up being you have to shift the groupby from outside of the verb to inside it. So you basically have to move groupby into your column operations. And again, this isn't necessarily bad, and it means Pandas can still do the same computations as dplyr. It's just, I think, surprising for dplyr users, and for me challenging to do when I'm trying to analyze data quickly.

The third roadblock, I'd say, is SQL. So dplyr users, if you're like me, working in industry, I lean on dplyr and dbplyr being able to generate SQL heavily. So you can just swap out your data source to SQL, and dplyr will generate a SQL query for you. In contrast, if you're using Pandas, you just need to switch to writing SQL. And that's what I see happen quite a bit in industry is once it hits the database, people just switch and start coding SQL. So your computation changes, and how you code ends up changing as a result.

So to recap, dplyr is really powerful. It uses one verb, select, for all the cases for grabbing columns. For groupby, you have this sort of really nice expressive syntax where you can modify the full table action. So you can have a grouped mutate or a grouped filter or group summarize. In contrast, Pandas is computationally powerful, and it's packed with options. But that means that you might have to do a little bit more work up front during the analysis. And this makes it great for engineering, but maybe a big challenge, I think, for exploratory data analysis or quick analyses.

How Siuba addresses these roadblocks

All right, so we looked at some core challenges in translating dplyr thoughts about data analysis to Pandas. Now I'm going to switch gears and look how Siuba can help you basically preserve those thoughts and code them in Python. And it aims to do this as faithfully to dplyr as possible. So I'm going to show you an example of going from dplyr code to Siuba.

I apologize. I know I start with these parentheses and put the pipe at the beginning of the line, and some people hate that. Sorry, my bad. So let's go ahead and just start the switch. First, we'll change our imports. So rather than using library, we'll do our Python imports. Then we'll change the pipe to be greater than, greater than. Next, we'll put an underscore dot in front of the variable names. And this sort of has to happen in order for it to be Python syntax. And then the last thing is a little bit tricky. We're going to take this mean function call, and we're going to change it to a method so it's more similar to how Pandas expresses operations.

All right, and so in a sense, with these few simple changes, we've gone from dplyr in R to Siuba in Python. So in looking side by side, hopefully my goal is that you can kind of like squint your eyes and they just look like the same code, right, and you can just figure out how to swap between them.

So now I want to go through the three examples that I showed before and just give you a sense for what they look like in Siuba. So the first roadblock was select, and now looking back at this example with Siuba on the left, it should be basically the same thing. Now you just have underscore dot before the variable names. It's worth noting, too, the very bottom one where you select certain columns, it's really easy to do, and it matches up sort of with the Pandas code. So Pandas, you can do this dot str dot ends with, and it will return true whenever a column matches that, and Siuba just lines up with that way of doing things, so it corresponds to the Pandas way.

The second example is groupby. So we showed a few different ways of doing filter and mutate. So the filter and mutate now for Pandas are on the right, and let's show how Siuba does it. So here's the Siuba filter and mutate on the left. Notice that they're eerily similar, hopefully, to dplyr, and we can just tuck in the groupby above the filter and mutate to make it a grouped filter or a grouped mutate. So it's meant to be easy to just swap in and out.

The last example was SQL. So I think this is kind of the biggest, most useful part of Siuba is you can swap out your data source from being a Pandas data frame to a SQLAlchemy, essentially a SQLAlchemy connection, and Siuba will generate the SQL query or run the query and return a table of data, just like dplyr. And right now it's mostly Postgres and Redshift that I've worked on supporting, but it can be extended to more backends. So there's early support for MySQL and SQLite, and I'm hoping to build those out further.

I think one of the incredible things is that you get ggplot for free, and that's not by any of my work, but by the work of a person named Hassan, who built a package called plotnine, which is an incredibly faithful port of ggplot. So ideally you can just do the full data transformation, data visualization, and you can carry over your hard-earned skills from R to Python.

So just to recap, Siuba tries to use a very similar syntax to dplyr. There are some cases where it needed to be tweaked a little bit to be Pythonic or to just be Python syntax. And incredibly, it aims to bring you SQL and ggplot. And I found that these things really helped me get back up to speed in data analysis, and have been fun just to carry into Python and to discuss with people.

Why Siuba is worth it long term

All right, so we talked about three roadblocks you might face when translating dplyr code and dots to pandas, and how Siuba lets you sort of roll past those by essentially copying and pasting dplyr into Python. Now I want to zoom out a little bit and ask, why is Siuba worth it in the long term? Why should you try out or adopt Siuba?

And the first thing I want to harken back to is Hadley's point in 2014 about cognition and computation, that dplyr as a cognitive tool has, if you've used it over the long term, probably really helped you build skills to ask important questions of your data. And those skills aren't even maybe that related to programming. So why not just bring those skills with you into Python?

that dplyr as a cognitive tool has, if you've used it over the long term, probably really helped you build skills to ask important questions of your data. And those skills aren't even maybe that related to programming. So why not just bring those skills with you into Python?

The next point is Siuba uses dplyr's architecture, and this lets it very flexibly add new backends. So whether you run against SQL or pandas, Siuba can support it. And I'm hoping to extend support to Python-specific tools in the future, like Dask, and tools like Spark, as well as fleshing out MySQL support.

The next thing is that Siuba runs just an enormous glut of continuous integration tests. So it's incredibly thoroughly tested. I would say it's like paranoid about ensuring that you get the same result back, whether you're running on SQL or pandas. And every time I push code, it runs thousands of tests.

The last thing is developer docs. So I've tried to leave a nice trail of breadcrumbs. So if you're curious about the internal workings of Siuba or looking to patch or extend it, there are just enormous resources to do that. I'd suggest the programming guide in the Siuba docs, which goes through all of Siuba's parts. Or I have something called architectural decision records in GitHub that document key decisions I made, why they were made, and contain sketches of those decisions.

Getting started with Siuba

The last thing is if you're interested in learning Siuba, there are some nice alleys you can go down. So the first thing I'd recommend is Dave Robinson's TidyTuesday screencasts. These are actually in R, so maybe not as related, but they were an incredible resource when developing Siuba and actually a big motivation to work on it. I think being able to see a person move quickly through data analyses in a holistic setting is really important. And TidyTuesday is a project that releases new data every week, lets you sort of see things in action.

The other thing is there's an interactive tutorial for Siuba on learn.suba.org. So if you're curious to just get started, even if you've never coded before, the tutorial's made to make it easy to get started. It's something I've tested on my family and friends, and I'm really excited for the opportunity for Siuba to make it really easy for learners to take their first steps into coding and data analysis.

The last thing is that I've tried to put up live analyses on YouTube of analyzing data for an hour, whether it's translating Dave's analyses from R into Siuba, Python, or doing TidyTuesday data analyses for an hour. So I highly recommend watching those if you want to see Siuba in action, and then trying Siuba out on TidyTuesday. Just take it for a spin and see what it looks like and how it compares, say, to using R or another Python tool.

So just to recap, you can find Siuba on GitHub at mhau.suba. You can pip install it. I can't highly recommend TidyTuesday enough. It'll let you get a feel for what data analysis with Siuba looks like. And there's learn.suba.org. If you've never coded, I've tried to design this for you, and I'd love to see people take their first steps into data analysis through this course.

So thanks to everybody who helped contribute to Siuba. Thanks to RStudio for putting this together. And thanks to the army of people who gave feedback on this talk and made it much, much better than its first version. So thanks for watching. I hope you'll try Siuba out. And if you have any questions, please feel free to reach out to me on GitHub or on Twitter at chowthedoc. Thanks.