Resources

Emily Riederer - Python Rgonomics

Data science languages are increasingly interoperable with advances like Arrow, Quarto, and Posit Connect. But data scientists are not. Learning the basic syntax of a new language is easy, but relearning the ergonomics that help us be hyperproductive is hard. In this talk, I will explore the influential ergonomics of R's tidyverse. Next, I will recommend a curated stack that mirrors these ergonomics while also being genuinely truly Pythonic. In particular, we will explore packages (polars, seaborn objects, greattables), frameworks (Shiny, Quarto), dev tools (pyenv, ruff, and pdm), and IDEs (VS Code extensions). The audience should leave feeling inspired to try Python while benefiting from their current knowledge and expertise. Talk by Emily Riederer

Oct 31, 2024
18 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Emily Riederer, and for years I've been coming to different data science conferences and telling stories about working in R. And inevitably at the end of every single session, someone would raise their hand and ask the standard question, that's great, but why didn't you use Python? And this happened to me so many times with so much regularity, I felt like I had some pretty good answers by the end. I talk about how the tidyverse is unparalleled for data wrangling, how R Markdown was just perfect for data science communication, and how R just had easier on your app for non-developers.

But despite all those, like, specifics I tried to fill out, the honest answer was I just loved that magical flow state feeling of working in R, where I knew exactly what I wanted to do, and it just worked. And I didn't feel like I could ever get that with what I thought was this stack that I had to use in Python of pandas, matplotlib, and Jupyter note. It just didn't feel as organic to me. I felt like I had to make this choice between the workflow and the ergonomics that I loved in one language, and the tools that constituted a successful Python programmer in another.

But instead of seeing these two things as in conflict with each other, what I've realized in the past couple of years as the stack has evolved, is that actually, these are just the criteria for picking good Python tools, and two sides of the same coin. We can find tools that both have the similar workflow we love in R, with functional programming, and a really comfortable level of abstraction, and we can use tools that are well adopted by the Python community, have that same level of support and engagement and interoperability, with plenty of questions and support on Stack Overflow and GitHub, that make us competent Python programmers without giving up everything we know and love and have learned about R.

But instead of seeing these two things as in conflict with each other, what I've realized in the past couple of years as the stack has evolved, is that actually, these are just the criteria for picking good Python tools, and two sides of the same coin.

So, today, I wanted to take you through a tour of some of the tools in my ergonomic stack of working with Python, spanning aspects of that wrangling, communication, and developer tool cycle that I talked about and always ceased to claim was impossible. And to do this, in honor of this conference being hosted in Seattle, we'll use some daily weather and temperature data from Seattle.

Data wrangling with Polars

So, first, let's talk about data exploration, which is one area where many of us R users coming from a tidyverse background have really been spoiled with dplyr and ggplot2. While many think of pandas as the go-to data wrangling language, I've actually found that Polars checks many more of the boxes for me of an ergonomic tool. We have that similar workflow with a functional paradigm and this really expressive and expansive rich API, but it isn't on its own a very stand alone impressive and successful project. It's seen great adoption in the Python community, so much so that it's natively supported in Seaborn and Scikit-learn, and there are those signals with the communities really latching on to it.

And it's no wonder why. It's zero dependencies and written in ultra fast, highly performant Rust code. Looking at a surface level, I think it's clear to see some of the superficial similarities between dplyr and Polars just in the way that it provides data frame in, data frame out methods with similar kind of function names. And beyond those naming conventions, the feature parity is also very similar, not only to dplyr, but also to the broader tidyverse. So, we have nice functions to support things like string and date wrangling, restructuring and reshaping our data, and working iteratively.

But a lot of different tools can have feature parity without having that same feel, without having the ergonomics. So, what I want to do is go a little bit deeper. We can look at the actual official tidyverse design principles and see how well those stack up in Polars. We can think about how our dplyr tools are composable. We have these small atomic functions that allow us to express really complex tasks. They're also consistent. So, the API that we learn in one package and one function easily transports and expands our knowledge across other tools. And they're human centered. They always kind of know what we're going for and kind of help us do that super easily.

So, let's see how well these hold up in Polars. In dplyr, of course, we have our piping and those data frame in and data frame out methods that make things very composable. Really across Python with method chaining, you can achieve that similar sort of narrative flow where we can read our code top down in the exact order of execution. Now, in R, there's no reason that we couldn't keep that piping paradigm going within each function. But our coding get kind of long and lengthy and we don't tend to do that. Instead, we tend to go back to that traditional nested function paradigm. But with Polars, because we have a lot of nice methods to operate on individual columns, we can actually expand that sense of composability and make our code just as readable from left to right as it is from top to bottom.

Similarly, in dplyr, I always value the consistency both for the purposes of reading and writing. All of the functions look sort of similar whether we're adding new columns or subsetting rows or columns. So, it's just easy both to read and to write. I never found the same to be as true about pandas where to do those same three operations, you might have to use three very different mindsets. So, using an anonymous function to alter a column. Then articulating how you want to subset your data with a string. And then just using bracket notation. It just didn't feel organic. It didn't feel like language.

But again, Polars really brings back the same kind of consistency to the language as dplyr with a lot of just very readable code. And finally, the human centeredness. dplyr spoils us with so much syntactic sugar and I feel like Polars is very the same in this regard as well. Just one example I'll point to here is Polars implementing select helpers which lets us easily, much like in R, identify variables by a part of a variable name or a variable type and apply the same transformation to all of those variables at once instead of duplicating our code.

Similarly, Polars also knows that as data scientists, we work with data of very different sizes and in very different places. So, whether we're reading data in batch or streaming, in memory or out of memory, we can run roughly the same Python code and let Polars handle the details of optimizing that workflow for us.

Visualization with plotnine and Seaborn objects

But of course, data wrangling is only one half of exploration. We also have visualization and here we actually now have two really great options. First, there's plotnine which is a very high fidelity code clone of ggplot that is well supported by Posit. Sometimes I worry a little bit about a lot of packages that try to clone existing packages, that they may not be well supported, they may be hobby projects, but having that institutional support is huge. On the other hand, we have Seaborn objects which lives inside Seaborn, one of the most popular Python packages for data visualization, and this provides a novel experimental object-oriented programming interface with unique ggplot2 inspirations that allows us an alternative way to bring the grammar of graphics into creating the underlying matplotlib figure specifications.

So, again, if we think about what makes ggplot2 special, we have that sense of a grammar mapping different components in our data to different aesthetic elements and doing that incrementally in layers where we can control each layer as much as we want. It shouldn't be hard to convince you that we can achieve that thing in plotnine because this plotnine code, believe it or not, is actually Python code, even though it looks almost identical to the ggplot2 code that we know and love. But if we want, we can get a little more creative and give that a little bit of its own Pythonic flavor. Using Python's incremental addition operator, we can still use all the same ggplot2 functions we know, but break our code up a little bit more without that method chaining to make it easier to debug and experiment with.

Now, if we switch over to Seaborn objects, we will have slightly different syntax, but you can see largely the same design philosophy. We're still creating that plot object and incrementally adding to it and building from it, adding in geometries, altering our scales and our labels, all in that same sense of a grammar. And of course, packages that aren't clones can also be fun because they can bring new and different inspirations as well. One thing that can be hard in ggplot2, if you've ever tried, is to make facets that represent different variables in different facets. To do this, you have to kind of like manually reshape your data and almost try to trick ggplot2 into doing what you want. But Seaborn objects sort of meet this need by letting us specify multiple Y or X variables that we can put in different facets, which is a kind of fun feature.

And always, any of these tools, you always worry, what if I reach the limit of this tool? Will I get kind of stuck and stranded if I'm using a tool that isn't the language standard? But both of these tools make it really easy to get out that underlying matplotlib figure. So, if we need to customize further, we can always fill up our sleeves and do so. We'll never get stranded.

Reproducible reporting with Quarto and great tables

So, now let's talk about reproducible reporting. In R, we've always been really spoiled with having R Markdown as a way to do it. And I used to always tell people, if Python ever had anything as good as R Markdown, I'd switch over. And don't say that if you don't mean that. Because then Posit goes and creates something like Quarto. And we've already seen today all of the great sense of the last mile of what Quarto can do and all the different formats it can produce. But I'm also interested in the first mile of Quarto and the way that it can really replace Jupyter Notebook.

Jupyter Notebook can be a really powerful tool for using Python, but they come with certain downsides. Their kind of rich text format can be harder to version control and peer review. Because there's a lot underlying the actual code and its representation. And because of that, you can also run into issues with managing the kernels behind it and the state of one of the different code chunks per execute. Quarto, I feel like, actually hits really nice balance for this instead of giving us a simple plain text file that we can work with and edit and save. And delivering a similarly interactive experience instead of through the file itself, but through its interactions and extensions with various IDEs, be that Positron, VS Code, or RStudio.

So that's only the means of communication, but what about the substance of that communication? And for that, we have the great tables package, which you may have seen in the table session yesterday. And I would compare that to other Python equivalents, but as far as I can tell, there barely are any. Great tables is kind of unread. It's built by the same team that built the GT package in R, but curiously, they didn't even attempt to clone the syntax, but rather deliver kind of the best experience for either language independently. And you can see why I won't go into this code in great detail. We have that same sense of grammar and incrementally building up that table object piece by piece as we do on the plotting side for excellent data science.

Developer experience

So, if any of this has interested you at all, you're probably wondering, how do I get started? And that's where we get to talk about the developer experience. In R, we've always been really spoiled by tools like RStudio for just everyday code writing and use this, which has helped us kind of bridge the gap from user to R package developer. In a very batteries included way. But then when we move to Python, there are a lot more just like small little places that can trip us up or just give us paper cuts in our work. Whether that just be installing and managing many different versions of Python, then moving to installing and managing many different packages and making sure we have the right package version in the right place. And even picking a dependency manager, of which there are 20,000 options. And finally, finding just like picking our development environment where we want to run all that code.

And I think some of the ergonomics that Python's really taught us to value in terms of those IDEs are being able to do things programmatically instead of navigating through complicated UIs. Being helpful and slightly opinionated to nudge us towards good practices without being overbearing. And just being unsurprising and not having weird and unexpected sightings. So, some tools I've learned to love in this regard for Python.

Our first, pyenv, which provides a great streamlined way to install Python on your system. This allows a way a structured way to install all of our different Python versions in a shared location on our computer where both we and more importantly our computer can find them again. And if you want to switch what version of Python you're actively using at either the global or the project specific level, that can also be easily done on the command line.

Next up, we need to manage the different versions of different packages that we're installing for use in different projects. While environment management is a great practice in any tool, it's especially important in Python, even more so than R, because PyPI, unlike CRAN, isn't forcing us to do a lot of reverse dependency checks when new issues of packages are released. So, we're more likely to have package versioning conflicts, bubble up and immediately cause breaking changes in our code.

And there are many, many different tools for creating these virtual environments. But one that I've particularly come to love is PDM, because it does exactly what I want it to do very simply and very easily and truly absolutely nothing more. It creates a virtual environment at the project level and always ensures then that we're installing the packages we want into that virtual environment instead of somewhere else. And when it does that, it logs both the dependency that we explicitly added and the dependency of those dependencies separately. This may seem like a kind of esoteric point, but it's actually really nice then if you decide you didn't need a certain dependency, because you can very cleanly uninstall both that dependency and all the access that it added to your environment altogether.

And I tend to try to avoid tools that are too much off the beaten path, but one reason that I get comfortable using PDM is it's made itself very interoperable. If we want to in a single command, we can create that more common Python requirements.txt file to go ahead and reinstall and recreate that same environment with any range of other tools that maybe your coworkers or other developers might want to use.

IDEs and extensions

So, finally, let's talk about where we run code. VS Code is one, you know, kind of very common tool for developing in Python, which I'd really started to learn to love when Positron came along. And I think Positron is an exciting addition, largely because it just comes with a little bit more batteries included and smooths the gap of getting started and getting all those languages to talk to one another. And it brings back a lot of the features we've learned we really love from RStudio, like that plot viewer and variable explorer for very data native first environment.

But the thing that's most rich about either of these tools is the ecosystem of extensions and command line tools that play nicely with them. And I think it can be easy to assume that some of these tools are kind of, oh, they're more for power users. But I truly think that's the wrong mentality. When we're starting a new language where we don't have that inherent sense of flow and intuition, some of these tools can help outsource what the back of our mind isn't giving us automatically and automate that to bring those intuitions to us.

So, for example, we can use cookie cutter to create project templates and learn from many, many experienced Python developers over time who thought systemically about how to set up a good project. Similarly, we can avoid typos that newbies tend to make with tools like autodoc and pyindent to automatically create the documentation strings we need and make sure we have the right white space in our code. And then some of my favorites are error lens and ruff, which provide linters and stylers to help us identify both syntactic and stylistic issues in our code and elevate them right to the front of our attention so we know to fix them.

This is just one example of those packages at work that, again, as inexperienced developers, we might immediately know that some of these things are code smells or look a little fun. Like importing a package halfway through our code. Or importing a package with a typo in it. But when you're new to a language and you're juggling a lot of competing interests, you may not exactly you may not have back of your head just immediately flagging to you that those are issues. So, it's really nice to have your IDE doing it for you.

So, in conclusion, I'll never be that person sitting in the front of the room asking you why you didn't use Python. R was my first love. And R is also a lot easier to make puns with. Like ergonomics. But hopefully at the end of the day, if I've done my job here correctly, you may ask why don't I also check out some of these Python tools? So, thank you very much for coming. That really concludes my talk. I'll note I do have a few related blog posts on the issue. And if you have other Python tools that you find work particularly well in your workflows, I would love to hear about them. Thank you.

So, in conclusion, I'll never be that person sitting in the front of the room asking you why you didn't use Python. R was my first love. And R is also a lot easier to make puns with. Like ergonomics.