Polars, pandas, and Narwhals, oh my! | Marco Gorelli | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout, everybody. If we haven't had the chance to meet before, and if this is your first Data Science Hangout, I'm Rachel. I lead Customer Marketing at Posit. Posit builds enterprise solutions and open source tools for people who do data science with R and Python. We're also the company formerly called RStudio .

Hi, everybody. I'm Libby. I help Rachel out with the Hangout community, and I'm also a Posit Academy mentor. So I mentor cohort-based groups who are learning R and Python to apply those to their everyday data jobs.

The Hangout is our open space to hear what's going on in the world of data across all different industries, to get to chat about data science leadership, and connect with others who are facing similar things as you. And we get together here every Thursday at the same time, same place.

But thank you so much to those who have helped make this the friendly and welcoming space that it is today. We're all dedicated to keeping it that way. If you ever have feedback about your experience that you'd like to share with me anonymously, good or bad, or maybe suggestions for topics to dive deeper on, I'm going to share a Google form in the chat with you right now, but you can always reach out to me directly on LinkedIn as well.

With all that, thanks again for joining us. I'm so excited to be joined by our co-host today, Marco Garelli, Senior Software Engineer at QuantSight Labs. Marco is a core dev of Pandas and Polars and the author of Narwhals. And Marco, I'd love to have you just kick us off by introducing yourself and sharing a little bit about the work that you do, but also something you like to do outside of work too.

Sure thing. Hello, everybody. Thanks for inviting me. Also, I'm glad that we just surpassed 128 participants. Yes, that's right. We celebrated powers of two, not meaningless numbers like 100. Who cares about those numbers? I work as a software engineer at a company called QuantSight Labs, which does a bit of a mixture of open source and consulting and training, a mixture of things that deliver value to the open source community without necessarily giving a profit and things that are meant to deliver a profit so that we can fund cool open source work.

Outside of work, I'm lately really passionate about playing Irish music, not because I'm Irish, but because where I live, there's an Irish pub, which every week hosts a jam session where anyone who plays an instrument can go along and play some Irish music with the hosts of the jam session. So if you have an instrument, you can just go along. They're very friendly. They show you the chords, show you the songs, and that's a major source of entertainment for me outside of work these days.

Introducing pandas, Polars, and Narwhals

So pandas and polars both expose objects known as data frames. A data frame is a two-dimensional object with which you can store data, typically in columns. So you've got a collection of columns. Each column has to have the same number of elements, and within each column, all the elements need to have the same data type. People in the chat are saying that they are two cute bears, and this is true. It does seem to be a bit of a law that in data science you need to name your tools after animals. I tried to follow that tradition with narwhals.

But pandas actually derives from a panel data analysis. A panel was some structure that was in the library, but it's no longer around. And polars was, so it's written in a programming language called Rust, where the file extension is usually .rs. So that's why polars is named like this. It ends with .rs and also makes a reference to the pandas library.

Narwhals, on the other hand, tries to bring them together. So traditionally, most of data science, let's say that pandas has carried a lot of the weight of the world of data science on its back for the best part of a decade. And most of the time when people write tools intended for data scientists, they are writing them with pandas in mind. So maybe people can pass in arrays, maybe people can pass in pandas data frames, and quite often that's about it. But things are starting to change. A lot of users are demanding that their tools natively support polars data frames because it's newer, it's trendier, it's got a bunch of improvements over pandas. Like it's a lot stricter, it has some parallelization by default, it's got a lazy API, it's generally speaking noticeably faster and stricter in a lot of respects. It helps you avoid making a lot of bugs that you would otherwise make quite easily using pandas.

So there's a lot of demand now on people making data tools to not just support pandas, but also to support polars data frames. So how do people do that? You can either, there's a few different strategies. So let's talk about the different strategies.

So if you want to make a tool that supports both pandas and polars, what are your strategies? So one strategy is to just write your logic using pandas. And then if your input isn't pandas, you convert to pandas and you continue like that. This kind of works, but it's a missed opportunity. It could be a lot better to just keep things native to the library that people are starting with. Another strategy could be to duplicate your logic. So you've got your logic for pandas dataframes, your logic for polars dataframes, but then what happens? What happens next year when somebody comes across with belugas dataframes? You're going to have to have yet another set of logic for those dataframes.

So what we're trying to address with narwhals is a third way of doing this, which is just express your dataframe logic once using a nice little unified API that's extensive enough to be useful, but not too extensive that it becomes unmaintainable. And like this, it can just dispatch to whatever input your user provides and you enable your users to achieve the concept of BYODF, which is not a system of a down song. It stands for bring your own dataframe.

Lazy vs eager execution

So with Pandas, everything happens eagerly. You give the library some instructions and it evaluates your instructions the moment you give them. Whereas Polars, it's got an option whereby you can give it some instructions, it can wait a bit, and then it can do them all together in the most optimal way that it can detect. If I was to tell you to cook me a recipe and I gave you the steps one at a time, and you have to do the steps one at a time, and for each step go out and buy the ingredients that you need for that step, it's going to be a pretty inefficient way of making a cake. But if I can give you the recipe beforehand and you can buy all the ingredients together, then maybe do two steps at the same time, it's going to be a much nicer cooking experience.

On the question about, I don't deal with big datasets, why should I care about lazy execution? I would say that performance doesn't matter until it does. And that when you've got something that's categorically faster, it changes the kinds of questions you can ask. So in a previous company I was working at, I remember that there was some workflow where we were making weekly forecasts or something. There was like a very complicated process, like the forecasting process took like hours. But the moment when we modernized the tool chain, used some better algorithms and we got the time from hours to minutes, then all of a sudden the questions that product management started asking were, well, maybe we can run daily or by daily forecasts. So yeah, categorically better performance really changes the kinds of questions you ask. And then at some point, even if your data isn't big now, maybe it will be later. So it's a good thing to start with scalability in mind.

Performance doesn't matter until it does. And that when you've got something that's categorically faster, it changes the kinds of questions you can ask.

If you're relying on volunteer labor, then try to make it fun. Try to encourage people. There's a fair bit of evidence that being nice and encouraging works a lot better than being critical and responding to people with minus one the moment they make a suggestion.

So yeah, be extra nice to everyone and give people cat GIFs. I mean, another big problem that open source has is that of funding. And if we did have more funding in open source, I think it would be good to prioritize mentoring people from underrepresented groups.