Resources

Polars, pandas, and Narwhals, oh my! | Marco Gorelli | Data Science Hangout

Welcome back to the Data Science Hangout! This week, we're joined by Marco Gorelli, a senior software engineer at Quansight Labs and a core developer of Pandas, Polars, and the author of Narwhals. Marco shares his insights on the evolving landscape of data frame libraries in Python, focusing on Polars, its advantages over Pandas, and the role of Narwhals in creating data frame agnostic tools. Marco discusses his journey into open source, highlighting the importance of finding your niche and contributing to areas that others might find "boring" but are essential for a project's success. He also emphasizes the significance of fostering welcoming and inclusive open source communities. This episode explores several key topics: Polars: We learn about Polars' key features like lazy execution, its performance benefits, and the power of its expressions API. Marco explains how lazy execution can lead to significant performance gains and answers questions about its relevance even when working with smaller datasets. Narwhals: We discover how Narwhals enables developers to write tools that work seamlessly with various data frame libraries, promoting interoperability and simplifying development workflows. Open Source Challenges: Marco addresses the challenges of maintaining work-life balance while being deeply involved in open-source projects. He offers practical advice on prioritization and managing the constant influx of tasks and requests. Resources mentioned in the chat: Pandas: https://pandas.pydata.org/ Polars: https://pola.rs/ Narwhals: https://narwhals-dev.github.io/narwhals/ Awesome Polars: https://github.com/ddotta/awesome-polars Ibis: https://ibis-project.org/ Polars Plugins Tutorial: https://marcogorelli.github.io/polars-plugins-tutorial/ Understanding Polars Expressions: https://www.youtube.com/watch?v=E7cHgN9rd9c Great Tables: https://posit-dev.github.io/great-tables/articles/intro.html Great Tables Blog - Polars Styling: https://posit-dev.github.io/great-tables/blog/polars-styling/ Great Tables Blog - BYODF: https://posit-dev.github.io/great-tables/blog/bring-your-own-df/ Ruff: https://astral.sh/ruff UV: https://docs.astral.sh/uv/ R/Pharma 2024: https://rinpharma.com/ PyladiesCon: https://conference.pyladies.com/ Beeminder: https://www.beeminder.com/ Todoist: https://todoist.com/ Nyctography: https://en.wikipedia.org/wiki/Nyctography This episode offers valuable insights for anyone working with data in Python, particularly those interested in exploring the benefits of Polars and the power of Narwhals for building data-frame-agnostic tools. ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co LinkedIn: https://www.linkedin.com/company/posit-software To join future data science hangouts, add to your calendar here: https://pos.it/dsh (All are welcome! We'd love to see you!) Thanks for hanging out with us!

Nov 12, 2024
58 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome back to the Data Science Hangout, everybody. If we haven't had the chance to meet before, and if this is your first Data Science Hangout, I'm Rachel. I lead Customer Marketing at Posit. Posit builds enterprise solutions and open source tools for people who do data science with R and Python. We're also the company formerly called RStudio.

Hi, everybody. I'm Libby. I help Rachel out with the Hangout community, and I'm also a Posit Academy mentor. So I mentor cohort-based groups who are learning R and Python to apply those to their everyday data jobs.

The Hangout is our open space to hear what's going on in the world of data across all different industries, to get to chat about data science leadership, and connect with others who are facing similar things as you. And we get together here every Thursday at the same time, same place.

But thank you so much to those who have helped make this the friendly and welcoming space that it is today. We're all dedicated to keeping it that way. If you ever have feedback about your experience that you'd like to share with me anonymously, good or bad, or maybe suggestions for topics to dive deeper on, I'm going to share a Google form in the chat with you right now, but you can always reach out to me directly on LinkedIn as well.

With all that, thanks again for joining us. I'm so excited to be joined by our co-host today, Marco Garelli, Senior Software Engineer at QuantSight Labs. Marco is a core dev of Pandas and Polars and the author of Narwhals. And Marco, I'd love to have you just kick us off by introducing yourself and sharing a little bit about the work that you do, but also something you like to do outside of work too.

Sure thing. Hello, everybody. Thanks for inviting me. Also, I'm glad that we just surpassed 128 participants. Yes, that's right. We celebrated powers of two, not meaningless numbers like 100. Who cares about those numbers? I work as a software engineer at a company called QuantSight Labs, which does a bit of a mixture of open source and consulting and training, a mixture of things that deliver value to the open source community without necessarily giving a profit and things that are meant to deliver a profit so that we can fund cool open source work.

Outside of work, I'm lately really passionate about playing Irish music, not because I'm Irish, but because where I live, there's an Irish pub, which every week hosts a jam session where anyone who plays an instrument can go along and play some Irish music with the hosts of the jam session. So if you have an instrument, you can just go along. They're very friendly. They show you the chords, show you the songs, and that's a major source of entertainment for me outside of work these days.

Introducing pandas, Polars, and Narwhals

So pandas and polars both expose objects known as data frames. A data frame is a two-dimensional object with which you can store data, typically in columns. So you've got a collection of columns. Each column has to have the same number of elements, and within each column, all the elements need to have the same data type. People in the chat are saying that they are two cute bears, and this is true. It does seem to be a bit of a law that in data science you need to name your tools after animals. I tried to follow that tradition with narwhals.

But pandas actually derives from a panel data analysis. A panel was some structure that was in the library, but it's no longer around. And polars was, so it's written in a programming language called Rust, where the file extension is usually .rs. So that's why polars is named like this. It ends with .rs and also makes a reference to the pandas library.

Narwhals, on the other hand, tries to bring them together. So traditionally, most of data science, let's say that pandas has carried a lot of the weight of the world of data science on its back for the best part of a decade. And most of the time when people write tools intended for data scientists, they are writing them with pandas in mind. So maybe people can pass in arrays, maybe people can pass in pandas data frames, and quite often that's about it. But things are starting to change. A lot of users are demanding that their tools natively support polars data frames because it's newer, it's trendier, it's got a bunch of improvements over pandas. Like it's a lot stricter, it has some parallelization by default, it's got a lazy API, it's generally speaking noticeably faster and stricter in a lot of respects. It helps you avoid making a lot of bugs that you would otherwise make quite easily using pandas.

So there's a lot of demand now on people making data tools to not just support pandas, but also to support polars data frames. So how do people do that? You can either, there's a few different strategies. So let's talk about the different strategies.

So if you want to make a tool that supports both pandas and polars, what are your strategies? So one strategy is to just write your logic using pandas. And then if your input isn't pandas, you convert to pandas and you continue like that. This kind of works, but it's a missed opportunity. It could be a lot better to just keep things native to the library that people are starting with. Another strategy could be to duplicate your logic. So you've got your logic for pandas dataframes, your logic for polars dataframes, but then what happens? What happens next year when somebody comes across with belugas dataframes? You're going to have to have yet another set of logic for those dataframes.

So what we're trying to address with narwhals is a third way of doing this, which is just express your dataframe logic once using a nice little unified API that's extensive enough to be useful, but not too extensive that it becomes unmaintainable. And like this, it can just dispatch to whatever input your user provides and you enable your users to achieve the concept of BYODF, which is not a system of a down song. It stands for bring your own dataframe.

Lazy vs eager execution

So with Pandas, everything happens eagerly. You give the library some instructions and it evaluates your instructions the moment you give them. Whereas Polars, it's got an option whereby you can give it some instructions, it can wait a bit, and then it can do them all together in the most optimal way that it can detect. If I was to tell you to cook me a recipe and I gave you the steps one at a time, and you have to do the steps one at a time, and for each step go out and buy the ingredients that you need for that step, it's going to be a pretty inefficient way of making a cake. But if I can give you the recipe beforehand and you can buy all the ingredients together, then maybe do two steps at the same time, it's going to be a much nicer cooking experience.

On the question about, I don't deal with big datasets, why should I care about lazy execution? I would say that performance doesn't matter until it does. And that when you've got something that's categorically faster, it changes the kinds of questions you can ask. So in a previous company I was working at, I remember that there was some workflow where we were making weekly forecasts or something. There was like a very complicated process, like the forecasting process took like hours. But the moment when we modernized the tool chain, used some better algorithms and we got the time from hours to minutes, then all of a sudden the questions that product management started asking were, well, maybe we can run daily or by daily forecasts. So yeah, categorically better performance really changes the kinds of questions you ask. And then at some point, even if your data isn't big now, maybe it will be later. So it's a good thing to start with scalability in mind.

Performance doesn't matter until it does. And that when you've got something that's categorically faster, it changes the kinds of questions you can ask.

PyArrow and dependency management

So as of Pandas 1.5, you can set PyArrow back to data types. And for some data types, that makes a big difference in Pandas. So for strings, it makes your operations a lot more efficient. And for integers, it means that you can store missing values properly. Whereas for the classical integer data types in Pandas, the moment you've got a missing value, then your column becomes a float column.

And some people suggest this as a solution to data frame interoperability. Like why can't we just write our tools using PyArrow? And then why do we even need narwhals? Everyone can just convert to PyArrow. So yeah, let's address this. And it goes back to Libby's question about lazy execution. So PyArrow, it's a very... So Arrow is a nice memory format. PyArrow specifically is a library in Python, which implements this memory format along with some compute functions. But it's all eager execution. So if you have the possibility of keeping your workflow lazy, then being forced to convert to PyArrow breaks the lazy execution. It's a missed opportunity.

Furthermore, PyArrow is a fairly heavy dependency. Let's have a show of hands. Has anyone here run into dependency hell when working with Python projects? Yeah, I remember one company I worked at, I was not able to upgrade the dependencies of a project because two different libraries had pinned TQDM to different versions. Like TQDM, it's just a progress bar. And we weren't able to update our dependencies because of a progress bar, which I didn't even want to see. So since then, I've become rather anti non-required dependencies. And I think it's a pity if a dependency as heavy and as difficult to install as PyArrow becomes required in too many places. By all means, use it if it brings an advantage. But I really hope that the solution to data frame interoperability does not become PyArrow. I think we can do better.

Great Tables and Polars integration

So, Marco, I learned from you that Great Tables, which our team works on, plays really nicely with polars because they make polars expressions part of the API. It was probably on some social post somewhere. It might actually have been on the Polars Discord, where one of the Great Tables devs, maybe Richard, maybe Michael, one of them had posted about it saying, oh, we realized that the polars expressions are really nice. Like we started with supporting pandas and then when supporting polars, we decided that we can do better than just trying to mimic what we're doing for pandas. We can go further. We can allow expressions to, we can allow users to pass expressions and like this have a really nice readable polars idiomatic experience.

And yeah, I really liked it and figured, I then got a bit angry when I saw a bunch of people online saying, oh, but I like polars, but it doesn't have any way of displaying tables. I was like, okay, well, there's Great Tables, which does this really well. So maybe we should hook that into polars. So as of polars version something one, there's a dot style namespace with which you can access some Great Tables functions and get your data frames beautifully displayed to you. Like I was really shocked by how beautiful these displays could be with just a few lines of code.

Open source accessibility and community

All right. I think I've got the answer. Yeah, I think if you consider the topic of women in open source, there's typically relatively few in most packages. But in narwhals, I think we've got a far higher than average number of female contributors. And I think part of the reason is that when we approve pull requests, we give people cute animal GIFs. Now, I'm not saying that this is the solution to diversity problems in tech, but it is the differentiator and it helps make the project more fun.

And we've got a zero tolerance policy to unpleasant people. So my recommendation would be to, like, you know, if you're relying on volunteer labor, then try to make it fun. Try to encourage people. There's, I think there's a fair bit of evidence that being nice and encouraging works a lot better than being critical and responding to people with minus one the moment they make a suggestion. Yeah, if you're relying on volunteers, they're not very likely to make other suggestions if their first experience with your library is minus one, this is a terrible idea.

If you're relying on volunteer labor, then try to make it fun. Try to encourage people. There's a fair bit of evidence that being nice and encouraging works a lot better than being critical and responding to people with minus one the moment they make a suggestion.

So yeah, be extra nice to everyone and give people cat GIFs. I mean, another big problem that open source has is that of funding. And if we did have more funding in open source, I think it would be good to prioritize mentoring people from underrepresented groups.

Getting into open source: finding your niche

Yeah, I think when we chatted about Libby, I'd quoted to you the phrase, if there's something which you find interesting, but which other people find boring, then that's your competitive advantage. And when I started using polars, I noticed that a lot of the time zone code just hadn't really been done properly. Like there was in theory, some support for time zones, but it was typically just convert to UTC, apply an operation and then convert it back, which isn't typically correct. Like it works in some cases, but you know, if you need to report to your boss, how many sales you've made per day, then the definition of day better respect the time zone that you're selling in. If it respects the day boundaries of UTC, then it's not a very useful analysis.

So yeah, I've got a bit of a niche interest in time zones. Maybe, well, I live in the UK, parents live in Italy. So I don't know, just always thinking about difficulties with scheduling and always having to remember in which direction to add or subtract hours. Had a bit of a niche interest in time zones, but I didn't know Rust. So how was I to contribute to polars? Well, just by doing the time zone things that nobody else wanted to do, and in the process learning Rust.

Narwhals vs Ibis

So perhaps for people who don't know, Ibis is a project which describes itself as being a portable data frame library. So the idea is that they've got this Python API, which then gets translated to their intermediate representation, and they can then translate it either to SQL engines or to Polars. And like this, with just one API, you can target different engines.

And if you're an end user, if you need to do some complicated analysis, if you're an ML engineer or whatever, then I'm not sure that Narwhals is necessarily the best tool for that task. So with Narwhals, we're trying really to target library maintainers. And to do that, we're making decisions based on what library maintainers need. And I don't think that our API is extensive enough or that it will be extensive enough for the kinds of really customized, really complicated analyses that a lot of analysts and ML engineers are doing. So a SQL front end like Ibis might be a better choice for that.

Conversely, for the kinds of tools that people are building using Narwhals, I don't think that Ibis is a very good choice. And a few reasons for that are that a lot of tools that use Narwhals really do a lot with eager execution, whereas with Ibis, everything is lazy. And with Narwhals, we've got both the lazy and the eager execution, so you can choose. But at some point, you may well need to do things eagerly, and we allow that.

The other is that with Ibis, every back end requires Pandas and PyArrow as required dependencies, whereas with Narwhals, we've been very strict from the beginning by saying no back end should have any extra dependencies. Another is that with Ibis, there's no support for categorical data.

What's next for Narwhals

So top priority for now is Plotly. So one of the main contributors to Narwhals, Francesco, he's opened a pull request to Plotly to Narwhalify their Plotly Express part of the code base. So Plotly folks have been very encouraging about that. They've marked it as a priority one in their repo as well. So it's top priority for a lot of people, but that's really at the top of my mind at the moment. And then I'd like to see where else we can go with Narwhals. There's some other projects that have showed interest. So there's the Profit Library, there's Scikit-Learn. It would be unreal if we were actually able to make it into Scikit-Learn, so I'm not holding my breath on that one. But I think at some point I'll just open a pull request, get ready for it to be rejected. If I start with low expectations, then I won't be too disappointed.

Productivity and prioritization

Ruthless prioritization. I think you need to just say no to things. I think this one might come from Warren Buffett, maybe, which was about, I think the story is that he asked someone like, what are your top 30 goals? And yeah, how many of these do you plan to address in the next year? And this person says, oh yeah, I might do the top five. Maybe I can do the top 10, top 15, if I can do it. And Warren's reply is no, no, you've got it all wrong. You need to circle the top five and everything else you consciously do not work on because it's going to distract you from the top five. So I think you just need to be ruthless in saying no to things. Like not everything can be fixed, not everything should be addressed. Don't chase rabbits when you're out hunting elephants.

For tasks, I use Todoist, and there you can set labels and you can set priority levels to things. So yeah, pretty ruthless about checking that both on my phone and on my laptop. So there's a book called Getting Things Done, one of these self-help books about productivity, and I think most of it is not worth reading. But there is one lesson in there, which is about having a capture-everything device. So for me, having the Todoist app both on my phone and on my laptop, chances are, whenever I'm around and I think, oh, I need to do this, I just add it to my app. And one insight that I got from that book was that it's a major source of stress in your life when there's open loops. So that is to say, when there's something that needs doing and it's not captured somewhere. Like if you're just about to go to sleep and you think, oh, I really need to remember to do this tomorrow. I really need to remember to do that. It's going to stress you out. But if you know that it's in your app and the next day you're going to look at your app because that's part of your routine, that's part of your routine.

Open source as an addiction

Although first I need to address the comment in the chat, which is addicted to code. No, I hate code. I want as little code as possible. I'm addicted to open source, like just tools that solve problems and things, but code, I'm addicted to open source. So yeah, I'm addicted to open source. If there was a way to do it without code, that would probably be better.

Yeah, I think when you start making contributions, it can be very rewarding to fix things and to add features which are going to be used by millions of people. On the flip side, you start feeling a bit of responsibility for things and you become really attached to the things you've done. And it can be a bit dangerous, like if you address all of the issues, all of the pull requests that are coming in, there's just not enough time in the day to do that. Furthermore, it's possible to keep yourself endlessly busy with relatively unimportant tasks which maybe just affect 0.001% of users. So in terms of maintaining a healthy work-life balance, I really suggest more prioritization.

Choosing between pandas and Polars for new projects

Yeah, if you're starting a new project, I'd really recommend this as the best time to try out Polars. You know, if you've got an existing project, my recommendation is if it's not broken, don't fix it. But if you're starting a new project, I think it's, there's not, like, unless you need something which only Pandas does, like maybe complex number support, or some geospatial functions. But even then, DuckDB has got good support for geospatial, so you can always just get around it with that.

Yeah, so it, I think in the previous person who asked a question, there was a bit of a comment that we become attached to the code we write. And it's been a bit painful for me to maintain Pandas for years and then come to realize that actually it's probably just not the best thing going forwards for new projects. But yeah, it's just something we've got to contend with. Different tools are going to come along, and yeah, we've just got to be ready for that. And there's a little, you can find that if you install narwhals and then do import narwhals.this, there's a little Easter egg there. And one of the phrases is, our code is not irreplaceable. I just want to try to remind people, like, don't get too attached to the code you've written.

Yeah, I love that, that the index can be powerful if you use it correctly, but nobody does. So for most people, it's just an annoyance. So yeah, I would say the main difference when coming to Polars from Pandas is that the way you interact with data frames isn't with square brackets and series, but with expressions. And we don't have that long left. So if you want a more detailed answer, then I'm going to plug my own talk from PyData Amsterdam called Understanding Polar Expressions When You Use Pandas.

Map elements and fun open source moments

So map elements is the way that you can apply user-defined functions row by row. And it's really a last resort function. It's something you should only use if you absolutely have to. Usually it's just more efficient to use the built-in expressions API or to write a plugin if you really need to. But regarding map elements, one of the most fun experiences I've had in open source was when I just got so angry seeing some blog posts where people were saying, ah, Polars, it's not really faster than Pandas because look at this function where I used map elements and it's not very fast. I was like, okay, there needs to be a way to tell people not to use map elements. So my first suggestion was to rename it to map elements underscore slowly, but this got rejected.

I'm not really sure why. I think it was a brilliant name, but no, it didn't happen. So what we then did was we would take the lambda function which people pass in, we decompose the bytecode, we reverse engineer the bytecode to figure out what the user had passed in, and then we can emit a warning telling the user what to do instead. And yeah, that was wild. Most fun, one of the most fun open source experiences ever. We can't do it for everything. If we could, then we could just do it for them instead of emitting a warning. And I really don't want to do it for the user in cases when we can detect it because then people don't learn. So yeah, we emit the warning instead of doing it for you to teach you a lesson.

So, well, in-person talks, I'm out-talked, out-conferenced for this year. I've done too much. But there will be PyLadiesCon towards the end of the year, and one of our regular contributors, Magdalena, will give a talk there, and might even lead a sprint. So if anyone's interested in contributing to Narwhals, I'd really recommend attending that.

Well, thank you so much, Marco, for taking the time to join us today and sharing your experience with us all. This has been great getting to learn from you. Thank you so much for inviting me. This was a lot of fun.

It's always amazing how quickly an hour goes by, but Libby and I will write up some episode notes and put that with the recording. But I do just want to say, if this was your first Hangout today and you want to join us again, they are every Thursday from 12 to 1 Eastern Time. So next week, Steph Locke at Microsoft is going to be joining us. Steph was a recent keynote speaker at Earl and came highly recommended from Hadley Wickham. We'll be talking about business as usual and how it can slow down the development cycle and what we can all do about that. Thank you all so much. Nice hanging out with you all, and I hope to see you next week.