885: Python Polars: The Definitive Guide — with Jeroen Janssens and Thijs Nieuwdorp

Python #Polars #Pandas Jeroen Janssens and Thijs Nieuwdorp are data frame library Polars’ greatest advocates in this episode with @JonKrohnLearns where they discuss their book, Python Polars: The Definitive Guide, best practice for using Polars, why Pandas users are switching to Polars for data frame operations in Python, and how the library reduces memory usage and compute time up to 10x more than Pandas. Listen to the episode to be a part of an O’Reilly giveaway! This episode is brought to you by: • Trainium2, the latest AI chip from AWS: https://aws.amazon.com/ai/machine-learning/trainium/ • Adverity, the conversational analytics platform: https://eu1.hubs.ly/H0jxK210 • Dell AI Factory with NVIDIA: https://www.dell.com/superdatascience Interested in sponsoring a SuperDataScience Podcast episode? Email natalie@superdatascience.com for sponsorship information. In this episode you will learn: • (00:00:00) Introduction • (00:04:46) Why Jeroen and Thijs wrote Python Polars: The Definitive Guide • (00:18:18) Best practices in Polars • (00:25:08) Why Polars has so many users • (00:32:37) The benefits of the Great Tables package • (00:50:05) Jeroen and Thijs’ partnership with NVIDIA and Dell for Python Polars: The Definitive Guide Additional materials: https://www.superdatascience.com/885

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

It's if you think of an expression as a recipe, then the operations would be the steps, and the functions and methods would be the cooks. So how does this metaphor shape your philosophy about best practices with data transformation design in order to deliver clean, readable pipelines, especially in large collaborative projects?

The short answer is no more brackets. With Polars, you take a different approach. You're writing a paragraph of things that you want to do. Yeah, and I think one of the things I've always liked most about Polars is the very declarative approach of what you're writing down. With Polars, you declare what you want as end result, and you just leave the specific processing and optimization to the engine, and that makes it way easier to read.

With Polars, you declare what you want as end result, and you just leave the specific processing and optimization to the engine, and that makes it way easier to read.

Jeroen and Thijs, welcome to the Super Data Science Podcast. You guys are together. I don't think... Have I ever had this situation before? I can't think off the top of my head of ever having two guests, but that they are co-located.

Where are you two co-located? Well, thanks, John. It's great to be here again. Good to see you again. And we're calling in from Rotterdam, the Netherlands.

And that voice for the listeners out there, for the people not watching the YouTube version, that was Jeroen. That's his voice. And for people watching the YouTube version, he's the one in the pink shirt. Oh, also his mouth was just moving, and sound was coming out of his face. But whatever is easier for you to track.

And then, in a charming and matching or complimentary forest green shirt, we have Thijs Nudorp. Thijs, what does your voice sound like? Thanks so much for having me, John. This is what my voice sounds like.

You both have... It would be helpful if one of you didn't have a Dutch accent, but fine, we'll have to just work with this. We have accent? What do you mean, we have accent?

Background and the book's origin

So, Jeroen, you've been on the show before. You were on an episode 531 back in December 2021. That was the very end of my first year hosting this show. We had a great time on that podcast. It's a great episode that people can listen to. But Thijs, am I correct in understanding this is your first podcast ever? It is. I've never podcasted before. It's just since the book is taken off, we're finally getting into that marketing and we're kicking off with the best. So I have no clue what to do after this.

I hate to let you know, it's going terribly so far. Yeah, well, I'll just start with my biggest fear and just break it down from there, right? Your biggest fear. Yeah, exactly. We'll start with your biggest fear and then we're going to move on to your biggest beer. If you could grab one of those, I think it might smooth things along.

So you guys, until recently, until February 2025, you were co-workers together at Zomia, which is the leading data and AI consulting company in the Netherlands and makers of open source data frame library Polars. And then Jeroen, you recently took a DevRel job at Posit. It seems like a lot of people are moving over to Posit. A lot of big names are there now, makers of RStudio and lots of other great open source tools.

But what I'd like to speak about most in this episode is your new book, brand new, released actually the same day that we were recording this, which is April 1st. So now that this episode is live in May, hopefully it'll actually be available again because right now in a lot of locations around the world, at least, if you try to buy Python Polars, the definitive guide by our guest today, Jeroen and Thijs, you wouldn't be able to get it on April 1st because it is sold out. But O'Reilly are very good. They do on-demand printing. And so they should be able to resolve that pretty quickly.

We've had Polars on the podcast before. We've had Richie Fink, its creator. We've had another key contributor to the Polars project, Marco Garelli. But the Polars library has grown a lot since then. It's about to pass pandas in popularity, if we measure that in kind of number of GitHub stars, if that's a measure of popularity. And yeah, and now it has this great O'Reilly book, thanks to the two of you.

So what spurred you guys on to write the book? What was the experience? Like, oh, and I've got to tell the listeners about this, that you're doing a book giveaway. So I think we'll give them until Sunday. What do you think? Up until Sunday? Sounds good to me.

So there's a URL. You can say what the URL is and what the free book giveaway is. We do free book giveaways on the show, lots, physical books, but there's something special about your book giveaway that we've never done before. So I'll let you guys fill the audience in. Yeah. For your listeners, we wanted to give away hard copies that are signed by the both of us. So in order to be eligible for a copy, you go to pollersguide.com slash SDS and you fill in your name and email address, and then you enter the raffle. And then by Sunday, we'll, yeah, we'll let you know. Even if you don't win, you'll still get the first chapter for free.

We'll give away three copies. Nice. And people can be anywhere in the world? Anywhere. We'll take care of the shipping.

You have until Sunday to submit yourself into the raffle and get a signed copy of Python Polars, the Definitive Guide from both of its authors. Our guest today, Jeroen and Thijs, super, super cool.

So now with that out of the way, tell us what caused you to write this book and what the process was like. Yeah. It started with you, right? Exactly. So let's start with the origin story here. I joined Xomnia in January 2022.

So when I started, I was just getting to know everybody working in the office. And there was this one guy really focused working behind his laptop. Everybody was going to lunch, but he would just stay working. Turned out that was Richie Vink, the creator of Polars, I learned later. And I didn't know anything about Polars, but I didn't have an assignment yet. I had some time to work to explore a dataset, and I decided, let's try out Polars. And I was immediately hooked, of course. I immediately figured, okay, this is so cool. This deserves a book. This is going to be a big thing.

But I already knew, having written Data Science at the Command Line before, twice, that I never wanted to write a book by myself anymore. So I needed another victim, right? Someone to share the pain with. And Thijs, so very shortly after that, I got assigned at a client with a large code base, and Thijs was also working in that same team. So we were not only colleagues, we were also working for the same client in the same team. And so I felt like, hey, Thijs, Thijs, he seems to be good at this. He likes to write. Why don't I ask him? And to which his answer was, obviously, yes, yeah.

And so I had a meeting with O'Reilly anyway about whether I could do anything else for them, maybe a video course or something related to Data Science Command Line. But that's when I asked him, like, hey, have you heard of this thing called Polars? He said, yeah, yeah, we've had four proposals so far, but we've all rejected them. I was like, oh, wow, four proposals already. And so that's when I knew that we had to write a serious proposal. So we wrote one, over 15 pages. We brought in all the stats that we could, and of course, by then, O'Reilly hesitantly said yes. But after a few months, they realized, like, oh, wait a minute. This is actually going to be a big thing. We want to have this book now.

That's really scary, because writing a book is, it is torture. When I wrote Deep Learning Illustrated, it was the worst experience of my life. The only thing that came close was writing a PhD dissertation. But with a PhD dissertation, there's not that much pressure, because like two people are going to read it.

But then, but writing a book like Deep Learning Illustrated, the idea was hopefully more than two people would read it. And I don't know if you feel differently about this, Jeroen, having now written several books before. So maybe you feel like, you know what? I can actually, I can write a bestseller. I know the process. I know what to do. But at least for me, I mean, I've only, I've only released that one book so far. And for me, the whole time I was writing it, I was filled with this deep concern that it would come out and everyone would realize that I was a fraud, that like, I had no idea what I was talking about.

Yeah, I recognize that, especially with the first edition of Data Science at the Command Line, which I wrote right after I finished my PhD thesis, I was in this groove. But I really felt like an imposter during that entire time, especially since everybody and their dog seems to have an opinion about Linux and Unix and which tools to use. And so a lot of opinionated people there, which made it all the worse.

But by the second edition, I kind of, I realized like, hey, you said bestseller. Well, I'm not sure that our book is going to be a bestseller. I'm pretty sure it's not, but. The Polars book? I mean, it might not be a New York Times bestseller, but I bet in some Amazon categories it will be.

But what I have learned is that I and Thijs, I knew that Thijs, we can definitely write a book, right? You don't have to know everything. That's what a lot of people think is that you have to be an expert in the topic. That's not true. You think, maybe you think that you're an expert, but as you start writing, you'll realize that you have a lot of gaps in your knowledge. And that's when you start learning more and more about the topic. So I was, by then, when we started writing Python Polars, I was pretty confident that as long as we would stay one step ahead, right? And what definitely helped is that we were able to implement the things that we learned at our client.

And I guess the biggest takeaway here is that you don't have to know everything when you start writing a book. You'll figure things out along the way. Yeah, it turns out that the imposter syndrome is like, it's a natural part of the writing process.

Writing process and collaboration

And Thijs, what is it like working with a tyrant like Jeroen? In my opinion, it went very naturally. I think quite early on, we already noticed that we have a relatively complimentary writing style that I just start putting words on paper and start restructuring and moving it around and refine it more. Something that stems from the time I was still writing a thesis where I couldn't get anything on paper because I was so judging everything you put down, like, nah, that's not quite it. And you kind of get stuck in that, right? So I learned to just get stuff out on paper and it may not be proper in the right format and the right semantics, not exactly the nuance you want to catch. But ultimately it gets you to where you want to be. It's just like the first 80% needs to come first and that's not perfect yet. And Jeroen is one of his qualities is that he can use his perfectionism in such a way that he's very good at the refining phase. So when I put some meat in the chapters already, he comes and moves stuff around and, oh, have you thought about this or shouldn't you word it like this? And that really does the eyes. So in that sense, not necessarily a tyrant, it's just a very effective perfectionist.

Yeah, there is a fine line and I am very well aware that there is such a thing as preparing too much, as overthinking things. And it really helps when there is already something on the page. So for example, that could be text written by Thijs or by myself. What I sometimes do as a trick, whenever I feel I'm stuck, because this is a book that involves a lot of code, I'll first write all the code cells, all the code chunks, so that I can then fill in the gaps with text along the way. That's one of a couple of tricks that I could apply here.

Polars in production at Alliander

So there's a lot of foundation around your book and around Polars and why people are using it so much for DataFrame's operations today and more and more and more so. Earlier in the episode, you mentioned about a real world implementation of Polars and maybe, as you said, maybe the first ever production instance of Polars. And so am I right in understanding that's Aleander? I'm probably butchering the pronunciation of that. Yeah. Aleander. It's a power grid provider in the Netherlands. So they provide the infrastructure for both electricity and gas in a third to a half of the Netherlands, I believe.

The origin story here is that Thijs and I, we were both very excited about Polars. We were writing a book about it. And then, all of a sudden, it became clear that at AliAnder, we needed to speed up the pipeline. Right? We needed to lower costs. We needed to process much more data. And in the current state, that just wasn't possible. It was a combination of not only Python and Pandas, but also R code. So it was very inefficient. To give you an idea, we were running this on a single AWS instance that had over 700 gigs of RAM. 700 gigs of RAM.

But we were very excited, and we were like, hey, let's try this out. Let's do this. At first, the team was very hesitant. Right? We had two people, or three actually, we had another colleague, three people promoting Polars that is being developed at Xomnia, right? So they were very skeptic, understandably. So what we did in order to convince them is to just take on a very small piece of code, some low-hanging fruit, and benchmark it, and re-implement the Pandas code into Polars, and then just show the numbers. And by then, they were immediately convinced, right, this is indeed way faster, uses way less memory. Let's try this out.

Let's take on this huge code base, piece by piece, by translating, not one-to-one, because you can't do that. You really have to reason about the inputs and the outputs, and then do it in an idiomatic way, right? You cannot just translate Pandas to Polars. I think it took us, well, what, six months, a year? I don't even remember. But eventually, I left that client at that time, but there was a moment like, okay, we can now get rid of R and Pandas as a dependency of this project, and it's been running smooth ever since.

Yeah, I think ultimately the size of jobs at the beginning was about 500 gigabytes for just that task of doing like one calculation, and we shrunk it down, like both being a consequence of implementing Polars, but also as we were going, rehashing some of the code structure that we were using in the project, we hashed it all the way down from 500 to 40 gigabytes, which makes it a lot more doable to run calculations.

So the second part of your question was like, okay, how did this influence each other, the book writing and putting it into production? And this was, yeah, it was a perfect match, because when you're actually need to put it into production, when you have a real problem to solve, that's also when you start to notice the limits, right? Or maybe inconsistencies or missing functionality. For example, there was this random sampling with weights, right? That's something that you can do in Pandas. You just give it another column that indicates the weights for the sampling. That's something, maybe even up until this point, something that Polars doesn't have.

Also, when you write, you start to look at things from a little bit of a higher level. So sometimes we noticed inconsistencies in naming or missing methods, like, hey, why is there no inline operator for the XOR operation, right? That's something that nobody ever thinks about, but when you need to put in a table in your book and you need to fill in all the pieces, that's when you start noticing these kind of things. So we were able to also submit some issues, maybe even a few pull requests to Polars itself along the way.

Why people are switching from Pandas to Polars

So I think one of the, when we were talking with Richie over the many times we talked with him over the course of writing the book, one of the main experiences that shaped how he wanted Polars to work was some frustrations that he had when running his pipeline and only 20 minutes in, you run into some trouble and it crashes. And that's not something you could have seen up front. So this is one of the things that, one of the experiences that helped him shape what Polars ultimately became.

And there's also a lot of good things that he saw in how Pandas works that he wanted to take and put in Polars. But generally, like Pandas became a big inspiration, both good and bad for Polars and also other libraries. I think like Spark, especially you can see that the syntax of the Spark is a lot like how Polars turned out. And there's other elements, like for example, from the Rust language that he, that Richie took to implement in Polars because it just made it work so nicely.

Maybe this is a good moment also to clarify that we are very appreciative of Pandas, right? We're not here to bash Pandas at all, right? Without Pandas, there wouldn't be Polars, right? So we are very much appreciative of what Wes McKinney and his team have done. Absolutely. Standing on the shoulders of giants, right?

The grammar of Polars expressions

So like R's tidyverse , which they make at Posit, where you now work your own, with Polars, there's also a grammar and a naming convention that is encouraged to preserve semantic clarity, which means that not only can you understand your code better when you come back to it later, but other people that you're working with can understand it more easily as well. In the book, you two compare expressions in Polars to recipes. So specifically, I'm going to read a little snippet of your book here. It's, if you think of an expression as a recipe, then the operations would be the steps and the functions and methods would be the cooks.

So how does this metaphor shape your philosophy about best practices with data transformation design in order to deliver clean, readable pipelines, especially in large collaborative projects? Big chunk. The short answer is no more brackets. When you read Pandas code, there are many brackets in there, and in a lot of cases, it's very difficult to reason about what the code is actually doing. And so with Polars, you take a different approach, not only with those expressions, which are indeed the building blocks, those small recipes, but also the part where you use the expression, namely in the entire query. So it's more a, it's almost like you're writing a paragraph, right? To come back to book writing, you're writing a paragraph of things that you want to do. It's a logical element in your entire pipeline. And that's much easier to reason about.

Yeah. And I think one of the things I've always liked most about Polars is the very declarative approach of what you're writing down. So in Pandas, it can be the case that you're very focused on specific operations in parts of your data frame, which can make it hard to follow what exactly is going on in the hood. But with Polars, you declare what you want as end result, and you just leave the specific processing and optimization to the engine. And that makes it, it makes it way easier to read.

Great Tables for data visualization

In your book, you introduce a way to style tables with a package called Great Tables. Great tables. But yeah, if you're typing it out, the package is great underscore tables. And in a talk in your room, you actually mentioned that tables are underappreciated in visualization. So could you elaborate on why this great tables package was created? What it does? Yeah, maybe what its advantage is to existing approaches out there?

In hindsight, so now that this package exists, right? Great tables. It's strange that there wasn't already a package, right? Because tables are everywhere, especially when people are working with Excel, a lot of people really like to add styling to this in order to make it presentable to stakeholders. Add some color, use currencies, what have you. Maybe some mini graphs in there, right? So now that it's there, it's so obvious that there should be a package for this.

So Rich, I'm actually not sure how to pronounce his last name. The creators of great tables, Rich Eon, I'm butchering that. But I do know the co-creator, well, they're both my colleagues, I should know, but I just call them Rich and Michael Chow . Great folks, you should have them on the show as well. They created the great tables package.

And so just only a few days ago, I saw a post by someone about Polars advocating or actually recommending that, okay, it's useful to add in the dollar sign when you're presenting currency. But what he was doing, he was actually changing the underlying data. I'm like, wait a minute, that's not the way to do it. You want to change how it's represented, right? You want to put this layer on top of it. That's what you need to do. And that's what great tables can provide. So you're not changing those floats or integers to strings in order to format it. That's not the way to do it. So there should be another layer. And Python has a myriad of data visualization packages. But when it comes to producing tables, well, I only know of one, and that's great tables.

So with Polars, you can indeed style data frames using the great tables package created by Rich Yannone and Michael Chow. So you use the DF style accessor, and that will then use the great tables package under the hood.

Yeah, I wasn't really hinting at the performance issues right there. Just it doesn't feel right to me that you're changing the actual data. You want to keep the data, the data, because you never, you never know what you want to do after that. Right. Maybe you want to have a subsequent calculation going on. You have to strip the dollar sign again. Yeah. Maybe you want to have round numbers or yeah. So there's so many, many instances where you just want to change the representation and keep the underlying data intact.

UV package manager

So we have been talking in this episode obviously about Polars a lot, which is a popular Python package. But another Python package that is really taking off recently is UV. And so it's a Python package and project manager that climbed from zero GitHub stars to, yeah, to, I actually don't have the number in front of me right now, but a very large number. Lots of people are talking about UV. And it has exceeded poetry as another longtime favorite for package management.

So Tice, in a blog post, you mentioned ditching poetry for UV. You talk about increased speed, reliability, and ease of use as the reasons for that. Do you want to tell us more about UV, poetry, and whether people should be calling it of? Gladly. Yeah. So this is also one of the things that we decided to do for the book. We started out with poetry and did all the version management of Python versions with buy-in and other tools around it. But ultimately when we were prepping the repo that can be used by the readers of the book that contains all the different notebooks that come with the chapters, so you can follow the chapters along and execute the code yourself, play around with it, obviously you need to set up an environment easily that can work on many different systems that all your readers might have.

So in the beginning we were thinking maybe to go for Docker because that generally is the easiest way to make something run on different kinds of configs. But as we were writing the book, UV became bigger. And at one point I just started experimenting a little bit with UV to see how easy it is to set up. And it boiled down to installing UV and then running UV sync, and it sets up everything. It sets up the right Python version, it just finds the right dependencies for your system. Everything just clicks. So that's ultimately what we went for as the final solution for that repo, to allow people to just install UV and just make it work.

And one of the reasons I started playing around with UV was mostly because it goes with the trend of the Rust-based tooling, which shows that very much like the performance of tooling is a feature in itself. It's one of the things that Polars showcases and it clicked very well, like UV has the same kind of thing going for it. Also the Rust-based tooling, which is leagues faster, it can be more than 10 times faster. That combined with the single command setup, it's made it a very quick and easy win.

Maybe you can say a few things about the regression that you found in Polars, where you came in handy. Yeah, so at some point, UV is so fast that you can on the fly set up an environment, like an ephemeral environment that's just set up for just that command and then torn down again. Wow. Yeah, so I was playing around with that to benchmark the different versions of Polars to see what the speed is on different queries, different kinds of setups, and iterating over the versions and just bumping it every time to see what happened. At one point I found, I think in version 1.2 point something, that there was suddenly a regression. The query started taking 10% longer to run the full benchmark, and it didn't really go down again. And drilling down, we were able to pinpoint the two specific queries of the benchmark that we were running just spiked up on a certain version. And because UV just sets it up so quickly, at one point with a script for git bisect, which allows you to pinpoint the exact commit version in the Polars repo where it started occurring, allowed us to find which specific commit caused this regression.

And funny enough, when I communicated it to the Polars guys in that week, they hit the same code. For some reason, they couldn't quite figure out quickly what exactly caused it, but they hit the same code and refactored it and it resolved itself. So ultimately, all was good. But it was interesting to finally have a package manager that was able to be used so quickly that you can start using it for complete new use cases that you couldn't have thought of before.

But it was interesting to finally have a package manager that was able to be used so quickly that you can start using it for complete new use cases that you couldn't have thought of before.

The command line

And so your whole previous episode on this show, 531, was all about data science at the command line. Obviously, you've written two editions of the book, as discussed. You've written R packages that make the command line more interactive and playful. I don't know if I can pronounce them properly. There's Rayliber, R-A-Y-L-I-B-E-R. Rayliber. Rayliber. Rayliber. Well, Rayliber is a wrapper around Raylib, which has nothing to do with the command line, but it's a C library to create video games.

But nevertheless, in that course, or at least in the course information, you say that the command line is as powerful as it is intimidating. So for our listeners out there who maybe haven't crossed that emotional barrier, maybe they do program, they use Python, maybe they use R, or whatever programming languages they use, but they haven't crossed that threshold, that emotional barrier, to start using the command line, what do you recommend to students to kind of get past that emotional barrier and see the command line shell as a great creative space for data science and software development?

Yeah, it's unfortunate that when you first see this window, this terminal, this blinking cursor with a prompt waiting for your commands, that it's such a shame that this is indeed so intimidating. But that's how, of course, when Unix or Linux was first created in the 60s and 70s, at that time they didn't even have screens, so things had to be, they weren't all flashy. So there is indeed a hurdle for you to take, for you to embrace the command line. And there are certain tricks that you can apply, certain changes that you can make in order to make the command line, yeah, a more pleasant environment, a more forgiving environment.

So things that I always like to do are, let me try to come up with a couple of them. First of all, use colors that you like, use a font that you like, add in aliases so that these long commands, these long incantations, that you don't have to remember them by heart, so you make the experience more ergonomic. It also helps to work in an isolated environment so that you know that you won't be able to break anything, Docker can be used for this. And I think if you do these kind of things, experiment with the command line every day for a little bit, don't try to do everything all at once, I mean, I don't, I just use it here and there, right, as a complementary set of tools in addition to, well, all the other data science tools that you want to use. And then, yeah, you'll gradually build up more and more appreciation of the command line, you'll be able to embrace it more and more, make it your own.

NVIDIA and Dell collaboration

And so first what I wanted to talk about, and so you guys may or may not be aware of this, but in 2025, two of the biggest sponsors of this podcast, to whom we're very grateful because it allows us to keep the lights on and make this show for everyone, are Dell and NVIDIA. And it sounds like for the appendix of your book, Dell and NVIDIA, you had some kind of partnership with them that allowed you to do more, explain how they were involved with your book.

Yeah. So at a certain point, I got a LinkedIn message from NVIDIA. It was something about being an influencer, and at first I didn't think much of it. After a week or two, I decided to reply like, all right, I'm interested, let's chat. And it turned out that they actually wanted to collaborate with us. They were quite eager to send us some hardware so that we could benchmark Polars on the GPU. And we're like, great. Only thing is, we don't have anything to put that video card in. So that's when they brought in their partner, Dell, and Dell was able to supply the rest of the hardware.

So that was a fantastic collaboration. And the way we did this is, Ty can say more about the software side of things, but in terms of hardware, it was all in the States. So Dell had this laboratory where they had, you know, a beefy machine, and they were able to swap out, you know, different NVIDIA video cards. So we did the RTX 6000, no? The Ada generation. The Ada generation. So these were all professional video cards, not the consumer. Yeah. So the workstation variants.

So, yeah, we were, it was very important for us that we were able to benchmark things ourselves, that we wouldn't just copy numbers from some, you know, some leaflet , right? Some promotional material. We wanted to produce these numbers ourselves if we were going to put them in our book. And you know, that was all fine. NVIDIA and Dell thought it was a great idea. And so eventually we were able to try out five different video cards for a number of different settings and packages. And that's all reported now in the appendix of the book.

So to start off with a little more context is that NVIDIA has a team called Rapids, which is working on creating all kinds of general purpose computing packages that can run on the CUDA platform. And CUDA is like the platform, calculation platform that NVIDIA opens up. So you can run any kind of calculation effectively on the GPU. And the difference between normal CPU and GPU is that GPU has many relatively dumb, simple processors, but just many of them. So if you're able to bend a calculation problem into something that the GPU can run, it oftentimes accelerates by a lot, by a factor up to 10.

So they also did this for packages like Pandas. They have CUDEF, is what their package is called. It's a data frame library, but runs on the GPU. And they wanted to collaborate with Polars as well. But since Polars has this layered architecture where it runs through an optimizer first and only then gets sent to an engine, it would be a waste to just put the Polars API on CUDEF and just translate it to normal CUDEF functions, because a lot of the performance enhancements from Polars comes from this optimization. So instead, Rapids worked together with Polars and designed a GPU engine that gets input from that optimization layer.

Because they recently finished the open beta for this new package, they got in contact with us to ask, hey, you guys are working on the book of Polars, do you want to collaborate? And we said, well, with the terms that we could test stuff ourselves and benchmark ourselves, we definitely said yes, because it turned out to be a lovely collaboration as well.

So we already noticed that in the beginning, the promotional material was a bit careful with what kind of size of data set it would be beneficial from. And it turned out from the test that we were doing that it's quite quickly already. Because data needs to be transferred to the GPU, you get a small overhead. So you start seeing the difference when the data set size grows. But it's already from one gigabyte and up, so it's relatively quickly because most data that you would work with in a professional setting usually tends to grow a lot. And we also noticed that even the relatively smaller GPU cards with less processors already benefit a lot from this, already have a big speed up from just using the GPU engine.

Marco Gorelli and the rewritten chapter

And then one final story that I want to get in here. So I mentioned already how Marco Garelli, so Marco Garelli was our first ever Polars episode on this podcast. So that was episode 815. And then he introduced me to Richie, the creator of Polars, who came in not long after that, a couple months later in episode 827.

So I understand that there's an amusing story involving Marco somehow sabotaging your book and forcing you to rewrite an entire chapter. Yeah, it's amusing now. So we have to go back a little bit further. Was at a Christmas party organized by Xomnia, where Richie was also present. And Richie was like, yeah, Polars is going to have data visualization capabilities. I'm like, what? Python doesn't need another package to do data visualization. There are already two dozen, so many out there. So at first I was like, man, they keep expanding the library. We just want to finish this book.

After a while, I started to realize, okay, maybe it's not so bad. I mean, if the book has a chapter about data visualization, maybe it'll sell better if it has pretty pictures. So I started writing. I was quite happy to find out that Polars itself doesn't do any data visualization. It has the df.plot namespace. But then every method in that namespace calls out to another package, hbplot. And I wasn't familiar with hbplot yet. It's this meta package which can target matplotlib and plotly and another one, bokeh.

And so, okay, I really had to get into hbplot, but I didn't just want to write about hbplot. I also wanted to include great tables, right, was a big fan of that. And you could argue that presenting a table is also a form of data visualization. I'm a big fan of plot 9. So it was going to be a huge chapter. So I've written this, and then all of a sudden I see on GitHub this pull request by Marco Gorelli. And he was like, okay, I'm going to change out hbplot for Altair. I'm like, what? Now I need to rewrite the entire chapter or at least a big portion of it. Like, Marco, what are you doing?

Now I know that Altair is a very good choice for this, especially for when you are working in a browser and you want to create interactive data visualizations, right? That is something that plot 9, for example, doesn't support. So Altair definitely has its use cases. And I should have known better as well, hbplot at the time, or the whole plotting functionality in Polars was marked unstable. So I should have known better. I was just too happy to get it out there. And you know what, Marco and I, we get along really well. We collaborate now on getting Narwhals, right, his project into plot 9 so that plot 9 better supports Polars as well as Pandas.

But yeah, that was the story that I had to rewrite nearly everything inside chapter 16 visualizing data. Nice. Great story. And it is funny to imagine Marco kind of sabotaging your book, because he's an extremely nice because we actually, the episode with Marco, I don't know if you know this or not, but I recorded it with him in person in London and he took a train, he took a train from Cardiff to London, which is like three hours or something to come and record the episode. And and then so I, we went for dinner afterward as well. And you really get this impression of a man who is exceedingly kind and kind.

I hate it. I hate it. I wish he wasn't like that. No, he's a very generous and kind person. Definitely a pleasure to work with. And I definitely love his dry sense of British humor. It's perfect. Every, every, every time he speaks at a conference, he tries to incorporate an expression from that country. So when he was presenting at Pi Data Amsterdam, he used the expression, helaas pindakaas, which translates to too bad peanut butter. Doesn't make, doesn't make any sense if you're not Dutch.

So people want some more of that humor. And he, he was also, he was extremely technical. I mean, the, the depth, you know, in this episode, we haven't gone and that wasn't really the point of the episode in that one as well as Richie's episode. So at 815 with Marco Garelli or 827 with Richie Vank in different ways, you get into the nitty gritty of why Polars is so fast under the hood. And so if people want

Featured software#

Great Tables