
Polars Meetup #1 - Migrating a large codebase to Polars by Jeroen Janssens and Thijs Nieuwdorp
In this community talk, Jeroen Janssens and Thijs Nieuwdorp share their experiences and best practices for migrating a large pandas codebase to Polars at one of the largest utility companies in the Netherlands. By implementing Polars, they achieved a 98% cost reduction. Watch the video to learn how you can start migrating your own codebase. Check our Meetup page to see when the next event is planned: www.meetup.com/polars-meetup/ #polars #dataframe #python #dataengineering #datascience
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey everybody, welcome back. So let's switch things up a bit. Let's go from the technical to the practical.
Last year, we had a big project where we converted a large codebase, say about 20,000 lines, from mostly pandas to Polars. And by doing so, we achieved a staggering 98% reduction in cost. Isn't that amazing?
And just as important, we got happier as a team because now, all of a sudden, we had a codebase that was easier to work with, easier to maintain. And in the next 30 minutes, we're going to explain how we did this, and we're going to share with you the lessons that we've learned along the way.
But I first need to give you a couple of disclaimers here. Because even though the title says that we converted from pandas to Polars, we cannot blame just pandas here because the codebase also consisted of a lot of R code, right? But that doesn't make for such a good title.
So I am all for a polyglot approach, right? Use the tool that gets the job done. But in this case, it was really hurting our performance and also the maintainability. That's half R and half Python, and Python calls to R. Oh, my. I'm getting flashbacks all of a sudden again.
So you should take the numbers that we present, especially the relative numbers, with a grain of salt. Also, I should say that this is not a tutorial about how to translate pandas code to Polars code, right? If you're interested in that, you can read Chapter 3 of our book. What book? This book. So it came out two days ago, right?
So yeah, for nearly two years, Thijs and I, we have put our hearts into this. So it's quite unreal to see this book on the shelves. And funny story, we don't even have our own copies just yet because apparently the book was sold out. And tonight, you and also the people joining us online have a chance to win a hard copy, a signed hard copy even. Once we have it and we can actually sign it, we'll ship it to you wherever you are.
So in order to be eligible for a copy, go to polarsguide.com.meetup or scan the QR code, fill in your name and email address, and then by the end of this talk, we'll draw a name.
About the speakers and Polars
So a few words about ourselves so that you know where we are coming from. So my name is Jeroen Janssens. Here in front is my co-author, former colleague, current friend, Thijs Nieuwdorp. And so Thijs still works at Xomnia. I used to work there up until two months ago.
And Xomnia, I should say, is a boutique data consultancy here in Amsterdam where, fun fact, Polars actually originates from because Richie Vink used to be our colleague, but he's now sitting over there. He used to work at Xomnia as well. That's where he first started developing Polars, right?
These days, or two months ago, I joined Posit as a developer relations engineer.
There's really only two things that you need to know about Polars. One is that it is blazingly fast. Here you can see our results from a benchmark that we did on a couple of queries next to pandas. Note that the x-axis is on a log scale, right?
And the second thing that you need to know is that Polars is very popular. Just look at Polars Go. And this graph is a couple of months old. It's currently around 32K GitHub stars. You know what? I think that's actually quite an achievement.
No, I mean, it's incredible. My prediction is that in about three or four years, it will have surpassed pandas.
But popularity is, of course, not a thing to strive for in itself, right? You shouldn't just use the thing that is most popular. But it does say something about the project. When you're choosing a new technology, you want to be sure that it's going to be around for a while. You want to be sure that when there is a problem, that there is a community that can help you, or maybe even other developers that can answer questions, or that whenever there is an issue, it gets resolved. And Thijs and I, we can safely say from our own experience that whenever we encountered something, that those issues are very often resolved very quickly.
The client and the problem
So a few words about the client, right? Because we work at this consultancy, Xomnia, and Thijs and I, we were both placed at the same client, in the same team, Alliander. So let me share a few things about our project, and about Alliander. Alliander provides electricity. It's the largest utilities company in the Netherlands, right? Even also here in Amsterdam.
And yeah, these numbers should give you an idea of the size, right? And what matters is that number, 35,000 networks. That will come back in a moment. Because that influences the amount of computations we have to do, or that our code has to do.
And Alliander has a problem, just like all the other utilities companies in the Netherlands. Red, red is bad. Red means that there is no more capacity to add things like heat pumps, or charging stations, or solar panels. And for us consumers, it's not so much a problem just yet, but on the commercial side of things, the commercial connection, very often Alliander just has to deny them. Sorry, our network cannot handle whatever it is you want. So they're being waitlisted.
And so Alliander has this huge task that they need to reinforce their entire network. Yeah, that means digging up cables, replacing them, installing additional transformers. You can imagine that this is a very costly and time-consuming endeavor. And what they want to do is they want to go from a reactive approach to a proactive approach, so that before problems actually start to arise, based on our predictions, they can say, you know what, let's take on this particular network in this particular neighborhood. Let's reinforce that network so that we can accept more connections, more heat pumps, solar panels, and what have you. That's what we were developing, this tool called Delphi.
The before situation
So the before situation, what were we dealing with? Processing just a single network. Remember, Alliander has over 35,000 networks. Processing just one of them takes over five hours. And part of this calculation is a simulation where we need to produce samples. And ideally, we want to produce 50 of them, but the current code base only allowed for 25. So we were already limited right there. This was being run on a beefy machine, again on Amazon, on AWS. The process used over 500 gigs of RAM, just a single ETL process.
So then, all of a sudden, upper management came down, and they said to us, guys, we have this great idea. Instead of running this calculation once per year, let's do it once per month. And instead of computing it for a handful of networks, let's do our entire grid, 35,000 networks. And you know what? You have 5,000 bucks a month to do this. We're like, wow, yeah, that's a great idea. But how are we going to do this?
You know, Thijs and I, of course, we immediately knew what the answer was going to be. Can anyone guess? Polars, that's right. It's just that we were part of a bigger team, and we still had to convince the rest.
And yeah, imagine this, right? It's a team of, what, six, seven people, and three of those people came from Xomnia. And here we come. We go and we say, you know what? You know what we should use? We should use this product called Polars that we have developed in-house. We should rewrite our entire code base with this. So you can imagine that a couple of, or that the others, the ones actually working for Alliander full-time, that they were a little bit skeptic, and not so eager, nor lazy.
So we had to convince them. It also didn't help that Polars wasn't well-known back then. We're talking almost two years ago, yeah? And yeah, this was pre-version 1.0.
Show, don't tell
So what did we do? We showed them the numbers, right? We needed to convince them with actual results. So what did we do? We took out a small piece of Pandas code, right, and translated that to Polars. Not one-to-one. You cannot translate Pandas to Polars line-by-line, right? It's the same as translating English to Dutch word-for-word, right?
So this is a matter of examining the inputs, examining the outputs, and then reasoning, you know, how can we accomplish this using Polars, right? It requires a different way of thinking. No more brackets. No more indices.
What matters is the result. We went from a thing that took 30 seconds to, you know, a new thing that did the same thing, which only took about a second, and that really convinced them. So that's our very first lesson of today is show, don't tell.
So that's our very first lesson of today is show, don't tell.
And now I'll give the floor to my co-author, former colleague, and current friend, to tell you about, you know, some other lessons that we've learned along the way.
Benchmark all the time
So the lesson number one that I can share with everyone, benchmark all the time. Literally after every step, run a benchmark, see what the improvements are because this allows to be guided towards what actually makes an impact, right?
One of the things that we ultimately didn't do well enough was to do all these benchmarks to keep track of which changes were making the difference. So we had to dig in some of the CloudWatch logs, had to dig up down in the commit history to start rerunning things from way back before we started implementing things. And that way we were able to come up with all the differences and put numbers to all the improvements we made overall.
So this is one of the things that we did in the beginning, just a very small network that we were able to run locally because, or I think on AppStream machines, just like a smaller machine, had like a peak usage of about, was it 37 gigabytes of RAM? And this is just one sample, just a very small example, just to see what kind of difference we made. And after all these changes that we made, we were able to reduce that to about a third for just this small example.
And this already allows us to actually go to management and show like, hey, this is the difference we're making. It's hard to convince management, like, yeah, the code is going to do exactly the same thing. There's no new features. Please let us do it. It never sits that well. They want features, not the same thing. But if you start putting numbers to it, that's already a lot more convincing than not having any of that to back it up.
On a bigger example, this is one of the examples we had where we needed the 500 gigabyte instances for. So this is something you see like building up. Also had to do with the type of architecture we were using in the code. This, before we came in and before we started using AWS, there was this massive on-premise server that you could just claim for yourself and just crunch the numbers over the weekend. So it was never that big of a deal. But as we moved to AWS and you pay for what you use, obviously it became bigger of a deal.
So the before picture, this is one of the jobs we found is 465 gigabytes. And afterwards, you can see that in a more iterative manner, we designed the code differently with the help of Polars, and now it's only using 75 gigabytes of RAM. So for exactly the same type of job, that's a big improvement. And also one of the things that you can't quite see in here is that we're able to do twice the amount of samples, which was not something possible before.
So with the benchmarks, the questions you can answer is like you can drill down into what exact changes make the difference, and you can put numbers to the business impact to share with your stakeholders what you're up to.
And one of the things I didn't talk about yet is how it can prevent regressions. So while writing the book, I was benchmarking different versions of Polars with the benchmarking tool, the PDSH, and we noticed that somewhere along version 1.2, some regression crept in for a very specific set of queries. And this is also by benchmarking often, you can find these kinds of regressions, especially if you're translating code to different ways of doing things to translating it from Pandas to Polars, from R to Python in the first place. There's always a risk that you just tweak something the wrong way, you just hit something you didn't intend to. And by benchmarking often, you can catch these things quickly and prevent them from happening and preventing from building on top of that.
Work lazy, not hard
Brings me to the second lesson. Work lazy, not hard. I bet in a room full of programmers, that's something that will resonate. So we're in the beginning to understand Polars better. We started working with the more eager ways, but lazy makes a big difference. As Orsan already explained why, this is one of the differences you can see, so it's the same type of benchmark. Again, with the log scale, you can see that even Polars, compared to itself, makes a massive jump in performance when you see the lazy engine compared to the eager, and that's just because a lot of work doesn't have to be done. So work lazy, not hard.
Another thing that we learned is, so with this lesson of having put everything into the lazy mode, we just try to run the whole query that we have on our program, we try to just make it lazy entirely. But that's also not the optimal way to do things. There's like a golden middle road in there, where some stages need to be kind of like, you would call it cached or persisted, depending on what framework you're working with.
For example, Polars itself already does some caching in here. So if you have two subqueries that are running, and they're using the same frame that's resulting from another part of the query, it will be cached, and both of these branches in the tree can read from that cache. But this gets kind of hairy the moment you start running gigantic queries in programs that have a massive pipeline. The problem is that it starts having to keep all kinds of things in RAM, it starts blowing up, it starts getting bloated, it just grinds to a halt.
So one of the tricks is one of the most simple diagrams we've added to the book. Literally the conversion between a data frame and a lazy frame can be done with these two methods. So parts of the query that we ran, we just collected them on the spot. Just save it to RAM for now, and from there you can reuse that frame, which you just keep in RAM until it's not needed anymore, and you start building more lazy queries on top of that.
So one example of that would be, you're going to need a little bit of imagination to keep it possible to do on a slide. So you take a lazy frame, say you have some heavy computation that you need to apply to that. In this case, with every collect, you run the lazy frame query all the way from scratch. So that means that this heavy computation that's in there is being run twice, which is a waste. So instead, you can kind of cache it like this. So you just run this heavy computation once, you cache it to RAM, and from there these two calculations that are following are just taking that from the sub-result, from the intermittent stage from RAM.
Consistency and fewer surprises
One of the things we learned to love, as we came for all the performance, the promised performance improvements of Polars, of course, but one of the things we came to love is the consistency that Polars offers. So I'm not sure if you guys worked with strings in Pandas. Luckily, the Arrow backend improved a lot of things, but back when they still had NumPy, there was all these different types that you could use for string. So in the beginning, it was just a Python object that was stored in the frame. Later, they came with an extension type, which was the StringD type, and there's also the Arrow StringD type, which also ties into this, and Polars is just string, as you would expect.
On top of that, Polars also doesn't use the index, which is something that's almost... I'm kind of scared to call out index in a room full of Polar users, but an index, or even worse, a multi-index, brings all kinds of unintended behavior with it. So you can get all kinds of unexpected nonce or incorrect joins that you hadn't expected, and it reduces your flexibility by a lot. So especially with reshaping or filtering, you often have to reset or re-index your frame, which is kind of something you don't really want to think about if you're just working on your data frame.
Because sort values sorts on the index as well, you still get the same result. So this is effectively a non-op. Yeah, that's not what you expect it to be. If you say, like, hey, give me this column and sort the values, you want that column but sorted, and instead it just doesn't do that. All kinds of stuff like this, Polars just doesn't do it.
On top of that, also, I'm not sure if you worked with missing values in Pandas. Yeah, so I made this table with an overview of different types of data, and it shows the preferred way of showing the missing value for those data types. Now, recently, the Pandas NA got introduced, which already made it a bit better. But still, for numerics, it's none, even though, according to the standards, that's not a number. It means that the operation that you did was on numerical data, but resulted in something that can't be represented as a number. But it's not missing, so it doesn't quite match. Same thing with datetimes. You get, like, not a time.
Ultimately, in Polars, it's just a non-value if it's missing. So you have all the missing data in Polars can be represented with a null, and every data type supports it. And even under the engine, it's optimized to deal with these null values. Missing data will never convert the datatype to something else.
So ultimately, it boiled down to NAN is not a missing value, but the result of an invalid operation, and Polars just supports that as it should.
Taking baby steps in migration
So back to the converting of the database. Like, if you have a 20,000-line database, you're not going to get that done in one sprint of two weeks. So the way we try to do this is by taking baby steps. We just took the parts of the program that we knew were taking the longest time to run in Pandas, and that's where we started.
So this requires that you're able to only work on small parts of your program. It's relatively easy. So if you want to go from Pandas to Polars, you just throw your Pandas data frame in the Polars data frame constructor, and you can start working your Polars magic on it. And ultimately, at the end of the part you want to work on, you just go back to it by calling toPandas, and you get a Pandas frame back. And this way, you can quite easily jump between Pandas and Polars and start working on the parts that you want to translate to Polars first.
Join the community
One of the things that Jeroen already touched upon, join the community. As you can see here, the community is big, it's growing, and it's extremely helpful. I bet there's people that... I remember talking with Richie one time about people in the community, about how even though Richie said he wrote Polars, there are still people in the community that know how to use and make certain steps sometimes better than he does, and I have the same thing. I wrote a book on it, but there's still people in the chats in the community that can help you out with problems so quickly, and they know exactly the right methods to grab. The community is very experienced, it's very open, it's very active on Discord, and I can definitely recommend to hop in there to get to know the people there.
The after situation and results
So, having had all these lessons and having learned all these magical things during our journey at Alliander, ultimately it boiled down to a result, of course. We promised 98% reduction, so we're going to have to come up with something. So, the after situation. One net, one of those 35,000 nets. Now it took four hours, so that's a reduction of 20%. We did that with double the amount of work, so instead of 25 samples, we finally hit the target that the stakeholders set for us, which is 50 samples.
Instead of the 500 gigabytes of RAM, we can now do it in 40, which is a 92% reduction. And, of course, it's now fully in Python. It's powered by Polars. So, this mostly helped the team in trying to maintain the code. If you have to maintain both R scripts that are connected to Python through an RPy2 bridge, that's just going to get hard to maintain at some point. So, at this point, it was just a fully Python file, also running on a Docker that you just have to have Python on instead of all R packages as well.
So, with this new setup, we were finally able to run a calculation on the entire low-voltage grid, which is those 35,000 nets. And, luckily, we did this at a lower cost than expected. So, where we got the target spend was $5,000, we were able to run it in $3,500, 70% of our target. So, all of this was done even under the target spend.
So, that boils us down to the big calculation that if we had used the code that we had two years before that, it would have cost us 140K. And, sorry for AWS, but now we're able to do it for 3.5K.
So, ultimately, the big takeaway is that we definitely came for the speed. We came looking for the performance, and that's definitely one of the things that pulled us in, that helped us convince the team that this is a good idea. But, as we started building with it, we just came to love it, and we ultimately stayed for the API.
We definitely came for the speed. We came looking for the performance, and that's definitely one of the things that pulled us in, that helped us convince the team that this is a good idea. But, as we started building with it, we just came to love it, and we ultimately stayed for the API.
Summary
So, to summarize, start with show, don't tell. Don't let the numbers show more than all the cool stories you can tell about Polars. Like, the trust me bro is not going to do it for management, so you need to back it up with some facts. You can get those by benchmarking.
Start early, start practically first thing. Like, if you're going to work with this, okay, what's the status quo, what's the baseline? And, just keep benchmarking. Every once in a while, after you add new features, just run it again, see what's up.
Definitely start using the Lazy API. The Eager API is brilliant to iteratively work and see how you need to tweak your query to get it to do what you want. But, ultimately, when you're putting it in your applications, you have to make it lazy. That makes it way faster.
If you're trying to go too hard on the lazy thing, that's not it either. So, if you do heavy calculations, think about caching at some midpoints in between to prevent double work.
As you're doing that, you'll notice the fewer surprises in Polars. One of the reasons we came to love it. You can easily start small. You can swap between Pandas and Polars relatively easily with just a couple of methods. So, you can just take out the parts where you think it can make the most impact.
Join a community. We would love to hear the stories. We would love to hear you guys share your successes. And, if you need help, there's a community that's glad to help you out.

