Polars: The Blazing Fast Python Framework for Modern Clinical Trial Data Exploration

So in the placebo condition, 16% of people are less than 65 years old. And so these are kind of three interesting pieces of the table. The pulling out these overall sample sizes, formatting the values in different ways, and kind of combining information together by putting the percents in parentheses.

All right. And if you have any questions, feel free to put them in the chat. And you're definitely, feel free to interrupt me if I miss anything.

Loading and exploring the data

All right. So to start, I'm just going to load this data and show you a bit what the sample data looks like. So, okay, let's do... table. Let's look at the data. Oh, it prints out here. Okay.

All right. So notice that the table data is funny. It actually matches pretty closely the table on the right. So you have like age here. And then you have these different labels, like N, mean, SD, median, min. So our table is pretty close to the final format. We just need to do a little bit of cleanup.

So the first thing I'm going to show is pulling out the... Sorry, I'm going to make this full screen. Pulling out the overall values here. So what we'll do is we're going to run a simple filter. So let me just show you. So we're going to filter where category is this age Y. And where the label is N. We're just going to pull out basically the first row of data.

All right. So here we go. So we use the filter with PL.call category is age Y. And PL.call label is N to get the first row. Then we're going to select these four columns where we want to pull out this N value. So I'm going to show you that piece.

Notice that in Python, one funny thing is that to run this, notice that in R, you might be able to just highlight part of your pipe and run it. In Python, what I do is I usually highlight up to where I want to run. And then I close the parentheses at the end. You could also do it by commenting out lines. But I find that this is a quick way to just run little bits of code. All right. So notice we've selected now the four columns we care about.

And then I needed to kind of wrap up with a little bit of more kind of advanced Polars where we're going to cast to an integer just to get rid of these decimal places. And then .row is a funny method that is like give me this exact row of data. Named means like as a dictionary. So just to show you. All right. So it gave us back a dictionary with these values. And this is probably the funkiest part actually of the whole table because we kind of need to pull these out and like hoist them into the column names.

All right. And Michael, this reminds me of our discussion about labels. Yeah, it is kind of label-y. Yeah, you create a separate object. In this case, it's a dictionary, which, you know, does the job as well. Earlier, we talked about a second data frame. But as we'll see later on, you're referring back to this dictionary and then use them to create the column names. Yeah, totally.

Using selectors and formatting values

So that's the end overall. The next piece to show off is selectors. So what we're going to do is we're going to clean up the code a little bit. But I just want to show, Jeroen showed a little bit of how you can kind of pull an expression out. You can assign an expression to a variable and then reuse it. We're doing something similar here with these selectors. So I'm saying, select these four columns by name. So just select these four columns. And then this selector is saying, choose all columns that end with underscore PCT. And so if I run this on tbl, tbl.select, I can get these columns, just these columns back. You know, tbl.select, this percentage selector, you know, I can get all the percent columns. So sometimes I find it's nice to just pull these out a little bit to kind of flag like some of the structure that we're going to get into.

All right. So we pulled out these overall n values. Next, we're going to clean up the table a little bit. So specifically, right. Notice that we might want to limit, I think this table actually cut short some of the values, but we might want to limit how many decimal places or significant figures people see. The other one is these percent columns. We might want to write this out in a more clean way. So for example, we might want to write it like this with a parentheses around it and a percentage sign and then the p values we might want to shorten.

So I'm going to do that here. And basically, the trick is we're going to use this with columns method, which is a lot like the dplyr mutate. And we're going to use it to change our columns. So this one, and this is using a, these all come from Great Tables. So these are actually like formatters that clean up your numbers and turn them into strings. So basically, we're doing a format number to three significant figures. And you can actually test these out just directly in your console. So you could do something like, you know, put a number in nsigfig equals three. Yep. So you can see that it formatted it to three significant figures.

So that's the interesting thing about these functions is that you can actually is they can work either inside with a Polars expression, or you can try them on numbers directly to see, just to like kind of experiment and get a feel for them. All right. So I'm going to run these.

All right. Okay. So the key is now these conditions, the placebo, these two in total have all been shortened to three significant figures. The percentage columns are now, if you look, formatted so they're inside parentheses with a percentage sign. And then the p values have been shortened to four decimals. And notice that they're all strings now. So they're sort of like clean formatted numbers that can go into the table.

Okay. So the last thing we need to do is add these to the percentage columns to the value columns. So we're just going to basically add them together. So placebo plus placebo percent. The one trick is we need to fill in null values because a null plus a string is a null. So this is called null value propagation. So we need to basically just make sure they're empty strings so that anywhere there isn't a percent, we still keep that left-hand value. So it's kind of like a funny dance. But it's just basically how, because of how nulls work. So I'm going to run this and do cleaned.

Okay. So notice that now in the value columns, the percentage sign, we have it fully formatted for our table. So with the value and the percentage. Yeah, that's right. So someone asked, how do you align decimals? So I'll show you in the key is Great Tables. So now that we have our sort of raw Polars table ready for display, we can use Great Tables to sort of style it.

Styling the table with Great Tables

So here's what it looks like. So Great Tables, if you call dot style on a Polars data frame, that's a Great Tables table. So this is the same as if you were to use this. Great Tables has a GT object that starts everything. These basically do the exact same thing. So to align decimals, Great Tables has, I think, a number of ways that it does it. I'm not exactly sure for these with parentheses. It might be a little tricky, but I think that by default, Great Tables is a number of ways that it tries to align decimals. Let me try producing the table and see.

I think depending on how you need them aligned can differ. But actually, let me from Great Tables import exible. I think this will actually show the alignment. So not here. Not here. You know what? I have to admit, I'm not totally sure. I thought there are some different ways to align the decimals, but I can't quite remember now. Let me try format number. I'm just going to look at the help. So if you put a question mark at the end, you can see the help. It's not formatted in my favorite way, but there might be an align. I failed you. I think there's some sort of alignment strategy somewhere in Great Tables, but I actually can't remember exactly where. So okay, there's auto align. So that suggests that there are ways to do alignment. I know it's possible in Great Tables. I have to admit, I don't know exactly know how. So I think we need rich in the Great Tables big brain. But if you know in GT how to align, definitely let us know in the chat. I think it's probably similar in Great Tables. But that's a great question. I think that's a big one for table styling.

All right. So I'm going to keep going and maybe someone knows how to do the alignment. All right. So to start, I'm going to create this header. So all right. So what tab header does is it lets us add a title and subtitle. So here we're able to give the table a name and a brief description. And then I'm going to use submissing to get rid of all these nones. So I'll show you that now. So notice we got rid of all the null values.

One funny thing is we use this percentage column. We moved it in. So like placebo percent, we don't really need anymore because we moved it in here. So we're going to use calls hide to hide all of these percentage columns. All right. Nice. So we got rid of the PCT columns. And then we're just going to clean up our label. So notice our column names are still these underscore names with lowercase with underscores. We're just going to clean them up a little bit using calls label.

So a key here is we're now using this nOverallPlacebo piece. We're using a Python F string. So basically, this lets us insert this value from the dictionary. So nOverallPlacebo looks like this. So it's just the number 86. So we're able to, in these curly braces, write a little bit of Python. And it'll just insert the result. And we're also using this thing called MD for markdown to be able to format a little bit the result. So let me just show you that. All right. So now notice our labels are a lot cleaner. And they have this and these sample size values in them. So that's a lot nicer to read.

And then the very last thing is we're going to put a footnote in with the date and a note about the source. So let's do this. All right. So that just added a source at the very bottom. And I think that creates the full table. So we did some title, subtitles. We set the column labels. We gave them nice names with calls label. And we had those extra percent columns that we hid with calls hide. And then we also got rid of none values, which were just written out by setting sub missing. And then last, we added a little source note at the bottom about when we executed this program.

So that's the gist of using Polars to go from not totally raw data, like processed a little bit data, to a table that's ready for publication. Yes, if you have any questions, I'm happy to walk through that table. I do also have a note about a little piece I can show, which is this table is like pretty, it came pretty pre-processed. You know, if you look at TIBL, it's already in the format of this final table. So obviously, there's like a lot of work people would be doing to get this thing ready.

Exploring data wrangling with Polars

And for example, like one thing I could show is, if we do, I added a dataset for, oh, okay, let's see. To show how you can do a little bit of, I'm just going to call this age. So just to show how you can do a little bit of data analysis. So this is kind of like, this is what these, this square might look like, which is like the counts of all the different people in the different age groups. So this is like a wider format, has the conditions as the columns. And then if you look at this table, you can see that there's like a lot of data. Has the conditions as the columns. But you might, if you were analyzing it, you would probably see it in a tidy, long format, where condition is its own column. And then condition and age group are sort of crossed here. And you have N for each of these.

So if that's the case, you can do things like group by, just to show, you know, what some of the stuff Yuri mentioned. You can do group by.ag and do like the sum of N. There's so much help going on in this IDE. Okay. To get these totals, you can always calculate them. Or the interesting thing is, how would you get these percentages? So notice that these percentages are the percent within the condition. So within placebo, 16% of people are under 65. And the key here is that's a, with columns. And Polars is very sequely. So basically, what you can do is, you could do like pl.call N divided by pl.call N.sum over condition. So this is kind of like, this is like using a group by with mutate in dplyr. Or the dot group by argument.

Over, we're saying, sum and within, sorry, my little zoom bars in the way. Sum N within each condition. And return that sum for that condition for each of the values in that condition.

So basically, actually, let me show you. So I'm going to say, condition N, just so you can see what it looks like.

So notice that, I'm going to sort by condition. It's another neat thing. So, all right, notice that condition N is the same for all of the placebo group. And the same for all of this condition. And, okay, unfortunately, there's the same N, but the same for this one too.

So basically, the key then is that you can do pl.call N divided by this to come up with the percentage. I'm going to say like percent equals, I'm going to round it. So, all right, so that gets us these percentages calculated.

So if you were doing something like this from raw data, you might do a little bit of this prep work. So that's, yeah, that's the gist of using polars to create a table. And do a little bit of the prep work.

And hopefully it shows off some of the power of polars, like with columns, group by, ag, and sort. So this is most of what I've prepped. Super happy to answer any questions you have, either about this or polars more generally.

Q&A and Great Tables discussion

Yeah, thanks, Michael. This was a great walkthrough where you not only demonstrated how you can apply polars to a somewhat more complicated or a dataset that is more grounded in the real world, especially for this audience, but also how you can create beautiful tables. And I think there's actually a lot more that you can do with this package. Things that you haven't yet shown, that's not the focus of this tutorial, of course, but for those of you who are interested in creating great tables, Michael and Rich, they have actually, there's some great YouTube videos of these two where they do give a very good overview of what Great Tables has to offer.

Now, there was one question, Michael. Whether it's possible to include the name of the statistical test to generate the p-value, either as in the column name or in a footnote? Yeah, that's a great question. So the easy one to do is the column name. I think this one here was intended to be added to a footnote, like we could add it here, I think.

Let me just double check that this does what. Oh, no, I did a bad thing. Let me see. What did I do? Why did I make you so mad? Well, you started with a keyword argument and then you added another one. Oh, interesting. Another regular argument, yeah.

So I'll just do another .tab, source note, .tab, source note, another. Let's see. So I think that this one, actually, you could add it here. So we could do like one, this was a, I don't know what test you run the most, but I think I saw something about a chi-square somewhere. I don't know. I don't even remember how to, or a Kruskal-Walsh. I saw like, I don't know where I saw these mentions, but that's one way is you could put it in the footnote. I'm really revealing myself by just simply writing the sentence, this was a chi-square test.

glyphs and map them to the source note. We're still adding it to Great Tables, but I think I saw Rich was actively working on it, so I'm hopeful that we'll have it in the next like couple months, hopefully.

Yeah, it's a good question. I'm seeing another. We generate many similar type of tables into Excel for some of our clients, given that's the preferred way to consume the table data. How to export the table into Excel? Do you mean the Great Tables output or the Polars table? Probably the Great Table. I think, yeah, it's a good question. I don't think there's an easy way to do it. One thing that might work okay is, let me open up a new, I'm going to open up a sheet.

Just for completeness, you can write a Polars data frame to an Excel sheet. Great, this is incredible. I just copied the HTML output. So I think the key is that because this, I learned something today. So because this is a, I think when I copied it, Excel's pretty good at this kind of thing. So I don't know, it might not be 100% of the way, but I'm pretty surprised at how far it got. Yeah, freaky. So yeah, try it out. Including footnotes?

Say what? Footnotes? Did I? Maybe I just didn't, I don't know. It just cut that part out. Can't have it all. It could be that I didn't, maybe I just didn't highlight powerfully enough. You know what I'm saying? But is there a way to just copy the full shoot? It could be that if you copy the whole thing, you know, maybe that'll do it. Try it out. It's tricky. It's tricky.

That's an interesting one though. I'm surprised. I mean, I understand that. Delighted. Yeah. I always, you know, maybe I'm a little bit, you know, playing the devil's advocate here, but one question that you could also ask is like, is it really necessary to export this to Excel, right? Maybe the client would be happy with some other format as well, but maybe Excel's just what they're used to and they're not aware that there are other possibilities. Maybe an HTML page, right? So that this becomes part of a document. Maybe that will suffice or maybe even be better or a PDF. Yeah. That's also something you can ask.

Yeah. I do know Excel is kind of nice though, because you can kind of like do a lot of tweaking. You can like widen columns and merge cells and kind of do a lot of manual styling, but it does, it's hard to reproduce is the one kind of tricky thing. Yeah. There are different approaches. Yeah. Yep. Yeah. Nice. Any other questions about polars?

Polars in the wild and ecosystem

I'd be super curious what kinds of things people are looking to use polars for or what you're using pandas for even. I'm definitely a bit less familiar with like Python and pharma. So it's an interesting space. One area where it comes up quite a bit is on like the larger data side. So a little bit more on the research side than some of the late stage clinical trials, but there they often use like other backends like Snowflake and Databricks to interact with data, a lot of Parquet. And so it's just interesting to kind of see where polars fits, how people are using it.

It does. Sorry. That reminded me of Narwhals. Like so now, like if you work, if Narwhals is nice, if you work on a team that uses a lot of pandas, you can use Narwhals to kind of like code in polars, but have it run on pandas. And some of what you said, Phil, just reminded me too, that Narwhals has some support for let me just go into one of these. I thought that they also, I think they're working on support for SQL with Ibis, another Python library, and also on integrating with DuckDB. So I think like generating SQL is maybe the kind of big one here, they mentioned. So you can use Narwhals with a tool called Ibis to kind of like code in polars, but generate SQL code, which you could use to hit like Spark or Snowflake or any of those types of warehouses. Yeah. And they also have one called SQL frame. Yeah. And they also have one called SQL frame. So it seems like they've been busy kind of wiring up everything.

Yeah. Narwhals is an interesting one. Any other interesting polar stuff? I think PyShiny, one fun thing is ShinyLive can run polars in the browser. So you can do import polars as PL. Let's see. I'm just going to do render DataFrame, PL.DataFrameX. This may be over the top to show that they can run it, but I think this works. I think this works. Yeah. Yep. So they've got polars running in the browser, which is pretty cool. This slider obviously doesn't do anything anymore, but yeah. So a lot of neat stuff cooking.

Yeah. Tim mentioned Ibis is cool on its own. So that's another one to check out. It's like DataFrame, similar API to polars, but it's made to fire on different SQL backends. Okay. Confusingly, Ibis can fire on polars. Narwhals can fire polars on Ibis. It's like everything's wired to everything else right now, but a fun one. Yeah. So you have ShinyLive. Any other cool polar stuff?

I think those are the big ones, probably. Let's see. What are the newest polars releases? Do you know what polars is up to? No, I have no clue. I'm going to pull it up. This team, the polars team is cranking out features on a very high speed.

Wow. Yeah. So it feels like a year or two ago, they bumped to version one, but... Yep. There's a lot going on, but I feel it is very ready to use in production. In fact, before I joined Posit, I was working for a client where we converted a very large code base from pandas to polars, and that's been a big success, and it's still running. It's still running. So I feel very... Michael is looking up the video.

Before I joined Posit, I was working for a client where we converted a very large code base from pandas to polars, and that's been a big success, and it's still running.

Yeah. Yeah. Yeah. Yeah. I'll go to the middle and freak you out, and then I'll... What are you doing, Michael? What is this? I just thought I was trying to get to a point in time where you are, but here we go. Who's this guy? You know? It's also cool. You didn't give out three free copies of Python polars to the PyData crowd, so I feel like this workshop is lucky, you know?