
Polars: The Blazing Fast Python Framework for Modern Clinical Trial Data Exploration
Polars: The Blazing Fast Python Framework for Modern Clinical Trial Data Exploration - Michael Chow, Jeroen Janssens Abstract: Clinical trials generate complex and standards driven datasets that can slow down traditional data processing tools. This workshop introduces Polars, a cutting-edge Python DataFrame library engineered with a high-performance backend and the Apache Arrow columnar format for blazingly fast data manipulation. Attendees will learn how Polars lays the foundation for the pharmaverse-py, streamlining the data clinical workflow from database querying and complex data wrangling to the potential task of prepping data for regulatory Tables, Figures, and Listings (TFLs). Discover the 'delightful' Polars API and how its speed dramatically accelerates both exploratory and regid data tasks in pharmaceutical drug development. The workshop is led by Michael Chow, a Python developer at Posit who is a key contributor to open-source data tools, notably helping to launch the data presentation library Great Tables, and focusing on bringing efficient data analysis patterns to Python. Resources mentioned in the workshop: * Polars documentation: https://docs.pola.rs/ * Plotnine documentation: https://plotnine.org/ * pyreadstat: https://github.com/Roche/pyreadstat * Examples of Great Tables and Pharma TFLs: https://github.com/machow/examples-great-tables-pharma * UV Python package manager: https://docs.astral.sh/uv
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Yeah, thanks everyone for joining us. We're so excited to talk about Polars and actually forgot the long name of this workshop, but how it relates to clinical reporting and TFLs. I'm here with Jeroen Janssens, who actually is, I would say, the Polar's big dog. So it's a crime that I spoke first because I feel like Jeroen is the authority on Polars. And so I feel really lucky to have him here.
Jeroen, do you want to take it away? Yeah, yeah, yeah. No, Michael, I can't believe you spoke first. This is not at all how we rehearsed. But I'll forgive you. I'll forgive you because look at our shirts. Look at our shirts. They are DataFrame themed and that makes it up. So yeah, and actually, you know, the full title, I have it here in front of me. I'll share my slides in a second, but it's Polars, the Blazing Fast Python Framework for Modern Clinical Trial Data Exploration. That's quite a subtitle there. And so we have some, you know, we have our work cut out for us and we're indeed very excited that you're joining us, that you're, you know, investing your time in learning about this, this awesome package. So yeah, my name is Jeroen.
I work at Posit, just like Michael. I do developer relations. Michael does open source. And so for the next two hours, we're going to be talking all about Polars. And if you have any questions, you can use the chat. I might not see them when I'm talking, but Michael, shall we say that whenever you detect a question in the chat, that you just give me a heads up, that you vocalize it? Yeah, I can do a little nudge when it seems like good timing, if there's a question that... Yeah, yeah. Or you can just, you know, convert the text to speech yourself.
And I, one thing that could be nice too, while you get this cooking is, I'd be curious what what drew people to this workshop, since it's called R and Pharma. I'd love to hear, you know, your experience with Polars and what, what your kind of drew you to Polars and Python. So maybe so maybe drop in the chat. Good question to start with. Yeah. How you got here? I don't know. Because it is, the conference is R and Pharma, but yeah, this is a, this is, this is going to be all about Python, all Polars and Polars and Python.
Polars in a nutshell
So this is what we're up to for the next two hours. I'm going to start with, you know, laying the groundwork, giving you the context of what is Polars in a nutshell. You may already be familiar with what Polars is about, but since we have quite a few, quite a few people joining us today, I thought it'd be good to first, you know, give you the basics.
Then ETL. So extract, transform and load. That's what it, you know, basically comes down to, right? So we're going to, we're going to have a look at how Polars, how its vast API maps onto that. Then we're going to have a look at some common operations that you'll use probably, you know, 80% of the time. And then, you know, depending on how we're feeling, we we could do a deep dive into expressions. So this is kind of like a buffer depending on, you know, how fast or how slow we're going. It's all good. We can go as deep or as shallow as we like. I want to make this this tutorial interactive, you know, to the extent that that's possible using Zoom chat. But then I'm going to give the floor to Michael because he has actually prepared a demo where he applies Polars and also some, you know, does some great tabling with clinical trial data. So Q&A is listed here at the very end. But again, as we said, questions are welcome throughout the tutorial.
So Polars in a nutshell. Polars is a DataFrame library. It's written in Rust, but it has bindings in a couple of different languages like Python, R, JavaScript. And the one for Python, that's actually the most popular one and the most developed one. So that's why we're going to focus on that one today. If you're like, I really like Polars, but I feel more comfortable with R, there are alternatives for that as well. So there are various ways in which you can use Polars within R. Although I have to admit, I don't have that much experience with that, but I just wanted to point out there are options.
It is blazingly fast. So I have, yeah, here's a benchmark to back that up. So what we see here is different timings, different timings for different queries, different benchmark queries that we have performed using Polars and four other popular DataFrame libraries. So on the x-axis, we have time. Notice that it's on a log scale. And so this gives you an idea of how fast Polars actually is. DougDB is also really fast. And so for a couple of queries, DougDB is actually faster. And this changes over time. There's a very healthy bit of competition going on between these two. But it is fair to say that Polars is a little bit faster than DougDB. But it is fair to say that, you know, both Polars and DougDB, they are way faster than the alternatives here. So Pandas and PySpark or Dask.
Now, the second thing you should know, so the first one being Polars is really fast, blazingly fast. The second thing you should know is that Polars is really popular. These are GitHub stars over time. And now popularity by itself is, of course, not really, you know, the only thing that matters. But when you're choosing something new, it really helps to have, you know, to use a technology which is, that has a community around it, that has good support. So, and as we can see here, this is definitely the case for Polars. This graph is a little bit outdated. I checked an hour ago and it's now at about 36,000 GitHub stars. All to say that in this very short time span, Polars has gotten a lot of attention.
I guess that's about it, right? Oh yeah, the API. It has a very expressive API. So what you often hear is that, you know, come for the speed. But what you often hear is that Polars come for the speed, stay for the API.
But what you often hear is that Polars come for the speed, stay for the API.
And I have a slide about that as well, but it's a little bit further down the road. Now, the API, so the actual code that you're writing is, I hear very often from people who are used to Pandas, which is another very popular data frame library in Python, which has been around for, well, I would say 15 years, give or take. It's very different, but what I've been hearing from our people trying, you know, to get used to Polars is that it reminds them a lot of, you know, of the tidyverse. So, and I would say that's a great compliment. Now it's not exactly the same and it never will be because R and Python are still two very different languages. But yeah, there are some similarities there.
Hey, so there's this book that I wrote together with Thijs Nieuwdorp. And if you're interested, we are giving away three digital copies. So if you go to polarsguide.com slash rfarmr, if you scan the QR code, you can enter the raffle by filling in your name and email address. Everybody who does this gets the first chapter for free. So this could also be, you know, this could also help you to assess whether it's worthwhile for you to invest more time into learning Polars. But yeah, you could be one of the three lucky people to win a free digital copy.
Yes. Okay. Yeah, here it is. I mean, even Yoda agrees with this. So come for the speed, stay for the API. And we're going to be focusing mostly on the API. The speed is something you have to, you know, see for yourself. Once you try it out for yourself, you'll notice how fast it is. I have shown you the benchmark, but it is not until you try it out for yourself that you can truly witness how different this is. So once you experience this speed up, it actually opens up possibilities. It's a strange thing, but it's, that's how time can, that's what time can do to you.
Expressions
So regarding this API or the actual code that we need to write in order to get stuff done, about 50% of that comes down to expressions. And expressions are a very important concept within Polars. And this is also the thing that is perhaps most difficult to grasp. It is the thing that you have to, yes, perhaps spend the most time in. I'm sure that, you know, as our users, many of you have experience with ggplot2, you know, for visualizing data. I love ggplot2, but do you still remember when you first, you know, needed to understand the ideas, the concepts, the grammar of graphics behind it? It took a while. There's definitely a learning curve, but once you get it, once that, you know, there's something, something clicks and you're like, yes, I get it now. And from that moment on, you feel liberated. You don't have to consult the documentation for every little thing that you do because you understand the underlying ideas. And that kind of holds for expressions as well. There is a learning curve associated with these things, but once you, once you grasp it, it, it, after that, it comes very natural to you. And you'll then notice how, how easy it is for you to work with these building blocks. And that's a good thing, because expressions, they pop up in various places. And I'll show you those places in a moment.
First, I just want to drive home the point how big of a thing expressions are within Polars. So when Tyson and I set out to write the book, you know, for O'Reilly, we were like, yeah, we'll devote a chapter to expressions, of course. But we were, as it turned out, incredibly naive in that respect. So it turned out we needed not one, not two, but three chapters to cover everything that expressions have to offer. So within the book itself, they, they, they play, they take up a lot of pages. And I also counted, so you should take this with a little bit grain of salt, because it's, it's not, you know, it's not always straightforward to get the number of functions or methods that a package has to offer. But if you look at, so, so there are 404 methods associated with expressions. And so that, so that's when you, and you, you know, sum the, the top two, the top two numbers. So the, the top level functions that Polars offers. So for reading in data and writing your data, no, that's a method of a data frame, lots of functions there. A data frame also has lots of methods now. And then you see that also expressions there, it's about half. So yeah. So that's why I wanted to devote some time to this, to these things.
ETL with Polars
Now, ETL, extract, transform, and load. That's, you know, in that's often what it comes down to when you work with data, regardless of the type of data that you work with. At least in my experience, if you have other ideas here, I would love, I would love to hear this. By the way, Michael, if there is something coming up in the chat, let me know, right? I don't see any messages at this moment, but if there is a question that's, that's relevant, then be sure to let me know. You have the links to the slides by chance.
So ETL, extract data. So read data from a particular source. Doesn't matter where it comes from, read some data. Then a whole bunch of transformations, right? And I made, I enlarged this T just to, you know, visualize that, yes, that is where we spend at least 80% of our time. And then it's storing that data somewhere. All right. Maybe you want to make some, some visualizations. You want to model your data, maybe create a great looking table. Okay. Those are also very important steps, but when it comes to actually working with the data, is it what, this is what it comes down to.
So here I have a complete example, and I'm not expecting you to read this code. I'm not expecting you to understand it or whatever. I just want to point out the ratio of ET and L. So what I've colored red here. So there are a few lines at the beginning, and then there's one more towards the bottom on the left side. That's where we read in some data. So in this example, we, this is, it doesn't really matter, but it, this is a city bike data. So bikes that you can rent in New York city. We read in some CSV data at the start. And later on, we combine that with some geo JSON. We create a beautiful visualization that I'll show you in a moment. And at the very end, there's this line, which I've made blue. And that's where we write the results to disk, to parquet format. And the rest in white, that's, those are all the transformations, right? That's where we select columns, where we filter, where we do aggregations and joins and so forth. Right. And I think this is very typical, right? There are always exceptions to this, but I think for most data pipelines, if you will, this is what it comes down to.
By the way, when it comes to reading data, polars can read in lots of different formats and can read from lots of different sources. So it can also read from lots of different databases. It can read Excel spreadsheets, JSON. I don't know. I don't know actually what kind of formats are used most often in your field. I am curious to hear about that, but in any case, I'm pretty sure there's a way to get that data into polars.
Yeah. And Jeroen, someone asked about read Excel, which maybe is a interesting kind of detail between Python and R that the R package doesn't have read Excel. Oh, really? Is that, I didn't know that. I think it's just because it uses a Python package, but they didn't like a workaround. It's kind of just a funny, one of those funny kind of in-between methods. Oh yeah. Yeah. I bet there's some discrepancies between the different languages, you know, that you use. So in that case, yeah. Also, so the R version or sorry, the Python version or the Python API relies on some other package to do the reading of the spreadsheets. And yeah. Okay. So I wasn't aware of this, but I'm pretty sure that, you know, with that, you know, in R what I would then do is first read it in into a regular data frame and then pass that on to polars.
Okay. Oh, we're recording this, right? Nevermind. Hey, look at this. Wow. This is a visualization that was created using plotnine, right? Actually a Python port of ggplot2. I'm not going to show you the code to plot this, but that wall of code that I showed earlier, that was actually needed in order to produce this visualization. So just wanted to throw it out there. And if you are curious, this is discussed in the first chapter of the book. Anyway, just wanted to plug plotnine as well here, I guess.
Common operations
Okay. So transformations, right? We've seen this wall of text, lots of code. I've said, you're right. Expressions are important. Okay. Let's get, you know, down to business. What are some common operations? Yeah. So these are not all of them, not all of them that you probably want to do, but I would say that in the majority of cases, you want to either select columns that you already have in your data frame. Maybe you want to select a subset or create a new column. Maybe you want to filter rows. Maybe you want to do some aggregations, right? Some grouping and create aggregate statistics. Or last one, sorting rows. I would say that these are the most common ones, perhaps joining. I could have added joining here as well. But let's start with these, right? Now, the reason I want to cover these is to show you where expressions play a role and how they are being used by the rest of the API.
Yeah. And don't worry, Michael and I, we're not expecting you to be an expert by the end of these two hours. What we hope to get out, what we hope that you get out of this is that you're like, yeah, this, this Polar's thing sounds interesting. I'm going to give it a go. Yeah. If that, or maybe you decide, you know, not my cup of tea. That's fine too. But then I think Michael, right? You, we have, we have succeeded.
Okay. Okay. So not clinical trial data yet. Let's keep things simple for now. Let's work with some fruits. Fruit is good for you too, right? So we have here 10 fruits and various properties related to those fruits. Yeah. So their weight, color, whether they are round or not, and their continent of origin.
So let's see. So, so those five common operations that I just listed, let's just have a look at how they can be used using Polar's on the, on this small data frame, right? So we're using the select method here. So on the fruit data frame, and we're passing in four different arguments to this method. Now we see pl.call a couple of times. Now this is the most common way to start an expression. Yeah. I haven't even explained what an expression is. Don't worry about it. First, I want you to get a feeling for what this is. I want you to build up some, some intuition. We'll get to the definition later. The pl.call, the first one, and then in quotes name refers to the column called name, right? And we're not doing anything else with it. So what we're basically saying is, all right, select the name column. That's all. That's all. Now in the last argument here is round. You see that we just use the name itself. That's possible too, but then we're not creating an expression. And that means that we cannot do anything special with this column. We can only select the column as is. So for example, what we do here with the weight column is we divide it by a thousand, right? We go from, so this is in metric. We go from a gram to kilogram over here by dividing it by a thousand. But in order to be able to do that, in order to be able to do some operation to a column, it needs to be an expression. That's what we're doing here. We're using pl.call. Call obviously being short for column. Now that second argument over there, that is a regular expression. So keep in mind, whenever I say the word expression, I am referring to a Polar's expression and not the thing that you may know as a regular expression or a regex, right? Those are two different things. But what we're doing there is we're selecting all the columns that have the text or in it. So color and origin.
Second common operation. Creating new columns. This is done using the withColumns method. And I always like to tease the creators of Polar's that it was a mistake to call this withColumns. Why? Because all the operations, all the other operations, they're verbs, just as with the tidyverse. But for some reason, and I'm guessing this has to do because Polar's has been inspired by Spark, they went with the withColumns method. Anyway, let's not dwell on this for too long, Jeroen. Let's just focus here on the good stuff. And that is that you can create new columns using expressions here. Okay. First argument, pl.lit. So a literal value. This is another way of starting an expression. So the other one being pl.col, that is to use an existing column as a starting point. pl.lit, that's when you start with a Python value. Maybe you have some other data somewhere lying around. It could be, I don't know, a constant value, like true, or maybe it's a number, maybe it's a list of values, or a numpy array. So pl.lit true starts a new expression, right? And we're giving that a name. It doesn't have a name by itself because it's entirely new. We give that the name isFruit. And this, at the very end, then becomes a new column called isFruit. So then, second argument here, just again, a new column called isBerry. And this is based on the name column, right? You can see that because we use pl.col name, and then we apply some operation to it, namely the endsWith method, which is part of the str namespace. So whenever the name of the fruit ends with berry, the value for isBerry becomes true. A little bit contrived, I know, but I just want to demonstrate a couple of options here. Now, in this second argument, we're giving the name isBerry, and we do that in a different way. So there are two different ways of giving a name to a column. One is using the alias method, and the other one is by using a keyword argument. There's some restrictions there, of course, because it has to be a valid Python keyword argument, a valid Python name, actually. Meaning it cannot start with a number, for example. If you want that, well, don't want that, but if you really need to use the alias method. Right, so that's it for creating new columns.
Now, filtering rows. This is actually a single expression that I have put onto two different lines. It's actually an expression
that's composed of multiple expressions. So what we do here is we keep all the rows, we keep all the fruits that weigh more than a thousand grams and that are round.
So we have two rows, two different fruits that, yeah, for which this expression yields true. These two expressions, by the way, they're combined into a single expression using the and operator here. You can combine expressions like this as well.
I guess, yeah, and there are other ways to combine expressions. You can use arithmetic and comparison operators. Actually, there is a comparison operator here with the weight, right? We say weight has to be greater than a thousand. Yeah. And the Boolean operators. All right. I'm not hearing any questions, Michael, or are you secretly answering them via Zoom? I have answered a couple. There is one.
Q&A: column labels in Polars
There is one. Here's a question for you. All right, all right, let's hear it. Riddle us this, Jeroen. Is it possible to have labels attached to column names?
To have labels attached to column names? Oh, wait. Oh, that is something that I don't, I've never, I've seen that when I had to use some SPSS data. Is that where this comes from? So a column has and a name and a label. Is that some extra metadata?
I'm not sure. Yeah. It does sound like it's SPSS related because Albert in the chat's mentioning that there's a way to do it. Is there a way in Polars to do this?
Oh, it sounds like SAS as well. People are weighing in. This is more expressing our naivety about labels, but it's clear they're load-bearing. The chat's actually really picking up. So it's clear labels are load-bearing. I love that. I love to see the interactivity.
I do see very tiny pop-ups, but I can't read it. Okay. So labels, I think that's not possible in Polars. I think this is a feature that you may have gotten used to when using SAS or SPSS. So, and now I'm wondering, okay, what is the, maybe now it gets a little bit philosophical, but the use of a label, that is something that it's not really important for me at this moment. But I think the short answer is, is that you can only have, a column can only have a name. Labels have, so if you, if there is some additional metadata associated with the column that you may want to use at a later stage, maybe when you are creating a table of some sorts, what I then would do is, and it may sound very hacky, right? Because you're used to this very nice feature of being able to use labels in SPSS and SAS to applications I have very little experience with, I have to admit. But if I would need to use some additional metadata or for the columns in a data frame, I would create a separate data frame that has two columns. And one is the name of the column. And the other one is the metadata, the label that you would do. And then later on, when you have transformed all your data and you're ready to produce a table, perhaps using the Great Tables package, the same, then this is the time when you can refer back to that second data frame that you created.
Now, again, this is a workaround, it is hacky, but I also have to say that I haven't not seen this anywhere else. It's also not a feature, as far as I know, that most relational databases have.
Right. Like what you described, putting the metadata as a separate table is a very SQL database thing. Like, if anyone's ever looked up like information about the columns, it's always like a table where each row is a column being described. So it's a very nice, it's a pretty classic pattern. But I do get that if you're really used to SAS and SPSS, that this thing might be a really...
Yeah, that's a bummer. Who knows, maybe at some point there could be an extension for Polars for this. I don't know. I guess it also depends on how many SAS and SPSS users want to move over to Polars, right? If there's really a need, I know some fellas in the Polars team. I don't know, maybe we can work something out. Just saying. But yeah, thanks for that question.
I will say Albert Lee pointed to Roche has a package called pyreadstat that does, it sounds like something you mentioned, reads like two pieces out, the data and the metadata.
All right. Yeah, yeah, I've seen pyreadstat, right, which can be used to read in SAS data as a Polars data frame. And I guess it supports some other formats as well, might also support SPSS. But if that's how they do it, then now I'm feeling a whole lot better about my suggestion. It's not hacky at all. Just have a second data frame.
Yeah, yeah, that was neat. Okay, I was gonna ask for more clarification on the question sent to you, but that was a neat, that was a fun one.
Yeah, yeah, no, but please, please. I mean, more questions is better, right? If we, plenty of time. All right. So just keep those questions coming. If Michael thinks, you know, he can answer them in the chat, then that's fine too. If we want to turn this into a bit more of a discussion, I'm very happy to do that as well.
Group by and sorting examples
All right, then. Let's go back. Let's go back to this, this example right here, where we used a group by method on the data frame in order to compute some aggregate statistics. Right? So this is, this is an interesting one. It may take some time to read it. But we're using expressions here in two different places. In the first place, that's the first line. We are using an expression to determine which groups we're going to create. Actually, we're defining what a group is. And that is, wait, I have to look at it again. Yes, it is the first part. No, it's the last part. It's the continent. But then if it's like South America and North America, that is stripped of this. Yeah, because we're using the split method, str.split method, turn it into a list and getting the last element from that list. So it's a few steps here. But I guess if you look at the outcome, so in the resulting data frame, then it's clear what this expression is doing.
So that's the origin that we're actually using. We're using a derivative of the origin column there to determine the groups. Now, the second place where we're using expressions, that is within the ag or aggregate method. And that is first, we want to have a column called len. So this is the number of rows in each group. And secondly, the average weight. Yeah, so important to note here that a method like mean, that is one of a couple of methods that summarize data. So they're turning multiple values into one.
All right, I think this is the last one. Yeah, it's the last one, right? Sorting, sorting rows, very common. What are we doing here? Right? Again, a little bit contrived, but here the fruits are sorted by the length of their name. I couldn't think of anything better. Weight is an obvious one, but I want it to be a little bit more complicated. So the length in descending order. Yeah. And you get that by setting the descending keyword to true.
I'm curious to hear actually, who is, who has experience here with pandas? Let us know in the chat. Michael, I'm counting on you to give me some aggregate statistics there. We got to let it incubate, you know, people are getting incubate. Yeah, let it incubate for a little bit. All right, we got some pandas coming in. We got a little bit of pandas in the house. That's okay. I, I am, I owe a lot to pandas, I have to say. So there on the internet, there are sometimes, you know, some, some annoying or less elegant comparisons between polars and pandas. Um, and I say, let's ignore those. Without pandas, there would never have been a polars. And, um, so I, I really, um, we owe a lot to them or yeah, we, I say we.
Um, I'm mentioning pandas because when you're, when you want to change the sorting order and in pandas, right, if you want to sort your rows, but you want to do it in descending order, what you have to do in pandas is you have to set the ascending keyword to false. Now I'm not saying which one is better. I'm just saying, I'm just want to point out if you do have experience with pandas, you may run into, you know, a couple of these small differences and, uh, yeah, that's, um, that will, that will require you to do some unlearning there. Yeah. Something to keep in mind if you are interested. Uh, chapter three is devoted entirely to this.
Um, anyway, anyway, expressions, right? We've seen a couple of examples now. Um, we've built some intuition for what an expression is and what it can do now. Let's see. Yeah, we've got some time left, right, Michael? Um, let's look a little bit more closely at what it is. I have a question. I'm so sorry. No, no, no. Don't apologize. I love questions. Yeah, let's hear it. Does it break down ties alphabetically by name or by original sort order?
Oh, so whether the original, uh, oh, wow. It's a very tricky question. Is it stable? You know, I would guess, I would hope yes, but I, that's something that would, I would have to look up. Um, you know what? It might even have different, does it have just one sorting algorithm or multiple? This is something that we can definitely look up. Uh, I don't want to do that right now, but maybe I can do this when you're talking. Yeah. I think there's, I just found a, there's a maintain order argument. There you have it. I knew it. I knew it. I was, isn't that fantastic? Um, it's false. So I guess, I guess it's not stable by default, but you can make it stable. So, yeah. And then I guess it'll be, it'll be a little bit slower, not that you'll notice it, but, uh, that's what I'm guessing. Uh, that's why they, why they made that false by default. Yeah. Yeah. Yeah. Yeah. Great question. Keep them coming.
What is an expression?
What is an expression? Okay. So if you, um, if you, uh, look at the official Polars documentation, they define an expression as a lazy representation of a data transformation. All right. All right. Let's keep that in mind. If you ask, uh, Marco Gorelli, who is a core contributor to Polars and the creator of Narwhals, uh, uh, another fantastic, uh, Python package that, um, that sort of, uh, abstracts, uh, all the different DataFrame, uh, uh, implementations. Um, he defines an expression as a function from a DataFrame to a series. So, and, you know, I'm not saying that these two definitions are wrong, but I think that we can do better than this. I think that we can do better.
Um, I guess that's also when you start writing a book and you're spending so much time almost two years with the material, you kind of have this need to define things. I don't know if that's a form of yak shaving because I, you know, it's just better to, to be writing, but, um, here's a, here's what Tyson and I came up with. And that is, uh, an expression is a tree of operations that describe how to construct one or more series.
an expression is a tree of operations that describe how to construct one or more series.
Now that's quite a mouthful. Um, so, so let's, um, let's break this down because I do think it is important that you, uh, have a mental model of what an expression is. Um, so, so that you know what you're dealing with and maybe, maybe you won't get it now. That's, that's okay. But when you do begin this Polars journey, I do want to advise you to, to grasp this concept of an expression, because that is something that will come back, uh, uh, the most, otherwise it'll haunt you.
Okay. So let's break this down. Let's start at the very end here series. A series is a, um, um, a sequence of values. That's how I see it. Very often, a series is also a column. It's not always the case. Let me go back to this example right here. Yeah, I guess this will do. I'm sorting on a series that does not exist in this data frame, right? I'm creating a new series based on an existing column, but this new series does not become a column by itself, right? We don't have the length of the name of all the names somewhere. It is never materialized. It's only used to sort. Yeah.
Maybe I'm already giving away too much here. So to construct one or more series, okay. So the second part I want to highlight here is that it is a tree of operations. Here's an example, uh, of how you can think of this. So what we're doing here is, uh, let's see here. I have an expression. It's not based on any existing data frame. Yeah. What we're doing is we're adding three and five, and we're dividing that by, well, first adding one and five. So, um, all these operations. So we start with two different expressions, the three and the one, both of those. And they're at the, uh, they're at the very bottom, as you can see here. Uh, and what we do is we add, it's interesting how to, uh, that I'm struggling so much putting this into words, but, um, there, so there are in total four different values that we're dealing with to start and they are combined. So five and one are combined using the, using arithmetic, right. The plus or the, the add method or the, the, the plus operator, if you will. Yeah. So these two, uh, combined with that operator will create a new expression. And that's done, uh, again for the numbers five and three on the right side. And then as a final step, these two expressions are combined into a single expression using the division operator. Yeah. And you can visualize any expression here using the meta tree format method.
Um, and this is a very simple example, right. But, um, this is to show that we are dealing with a, a, a tree of operations here. Um, yeah, I hope this helps in your mental model. Um, right. That describe, so an expression is a tree of operations that describe how to construct one or more series. And this describe part, um, is actually related to what the official Polars documentation calls lazy. So the expression by itself doesn't do anything. This is perhaps the most, the most interesting aspect of an expression. They are lazy. They're only a, you could call them a, a recipe of what to do. And it is the method combined with a data frame that determines what the result will be.
And I guess I, I think I have a, uh, an example about this, uh, in a second. Okay. Constructing here. So, um, why did I put this in here? Um, yeah, so there, it is created, a series is created, but it's not always becoming a new column. I think that's why I put this in here and I should read the book again. Um, one or more series. This is an interesting one because this is not in the definition that Marco Gorelli gave. Yeah. Um, remember that expression I had with the regex, the regular expression, uh, or which yielded two columns, uh, color and origin. Yeah. So do those are two series and in turn, two columns that were created. So a single expression can lead to more than one series. Now the developers, they will say, yeah, but under the hood, it's expanded into multiple expressions and like, all right, all right. That may be, but from a user perspective, it is still only one expression.
Here's another example where we use a single expression to modify all the existing columns. So we start out with two columns, A and B, and we're multiplying them by 10, yielding another two columns or again, two columns. So in total four columns with just a single expression.
Let's see now. All right, Michael, do we have any more questions that we want to discuss? No, no. I was just putting a note that PL.all is, it's kind of similar to like using a cross in. Oh yeah, yeah, yeah. You could also do PL.call asterisk. Um, yeah. Or a cross in the middle. Yeah. Or a cross in R you mean? In R, yeah. Just for the R. Of course you were, oh, very good of you. You were relating it back to some tidyverse methods. Right. Functions. Sorry. That's very good of you. That's, uh, thank you.
You're welcome. Yeah. Yeah. Although, you know, if you're used to a tidy select, that's then, that, yes, nothing compares.
Expression properties
So properties, expressions because of this, they of course also have some, some properties. And I think in my, uh, in, in my, my story, I have, um, mentioned these in passing, but let's just say, let's just have a look at what we want to pay, uh, to which ones we want to pay extra attention here. Again, uh, they're lazy, right? They're like recipes and, um, it's the, uh, the methods. So .filter, .sort, .groupby, they are, they are the cooks that follow the recipe. Yeah. They depend both on the function. So .filter, .sort, select, and the data to which they're applied.
So, um, here's an interesting, uh, um, example to, to demonstrate this is, uh, I create an expression called is orange. This is just a Python object, right? An expression is just a Python object, which you can assign to a variable is orange. Now I can reuse this, this expression, uh, here in this first example, I am creating a new column called is orange. Notice that, you know, the name of the, of the expression and the name of the variable is the same as both is orange, but that doesn't necessarily have to be the case, right? I could have also assigned this to a variable called e for expression, right? Not a very good name, but what matters is the name of the expression, which we use, which we create here using .alias. Now here, I apply it to the fruit data frame, which we already know, but we can also apply it. Um, wait, I'm going to fly it to the fruit data frame a couple of more times, but then using different methods. So we can use it to filter. So now we're getting all the color. Oh, we only got the rows, um, of the fruits that are orange. Yeah. Same, exact same expression used in a different place. We can use it to create the groups. How many, uh, how many orange fruits do we have? It so happens that in English, orange is also a fruit, but I'm talking about the color here, of course. Yeah. So six that are not orange or that are.
Okay. Now finally I get to demonstrate that these can also be used on other data frames. So remember we have this is orange expression and I'm here, I'm applying it to an entirely new data frame. Yeah. Um, that I'm making up right here. These are three different, um, flowers, right. And, uh, two of them are orange. Fantastic.
Expressive, right? Yeah. There's a, okay. I can go on about this for a very long time, but I do hope that, you know, because of this method chaining, uh, that you can do and that you can combine these expressions, uh, um, that you can combine multiple expressions into a single one and that you can reuse them, that they are indeed very expressive. Um, efficient? Yes. The Polars execution engine, it will execute these expressions in parallel. That's, that's, that's, uh, that's awesome. Um, that means that in many, many cases, uh, your Polars query will be very fast, will be a lot faster than what you're used to.
I don't know if that, if speed is a thing, right. Let us know in the, in the, in the, in the, in the, in the chat, if speed is something that you, uh, run into, meaning that your code is being too slow and it really becomes a burden. It may not be the case, right? Maybe you're just dealing with a couple of thousand rows and all is good. That's fine too.
All right. Expressions are also idiomatic. Last thing. And then I think I'm going to hand it over to Michael. So here's some non-idiomatic Polars code. Yeah. If you're used to pandas, this may look familiar, right? With the square brackets and all, but, um, and, and it is indeed the case that this code produces the same result as the earlier filter example that I had. Now it is suboptimal because, um, these two things, these expressions, they're not executed in parallel, right? It's, it's not an expression. So the optimizer that Polars has cannot make any optimizations and they're, they're executed, uh, in a serial fashion.
Um, also there are, it is very, uh, possible, it's very likely that you will run into an error. Um, and that's because, let me see here. There's a lot going on here on this page. Yeah. So we got some fruit and, um, I'm doing a filter and creating a new column. And then I'm trying to do the same using, so the top one is idiomatic and just produces a beautiful data frame. And the second one is not idiomatic. And the reason why that is, is because, um, the filter method reduces the number of rows from 10 to two. Yeah. So, so by the time we are, uh, at the with columns step, um, Polars thinks, or Polars is, is, is, is, uh, dealing with a data frame that has
with a data frame that has two rows, whereas the code that so, so isBerry equals fruit, uh, name in square brackets, that refers to the original fruit data frame, which still has 10 rows. So that's a mismatch, right? Actually the error, the shape error right here, pretty much gives it away. Unable to add a column of length 10 to a data frame of height two.
Yeah. So just as a, as a warning, right. Uh, or maybe as an encouragement that you should always use expressions in Polars. Ah, right. That was a lot of talking. Now, I guess now is the time, if you have any questions, uh, about, um, uh, anything that I've just said or things that I have not said about Polars, now's a good time. If you want to lay to wait a little bit longer, uh, that's fine too. Otherwise I'm going to give the floor to, uh, to Michael.
What do you, what do you, what do you think, Michael? Do we have any, anything in the chat that is we're discussing now? No one, no one answered, uh, do we have a need for speed in the chat? So it's a toss, you know? Uh, that's cool. That's cool. I mean, um, there's more to life than speed. So it's true. Should I, let me crack open this demo. All right. Jared in the chat feels a need for speed. So that's, ah, thanks Jared. Yes. Um, all right. I'm going to give the floor to you.
Setting up the demo environment
I don't know if I have to stop sharing now. That's great. I'm going to, I'm just going to stop sharing. Uh, Phil's blurb, just so that people, uh, in case anyone wants to follow along with, um, on Posit Workshop, I'm just going to, I'm going to crack open and use Posit Workbench. Um, since it's, it seems like a good way to just kind of fire it up and, um, all right. So let me, let me share my screen. Hey, Michael, and shall we do the same now for you? Whenever there's a question in the chat, I'll just, uh, give you a heads up. It's a deal. I'll, um, on my second monitor, I'll also open the, put the chat just to just in case.
All right. So, whoa, I don't know. Don't look at my desktop. Cause I actually literally never look at it and didn't know it even existed. So I think before my Mac, I had, I had a setting where desktop didn't exist, basically like the files aren't there. Nothing's there, but I think it update changed it and brought the desktop back, which was frightening because it's, it looks just littered with files.
So if you see a desktop- Excuses, excuses, Michael. We know what you need to do after this workshop. You need to clean up your desktop. But let's focus on the good stuff, on this clean code.
So I'm going to start from scratch again, just so I can show how I got into it. The one piece I can't show is I don't think it'll have any log in. So I just clicked Posit Workbench at workshop.posit.team. And I think if this is your first time, you'll be prompted to sign in. So usually I click through and then there's a GitHub option. So I chose, actually, let me, I'm just going to do this. So Posit Workbench, and then I clicked sign in with open ID. And then I used GitHub to sign in.
And let me see what file I might have. I think for a speed, if you click on register, it's a little bit faster just cause you can plug and chug an email. Oh, nice, cool. That gets you in pretty quickly. If you don't mind logging out to GitHub or Posit Cloud, that works fine too. Either way is totally fine. And for the email here, you can totally make up an email address, works fine. The only thing that I usually do is just so in case I ever want to log in, I just keep the email and the username the same. Even if I use a fake one, I can kind of remember how to get back to it.
Remember, yeah. But just for, since I'm on here, we don't keep these up forever. These are just to explore and play with the environments. The Posit Cloud environments do stay up, but unfortunately they don't have Positron with Python yet. And so that's why we're using this separate environment. So this is just, these will be up for a couple days if you want to play with them. Yeah, nice. Okay, so it's a good place to visit, but don't try to live there.
All right, so I'm going to click into Workbench. And I'm going to, this is the session I started before. So Workbench will list out all of your sessions. So a session something like, I have Positron open. I'm going to create a new one here. So I click Positron Pro launch and that. Okay, so now we've got a new session cooking. All right, and it just opened it.
And then I'm going to look at my files. So it's like you don't have anything. I'm going to go to, let me put this in the chat. So, examples, Great Tables, pharma. Okay, so I'm going to clone this. So I went to code HTTPS and clicked this copy. And then I'm going to get clone here.
Oh, you know what? I forgot that Workbench you share, you have like a shared folder. So I had two Workbench sessions open. And so I'm going to delete my previous work really quick. And you can also clone just from the top left-hand side. There's three little lines and there's a, oh, it's right there. Clone repository, that works fine too. I saw that, it required me to sign into. Oh, it doesn't? Oh, oh, okay. But I could have gotten it wrong. Let's see. So I'm going to clone this.
All right, and then I'm going to CD in. This uses UV, so I'm going to do UV sync. Let me make this a little bigger for folks. Sorry, this bar is so long right here, but, so UV sync. So what this will do is it'll create a virtual environment in the folder .vmv in this folder. And basically this will hold like everything we need to run Python and our packages. And I think Positron usually picks up on it.
For some reason, it didn't here. So let me just, maybe because I didn't have Python select. Oh, you know what we have to do? I'm so sorry. And then now that I ran UV, I'm going to click open folder. Now that I ran UV, I'm going to click open folder. And I am going to open the examples folder that we cloned. Sorry, so I cloned the folder. I went into it and I ran UV. But the one thing I forgot to do is I need to open the examples folder. All right, so here we go. And now I bet it will. So notice now Positron knows that we're using UV. And it actually is able to kind of set it up from the beginning. So yeah, opening folder, it's kind of nice. Like Positron knows it's UV and can set everything up.
Introducing the Great Tables clinical trial example
All right, so I'm going to just to show you what we're going to do. I'm going to run the full example. And I'm going to produce the table first before going through the code. So here's the, I'm going to make this a little smaller. All right, so here's the table we're going to produce.
So this is, I got this from Rich Ianan, who is the author of GT and maintains Great Tables with me. And this is from some clinical trial. And I have to admit, I'm a little bit less familiar with this type of work. But this was a type of table he assured me is kind of representative of what people might do.
And I think it's interesting because, so this is using Great Tables to recreate this table. And let me just flag some of the things we're going to do. So I actually included an image with some things annotated. So I'm just going to show you. So here we have a count of our sort of like sample size. This is up in the inside the column labels. So we're going to want to do that. We've also formatted a lot of the values. So notice pvalue used to have a lot more decimals. And we've cleaned it up a little bit.
And then this, another interesting kind of mechanic is that we have this extra information, the percent. So here, this is reporting, I think this is number of people in these different age brackets. But then we also have the percentage for the condition. So in the placebo condition, 16% of people are less than 65 years old. And so these are kind of three interesting pieces of the table. The pulling out these overall sample sizes, formatting the values in different ways, and kind of combining information together by putting the percents in parentheses.
So in the placebo condition, 16% of people are less than 65 years old. And so these are kind of three interesting pieces of the table. The pulling out these overall sample sizes, formatting the values in different ways, and kind of combining information together by putting the percents in parentheses.
All right. And if you have any questions, feel free to put them in the chat. And you're definitely, feel free to interrupt me if I miss anything.
Loading and exploring the data
All right. So to start, I'm just going to load this data and show you a bit what the sample data looks like. So, okay, let's do... table. Let's look at the data. Oh, it prints out here. Okay.
All right. So notice that the table data is funny. It actually matches pretty closely the table on the right. So you have like age here. And then you have these different labels, like N, mean, SD, median, min. So our table is pretty close to the final format. We just need to do a little bit of cleanup.
So the first thing I'm going to show is pulling out the... Sorry, I'm going to make this full screen. Pulling out the overall values here. So what we'll do is we're going to run a simple filter. So let me just show you. So we're going to filter where category is this age Y. And where the label is N. We're just going to pull out basically the first row of data.
All right. So here we go. So we use the filter with PL.call category is age Y. And PL.call label is N to get the first row. Then we're going to select these four columns where we want to pull out this N value. So I'm going to show you that piece.
Notice that in Python, one funny thing is that to run this, notice that in R, you might be able to just highlight part of your pipe and run it. In Python, what I do is I usually highlight up to where I want to run. And then I close the parentheses at the end. You could also do it by commenting out lines. But I find that this is a quick way to just run little bits of code. All right. So notice we've selected now the four columns we care about.
And then I needed to kind of wrap up with a little bit of more kind of advanced Polars where we're going to cast to an integer just to get rid of these decimal places. And then .row is a funny method that is like give me this exact row of data. Named means like as a dictionary. So just to show you. All right. So it gave us back a dictionary with these values. And this is probably the funkiest part actually of the whole table because we kind of need to pull these out and like hoist them into the column names.
All right. And Michael, this reminds me of our discussion about labels. Yeah, it is kind of label-y. Yeah, you create a separate object. In this case, it's a dictionary, which, you know, does the job as well. Earlier, we talked about a second data frame. But as we'll see later on, you're referring back to this dictionary and then use them to create the column names. Yeah, totally.
Using selectors and formatting values
So that's the end overall. The next piece to show off is selectors. So what we're going to do is we're going to clean up the code a little bit. But I just want to show, Jeroen showed a little bit of how you can kind of pull an expression out. You can assign an expression to a variable and then reuse it. We're doing something similar here with these selectors. So I'm saying, select these four columns by name. So just select these four columns. And then this selector is saying, choose all columns that end with underscore PCT. And so if I run this on tbl, tbl.select, I can get these columns, just these columns back. You know, tbl.select, this percentage selector, you know, I can get all the percent columns. So sometimes I find it's nice to just pull these out a little bit to kind of flag like some of the structure that we're going to get into.
All right. So we pulled out these overall n values. Next, we're going to clean up the table a little bit. So specifically, right. Notice that we might want to limit, I think this table actually cut short some of the values, but we might want to limit how many decimal places or significant figures people see. The other one is these percent columns. We might want to write this out in a more clean way. So for example, we might want to write it like this with a parentheses around it and a percentage sign and then the p values we might want to shorten.
So I'm going to do that here. And basically, the trick is we're going to use this with columns method, which is a lot like the dplyr mutate. And we're going to use it to change our columns. So this one, and this is using a, these all come from Great Tables. So these are actually like formatters that clean up your numbers and turn them into strings. So basically, we're doing a format number to three significant figures. And you can actually test these out just directly in your console. So you could do something like, you know, put a number in nsigfig equals three. Yep. So you can see that it formatted it to three significant figures.
So that's the interesting thing about these functions is that you can actually is they can work either inside with a Polars expression, or you can try them on numbers directly to see, just to like kind of experiment and get a feel for them. All right. So I'm going to run these.
All right. Okay. So the key is now these conditions, the placebo, these two in total have all been shortened to three significant figures. The percentage columns are now, if you look, formatted so they're inside parentheses with a percentage sign. And then the p values have been shortened to four decimals. And notice that they're all strings now. So they're sort of like clean formatted numbers that can go into the table.
Okay. So the last thing we need to do is add these to the percentage columns to the value columns. So we're just going to basically add them together. So placebo plus placebo percent. The one trick is we need to fill in null values because a null plus a string is a null. So this is called null value propagation. So we need to basically just make sure they're empty strings so that anywhere there isn't a percent, we still keep that left-hand value. So it's kind of like a funny dance. But it's just basically how, because of how nulls work. So I'm going to run this and do cleaned.
Okay. So notice that now in the value columns, the percentage sign, we have it fully formatted for our table. So with the value and the percentage. Yeah, that's right. So someone asked, how do you align decimals? So I'll show you in the key is Great Tables. So now that we have our sort of raw Polars table ready for display, we can use Great Tables to sort of style it.
Styling the table with Great Tables
So here's what it looks like. So Great Tables, if you call dot style on a Polars data frame, that's a Great Tables table. So this is the same as if you were to use this. Great Tables has a GT object that starts everything. These basically do the exact same thing. So to align decimals, Great Tables has, I think, a number of ways that it does it. I'm not exactly sure for these with parentheses. It might be a little tricky, but I think that by default, Great Tables is a number of ways that it tries to align decimals. Let me try producing the table and see.
I think depending on how you need them aligned can differ. But actually, let me from Great Tables import exible. I think this will actually show the alignment. So not here. Not here. You know what? I have to admit, I'm not totally sure. I thought there are some different ways to align the decimals, but I can't quite remember now. Let me try format number. I'm just going to look at the help. So if you put a question mark at the end, you can see the help. It's not formatted in my favorite way, but there might be an align. I failed you. I think there's some sort of alignment strategy somewhere in Great Tables, but I actually can't remember exactly where. So okay, there's auto align. So that suggests that there are ways to do alignment. I know it's possible in Great Tables. I have to admit, I don't know exactly know how. So I think we need rich in the Great Tables big brain. But if you know in GT how to align, definitely let us know in the chat. I think it's probably similar in Great Tables. But that's a great question. I think that's a big one for table styling.
All right. So I'm going to keep going and maybe someone knows how to do the alignment. All right. So to start, I'm going to create this header. So all right. So what tab header does is it lets us add a title and subtitle. So here we're able to give the table a name and a brief description. And then I'm going to use submissing to get rid of all these nones. So I'll show you that now. So notice we got rid of all the null values.
One funny thing is we use this percentage column. We moved it in. So like placebo percent, we don't really need anymore because we moved it in here. So we're going to use calls hide to hide all of these percentage columns. All right. Nice. So we got rid of the PCT columns. And then we're just going to clean up our label. So notice our column names are still these underscore names with lowercase with underscores. We're just going to clean them up a little bit using calls label.
So a key here is we're now using this nOverallPlacebo piece. We're using a Python F string. So basically, this lets us insert this value from the dictionary. So nOverallPlacebo looks like this. So it's just the number 86. So we're able to, in these curly braces, write a little bit of Python. And it'll just insert the result. And we're also using this thing called MD for markdown to be able to format a little bit the result. So let me just show you that. All right. So now notice our labels are a lot cleaner. And they have this and these sample size values in them. So that's a lot nicer to read.
And then the very last thing is we're going to put a footnote in with the date and a note about the source. So let's do this. All right. So that just added a source at the very bottom. And I think that creates the full table. So we did some title, subtitles. We set the column labels. We gave them nice names with calls label. And we had those extra percent columns that we hid with calls hide. And then we also got rid of none values, which were just written out by setting sub missing. And then last, we added a little source note at the bottom about when we executed this program.
So that's the gist of using Polars to go from not totally raw data, like processed a little bit data, to a table that's ready for publication. Yes, if you have any questions, I'm happy to walk through that table. I do also have a note about a little piece I can show, which is this table is like pretty, it came pretty pre-processed. You know, if you look at TIBL, it's already in the format of this final table. So obviously, there's like a lot of work people would be doing to get this thing ready.
Exploring data wrangling with Polars
And for example, like one thing I could show is, if we do, I added a dataset for, oh, okay, let's see. To show how you can do a little bit of, I'm just going to call this age. So just to show how you can do a little bit of data analysis. So this is kind of like, this is what these, this square might look like, which is like the counts of all the different people in the different age groups. So this is like a wider format, has the conditions as the columns. And then if you look at this table, you can see that there's like a lot of data. Has the conditions as the columns. But you might, if you were analyzing it, you would probably see it in a tidy, long format, where condition is its own column. And then condition and age group are sort of crossed here. And you have N for each of these.
So if that's the case, you can do things like group by, just to show, you know, what some of the stuff Yuri mentioned. You can do group by.ag and do like the sum of N. There's so much help going on in this IDE. Okay. To get these totals, you can always calculate them. Or the interesting thing is, how would you get these percentages? So notice that these percentages are the percent within the condition. So within placebo, 16% of people are under 65. And the key here is that's a, with columns. And Polars is very sequely. So basically, what you can do is, you could do like pl.call N divided by pl.call N.sum over condition. So this is kind of like, this is like using a group by with mutate in dplyr. Or the dot group by argument.
Over, we're saying, sum and within, sorry, my little zoom bars in the way. Sum N within each condition. And return that sum for that condition for each of the values in that condition.
So basically, actually, let me show you. So I'm going to say, condition N, just so you can see what it looks like.
So notice that, I'm going to sort by condition. It's another neat thing. So, all right, notice that condition N is the same for all of the placebo group. And the same for all of this condition. And, okay, unfortunately, there's the same N, but the same for this one too.
So basically, the key then is that you can do pl.call N divided by this to come up with the percentage. I'm going to say like percent equals, I'm going to round it. So, all right, so that gets us these percentages calculated.
So if you were doing something like this from raw data, you might do a little bit of this prep work. So that's, yeah, that's the gist of using polars to create a table. And do a little bit of the prep work.
And hopefully it shows off some of the power of polars, like with columns, group by, ag, and sort. So this is most of what I've prepped. Super happy to answer any questions you have, either about this or polars more generally.
Q&A and Great Tables discussion
Yeah, thanks, Michael. This was a great walkthrough where you not only demonstrated how you can apply polars to a somewhat more complicated or a dataset that is more grounded in the real world, especially for this audience, but also how you can create beautiful tables. And I think there's actually a lot more that you can do with this package. Things that you haven't yet shown, that's not the focus of this tutorial, of course, but for those of you who are interested in creating great tables, Michael and Rich, they have actually, there's some great YouTube videos of these two where they do give a very good overview of what Great Tables has to offer.
Now, there was one question, Michael. Whether it's possible to include the name of the statistical test to generate the p-value, either as in the column name or in a footnote? Yeah, that's a great question. So the easy one to do is the column name. I think this one here was intended to be added to a footnote, like we could add it here, I think.
Let me just double check that this does what. Oh, no, I did a bad thing. Let me see. What did I do? Why did I make you so mad? Well, you started with a keyword argument and then you added another one. Oh, interesting. Another regular argument, yeah.
So I'll just do another .tab, source note, .tab, source note, another. Let's see. So I think that this one, actually, you could add it here. So we could do like one, this was a, I don't know what test you run the most, but I think I saw something about a chi-square somewhere. I don't know. I don't even remember how to, or a Kruskal-Walsh. I saw like, I don't know where I saw these mentions, but that's one way is you could put it in the footnote. I'm really revealing myself by just simply writing the sentence, this was a chi-square test.
glyphs and map them to the source note. We're still adding it to Great Tables, but I think I saw Rich was actively working on it, so I'm hopeful that we'll have it in the next like couple months, hopefully.
Yeah, it's a good question. I'm seeing another. We generate many similar type of tables into Excel for some of our clients, given that's the preferred way to consume the table data. How to export the table into Excel? Do you mean the Great Tables output or the Polars table? Probably the Great Table. I think, yeah, it's a good question. I don't think there's an easy way to do it. One thing that might work okay is, let me open up a new, I'm going to open up a sheet.
Just for completeness, you can write a Polars data frame to an Excel sheet. Great, this is incredible. I just copied the HTML output. So I think the key is that because this, I learned something today. So because this is a, I think when I copied it, Excel's pretty good at this kind of thing. So I don't know, it might not be 100% of the way, but I'm pretty surprised at how far it got. Yeah, freaky. So yeah, try it out. Including footnotes?
Say what? Footnotes? Did I? Maybe I just didn't, I don't know. It just cut that part out. Can't have it all. It could be that I didn't, maybe I just didn't highlight powerfully enough. You know what I'm saying? But is there a way to just copy the full shoot? It could be that if you copy the whole thing, you know, maybe that'll do it. Try it out. It's tricky. It's tricky.
That's an interesting one though. I'm surprised. I mean, I understand that. Delighted. Yeah. I always, you know, maybe I'm a little bit, you know, playing the devil's advocate here, but one question that you could also ask is like, is it really necessary to export this to Excel, right? Maybe the client would be happy with some other format as well, but maybe Excel's just what they're used to and they're not aware that there are other possibilities. Maybe an HTML page, right? So that this becomes part of a document. Maybe that will suffice or maybe even be better or a PDF. Yeah. That's also something you can ask.
Yeah. I do know Excel is kind of nice though, because you can kind of like do a lot of tweaking. You can like widen columns and merge cells and kind of do a lot of manual styling, but it does, it's hard to reproduce is the one kind of tricky thing. Yeah. There are different approaches. Yeah. Yep. Yeah. Nice. Any other questions about polars?
Polars in the wild and ecosystem
I'd be super curious what kinds of things people are looking to use polars for or what you're using pandas for even. I'm definitely a bit less familiar with like Python and pharma. So it's an interesting space. One area where it comes up quite a bit is on like the larger data side. So a little bit more on the research side than some of the late stage clinical trials, but there they often use like other backends like Snowflake and Databricks to interact with data, a lot of Parquet. And so it's just interesting to kind of see where polars fits, how people are using it.
It does. Sorry. That reminded me of Narwhals. Like so now, like if you work, if Narwhals is nice, if you work on a team that uses a lot of pandas, you can use Narwhals to kind of like code in polars, but have it run on pandas. And some of what you said, Phil, just reminded me too, that Narwhals has some support for let me just go into one of these. I thought that they also, I think they're working on support for SQL with Ibis, another Python library, and also on integrating with DuckDB. So I think like generating SQL is maybe the kind of big one here, they mentioned. So you can use Narwhals with a tool called Ibis to kind of like code in polars, but generate SQL code, which you could use to hit like Spark or Snowflake or any of those types of warehouses. Yeah. And they also have one called SQL frame. Yeah. And they also have one called SQL frame. So it seems like they've been busy kind of wiring up everything.
Yeah. Narwhals is an interesting one. Any other interesting polar stuff? I think PyShiny, one fun thing is ShinyLive can run polars in the browser. So you can do import polars as PL. Let's see. I'm just going to do render DataFrame, PL.DataFrameX. This may be over the top to show that they can run it, but I think this works. I think this works. Yeah. Yep. So they've got polars running in the browser, which is pretty cool. This slider obviously doesn't do anything anymore, but yeah. So a lot of neat stuff cooking.
Yeah. Tim mentioned Ibis is cool on its own. So that's another one to check out. It's like DataFrame, similar API to polars, but it's made to fire on different SQL backends. Okay. Confusingly, Ibis can fire on polars. Narwhals can fire polars on Ibis. It's like everything's wired to everything else right now, but a fun one. Yeah. So you have ShinyLive. Any other cool polar stuff?
I think those are the big ones, probably. Let's see. What are the newest polars releases? Do you know what polars is up to? No, I have no clue. I'm going to pull it up. This team, the polars team is cranking out features on a very high speed.
Wow. Yeah. So it feels like a year or two ago, they bumped to version one, but... Yep. There's a lot going on, but I feel it is very ready to use in production. In fact, before I joined Posit, I was working for a client where we converted a very large code base from pandas to polars, and that's been a big success, and it's still running. It's still running. So I feel very... Michael is looking up the video.
Before I joined Posit, I was working for a client where we converted a very large code base from pandas to polars, and that's been a big success, and it's still running.
Yeah. Yeah. Yeah. Yeah. I'll go to the middle and freak you out, and then I'll... What are you doing, Michael? What is this? I just thought I was trying to get to a point in time where you are, but here we go. Who's this guy? You know? It's also cool. You didn't give out three free copies of Python polars to the PyData crowd, so I feel like this workshop is lucky, you know?
Of course. Of course. Yeah. If you want to enter the raffle, you can do that. Of course. Of course. Yeah. If you want to enter the raffle, that's still possible. The link is in the slides that I shared earlier in the chat. Maybe I can just share the signup URL.
Yeah. Once more, right? You don't have to, only if you're interested. Yeah. Nice. Yeah. Elon pointed out this Polars DS extension. So extensions work is they often put... Let's see. Does this one do... Oh, this one you just import, and they provide new functions, basically, that you can run on stuff. I think they used to use great tables, but I don't know if they still do. Honestly, I haven't looked at this package in a long time, but it seems like it has a lot of nice stuff. So yeah, it seems like a cool one to check out.
Yeah, I think... Oh, hey, Curtis. Nice to see you, Curtis. No, it's a different Curtis. No. Oh, you know this Curtis. My bad. Yes, I do. I do. I can't believe I... Yeah. We've known each other for quite a while. I'm not trying to... Yes. Okay. I'll share the link to the slides again. Yeah. Nice. I can't believe I just presumed that you might not know a Curtis who showed up, you know? There is another Curtis that we both know. That is true.
Yeah. But there are a lot of Curtis's out there. Yeah. Nice. The other... I think Polars and Geospatial is coming along, but I don't know exactly kind of where it's at yet. But I feel when I checked probably six months ago, it was coming along pretty well.
Wrapping up
Yeah. So I think that's the world of Polars. Yeah, that's about it. So I hope that we have been able to give you a good overview of what Polars has to offer. We definitely didn't discuss everything that Polars has to offer. For example, we didn't go into the whole lazy versus eager mode in Polars, which it's a big deal. It's not something where I would start with. That's why we didn't include it. But just to say that there is more to explore if you're interested. So give it a shot. There are various ways in which you can reach out to Michael or myself. I don't know what the best way is. We're both on LinkedIn, of course. You'll be able to find us if you go to the polarsguide.com website. You'll be able to find me easily. We're always happy to answer any questions that you have. So thank you so much for joining us for nearly two hours. So we appreciate it and we wish you all the best and enjoy the rest of our pharma.
Yeah, thanks so much for coming. So excited to see so much interest in Polars in the pharma community. And yeah, excited to hear what people start using it for. So thanks for coming. Phil, do you have any last words? No last words. So I'm just going to put one more time in the chat in case you needed the form to get the badge. So for attending today, you'll get a completion badge. You can fill out the form and get that. We've got an amazing Python workshop on Friday with Yilong and the team to talk about clinical reporting with Python. So stay tuned for that if you're interested in helping to lead the charge in Python in that community. A big thanks to the team today for taking us through Polars. Excited to see the advancements that come out of the space. We've got the slides in the chat box as well. So with that, if you're interested, we also got point blank coming up next in seven minutes. So I'm going to jump over there with Rich, who also does a lot of Python work. I also think the package does work for R and Python. So you can technically think of it as a continuation of our Python session for today. So with that, thank you to the two of you so much for helping. If you have any questions, feel free to reach out to the speakers. We'll be posting the video on our YouTube channel in a couple of weeks.
And I think we're good to go. So thanks a lot. One last point blank is built with Great Tables. So if you're there, just feast your mind on that. Because Rich, I don't know how, but all of that is Great Tables in ways that I'll never comprehend. Awesome. I just put that the recordings will go on our YouTube channel in a couple, I want to say a couple months, but probably a little sooner than that. It takes a little bit of time to get everything up and out. So all right. That's great. Thank you so much. I'll go ahead and end the presentation for today. All right. See y'all. Yep. See you later. Bye-bye.



