Grammar of Graphics in Python with Plotnine - posit::conf(2023)

Presented by Hassan Kibirige {plotnine} brings the elegance of {ggplot2} to the Python programming language. Learn about The Grammar of Graphics and get a feel of why it is an effective way to create Statistical Graphics. ggplot2 is one of the most loved visualisation libraries. It implements a Grammar of Graphics system, which requires one to think about data in terms of columns of variables and how to transform them into geometric objects. It is elegant and powerful. This is a talk about plotnine, which brings the elegance of ggplot2 to the Python programming language. It is an invitation to learn about the Grammar of Graphics system and to appreciate it. It will include some tips on how to avoid common frustrations as you learn the system. Materials: - Website: https://plotnine.org - Source Code: https://github.com/has2k1/plotnine - Slides for this talk: https://github.com/has2k1/my-talks Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Data science with Python. Session Code: TALK-1137

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

We welcome you to join us in the discussion, and let's get started.

Mine name is Hassan, I am the author and maintainer of plotnine , a grammar of graphics visualization system in Python.

Now, in 2011, we had an election in Uganda. I was not satisfied with the breadth and depth of the analysis that showed up in the media, and I wanted to know more. So I learned Python and dove into the data. When it got to the visualizations, I felt the experience could be better. I looked around and discovered ggplot2 , an implementation of the grammar of graphics in the R programming language.

Now, this was the most elegant idea I had come to know of, and I wanted to use it in Python. I did not think I could implement it, but as it was a great idea, and great ideas spread, I expected someone more capable than me would implement it in Python. So I waited.

After two years, something came up in Python, but I was dispirited when I used it. It was a surface-level imitation of a grammar of graphics. It was like slapping the body of a super car onto a regular car. Nonetheless, I contributed to that effort to make it better, and in doing so, I gained the confidence that I could implement the grammar of graphics in Python. And I also took ownership of the project. That is how plotnine came to be.

It was like slapping the body of a super car onto a regular car.

What is the grammar of graphics?

Now, I have stressed the phrase a bunch of times. What is one word that comes to your mind when you hear grammar of graphics? Take five seconds.

If you have struggled to nail down what it evokes, that is okay, even if you are familiar with the concept. But how about a definition? What is the grammar of graphics? Well, this is just a language and rules to make statistical graphics. Probably that doesn't add much understanding, because it is possible to be able to define an idea yet not understand it, and the vice versa, to understand an idea yet be unable to define it.

But to me, that definition is adjacent to the words and idea that the words grammar of graphics evoke in my mind, likely because I am quite familiar with the concept. When I hear language and rules, my mind asks, how do I put things together? How do I compose stuff in that system? And grammar of graphics brings to mind composing visualization.

So we are going to explore the power and elegance of a system that is built to be used through composition of its components. The components are, we have the data, we have the aesthetics, we have the scales, we have statistics, we have positions, we have geometric objects, we have facets, we have coordinate system, we have theming.

But we have already acquired a suspicion for premature definitions. So we shall just go along and experience the components directly.

Exploring the palmer penguins data set

But first, let's introduce our data set. We are going to look at a subset of the palmer penguins data set. This is a data set of size measurements of three penguin species observed on three islands in the Antarctica. Ours is stripped down to these three variables. We have the species, the bill length, and the sex. We are going to take a narrow look at the bill length.

Often, we want a graphical summary of a continuous variable. A box plot is an example of such a summary. So let us start with that. First, we import all the components for easy exploration, then declare a grammar of graphics plot. Looking at the penguins data set in the aesthetics, we map the y aesthetic to the bill length and add a box plot geometry. We get one box plot for all the penguins.

We can add a box plot per species. To do that, we map the x aesthetic to the species. But before we continue, let us stream the x and y keywords. We can do that if they are the first two parameters. If you are coming from ggplot2 and from R, the most obvious difference that you see and you will experience with plotnine is that the aesthetics are wrapped in strings, are used strings wrapped in quotation marks. So let's continue. That gives us three box plots, one per species.

Let's give the species a different color. So we map the species to the color aesthetics and we get a different color for each species. Now, coloring by the species does not add any information. We shall change that later.

For more information, we can add a mean. Comparing means and medians gives us a simple way to detect skew. To do that, we compute a summary statistic using the mean function, visualize it as a column geometry, and fill it with color. And we get boxes, two types of boxes, the column box and the box plots.

But we still have a lot more information trapped within the individual points, all the information about the variability of the bill length is still trapped within the points. Let us visualize them. We add a point geometry for that.

The points are over plotted. A common solution to that is to change the opacity to something less than one. We do that, change it to half opacity, 0.5. But that doesn't change much. Maybe add some jitter to it. That is another common solution to over plotting. So we set the position to jitter and we get something like a strip plot overlaid on top of the box plots.

But the distribution is still largely hidden. How about we swap the box plot for a violin? We replace the box plot geometry with the violin geometry and we get violins. With that plot, we have lost the precision of the five point summary that the box plot gives us, but we have gained a better sense of the distribution. Depending on your context, this may be the right decision.

But something isn't right with the points. They are wayward. Imagine if we could squish them into the violins. If there are enough of them, they could by themselves convey a sense of the distribution like that. Looking at that, we do not really need the violin. In fact, we have danced along the motivation for a fairly recent type of plot called a Siner plot. It came around in 2017 and there is a geometry for that. GeomSiner.

And that gives us information about the density distribution, the number of points, their variance and outliers amongst them.

Refining the plot

This plot still isn't perfect. Let us refine it. We have a rectangular column because it was easier to use at the time, but it is displaying a point statistic. So a line segment is a better choice. So we introduce a line segment. We use the same statistic and function as the column. We set the color to black so that we have black lines for each of the species. Stretch it out to match the width of the column, which is 90% of the available horizontal space. And the y value is constant. It is the mean. So the segment aligns with the top of the column. That is some cross-checking.

So with that, we are done with the need for the bulky column. So we shall remove it. And we have our Siner plot with black lines for the means. But we also note that the exposition and the color of the means convey the same information. So the color is redundant, the type of species. We can change that. We shall let the color represent the sex of the penguins. We change that, and that separates the points for each of the species along the sexes.

And that also introduces another problem. Clearly, for each species, the different sexes have different means. Let us add segments for those means. Now, this is a mechanical process. We start off with the segment we already have. We copy it. We paste it. We remove the fixed color to let it inherit the global color for the sex. And the result is, you can barely tell the difference between the two segments. But it gives us means for each of the sexes for every species.

But there's something wrong with that. The colored segments overlap their territories. They need the same type of positioning as the scenery geometry and the box plot before it. So the default position for the scenery geometry is to dodge. That is, not to overlap along the x-axis in the same layer. We tell the dodging system that the lines occupy 90% of the available width, and it does the right thing. And that gives us our nice segments.

Now, using the text geometry, we can add counts to each of the groups, by the sex for each of the species in the group. We set the label to the number of points in the group, format it as n equals, and still with the summary statistic, we find the lowest point and set the y-aesthetic to a value below the minimum calculated. We are doing some math where we subtract a value of one, and that should give us text that does not overlap the points. And what we get is that counts in each group that are placed at the same precise distance below the lowest point in each group.

And let's make the colors more vibrant. Now, those colors are generated by a default color scale. We can vary the parameters slightly to give them a pop, and we get something like that.

But our problems have not ended. The colored line segments and the points are the same color. Let us differentiate them, though the difference has to be small because they represent the same group. We want the colors to be close enough in space, so we have to do color math just like we have done position math to have the text for the counts come in below the lowest point. And that color math is going to be to darken the colors. We use this utility function which takes a list of colors and some factor, and it darkens them by that factor.

For example, on the right, you can see the top. We transform the list of colors at the top by various factors, and the results are what you see. At the bottom, a factor of one turns any color black. Using this function for the desired line segments, we darken them by a factor of 0.25, and we have our final plot.

And there we are. We have easily stacked point geometries and adjusted their positions and colors. In the process, created a visualization that we can at best call a scene plot because that is the most complex of its component geometries. Otherwise, it has no name. It was not repackaged. The grammar of graphics made it easy to compose. Specifically, the layered grammar of graphics made it easy to compose.

The grammar of graphics made it easy to compose. Specifically, the layered grammar of graphics made it easy to compose.

A real-world example: guitar scale trainer

Our first task has been to get to grips with the concept. As you have seen it from my perspective, the next question is a little different. If you know something about the grammar of graphics, what question... That question could go like this. Can the grammar of graphics help me with my problem?

The next example is inspired by GitHub user, Jonathan Vitale, who asked for help to create graphics for a guitar scale trainer using plotnine. Again, we start with the data, which is called the data to be visualized. It will look like that. For each string, a fret to hold down while plucking the string. But this is the user-friendly format. We have a function that turns it into a format that is friendly for plotting.

Since we are going to be looking at a guitar fret board, let us define our plotting function. We give it data. It plots something like that. That is our starting point. We stretch... We can edit the theme to stretch it out, give it funky color, remove the grid marks, and it's starting to take shape. We add some markings within the canvas and the standard dots, then the frets, then the strings. We add diameters. We give it an accurate gauge. If you are familiar with guitars, you may know what that is.

For convenience, we bundle all this into a list. We get a template. Everything up to that point has been familiar. We did that in the first part, which is composing. But what makes this interesting is to get an accurate representation of a guitar. We have to create a custom scale. That means we have to use some math, the same math that guitar makers use to create guitars.

The key requirements for a custom scale are the domain, knowing the domain of the data, our guitar scale will have 22 frets, a transformation function, and an inverse function, and how to calculate the breaks. Using the Mizani package, which is a key requirement for plotnine, we create a trans object, which is like filling in a template.

Sorry, that trans object did not appear, but we use it to create a scale, and we get the frets unevenly distributed to show the non-linearity of a guitar scale. And with that, we get to visualize the final chord that we are going to be looking at, which is the C major chord. And those are the geoms that visualize the chord.

Featured software#