
Kara Woo | Always look on the bright side of plots | RStudio
Everyone who creates visualizations in R is bound to make mistakes that prevent their plots from looking as they should. Sometimes, these mistakes create beautiful "accidental aRt", though other times they're just plain frustrating. Either way, however, there's something to be learned. This talk will draw on years of watching both the ggplot2 issue tracker and the @accidental__aRt twitter account to highlight some common plot foibles and explain what they can teach us about how ggplot2 works. About Kara: Kara Woo is a research scientist in data curation at Sage Bionetworks, where she builds tools to help researchers document and share their data. Kara is a core developer of ggplot2 and collects data visualizations gone beautifully wrong on a blog called accidental aRt
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome to RStudio Global. Thank you so much for tuning in. I'm going to talk today about the things that we can learn about ggplot2 from our data visualization mishaps. So if you have spent much time visualizing data, you have probably ended up with some plots that did not go to plan. Maybe you had lots of overlapping blue hexagons, or text that took up the entire area of the plot, or giant hairballs. My name is Kara Woo, and I'm one of the maintainers of the Accidental Art Twitter account, and I'm a contributor to ggplot2 also. So I've seen a lot of plots that have gone totally wrong.
The Accidental Art account exists to showcase these plots, to showcase the plots gone beautifully wrong, the things that didn't work out but looked cool nonetheless. And I've seen also plots that just straight-up flopped in sometimes really subtle or hard-to-diagnose ways. And so there's this whole spectrum of messed-up plots that that I've seen. And I know that these can be really frustrating, but they can also be an opportunity for us to create better plots and to deepen our understanding of the tools that we use to create those plots.
So if you're like me, then when you create a plot like the ones that I've showed, you likely go to a place like Stack Overflow to find someone with a similar problem, find a hopeful solution, and bring that over into your code to hopefully fix the problem that you've had. And that's great, and that's totally legitimate. I'm going to argue though that when you get that working solution, it's worth taking the time to think about how that solution fits into the conceptual model of ggplot2. My feeling is that by examining these messed-up plots that our mistakes generate and how we fix them, we can deepen our understanding of ggplot2's philosophy to create better visualizations faster and more reliably.
My feeling is that by examining these messed-up plots that our mistakes generate and how we fix them, we can deepen our understanding of ggplot2's philosophy to create better visualizations faster and more reliably.
ggplot2's philosophy and the grammar of graphics
So what do I mean by ggplot2's philosophy? ggplot2 is a data visualization package that embodies a whole theory of visualization called the layered grammar of graphics, which is based on Leland Wilkinson's grammar of graphics, but it has some specific things that are, you know, tailored to its implementation in R. Lots of visualization software gives you a bunch of different charts to choose from. So you have your pie charts and your scatter plots and your line charts and your bar charts, and you take the end result that you want to achieve and sort of work backwards to fit your data into it. ggplot2 is not like this. ggplot2 is a more data-first approach to building visualizations, starting with, you know, principles of essentially a grammar for how to convey data in visual form.
So when we speak or write, we don't normally start with a sentence structure that we want to use and then fit our message into that structure. Instead, we start with some idea and we use our knowledge of grammar and syntax and vocabulary to craft the message we want to express that idea. And that is very analogous to how ggplot2 works. ggplot2 lets us express our data in visual forms that might be totally novel and not fit into common, you know, bar charts and line charts and common types of visualizations. This gives us a really flexible tool and a really powerful tool that we can use to create the intentional art like you see on the slide, which was built with Daniel Navarro's flametree package, which itself uses ggplot2. So having this flexibility gives us a lot of power and control. And it also opens up room for, you know, mistakes and for things to not always work out the way we had in mind the first time around.
Mapping mishaps
There are certain foibles and mistakes that come up a lot with ggplot2 that demonstrate pretty common misunderstandings of how ggplot2, the package, works. And that's what I'm going to talk about today. The first one that we're going to talk about are mapping mishaps. These don't have to do with geographical maps. What we're talking about here is mapping visual properties in the plot to variables in our data. In the parlance of ggplot2, these mappings are called aesthetics, and they're really at the heart of ggplot2's grammar of graphics.
So let's take this plot. This is a plot of flipper and bill length for different species of penguins in Antarctica. And in this plot, we have several different aesthetics. We have position along the x-axis, which corresponds to flipper length. We have position along the y-axis, which corresponds to bill length. And then we have both color and shape, which correspond to the penguin species. So one thing that we might want to do with a plot like this is label these different clusters. We have sort of well-isolated clusters of these data points, and maybe instead of the legend at the bottom, we want to label those clusters with which species they correspond to.
And what someone will often do to achieve this is add a layer to the plot where they map the text that they want to appear to the x and y positions where they want to place it. So here we have x equals 195 and y equals 55. We're placing the label chinstrap. And you can see that that does work in that it places text on the plot. If we look closely, though, it doesn't actually look that good. So it's pretty subtle in this example, but the quality of the text is a little bit pixelated. It doesn't look as good as the points and lines that are around it. And if we were running this code live, it would actually be a little bit slow to plot.
Now, like I said, it's a pretty subtle change or subtle issue, but we actually get a lot of ggplot2 issues opened because of this specific thing where people run into their, you know, with their own plots, that the plot is really slow, that the text looks really bad. And they come to us to say, is this a bug in ggplot2? Why does it look so bad? We can illustrate what's going on here another way by using the ggrepel package, which will ensure that text on the plot does not overlap. And when we do that, we can see that we actually haven't just put a single label on the plot, we've put the whole this whole giant starburst of the same text repeated hundreds of times.
What's going on here is that we've actually written our label the same number of times as there are rows in our data set. In this case, 344 times. By putting the text label that we want inside this AES function, where AES is stands for aesthetics, we've declared a mapping between our data and the label, causing that label to repeat as many times as there are data points. So that's not great. And we can avoid this by actually using a different function called annotate, which is designed to put labels on plots just once. So it's, it's designed to do exactly what we want to do just stick that label on there once not map it to the underlying data. So we tell annotate that we want to place a text annotation, again, give it the position and the label. And it places that text on the plot for us. And when we zoom in, it looks much better, it would be much faster to plot as well.
Scale snafus
So that was mapping mishaps, where we misapplied these aesthetic mappings between visual elements of the plot and our data. The next common mix up is what I'll call scale snafus. In ggplot2, scales convert values from the data space to the aesthetic space. So in with our aesthetics that we just talked about, we declared a mapping between visual elements of the plot and values in the data, you know, shape to species, for instance, the scale controls how those data points get translated into what we see on the plot.
We're going to talk about coordinate systems too, which are what draw the axes and panels in the plot and generally kind of set up the 2d environment of the plot. So most plots, including the one that we're using for our example here, use a simple Cartesian coordinate system. But other plots can use, for example, like a geographical coordinate system or polar coordinate system in the case of pie charts.
So let's return to our flipper and bill length visualization. And this time, in addition to the data points, I've added the smoothed conditional means for each species. Now let's say that we want to zoom in this plot to highlight one species and sort of focus our attention in on just one of these areas of the plot. There are two ways that we can accomplish this. One is by setting limits on the scale. Here we're setting the x minimum to 200 and the x maximum we just let be whatever it normally was. And then the other way that we can zoom in this plot is by setting limits on the coordinate system. Again, setting the x minimum to 200 and letting the x max be whatever it is.
When we limit the scale, this is the plot that we end up with. But when we limit the coordinate system, it actually looks quite different. The data looks the same, the individual points are the same, but the statistical summary looks drastically different. And that is because when we set the scale limits, any data that's outside of those limits will by default be converted to NAs. And that this scale transformation happens before any statistical summaries. So before we calculate what those smooth lines and error bands look like, that data has already been converted to NAs. The coordinate system, on the other hand, when you set limits there, it really just zooms in the plot. So all the other data that was originally present is still getting used, even though you can't see it on the plot.
So in the first plot where we set the limits on the scale, statistical summaries look different because they're based on just the few data points that are visible for the species in red and purple. In the second plot though, you can imagine that all the rest of the data is still there and it's still getting used and those smooth lines are extending out towards where the rest of the data is. So this is clearly a case where the different ways of setting these limits have major implications for the statistical summaries that we're actually displaying on the plot. And it's important to know this. So ggplot2 does alert us to this fact and if we return to the code that generated these plots, you can actually see that the former version will warn you about what's going on. It says removed 192 rows containing non-finite values. So we've gotten this warning message, but it is at times easy to gloss over these messages or ignore them and so it's useful to know how scales and coordinates handle data that's out of bounds.
Theme threats
The last type of ggplot2 mishap that I'm going to talk about are issues with themes, theme threats. Now ggplot2's theme system is not strictly part of the grammar of graphics because it really controls all of the non-data aspects of the plot, whereas the grammar mostly deals with how we translate our data into visual form. But the theme system is really important because it is how you customize a plot to look exactly how you want it, all the way up from the text font and the background color down to small details like the length of the ticks on the axes, for example. There are dozens and dozens of things that you can customize about plots in ggplot2, but it would be really cumbersome to have to specify every single one of those every single time. So ggplot2 comes with a set of default themes that are built in that you can use and a system for theme elements to inherit their values and how they look from other aspects of the theme.
So let's look at an example. In our theme hierarchy of different elements that you can customize, text is one of the ones that's way up at the top. In theory, this applies to all the text elements of the plot. So if we look again at our penguin plot, this plot has a number of different text elements. It has title, it has axis titles and axis text, it has legend which has some text on it as well. So all of these are within the umbrella of text. If you wanted to, for example, set the font of all the text in this plot, you would want to do it up at that text level rather than having to say I want the title to be this font and the axis labels to be this font and so on.
At the next level of the hierarchy is, in this example, is axis text. So we went from text down to just the text that labels the axes. That's everything that is outlined in blue on this slide, all the axis text elements. And in this plot, we have axes on the left, right, and bottom. At the next level of the theme hierarchy, there's axis text X and axis text Y. So this allows us to, for example, customize the Y axis while leaving the X axis alone and vice versa. And there's actually one more level, axis text X bottom, axis text X top, axis text Y left and axis text Y right. Try saying that 10 times fast. Where we can customize axes for potentially on one side of the plot but not the other.
So let's return to our plot. So again, we're looking at these penguin measurements. And this time I've added a second axis on the right-hand side of the plot that just duplicates the axis on the left-hand side. So they're exactly the same. And now let's say that we wanted to customize the labels on this axis, on our two Y axes, so that they're maybe rotated 90 degrees and right justified. So we might write something like this to customize both of these axes, where we're setting axis text Y to angle equals 90, horizontal justification equals 1. And since we want to apply this to both of our Y axes, this seems like the logical way to do it. But when we look at the plot, something is not quite right. The rotation has taken effect, but the justification has not. So only on the left-hand side are our axis labels justified. On the right-hand side, they don't match.
So why would one of these changes take effect, but not the other one? And the reason for this has to do with the base theme that I'm using. So I built the theme for these plots on the default ggplot theme, theme gray. And if we printed out the whole theme, it would be like super long, so we're not going to look at all of it. But I've just put on the slide the parts of it that have to do with axis text. You can see that theme gray sets the margins and the justification of axis text Y. And then it actually adjusts those details for axis text Y right. And in particular, theme gray is setting the justification of text in axis text Y right to 0, corresponding to a left-justified text.
This brings us to the crucial piece of the theme system, which is that when you're customizing theme elements, the most specific element wins. If the theme that you're using sets a specific element, like the justification of axis text Y right, then if you want to change it, you need to edit that very specific element and not any of the ones above it. So in our case, theme gray sets the justification, but not the angle of axis text Y right. So when we change axis text Y, the justification that we provided gets overridden by the theme that we're using. This is something that has thrown off so many people. It's thrown off me, it's thrown off other contributors to ggplot, it comes up all of the time, because it is a subtle and slightly confusing aspect of how this theme inheritance works. But when you think about the fact that it's working by specificity, and not by the order in which you provide, you know, customizations to the theme, it does make sense.
This brings us to the crucial piece of the theme system, which is that when you're customizing theme elements, the most specific element wins.
So to get a working plot, we would need to edit axis text Y right like this. So instead of setting our customizations to the axis text Y element, where they'll be overridden by the theme that we're using, we customize axis text Y left and axis text Y right. And that makes sure that our customizations get applied the way we want them to. And now everything looks good on the plot. These these two axes are looking exactly the same on the left and the right.
Recap and closing thoughts
So let's recap the ggplot2 mishaps we've talked about. We have mapping mishaps where we accidentally map the data to a single visual element over and over and over again. We have scale snafus, which is where we have two different ways of setting the boundaries of a plot where one of them discards data, one of them does not. And this can affect the statistical summaries that are shown on the plot. And we have theme threats where we try to customize the plot at a higher level of the theme hierarchy that then gets overridden by a more specific theme customization.
We haven't gotten into all of the components of the grammar, we haven't talked about stats, geomes, position adjustments in that much detail. But if you really want to get deep into the understanding of how ggplot2 works, and all of the parts of the grammar that we haven't covered, you should definitely check out the ggplot2 book. There's a chapter in there specifically about mastering the grammar, and it will really help you build the solid foundation for understanding all of these different components of ggplot2.
So I want to leave you with one last piece of accidental art. It's one of my favorite patterns that comes up over and over again on the accidental art Twitter page, this sort of weird geometric pattern that you often see in maps. So here we're taking data on states, trying to plot it on a map, but it's gotten totally messed up. So if you don't know what's causing this error, see if you can find out a fix to it, and think about whether that fix tells you anything about how ggplot2 works. For the purposes of this slide, I've not included the full code on the display, but if you want to see the full code for this plot, as well as all of the other visualizations that were included in this talk, have a look at this URL. And thank you so much for tuning in. Good luck with your future ggplots, and I hope that you have enjoyed the talk.
