Resources

Claus Wilke | Spruce up your ggplot2 visualizations with formatted text | RStudio (2020)

The ggtext package provides various functions to add formatted text to ggplot2 figures, both in the form of plot or axis labels and in the form of text labels or text boxes inside the plot panel. Text formatting can be achieved through a small subset of markdown, HTML, and CSS directives. Features currently supported include italics, bold, super- and sub-script, as well as changing font size, font family, and color. Basic support for adding images to formatted text is also available

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Yeah, hello. It's great to be here. My name is Klaus Wilke. I'm with the University of Texas at Austin, and today I'm talking about work that has been sponsored by the R Consortium. That's why they're prominently featured in the bottom, and it's about improving text rendering in ggplot2.

So a common problem that probably all of you at some point have run into is you may want to mix italics and normal text in ggplot2. So in my day work, I'm a biologist, so it's a very common problem for biologists. You have species names that are supposed to be in italics, but then you have other text surrounding it that shouldn't be in italics. So how can we make that happen?

And so just to give you an example of how that may arise, here's a stack overflow issue that's asking that question. For my plot figure, I want to label categories on a bar plot with the first word being italicized while the following words are not italicized. I want the category labels to look as follows, and so there's a typical biology example. There's a bacterial name. I'm not trying to pronounce it, and then OTU1, operational taxonomic unit. That's a biology term, but in any case, so this should be italics. This should not be italics, right?

So let's look a little more into the example code and data that comes with this issue. So this is a made-up data set that the person provided. We have in the first column the name of the bacterial species. Then we have these OTUs. Then we have a name column that combines the bacterial species and the OTUs, and then we have some numeric value, and so the person was turning this into a bar plot with just three lines of ggplot2, right? You take the data, pipe it into ggplot2, put name on x and value on y, but then we flip the coordinates so we get a horizontal plot, and of course, you see the problem in this plot is that the bacterial names are not in italics. So how can we solve that problem?

The common solution that you will find for that is some sort of plot math. So this is the accepted solution on Stack Overflow. You can make a vector of expressions and apply it to the labels argument in scale x discrete, and so the code that makes the magic happen is this code here, quote italic parenthesis single quote double quote comma x bracket one. I cannot write that code. I'm sure there's people in this room who can write that code, and I salute you. I cannot, and I don't want to.

I would rather write my own text rendering software in C++, and so that's what I've done.

I would rather write my own text rendering software in C++, and so that's what I've done.

Introducing ggtext

Okay, so that leads us to the question, so how should we specify formatted text? So let's make first a slightly simple example. Instead of the axis tick label, let's just look at the title. Okay, I want a title. Species names are in italics, and italics should be in italics, and maybe I've already given it away, so I put italics and stars. Anybody that has used Markdown knows this, right? If you're using R Markdown, if you want something in italics, you put stars around it, and so that's what this package ggtext facilitates.

So as of right now, it's still in development. You have to install it from GitHub, but it's essentially ready to go. It's just waiting for the next release of ggplot2, so sometime within the next two to three months, this should be on CRAN. So I've now loaded the package. Nothing has happened yet because I haven't changed anything in the code. It wouldn't be good if just randomly stars were interpreted as Markdown because maybe sometimes you actually want to have stars in your text. So we have to tell ggplot2 that now it should interpret this as Markdown, and so I've hooked this up with the theme system so that now if you've ever used themes in ggplot2, usually the plot.title tells us how the title of the plot is styled, and normally you would write element text, and then you can use that to change the color and so on. And now I've written this element Markdown, and so now you see this has been interpreted as Markdown, and so now this initalics is initalics.

Applying markdown to axis labels

Okay, so now let's apply this to the original problem. So the additional issue here now is that we have these different species names, and we don't want to manually put them into stars. We want to automate that, right?

And so there, a really, really good companion package is the glue package. If you're not routinely using glue, I encourage you to look into it. Glue allows you to very quickly paste pieces from your data frame into strings of text. So I'm taking the data here, pipe into mutate, and then I create a new name column where I call glue, and then here I have this star, and then backed name is the column name that has the bacterial name in curly braces, and that tells glue, don't write backed name into the string, but actually pull it from the data frame. And then here, the same with OTU name, and so that now creates this table where now you see I have nice stars around the bacterial names. And so now I pipe this into ggplot2 with the element Markdown. So now it's, of course, the axis.text.y that I want to change, and now you see this is in italics. This is not.

Adding color with HTML and CSS

Okay, so now you might say, okay, your plots are boring. Don't you ever use color? I do. So let's do something with color. I'm doing a little bit of a trick, so if Hadley's not in the room, right, he would say, you're circumventing the scale system. Yes, in this case, that's the right way to do it. So I'm adding a color vector, so these are just colors that I chose, and now I'm still using glue, but now you may see here I'm actually writing a little bit of HTML. So this is the italics tag now, and I'm using inline CSS styling to get the color in there.

And with that, I can actually have now colors. So the trick that I'm using here is I'm using scale fill identity, so I'm mapping color onto fill, and then scale fill identity instead of trying to set up a color mapping. It just takes the actual colors that are in the column. Now one other thing that you may notice, it's a topic that's very dear to my heart. You may notice that visually these colors and these colors look kind of the same, right? They match very well, but I'm actually setting the transparency to 0.5 here. So the bars are actually at 50% transparency, and then they look the same as the text, which is completely opaque. And that's just how our visual system works. When you color large areas, you need the color to be lighter. It's a very common problem that I see when people color text, that the text becomes too light. Like the text always needs to use a darker color to look the same as a large area that you colored the same. So this worked out well. So if you make the color 50% transparent, it's going to be about right.

Images in axis labels

So this is another issue from stack overflow. It's another biological problem. So it's made up data. The key thing here is there are these phases. So these are phases of cell division, again, a biological example. And so this person had these different images. And so this image corresponds to this phase, this image corresponds to this phase, and so on. And so the question was, can we somehow get the images into the figure? And we can.

So now we're using the image tag. We're loading or we're setting the images. So in this particular case, it's in a folder on my hard drive, but you can pull them from a web page if you want. That works just fine. We put a break tag, and then the word, and we stick it into ggplot2. The only other thing is line height. The default line height, for some reason, is just too low. It's 0.9. I don't know why. 1.2 is a better line height if you actually want to have things in two lines. And there we go. We have the images in the axis and the label underneath.

Text boxes and geom_richtext

There's a few other things that it can do. So there's actually the option to have text boxes with some amount of text wrapping. You can color the background. If you go to the website, which is in the bottom right corner, you can see the code that makes this particular figure. So everything I've shown so far is theme elements. Does this only work for theme elements? No. I also have written a geom that you can use to use the same text rendering in a plot.

So let's do another common situation. This is the iris data set, which I assume everybody knows. So we have three species of iris flowers. We're plotting the sepal width against the sepal length for the three species. And we are putting regression lines in here. And so oftentimes, when you have a plot like that, you may want to annotate this with, say, the R squared of the regression line. And so it's, again, the problem, OK, how do I get the R squared into some plot math expression that then I can add to the plot? And so we'll, instead of doing that, use the same technique here.

So let's just go through the code quickly. So first, we calculate the correlation. So we take the iris data set, group by species, summarize, calculate the R squared. So then we end up with a table that has the species names and the R squared, right? And now, we turn this into our labels. So this sepal length and sepal width, 8 and 4.5, those are just the coordinates where the label will appear in the plot. And then, again, I use the glue function. So I have R in stars. So it's going to be italics, superscript 2, because it's going to be R squared. And here, I'm pasting in the correlation coefficient, and I round it to two digits. So in the glue package, inside the square brackets, you can just write regular R code, and it gets executed and replaced by the resulting string. And then, this goes into ggplot2, into a geom rich text, which takes the label and interprets it as HTML. And so now, you have here this label, R squared equals whatever. So by default, it puts a white background and a box around it. But all of that, you can change. You can take the box away if you want. You can put a different background and so on.

The gridtext package

So that's really all I can do. However, there's actually two packages that I've written. So this package was the ggtext package, which provides the interface to ggplot2. But it's really just a wrapper around an underlying package. So the package that makes it all possible. And that already is on CRAN. And that's the grid text package. And so that is a great package if you're actually writing grid graphics code. So for example, if you're writing your own ggplot2 extensions. And I've made it so that if you're used to writing grid code, it should be very easy to use the grid text package. So if you've never seen grid code, this may look scary. But if you have seen grid code, this should be very familiar.

So this is, so far, just regular grid. I have some text here. Then I set up grid graphics parameters, so GPAR graphics parameters. I'm using a silly font with some font size and a color. And then I create a text grop, grop stands for graphics object, text, the coordinates x and y, horizontal justification, and the graphics parameters. And then I call the grid.draw function to draw it. And I get this outcome. And so now, in addition, I draw the same string, but using the rich text drawing function from grid text. And so it looks like this. So now I have rich text grop, text, coordinates, hs, gp. So it looks exactly the same.

A word of caution

I want to end with one word of caution. So you may now think, oh, great, I can do anything. That is not true. You can pretty much do exactly what I showed you and not more.

So the issue is that I'm not handing off the HTML to a web browser that renders it or some professional text rendering library. For technical reasons, there's just no good way right now to achieve that. So I had to write my own rendering engine from scratch. And so my approach was, OK, what are the five things that I can implement in a reasonable amount of time that really, really make a difference? And the number one is italics and bold. The number two is colors. Number three is being able to change fonts. So you can change a different font if you want to. Number four is superscripts and subscripts. And then number five is images. So that's where I currently am. You can do a lot with that.

Just this morning, I got this bug report that I was just waiting for, which listed 100 markdown features that don't currently work. Yes, I know. They don't currently work. The truth is, my vision is to increase the number of features that we support. At this point, we have a good user interface that I think will remain stable. And so I can now keep adding features and just eventually, at some point, certain things may work that right now don't work. But anything that works now should continue to work. And the R code, sorry, the R code should not change going forward. So with that, make plots with formatted text. Thank you.

Q&A

We have a couple minutes for questions here. So let me go down the list. There were a lot of questions in here that you just covered at the end, like what else is available. So I will scroll through those. Do you have any thoughts on adding shorthand markdown-ish syntactic sugar around the HTML tags for this work?

I'm not sure I understand the question. So the way that this whole thing works is it actually, it takes the markdown, it sends it through the regular markdown package that is part of R, which turns it into HTML, and then the HTML gets rendered. So anything that you're used to doing in markdown works, in theory, as long as the resulting HTML is actually supported, right? So going forward, what I need to do is I need to support more HTML features, and actually the biggest issue right now is I need to support more CSS, because that's the next big stumbling block for a lot of obvious features that you would want to have. And then the markdown should just come by itself through the existing markdown package.

I was really happy to see some glue in there, because I love that package, but we have some folks in the audience who wanted to know, what is the advantage of using glue over paste or paste zero? So the advantage of glue over paste is just that it may be slightly easier to write, in particular if you have a lot of things, right? I mean, paste would work just the same, right? If you like paste, use paste. But I find this very easy to read. These curly braces are very low overhead, right? So it's very easy to just look at the piece of text that comes out and says, yes, this looks correct, versus with paste, it gets a bit more complicated to figure out what the final output actually will be.

Here's one about, do you have any tips for incorporating nonstandard fonts into plots that might not be available on your machine? OK, so first of all, this package doesn't solve that. This package will work with any font that works regularly in your R setup, right? So the problem of having other fonts, like Thomas is working on that, the long-term goal is that any font that you have installed on your machine should just work. And actually, on Macs, that actually works reasonably well, honestly. Any font that the operating system knows I can use. I installed, actually, Font Awesome from the web, and then I could put little icons into plots. I didn't demonstrate that today, but that worked just fine. So as long as R supports the font, it will work in this case. But how to make that happen, that's really for Thomas.

And lastly, of course, where are your slides available? Where can you be found for the rest of the conference? Oh, I completely forgot about that. So I actually tweeted it out right before, actually earlier this morning. So if you go at at Klaus Wilke, that's my Twitter handle. And the last tweet I made, or the second to last tweet I made, is the slides.