Dewey Dunnington | Best practices for programming with ggplot2 | RStudio (2020)

The ggplot2 package is widely acknowledged as a powerful, dynamic, and easy-to-learn graphics framework when used in an interactive environment

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Our first speaker for this afternoon will be Dewey Dunnington.

Hi, this is a lot of people, this is great. So I'm here today to talk to you about programming with ggplot2 . I imagine that all of you use ggplot2, or many of you use ggplot2. It's a fantastic package for making graphics, and a lot of times, there are times you need to do that in a function, or in a Shiny app, or in a parameterized report, or in a package. And those are all great uses of ggplot2, and there's some different paradigms around how to do that, and I just wanted to talk about that.

So my story starts, as many great books and movies have started, with the reverse dependency check. So reverse dependency checks are something that the ggplot2 team does when we release, or we're preparing a release for ggplot, and there's over 1,500 packages that directly import or depend on ggplot2, and over 2,500 that use it in the suggests on a package. So there's a lot of ggplot code on CRAN.

And that's great for us because it means that there are a lot of use cases that are considered that we didn't actually know about necessarily. So we run all of the command checks with the old version of the package, and command checks with the new version of the package, and see which packages broke. And sometimes that's our fault, and sometimes there are some things in other packages, but for the most part, as a fresh ggplot2 intern over the summer, it was my job to go through all of the packages that we broke.

And through doing that, I was able to read a lot of ggplot2 code and realize that there were a lot of problems that people were trying to solve in packages that we hadn't ever documented, and also in some cases that we hadn't made possible. And so the result of that was this vignette using ggplot2 in packages. It's available on the development version of the ggplot2 package down site, the main site for ggplot2, and the release of ggplot2, the next version is coming, I'm told, and then it will be on the main ggplot2 website. But this talk is largely derived from that, and so I'd encourage you to read it if you'd like to know more.

Using ggplot2 in a function

So I'm gonna start with something much like the ggplot that is an R for data science. This is one of the first plots that many people make in R, and if not the first R code that they ever type. But this is using the mpg package, it's a bunch of cars, and so my 2008 Ford Ranger was a four-wheel drive car, and it had a huge engine and terrible gas mileage, and so it's down here. My Crosstrek that I currently drive is up here somewhere, smaller engine, better gas mileage, and the 80cc scooter that I sometimes drive is off the chart and off the chart. Excellent gas mileage, very small engine, sounds like a lawnmower.

So that's what we're looking at here, and there's a couple things, this is a good plot just to illustrate some of the problems that I noticed where people were trying to solve. So for a variety of reasons, you might start out with something like this. Maybe it's a parameterized report, so you've got something in our markdown, you've set it up so that you can change what the color variable might be, like maybe you don't want it to be the class of the car, maybe you want it to be something else. And also you might want to change how the points are grouped in facets.

And so you have a facet variable, in this case we're using the DRV variable, whether it's four-wheel drive, front-wheel drive, that kind of thing. But we have this mapping and facet specification that uses tidy evaluation, so ggplot knows that class, for example, is to look for it in the mpg dataset, because that's the plot data, and it knows to look for the DRV variable in mpg because that's the plot data, but you can't just, if you put a string in there, it won't do what you expect it to do.

So the best way, I think, to solve this problem is to use the dot data pronoun to do that. And pronoun is a fancy word for the fact that ggplot now knows that dot data is a stand-in for mpg, so it's a stand-in for the layer data, for the plot data, and so you can specify these things and use them much like you would a data frame. You can subset them with the double brackets, you can use the dollar sign, which I'll show a bit later, and you can do that both with a facet specification and with a mapping.

So this is really powerful, and this lets us use syntax that we sort of expect in a mapping and in a facet specification.

So then you can make a function around it, and this isn't artificial, by the way, this is actually how I made this, this is the process that I go through when I make a function for ggplot, so the order is kind of helpful, I think. So you can put it in a function, we can pass color variable in as arguments, and then we can call that function and we get a plot. So that's a useful way to simplify making possibly a bunch of plots or making plots in our markdown report.

So we've got this function now, it's got some arguments, those arguments are what we expect them to be. But it doesn't quite work because we actually can't remove the color mapping. So if we go back to this plot, what if we don't actually want color in there? And there's lots of reasons that you might want to do that, you've got a checkbox in a Shiny app or in a parameterized report, you want it to not make things colored. So currently that doesn't work.

So we have to go back to the code and change some things. So the best way I think to solve this problem is to take this mapping here, so this is what we're doing with the mapping, and just move that up. So we can't modify it if it's not on its own, and it's a bit simpler to bring it up top so we don't have to modify it within the geom point function, it's a bit cleaner to read. And then we can test for if the color variable is null, and then we can modify the mapping based on that.

And so this is a pretty common problem that I've had personally with programming with ggplot2. And so I think it's a good pattern where you put your mapping up top, and you can modify that as you go depending on what you're trying to do.

So that's all well and good, but we still can't remove the facet. So we've got three panels, and maybe we don't actually want any panels. Maybe we want to give a user of a Shiny app an option to say, okay, don't actually group this by anything, maybe we just would like to have just one plot.

So again, back to our code. And I think that the best way to solve this problem is to, the facet was previously at the bottom, and it's just to move that up a little bit. So again, we can work with it a bit easier, and we can keep the structure of the plot at the bottom of the function so that it stays clear exactly what the plot is made up of. And then we can test whether the facet variable is null.

And we can use this trick with ggplot2, which I think is highly underused, where you can add null to a ggplot2, and it changes nothing. And you think that that's not actually that useful, but it's really useful in a situation like this, because without it, you would have to define two ggplots. You wouldn't be able to just say, okay, I don't want to add anything to the plot in a certain case. So I think this conditional adding of possibly nothing to a plot is really powerful, and it really helps clean up your actual ggplot specification.

And we can use this trick with ggplot2, which I think is highly underused, where you can add null to a ggplot2, and it changes nothing.

So that interns like me, when I'm reading your code, it's a bit easier to read. But mostly for you. It's mostly for you.

So now we've successfully have a slightly less interesting plot. But the point is we can actually modify now a couple of different options about it. We could add the color back in if we wanted to, and it still wouldn't give us any panels.

So I think that the big take-homes here for just using ggplot in a function is that you can use .data much like a data frame within AES. So you can use it as a stand-in for that layer data that you want to map. And it follows all the regular rules within the double brackets so that you can put variable names in there that might be inputs in a Shiny app or they might be parameterized report variables or even just in a function that you made for yourself because you want to make a whole lot of plots. That works both in AES and VARs.

The second thing that I think is really helpful for just using ggplot in a function is using the fact that adding null to the end of a ggplot does nothing. And that's really helpful if you might want to add something and you might not. It helps keep your actual specification of a ggplot pretty clean. And something that I didn't show an example of but that's similarly useful is that you can add a list of things to a ggplot and it will add all of them in the list sequentially. So that's pretty useful. It's a bit easier to programmatically modify lists than to programmatically modify adding a bunch of things together. So that's another nice trick that you might be able to use in specifying some of your plots.

Using ggplot2 in a package

So if you are just making a function, then that's as far as you need to go. But if you want to use ggplot2 in a package, there are some other steps that you need to take. There's lots of reasons that you might want to make a little package. Maybe you'd like to put it on GitHub and let your friends install it. Or maybe you want to submit it to CRAN. So there's a couple more constraints that come with that.

So this is the plot mpg package. I'll show you a link to the slides at the end of the talk and you can see this little tiny package. It has one function, the plot mpg function. And it lets you specify a color mapping and it lets you specify a facet variable. So that's what it looks like.

Very important digression. This slide isn't in there because of any of your functions. Your functions are fantastic. This slide is in there because recently when I was making this talk, I was faced with about 400 lines of conditional plotting code that I put in a package for ggplot2. Lots of if statements. I was trying to make the perfect plot for plotting this very particular type of data about mud, because that's what I do.

And what I realized when I was going through this code is that really what it boiled down to was four lines of a ggplot. There was really four lines of ggplot code that took hundreds of lines to handle all of the edge cases about various types of data that could end up there. And so my solution to that was actually to make a vignette first and also to put examples in a really, really good data generating function. So what that ended up with is actually really good documentation, and I don't think that there was anybody that could make any less good plots. I think that it actually resulted in a far better package, and it was a lot easier for me to test, because I had this really great data generating function. So I think it's worth considering that.

And even if you still would like a plot function, and you probably would still like a plot function, if you do both of those things, then you can make something that's easier to test.

So this is the function that we started with. So this is a great function, but it won't work necessarily in a package. So when you refer to anything in a package, you have to be really explicit about where it comes from. And I know that many of you probably know this. It's general package development, but you can use the colon colon to refer to something in a really particular package. But you can also do this thing where you import from ggplot2, and you can import those functions to use them within your package without having to do the ggplot2 colon colon thing. Sometimes you do have to do that with datasets. You always have to preface it with the package name beforehand.

But this is a really great trick that you can use within a package to keep your code looking clean. And if you use ggplot2 a lot, then this will keep your ggplot2 looking really clean inside the package and a lot easier to read. Again, I was an intern. I read a lot of code. And in the 400 lines of code that I just removed, ggplot2 colon colon, I did a search before I did it. It showed up, like, 157 times. So this is a way to keep it a little bit cleaner. And then when you run the documentation, it will spit that into your namespace, and you can keep your code very similar to what it was before.

But we're gonna get a problem. So I don't know if you guys have experienced this, but when I get a little X at the bottom of an R command check, my heart races a little bit, and I start to question my existence. But in this case, we're getting three errors that don't actually affect whether the function works or not. But there's something that R doesn't know about the function that we have to tell it explicitly. So in this case, we've got three things that R thinks that aren't defined. So if you're following along, there's this displacement variable, this highway variable, and the dot data variable. So these are things that R very correctly has not detected in some assigned statement before that.

And ggplot2 is smart enough to know that these things are in your layer data, but R doesn't know that. So we can explicitly tell it that using the dot data pronoun again. So this is the new and improved code, where we're using dot data as something like a data frame that we can use the dollar sign to extract a variable out of it, and that will help with the... Or that will remove the command check note. So we also... You'll notice that I had to import from our laying dot data. Dot data, we can't do that with, because it's special within the mapping. But if we do this, then it will make that problem go away, and so we'll get that little check note to disappear, and your heart can fill with joy when you run the R command check.

Testing graphical output

So a final thing that I want to talk about is testing. So if you followed my advice earlier and made a really, really good data generating function, so a function that returned a data frame that you can pass directly into ggplot2, then you can test that reasonably well. And if you saw Jenny Bryan 's talk this morning, you had a little test that introduction. And you can use something called vdiffer, which is a package by Leonel Henry, part of the Tidyverse team, to actually test the visual output.

So you can't always test something programmatically, and a lot of times with a plot, you shouldn't. You shouldn't reach into a plot object and try and test something. It's much better, and it's much more realistic, honestly, to the end point, to test the image that got generated. So in ggplot2, we use this, and I think there are hundreds of plots that we have to check. And if we had to do that manually, we might not do them, or we might not do it as well, or we might miss something. So we use the vdiffer package to generate some version of record for every single one of those figures. And then when we run the test, it checks that every single one of those figures is identical.

So you can use it much like another expectation, where we have this expect doppelganger, where we're expecting this plot mpg function, as it's called, to generate the same thing as some version we have on file. And so you can make the version on file by running this little function in the vdiffer package, and that will run all of your tests, collect all the doppelgangers, and spit out something that looks like this, where any of your new cases, it will ask you to validate. So it will say, yes, this is the version that I want on file. But if any of them are mismatched, it will also show you the two versions, so you can see what changed.

So whether that's a change in your code that induced a change in a figure, or whether it's a change in our code and ggplot2's code that induced a change in your figure, you can see what the change was, and whether or not that is something that is meaningful to you, in which case you can tell us or change your code, or whether it's something that is so small that it doesn't matter to you, and then you can generate a new version of record. So I think that that is a really powerful tool, and it's something that I've just started to move any package that I do to use. It doesn't have to be ggplot2, it can do lots of other formats, and it's a fantastic package.

So I think that that is a really powerful tool, and it's something that I've just started to move any package that I do to use. It doesn't have to be ggplot2, it can do lots of other formats, and it's a fantastic package.

So using ggplot2 in a package, we can use the double colon that you might be familiar with to refer to objects in ggplot2, whether they're functions or datasets. And then we can also use this import from syntax. So we can actually import specific functions from ggplot2 so that we don't have to use ggplot2 colon colon, so that your first line doesn't have to be ggplot2 colon colon, opening parentheses, ggplot2 colon colon, aes, opening parentheses, it gets very dirty very quick. The second is we can use the dot data within a mapping or a facet spec. We can use that to avoid command check problems with a variable not being defined. And finally, you can test. Everyone should be testing. We love testing. And you can test graphical output using the vdiffer package.

And so this is a bit of a toy example that I did. It's the mpg dataset. There's a reason that there's not a plot mpg package, because people don't need help plotting that dataset. But you might need help plotting all of the Canadian election results starting in 1867. So this is something I did right after the Canadian election. We do have elections. But even though we're ruled by the queen, very technically. But this is every election since 1867. And as you might imagine, it's pretty useful to have a plot function for that. And it's useful to have a data generating function. And there's a number of other problems that this package solves that I didn't have time to go into in the talk. And so if you're interested, I'd encourage you to check it out. It's a fairly minimal package.

So that is all I had. I'd encourage you to check out the vignette if this is something that you think you're going to do a lot. And I'm also very happy to answer any questions now or later on GitHub or on Twitter or any one of the things. So thank you.

Q&A

So we have maybe two minutes for a couple of quick questions, if you're amenable.

So it looks like a number of people in the audience were interested in your use of variable quoting and maybe why not or what about tidy eval.

Yeah. So what I talked about is tidy eval. The dot data pronoun is tidy eval. But if you're talking about the double braces or whether you're talking about the double exclamation point, those are things that you can use if you want your functions to use the same tidy evaluation semantics. A lot of times you actually don't want that. You don't actually want your function to do something nonstandard because a lot of people don't expect that. So I think that there are a lot of things that are good about that, and there's a lot of really great material by Hadley and others. But this is one of the ways that you can, if you don't happen to do that, that you can make that happen.

Another question here. Could you explain the difference between using null and geom blank?

Sure. So you can use geom blank, but geom blank actually does do something. Geom blank will still make the limits of the axis do something. And so that's why I was adding null. So this is a geometry, and geom blank is a perfectly good way, or it might be, to do what you want to do. But if you have a scale that you might want to add or might not, then there is no scale blank. So null is a stand-in for anything blank. It's just like that piece of the plot never happened.

I have maybe one more here, and I'm not sure if you'll understand or if you'll need more context, but how do you address the test brittleness when using vDiffer?

Yeah, so there's a lot of great material on vDiffer as a monitoring tool and not necessarily a testing tool. So it's sort of designed to be brittle. It's designed to fail very easily when there's any little tiny change, because it really does check the SVG that was printed out. When I was rewriting the axes, I ran into a problem where I drew the tick marks in the opposite direction, and it detected that. And that is a good thing from our perspective. So by default, it doesn't run those tests on CRAN, so you don't have to worry about your package failing on CRAN because of that, but it does run them on Travis if you're using Travis by default, and that is actually really helpful to make sure that your figures aren't different on some other operating system. So it is brittle, but I think it's brittle by design, and I think that that is a good thing. There are some ways to get around not running it on certain platforms if you think that that's a problem. So I think it's still worth doing.

Awesome. Thank you so much, Dewey. Please join me in in thanking Dewey one more time.

Featured software#