Resources

Kara Woo | Boxplots: a case study in debugging and perseverance | RStudio (2019)

Come on a journey through pull request #2196. What started as a seemingly simple fix for a bug in ggplot2's box plots developed into an entirely new placement algorithm for ggplot2 geoms. This talk will cover tips and techniques for debugging, testing, and not smashing your computer when dealing with tricky bugs. VIEW MATERIALS https://github.com/karawoo/2019-01-17-rstudioconf About the Author Kara Woo Kara is a research scientist in data curation at Sage Bionetworks, where she helps other researchers document and share their data. She has previously worked as an information manager at Washington State University and at the National Center for Ecological Analysis and Synthesis (NCEAS), where she combined data management with fieldwork at a remote Siberian lake. Kara is an enthusiastic R programmer, and collects data visualizations gone beautifully wrong on a blog called accidental aRt

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Alright, thank you everyone for being here, and bless you for choosing to come to a talk about boxplots.

And so I'd like to tell you a story about a bug in ggplot2's boxplots, but this talk is not really about boxplots, because it's really about how to go about understanding and fixing bugs when you're faced with a problem that you don't know how to solve. And so I'm hoping that the strategy that I'll share that helped me solve this particular bug will be one that you can take and maybe use in your own coding work, whether that's working on R packages, or data analysis code, or any other situation that you might find yourself in.

So a while back, someone opened this issue on the GitHub repo for ggplot2. They said, when I try to produce boxplots with colors depending on a categorical variable, these appear overlapping if var width is set to true. So if you're not familiar with boxplots in ggplot, var width is an option that lets the boxes vary in width depending on the number of points they represent, which can be really useful if you have very imbalanced classes in your data that you want to represent that some of these boxes are representing fewer points than others.

So I was working as a ggplot2 intern at the time, and I was mainly supposed to be focusing on other aspects of the code base, but I saw this issue come in, and I thought, this should be straightforward to fix. Like, I'll just knock this one out really quick.

But we've got about 18 minutes of this talk slot left, so you can probably guess that that isn't how things actually turned out. In fact, it was a way more complicated bug than I had really anticipated, and I ended up having to write an entirely new algorithm to place boxes in boxplots, and working on this was probably the deepest that I had ever gone into trying to fix just one seemingly simple bug, and I learned a lot about how to approach this process in doing so.

A framework for debugging

So when I'm trying to fix a bug in some code, I like to let myself be guided by three questions. How do I know what the bug is first? And then how do I fix it? And then how do I know when I'm done fixing it? And by answering these questions, I find I can keep myself a little bit on track and that there are some specific strategies that I can use in answering each.

So before you can go about fixing a bug, you have to know what the bug is, and the best way to figure that out is to start by isolating the problem. So at this point, you're not necessarily even trying to think about what the cause of the problem is, but rather just trying to pin down exactly when it happens. And the best way to convey this is through a minimal reproducible example or a reprex. A reprex should be the minimal code and data necessary to produce the problem that you're seeing.

And the minimal part is really important here, because if you have thousands of rows of data and lots of custom plot themes, there's a lot of distractions and it's hard to tell what's really relevant to the problem at hand. So paring it down can really help isolate the problem and often can turn up, can help you figure out what is causing it through that process. Then the reproducible part is important so that you or someone else can know if you fixed it. If you're not sure when the problem occurs, you can't make it happen reliably, it's hard to know if it's really been solved.

So the reprex package by Jenny Bryan makes it really easy to produce these nice standalone copy-pasteable reprexes and it can even do things like embed figures into the output. And fortunately, the person who reported this issue provided a reprex of their problem, so they did this step for me of isolating the problem. So without any extraneous detail, they identified exactly when the problem occurs. When boxes have variable widths, they overlap. If they don't have variable widths, they don't. And I really like this reprex because every piece of code that is in it is necessary to producing this problem. They also gave an example of the case where the problem doesn't occur. When we have these uniform width boxes, you can see that they get moved side to side and they don't overlap. So that's what we're aiming for.

Tracing the bug in ggplot2

So once you've isolated when the problem occurs, it's time to start digging into the cause. And to do that, you need to kind of follow the data through its journey to the point where the problem occurs. If plotting code in particular, this can be sort of challenging because a lot is going on within ggplot2 to turn a data frame that you provide into a plot. And at the time that I worked on this, I was not familiar with all of the steps that were happening, so I wasn't even sure really where to start looking.

But this was sort of my thought process. So looking at the code that was reported, I can see that in the buggy version of the plot, we get this warning message. Warning, position dodge requires non-overlapping X intervals. And this is a little bit confusing to me because, like, yeah, the whole point is that these are overlapping and we don't want them to. Like, we're not trying to have overlapping X intervals here. But I can latch on to this position dodge and sort of try to figure out what is happening around there that is probably contributing to this problem, since that's in the warning message.

So I'm going to try to find position dodge and see if I can figure out what it's meant to do. And to do this, I need to search into the ggplot2 code base, which I have a copy of locally. I could search within the folder of code and find it. But honestly, a lot of the time, I'm searching on GitHub, and so I'll type in position dodge and find where it's located and see where I go from there.

So then I go through the function, reading it line by line, and seeing when the function that I'm on may be calling some other function that I don't know, and sort of repeating that process of just reading the code, reading the comments, hoping that I can make some sense of it. Hopefully the comments in code are clear enough that it's possible to do this. And in this case, I was able to figure out just by reading that there's sort of this pathway that the data goes through from position dodge to capital P position dodge to a function called collide and then posdodge. And it appeared that those last two were the ones kind of controlling the most about how these boxes were being placed.

So at a high level, what collide does is that it gets information about the boxes before they've been moved side to side. So when you have these overlap, it gets these overlapping boxes, looks for the overlap, and then sends all of the overlapping boxes to posdodge, which will shrink them down and move them side by side. And so the boxes sort of look like this figure when they arrive to collide, and then in groups, they get sent to posdodge. So one will go, and then the next will go. And posdodge is the one that moves them side to side.

Using debug to inspect the data

So that's helpful. But now I want to see, OK, what's wrong with the variable width boxes that's making this not happen? I want to explore what these functions are actually doing and look at what data they see so I can try to get more insight into this.

So I'm very regretful that I wrote R code for many years without using debug or trace or a lot of the really helpful tools for debugging that are built into R. So debug is one that I'll give an example of right now. And I use it all the time now because it lets you enter into the function that you're interested in and step through it and see all of the data that it's passed and see all of the transformations that it does as if you were in a regular R console.

So through calling debug on the collide function, I can enter into collide. And that's what this message here is saying, debugging in collide. And I can then view the data that's being passed to the data argument of collide. So this is roughly what the data looks like for a standard box plot when all the boxes are the same size. The data, I removed some columns to simplify it. But there's basically one row per box that gives information about what X position it's centered at, the start and end points of the boxes horizontally, the X min and X max. And then in the data that I removed, there's also the Y position, the color and so forth. So if you had just an empty plot, this would be all the information that you would need to draw boxes on a box plot.

Now we go back to my first point of isolating the problem. Because we know that this bug happens when the boxes have variable widths, but not when they have uniform widths. So I, using this debugging tool, look at the data for both cases. And I can see that for the uniform width boxes, the X min and X max for two boxes at the same position are the same. Which makes sense. They're the same size. So they're going to extend the same amount around the place where they're centered. And for the variable width boxes, they're different. Which also makes sense, because, again, they're different sizes. So they're going to extend farther or less far around where they're centered.

So by comparing these data sets and viewing the code, I'm able to find the source of the bug. Because one of the things that Collide does is, again, send these groups of boxes to posdodge to be dodged. And it is doing this on the basis of their X min value. So it treats all boxes that have the same X min as being at the same position. And so then they get sent to posdodge, and they get moved from one another. But these variable width boxes don't have the same X min, so Collide doesn't realize that they occupy the same position.

Attempting a fix

So now we know what the problem is. But fixing it is a whole other kettle of fish. And it may or may not be obvious, once you've found the source of the bug, what the fix ought to be. So depending on what it is, you may just have to get in there and start changing things. I made a lot of minor changes to the code and saved them with commit messages like, this doesn't fix position dodge, but it might be in the right direction, before I ever got to anything that was even a plausible solution. These are real commit messages from my git ref log that I dug up to put on this slide. And so I would go in there, make changes, run the reprex again, and see if anything was different or if somehow I had fixed the problem.

Since the main source of the problem was that Collide was treating these boxes with different X mins as not being at the same position, my first thinking was, well, could it use their X position instead? Because we have X as a column in the data, so maybe we can just swap that out and everything will be great. So that's roughly what I did here. And at first glance, it sort of works, but there's something wrong with this plot that maybe some of you with sharp eyes are spotting, and that's that the boxes are not in the right order. We have pink, then blue, then blue, then pink, then blue, then pink. So that's weird and not great.

And then there are some other problems that are not visible in that plot, but came up in other investigations, that this doesn't work for continuous X axes, it only works for categorical ones for some reason. And then there's this problem where the boxes are actually not being, the approach to scaling the boxes is not accurate. So when the boxes get moved side to side, they need to be shrunk down in order to fit. And POSDodge is getting them in groups and scaling them by however many are in the group. So if it gets three boxes, it'll shrink it down so that the three can fit side by side. If it gets one, then it may not need to really scale it down. And so you could end up in a situation where these variable widths that are supposed to convey meaning in the plot are not really accurate at all.

So this is a major problem. And one lesson here is that sometimes in trying to fix a bug, you'll end up introducing way worse bugs in the process. So that's why it's important to test many different scenarios. Isolating the problem helped me figure out the cause, but a solution needs to take into account the real world complexity of how the code is used in lots of different scenarios, lots of different types of data. Fortunately, ggplot2 has a lot of automated tests that can alert me if something that I'm changing causes unexpected changes elsewhere. And as I worked on different possible solutions, I also made sure to add more tests to make sure that things that I fixed didn't then get unfixed accidentally.

And one lesson here is that sometimes in trying to fix a bug, you'll end up introducing way worse bugs in the process.

Rewriting the dodging algorithm

So I then from here kind of rewrote this dodging algorithm so that it could see all the data at once instead of getting these groups of boxes separately. That way, it could scale all the boxes by the same amount and be consistent in that way. And I made some other tweaks that allowed it to work with continuous x-axes. And with these new changes, things are looking pretty good. The original issue was fixed. We have box plots that work with both continuous and categorical x-axes.

And then once you have something that's working for this original example, it's worth seeing if there are ways that you can generalize it to other situations. In reviewing my work on this box plot algorithm, Hadley pointed out that with a few minor tweaks, we could generalize the algorithm to handle not just boxes, but also bars and arbitrary rectangles. And I realized that if we did this, we could solve or at least improve three other existing open issues. So we could really kind of kill a bunch of birds with one stone with this, or feed a bunch of birds with one scone.

So we did this. The main thing that needed to be changed in order for the dodging algorithm to work with arbitrary rectangles is that this approach where we're looking for the x position is not really going to work anymore. Boxes and bars would be centered around the same x, but for rectangles, you could have overlapping rectangles like this one in the bottom left plot here that aren't centered at the same position, but still overlap and would need to be moved from one another. So rather than basing my determination of what boxes overlap on their x position, I wrote a for loop that compares the starting point of one box or rectangle with the ending point of the previous one so that it can check and see if there's overlap in that way.

And I should say that Hadley had suggested this for loop approach from the very beginning of my work on this pull request, and for some reason, I repeatedly ignored his suggestion, which in retrospect was extremely foolish. So another takeaway of this for me has been that if Hadley tells me I should write a for loop, I probably should write a for loop.

Knowing when to stop

But soon we had an algorithm that was working well for boxes, bars, arbitrary rectangles, and from here, it was sort of unclear, like, when do we stop? Some bugs require only simple fixes, but for more complicated ones, it can be hard to know when you're really done. This new dodging function seemed to be performing well, so again, we want to be testing a representative variety of different cases to ensure that it is, in fact, working before we stop. That's kind of at a minimum. But once it works, there still might be more that one could do to improve it, so you need to weigh the work that it would take to make it way, way, way better or a little bit better to how important it really is to make the thing that much better.

With this dodging algorithm, I had started to wonder whether it could, in fact, completely replace the original position dodge. I was really hoping that it would. It would have been really personally satisfying to kind of wrap everything up with a nice neat little bow and be able to just completely replace the old position algorithm with one that could handle a few more situations. But alas, when I tried this out, I found that it still didn't work for violin plots and dot plots, which also sometimes get dodged in ggplot. So violins, which look like this on the left, sorry, Hillary, just become slanty lines with position dodge, too.

So after weeks of work, we decided to just wrap this up and call it good enough. We solved the original problem and many others besides, and while it would have felt satisfying to come up with one comprehensive dodging algorithm that could do everything, we figured that we could always come back and improve things later and it just wasn't worth the additional effort that it would take to make it completely, completely perfect. So we ended with a second dodging algorithm, position dodge 2, which is similar but slightly different to position dodge but can do some things that position dodge can't.

Now in telling the story of this pull request, I really smoothed over a lot of the finer details and the pain and suffering that went into getting something that actually worked in the end, but if you want to see more of the finer details of discussion, you're welcome to read this. I don't know why you would want to, but hopefully you just get some nice plots at the end.

And though I found this bug pretty challenging at the time, it's really helped me understand and fix other bugs that I've encountered by taking the strategy of isolating the problem with simple, clear examples, following the trails of the data through the code to find the problem, messing around, experimenting, breaking things and seeing what happens, then testing any fix in many different scenarios, and then when it makes sense for the problem to generalize those fixes to handle a variety of different possible cases, but always keeping in mind when the marginal rate of return is too small to be worth it.

but always keeping in mind when the marginal rate of return is too small to be worth it.

So I'd like to thank GitHub user mcall who originally reported this bug, which I don't know if there's any chance this person is here at the conference, but if so, I would love to meet you, and Hadley for reviewing the pull request and folks for giving me feedback on this talk. Thank you.

We have time for one question while we get Amelia up here.

Don't ask me why the violin plots have become slanty lines, because I don't really know.

The question was, how long did this take from start to finish?

It was I think about three weeks. I think about three weeks, and I worked on a few other things at the time, but it was a major presence in my life for about three weeks, I think. I don't know. I could look at the pull request to be sure, but that sounds about right.