Kara Woo | Boxplots: a case study in debugging and perseverance

Transcript#

This transcript was generated automatically and may contain errors.

Alright, thank you everyone for being here, and bless you for choosing to come to a talk about boxplots.

And so I'd like to tell you a story about a bug in ggplot2 's boxplots, but this talk is not really about boxplots, because it's really about how to go about understanding and fixing bugs when you're faced with a problem that you don't know how to solve. And so I'm hoping that the strategy that I'll share that helped me solve this particular bug will be one that you can take and maybe use in your own coding work, whether that's working on R packages, or data analysis code, or any other situation that you might find yourself in.

So a while back, someone opened this issue on the GitHub repo for ggplot2. They said, when I try to produce boxplots with colors depending on a categorical variable, these appear overlapping if var width is set to true. So if you're not familiar with boxplots in ggplot, var width is an option that lets the boxes vary in width depending on the number of points they represent, which can be really useful if you have very imbalanced classes in your data that you want to represent that some of these boxes are representing fewer points than others.

So I was working as a ggplot2 intern at the time, and I was mainly supposed to be focusing on other aspects of the code base, but I saw this issue come in, and I thought, this should be straightforward to fix. Like, I'll just knock this one out really quick.

But we've got about 18 minutes of this talk slot left, so you can probably guess that that isn't how things actually turned out. In fact, it was a way more complicated bug than I had really anticipated, and I ended up having to write an entirely new algorithm to place boxes in boxplots, and working on this was probably the deepest that I had ever gone into trying to fix just one seemingly simple bug, and I learned a lot about how to approach this process in doing so.

And one lesson here is that sometimes in trying to fix a bug, you'll end up introducing way worse bugs in the process.

Rewriting the dodging algorithm

So I then from here kind of rewrote this dodging algorithm so that it could see all the data at once instead of getting these groups of boxes separately. That way, it could scale all the boxes by the same amount and be consistent in that way. And I made some other tweaks that allowed it to work with continuous x-axes. And with these new changes, things are looking pretty good. The original issue was fixed. We have box plots that work with both continuous and categorical x-axes.

And then once you have something that's working for this original example, it's worth seeing if there are ways that you can generalize it to other situations. In reviewing my work on this box plot algorithm, Hadley pointed out that with a few minor tweaks, we could generalize the algorithm to handle not just boxes, but also bars and arbitrary rectangles. And I realized that if we did this, we could solve or at least improve three other existing open issues. So we could really kind of kill a bunch of birds with one stone with this, or feed a bunch of birds with one scone.

So we did this. The main thing that needed to be changed in order for the dodging algorithm to work with arbitrary rectangles is that this approach where we're looking for the x position is not really going to work anymore. Boxes and bars would be centered around the same x, but for rectangles, you could have overlapping rectangles like this one in the bottom left plot here that aren't centered at the same position, but still overlap and would need to be moved from one another. So rather than basing my determination of what boxes overlap on their x position, I wrote a for loop that compares the starting point of one box or rectangle with the ending point of the previous one so that it can check and see if there's overlap in that way.

And I should say that Hadley had suggested this for loop approach from the very beginning of my work on this pull request, and for some reason, I repeatedly ignored his suggestion, which in retrospect was extremely foolish. So another takeaway of this for me has been that if Hadley tells me I should write a for loop, I probably should write a for loop.

Knowing when to stop

But soon we had an algorithm that was working well for boxes, bars, arbitrary rectangles, and from here, it was sort of unclear, like, when do we stop? Some bugs require only simple fixes, but for more complicated ones, it can be hard to know when you're really done. This new dodging function seemed to be performing well, so again, we want to be testing a representative variety of different cases to ensure that it is, in fact, working before we stop. That's kind of at a minimum. But once it works, there still might be more that one could do to improve it, so you need to weigh the work that it would take to make it way, way, way better or a little bit better to how important it really is to make the thing that much better.

With this dodging algorithm, I had started to wonder whether it could, in fact, completely replace the original position dodge. I was really hoping that it would. It would have been really personally satisfying to kind of wrap everything up with a nice neat little bow and be able to just completely replace the old position algorithm with one that could handle a few more situations. But alas, when I tried this out, I found that it still didn't work for violin plots and dot plots, which also sometimes get dodged in ggplot. So violins, which look like this on the left, sorry, Hillary, just become slanty lines with position dodge, too.

So after weeks of work, we decided to just wrap this up and call it good enough. We solved the original problem and many others besides, and while it would have felt satisfying to come up with one comprehensive dodging algorithm that could do everything, we figured that we could always come back and improve things later and it just wasn't worth the additional effort that it would take to make it completely, completely perfect. So we ended with a second dodging algorithm, position dodge 2, which is similar but slightly different to position dodge but can do some things that position dodge can't.

Now in telling the story of this pull request, I really smoothed over a lot of the finer details and the pain and suffering that went into getting something that actually worked in the end, but if you want to see more of the finer details of discussion, you're welcome to read this. I don't know why you would want to, but hopefully you just get some nice plots at the end.

And though I found this bug pretty challenging at the time, it's really helped me understand and fix other bugs that I've encountered by taking the strategy of isolating the problem with simple, clear examples, following the trails of the data through the code to find the problem, messing around, experimenting, breaking things and seeing what happens, then testing any fix in many different scenarios, and then when it makes sense for the problem to generalize those fixes to handle a variety of different possible cases, but always keeping in mind when the marginal rate of return is too small to be worth it.

but always keeping in mind when the marginal rate of return is too small to be worth it.

So I'd like to thank GitHub user mcall who originally reported this bug, which I don't know if there's any chance this person is here at the conference, but if so, I would love to meet you, and Hadley for reviewing the pull request and folks for giving me feedback on this talk. Thank you.

We have time for one question while we get Amelia up here.

Don't ask me why the violin plots have become slanty lines, because I don't really know.

The question was, how long did this take from start to finish?

It was I think about three weeks. I think about three weeks, and I worked on a few other things at the time, but it was a major presence in my life for about three weeks, I think. I don't know. I could look at the pull request to be sure, but that sounds about right.

Kara Woo | Boxplots: a case study in debugging and perseverance | RStudio (2019)

Transcript#

A framework for debugging

Tracing the bug in ggplot2

Using debug to inspect the data

Attempting a fix

Rewriting the dodging algorithm

Knowing when to stop

Featured software#

ggplot2

rstudio