Thomas Lin Pedersen - API-first package design — and learning patchwork in the process

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, thanks for joining me. As said, my name is Thomas Lin Pedersen . I'm a software engineer at POSIT and I'm super happy to talk to you today about patchwork. Now, patchwork is not a particularly new package, it's just been around for some years, but it's a very important part of the whole ggplot2 equation as it allows you to make very intuitive graphics composition in R.

Now, what do I mean with graphic composition? Let's just jump straight into it. So, code on the second slide. We're going fast. This is not super important, it's just to say that we have some ggplot2 code, it creates a plot, that plot is saved in a variable called p1. Likewise, nothing surprisingly happened here, we have a second plot saved in a variable called p2.

Now, when we were starting to prepare for these talks, POSIT provided us with some guidance from another company on how to make effective and good talks, and one of the key points when coming to showing code was that code should be short when shown on a slide, it should be big, and it should be very easy to understand. And I think this certainly fits the bill, but the next slide, which shows patchwork in action, I think does an even better job at living up to these guidelines.

So, here we go. So, this is patchwork code. I think it's pretty big, pretty easy to understand. And what patchwork is doing is that it is taking this plus operator that we all know from ggplot2, where it's used to gradually build up a ggplot2 plot, and it augments it to work together with multiple plots, so that it allows you to compose plots together. And that's all there is to it.

Now, what does it do when we run it? This is the result. As you can see, patchwork doesn't just squash the two plots together, it's doing a bit more work underneath. So, patchwork is very concerned with aligning everything, because one of its main uses is to create a cohesive single visualization containing multiple plots. So, it needs to make sure that everything is aligned and seems like it's fitting together. So, you can see in the bottom area of the plot area, everything is aligned, and in the top of the plot area, it's aligned, the two titles are aligned, and so on and so forth.

So, that's pretty simple, right? And that's kind of a problem, because it's not so funny to stand up here talking about a plus operator for the next 20 minutes. So, let's start over.

My name is Thomas Lundqvist. I'm here to talk to you about API-first package design. I know this is a dataviz track, and I'm not going to completely sidetrack you from that. So, as a kind of silver lining, I'm going to learn you a bit of patchwork in the process.

What API-first package design means

So, what do I mean with API-first package design? APIs are everywhere. This is how you interact with software as a user and also as a package developer. API-first, in my mind, means that there is a coherent idea behind the API in the package, and that coherent idea drives all the decisions during development and doing maintenance of the patch.

The reason why I'm deciding to talk to you about this instead of just making a regular patchwork talk is that it's very prevalent in how I think about patchwork, and even more so, I think it's a subject that is extremely important when it comes to designing visualization software. The reason why I think that is that visualization and plotting is a bit of a specific or a special subject within programming, because it deals with what I would call an unbounded feature complexity, because plotting and data visualization are inherently creative. It's a way of taking all of your artistic ideas and putting them down on paper or on the screen, and this means that more or less everything that you can dream up is fair game, right? You might not be able to do it through a plotting library, but there is no upper bound on what you might want to do when it comes to data visualization, and this kind of unbounded feature complexity is something that is best tamed through a good sense of API design.

There is no upper bound on what you might want to do when it comes to data visualization, and this kind of unbounded feature complexity is something that is best tamed through a good sense of API design.

And I think that we have seen with ggplot2, for instance, that this can be a fantastic way to gain success. So the rest of the talk, rather than just talking about patchwork and how it works, I will focus on three small nuggets of wisdom, perhaps, that I have picked up during work on both ggplot2, but certainly also patchwork, and all of these have definitely applied to when I was thinking about patchwork and developing patchwork and maintaining patchwork.

Having a mission

So the first thing I'm going to talk about is having a mission. Now, having a mission, I think everyone that sits down and develop a package is having a mission, but when I say having a mission, it means having a mission for how things work, how the user interfaces with the package, rather than just saying, I need to solve this and to hell with the rest. The reason why it's so important to have a mission is that coherence just doesn't come organically. You can't expect yourself to sit down, begin to code, and out of it just grows this perfectly little nugget of code that is internally beautifully composed. No, you need to think about this, and you better think about it up front, rather than think about it as you're coding and having to go back and kind of get everything to work together.

And I think in our team, in the Tidyverse team, I think we have been pretty good at that. Like, we have certainly made decisions that we have backtracked on, we have certainly broken things, because we have learned from our mistakes, but in general, I think we are pretty good at having a mission and having an idea about how we want the user to work with the code and integrate with our code. And I really think this is not just superficial. I think, and I think history kind of agrees with me, that the best user experience comes from this. It comes from having a theoretical backbone behind what you're doing. You can see this in ggplot2, which is based on the grammar of graphics, which is a whole book that defines the theory for how to think about data visualization. You can certainly also see it in dplyr , even though there's no fancy book that supports it.

But everything that these packages does is based on a very well-defined theoretical backbone. And what it does is that it allows the user to not have to learn everything there is to learn. What they just need to do is understand this theoretical backbone, and they can usually do that to extrapolate what they know into things that they want to do, without sitting down and learning everything from scratch. And the thing with having a mission, why it's so good, is also that missions are contagious. I don't have count of how many people have tried to take what ggplot2 does and try to apply it to some other programming language. And the same with dplyr. And I'm not saying this as to point fingers and say, oh, you're stealing from us. This is our idea. Not at all.

Actually, this just goes to show that once you have a great idea, suddenly it becomes impossible to think about solving this problem without thinking about this mission that someone else laid out. So missions are contagious. And we see this with ggplot2. We also see this with patchwork, actually. But because not, I think, a year or two after patchwork came out, suddenly matplotlib allowed you to use a plus operator. And this is not, again, this is not to point fingers. It's just that a good idea is contagious. And you actually want this because suddenly you are in the center of defining how to work with this problem space that you're deeply interested in.

In patchwork, in the readme of patchwork, this is the first paragraph. And this is the paragraph I wrote almost as the first thing when I began developing it. The goal of patchwork is to make it ridiculously simple to combine separate ggplots into the same graphics. As such, it tries to solve the same problem as grid extra, the grid arrange function, and cowplot plot grid. But using an API that incites exploration and iteration and scales to arbitrary complex layout. So I wrote this, and I have basically kept to that promise ever since during the last six years of development and maintenance. And there's a surprising amount of this mission statement that's about the how, the API. Not what it does, but how it does it.

And quite soon within development, the first part of development of patchwork was simply to say, can I add two ggplots together? And I figured out, hey, I could. And soon after that, I had to think about, well, what if I want to do something more? Because this plus operator is so simple, how can I augment it to allow for more complex layouts without really removing the simplicity of this first idea that I had?

So the second thing that I began working on is something that is not the plus operator, but is very, very akin to it, like other mathematical and programming operators that very quickly allowed me to scale what the API could do without really doing anything to make it more complex. Now, what this does, the first time you look at it, you might not really see it, but very soon you'll begin to appreciate that this is actually a visual representation of the composition that I'm trying to achieve. What it says is, take the plot stored as P1, put it on top of P3, and then take these two and put it side by side with P2.

So once you've understood that this is a visual representation and how it works, it's extremely simple to actually begin to build a very, very complex plus. And running this code, this is what you get. Again, it's not overly complex, this layout, but you can see how it can scale very, very quickly as you can begin to nest layouts within each other. And you can see the patchwork, even though you have a different level of nesting in the layout, it's able to keep its promise of aligning everything together and make everything feel like a complete whole. And this is without adding any additional complexity to the API.

Embracing the no

The second thing, and I think this is one of the main points that I want to drive through, when building packages with an API-first idea, is this thing of embracing the no. This seems like a very, very negative outlook, right? Just say no to everything. Why do I think this is so important? Well, I think it's important because as you begin to develop a package, maintenance will be a constant battle against API impurity. Unless you release a package and just immediately say, this is a feature freeze, nothing new will happen. Maybe I'll squash a bug or two. Then from now on, you will begin to battle against deterioration of the API. And this is simply, I would say, a rule of the universe. This is tied into, like, entropy will continue to increase unless you add in additional energy.

And the best way, one of the best ways, at least one of the easiest ways to spend this energy is just by saying no. Now, this seems like a simple idea. Saying no is not that hard, right? It is surprisingly hard. The reason why it's so hard is that once you've created something of, I would say, beauty, hopefully you feel like you're creating stuff of beauty, you put it out in the world, people will begin to use it. I mean, that's the whole idea of it. And you will have many users, you will have great users, clever users, smart people, and they will have ideas. And the thing with it, most of these ideas are great. And it's super hard to say no to great ideas. But still, your users have their own problems to solve. And their ideas might be great. But their idea and their main point of focus is not the purity of your API. So it's up to you to say, no, I can't do that. This is a great idea. I'm not going to solve it for you.

Their idea and their main point of focus is not the purity of your API. So it's up to you to say, no, I can't do that. This is a great idea. I'm not going to solve it for you.

And I've certainly done that a bunch of times. This is perhaps the situation that I've been confronted with the most. This is just one of many, many issues that have been raised around this particular design issue. This user is creating a plot or a plot composition. And one of the plots have extremely long axis titles or axis text. And what he rightfully observed is that this does not look good. There's a bunch of dead space in the upper plot. And that Y title is just dangling out there because it needs to be aligned with the other. Because patchwork will align everything. So this is an obvious, good idea to do something about it. He asked, well, could we not do so that the upper plot is not aligned all the way with the other plot? Let us stretch out, take up the full space so that it eliminates all of the dead space. And what did I say? I said no.

I said no repeatedly to this issue, to many other issues. And it's not that I don't emphasize with what he's coming with. The reason why I said no was that at that point in time, at least, I could not think of a way to solve this that did not add additional, like, cruft to the API. And I really did not want that. The other reason why I said no was that in my heart, I felt that this was kind of a red herring. Because I don't think that what he was asking for, like a stretch of the original plot, was really a good solution. I don't think it resulted in a beautiful plot composition. What I really felt that was that he needed to reframe how he wanted to combine these two plots in a way that made it look good. So I felt certain that I certainly didn't want to, like, imperil my API by providing a solution that I didn't really believe in, right? So saying no, saying no often, is a good idea. Now, this is not the end of the story. Because in the end, I said yes.

Why did I say yes? Does this really negate everything I've told you? No, not at all. So the reason why I eventually pivoted around to a solution was that I figured out how to solve it without, like, destroying the purity of my API. But that's not the only thing. The other thing was that I came to realize that while I still feel like this particular example would not result in a great composition if he had used the solution I came up with, I saw that there were other situations where a solution might be needed.

So let's return to the first plot we created. Because this is kind of the same situation, right? We have a bunch of dead space. The first plot has a legend in the bottom, and the alignment forces the second plot to be raised up. There's a bunch of dead space underneath it. But also between the title of the first plot and the plotting area because of the strips in the second plot. Now, remember the complex code we needed to create this plot, p1 plus p2. And what I kind of figured out was, and it might seem very obvious now when you see the solution, is to just have a like a small utility function that says p1 should not think about alignment at all. And you can see how this, well, it does complicate the code a bit, but it doesn't change how using my code without thinking about freeing alignment changes. It's still p1 plus p2, right? And what does it do? Well, as you can see, arguably a better plot for this particular situation.

So say no, say no often. But I think my main point is actually that the reason why I should said no is that you can never backtrack on a feature, or at least might be able to backtrack, but it's a bunch of work and it's really annoying. Because when you have a feature, people will begin to use it and taking that away from them is certainly not going to win you any popularity contest. But it's extremely easy to backtrack on a no. All it takes is saying yes. So it's better to say no and keep on that until you convince yourself that you actually have a good solution. Then you could be good to say yes.

Powerful simplicity

I'm going to end up with what I call powerful simplicity. The reason why I'm going to just briefly talk on this is that it may seem after all this talk that maintaining the purity of API is essentially just keeping features out of the way and dumbing down what the packs could do. And I don't agree that this is true. And I think if you think for it for two seconds, you will also come to the same realization. Like ggplot2, while maybe not having a completely simple API, still has a very, very strong theoretical-based API and is certainly not simple in what it can do. The same goes for dplyr and the same actually goes for the whole tidyverse because it shares a lot of ideas across packages. So a well-designed API can allow growth both in terms of what it can do and which advanced features it can put in. And one of the ways that you can do this is through composability, which is something that you know both from ggplot2 but also from dplyr.

Just to quickly show how this comes into play in patchwork, we have two new plots, actually some of the same plots, but I've just added some colors to it. And as you can see, what it does is actually exactly what patchwork should do. It composes them together. But this result does not look good. It feels like two separate plots. And this is because we have some repeated elements and such. So we want to get rid of these. We could do this by adjusting the ggplot2 code. For sure, we can remove the legend in the middle and so on. But why not just let patchwork do that by adding this additional composable setting function plot layout. And we can just say guides, collect all of those, access title, collect all of those. And what it does is very magical, actually. So it takes all guides and compares them even and say, well, these two look exactly the same. We only show one of them. And it takes all of the titles. These are redundant. Remove them.

And I think my time is running out. So I hope I have inspired you to think about APIs. I hope I have given you a slew of ideas for using patchwork. And thank you for listening.