Data Visualization with Seaborn - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Michael Wascomb. I'm an ML engineer at Flatiron Health, which is a company in New York City that uses technology to accelerate cancer research and improve cancer care. But I'm here today to talk about Seaborn, a Python library that I'm the creator and maintainer of.

So if you're not familiar, Seaborn is a Python data visualization library that's based on the Matplotlib framework. So Matplotlib, very general library, but relatively low-level. Seaborn offers a higher-level interface that is specifically geared towards creating statistical graphics.

This talk is going to be about what I've learned over the past decade or so of Seaborn development and how that's informed the current and future directions of the library.

Origins and early development of Seaborn

Seaborn started out with pretty humble ambitions. It was essentially a toolbox or an extension to Matplotlib that offered a few plotting functions with statistical capabilities, doing things like fitting a kernel density estimate to a histogram or a linear regression line to a scatter plot.

And these functions behaved a lot like the rest of Matplotlib. They primarily operated on NumPy arrays, because that was the main data structure at the time.

But this was around when pandas was starting to get popular and bringing the idea of a labeled columnar data structure to Python. And so some of the early Seaborn functions built on that and borrowed some ideas from the R community about tidy data, using data frames at first really to make faceted or paired small multiple plots with features like conditional color mapping, where the color mapping was really thought of as a faceting operation on the third dimension of the plot.

And as Seaborn developed, I was primarily concerned with filling gaps in Matplotlib's feature set, especially features that were important for emerging data science applications. So Matplotlib's roots are in the physical sciences, and it hasn't always served analytics and machine learning use cases as well as it could.

So for instance, one of the first major distinguishing features of Seaborn was its support for categorical data and kind of plots that are optimized for categorical data.

And eventually, this landed us in a kind of odd situation where Seaborn had support for some fairly complex graphics, things like regression or violin plots, but not simple ones, like just basic line or scatter plots. And remember, the initial intention or thought was add on to what Matplotlib doesn't have, and Matplotlib already had those things.

But it didn't have this tidy API, API where you could just pass in a data frame and name some columns and have those automatically mapped to components of the plot. So almost, I think, five years or so after the first release of Seaborn, it kind of finally gained such basics as a line plot and a scatter plot function.

But not just in their simplest forms, in forms that had support for mapping now up to three additional dimensions based on the columns in a data frame, including color and size and the style of the elements in the plot.

And so Seaborn's objective remained help people make plots faster so that they can understand their data better, but its focus and how it did that shifted from do their statistics for them to give them an intuitive interface that lets them rapidly iterate and think about their data, not the mechanics of specifying a graphic.

And so that prompted me to revisit even some of the oldest functions in the library and to try to impose some API consistency onto these tools that had grown rather organically and at the start without much intentional thought about API design.

The classic Seaborn API: strengths and limitations

So despite this somewhat haphazard and organic development, the result, the kind of status quo is a library that can be used to produce a relatively wide range of informative statistical graphics with a reasonably coherent API.

And the API model, the API encourages a mental model where you might say, I want to make a line plot, and then there's a function called line plot that you can do to do that. Or there's a function called bar plot, and you can probably guess what that does. So one objective of Seaborn is that each of these functions should be able to produce a usable plot with a bare minimum set of arguments, really just enough to specify the data.

They make some inferences about what you want based on the data types, and they try to use sensible defaults, and they produce complete graphics. They add labels and legends, so you can very quickly get from an idea or a question you have to a plot that you could at least show to one of your colleagues, and they'd understand what it is.

But customization is important. Good customization requires rapid iteration. And so each function exposes a number of additional parameters that afford varying kinds of control. So these range from specifying the mappings between your dataset to the visual properties in the plot, setting hyperparameters of whatever statistical operations that function's going to perform, and tweaking the plot's final appearance.

Now, it's not quite true that you can customize everything just through the parameters of the Seaborn functions. Recall that it started with the idea of, you know, like Matplotlib, but with some statistics, and so the functions were meant to fit into the standard Matplotlib workflow, which means that while Seaborn will add default labels to the plot, if you want to customize them, you'd use Matplotlib's API for that.

Still, you can get most of the way there by interacting just with the Seaborn API, and that's helped, you know, a lot of people make a lot of useful plots. But I've also come to appreciate, you know, over the years, some of the limitations that this design imposes.

So one obvious consequence is that each of these parameters has a, or each of these functions has a lot of parameters. And any one function's parameters do a lot of different kinds of things. So while on the one hand, you know, it's straightforward to explain how Seaborn works. You're just specifying parameters. You find a function that you think is going to do what you want, you look at its list of parameters to see what options it has.

But with so many parameters, and just in one, you know, linear list, it can be hard to know, you know, what bits and pieces you're going to need to solve any specific problem. And then because Seaborn is a higher level interface to Matplotlib, some of the parameters that you need to set might actually be Matplotlib parameters. And even just within Seaborn, some functions are at a higher level than others. And so the API relies heavily on keyword argument passing, which means to understand what parameters a function takes, you might need to walk through several different pages of documentation across multiple project websites.

And the, you know, organic development of the library has led to an expanding scope over time has led to some other opportunities for friction or confusion. So take, you know, as a case study, the strip plot and the scatterplot functions. These are both fundamentally scatterplots, or what they create are fundamentally scatterplots. But as I mentioned earlier, the function called strip plot actually came first, because it's a scatterplot that's optimized for categorical data. It's meant as an alternative or a complement to a box or violin plot. So it has capabilities like dodging and jittering that makes sense in that context.

And then the scatterplot function came later, producing plots that are in one sense more basic, but on the other hand have more options for mapping properties based on data. They allow you to control the size of the dots and the shape of the markers, whereas strip plot only controls the color. So you know, you might find yourself starting to make a plot using one of these functions, realizing that it can't do everything you want, switching to one of the other functions, and then realizing you've lost some of the features that you were relying on.

Now you can always add features, but the design makes this hard too. And paradoxically, it can feel hardest to add small features, small little nice-to-haves that just add a little polish or satisfy a few less common use cases.

So as an example, take the concept of adding a gap or a little spacing between the boxes in a box plot when you dodge them. It's a totally reasonable thing, you know, I like the way this looks. But implementing it would mean adding a parameter to box plot to control whether there's a gap, how big the gap is. And so that already long list of parameters is going to grow a little bit, and that's going to impose an ongoing cost to usability. Everyone in the future who comes and looks at the list of parameters is going to have one more that they need to sift through to understand what they can do.

And here's the thing, despite the way it's articulated on GitHub here, this isn't really just a request about the box plot function. Adding this would also mean changing the bar plot function and the violin plot function. All of the other functions that have some sort of dodging operation, you either have to change all of them or you have to add more inconsistency or accept more inconsistency in the library. And so beyond implementing that operation in the code itself, each function also has tests and documentation that would need to be updated. So it just increases the costs of adding even small things.

When Seaborn's functions don't support a feature, one option is to drop down to the underlying matplotlib layer or accomplish what you want there. Matplotlib's still a Python library, lots of people know how to use it, and this was part of the original design, right? Seaborn was just adding some things that you couldn't do in matplotlib, so if there was something it couldn't do, no harm, you just carry on as if Seaborn wasn't there.

But as Seaborn has developed more capabilities in its own style of API, the gap between the two libraries has grown, and so it's no longer a valid assumption, and it's also no longer a valid assumption that any Seaborn user is also comfortable using matplotlib. Lots of places have started teaching data science using Seaborn because it's a little easier, a little bit faster to get started out of the box.

But as Seaborn has developed more capabilities in its own style of API, the gap between the two libraries has grown, and so it's no longer a valid assumption, and it's also no longer a valid assumption that any Seaborn user is also comfortable using matplotlib.

Introducing the Objects Interface

So after nearly a decade, I decided to take a step back and think, if I were designing this library from scratch and had from the start a focus on a flexible but consistent interface that was easy to use and didn't constrain what you could do, what would that look like?

So the result is what I'm calling the Objects Interface, or Seaborn.objects. It's a totally new approach, basically a ground-up rewrite that takes inspiration both from things that worked well in classic Seaborn and from libraries that are more explicitly based on the formal grammar of graphics, like ggplot, also Vega, parts of D3, other libraries that are out there that are built on some of these ideas.

So while people have sometimes referred to classic Seaborn as ggplot in Python, I've always thought that's wrong, or at least overselling it. Seaborn has certainly taken ideas and inspiration from ggplot since the beginning, but it's also always had a narrower ambition. It's not intended to be a fully flexible system for producing graphics.

The Objects Interface is much more explicitly based on the underlying formalism and also has somewhat broader ambitions to be more general. It's still not ggplot in Python, though. It's not going to be a direct port of the syntax that you might be used to if you're a ggplot user. It should feel familiar, I hope. I think the learning curve should be pretty rapid, but it's a different library.

I always say that if what you want is just to take your ggplot code and have it run in Python, you should check out the Plot9 library, and today you're in luck because you just have to wait a few minutes and you'll hear about that. But for the time being, let me walk you through how I've approached this in Seaborn.

So everything in the Objects Interface is built around an object called plot. Plot has a number of methods, and you call these methods to specify how various aspects of the graph should appear. So that includes data specification, adding layers, parameterizing the scales, customizing the decorations, and the theme of the plot. Every time you call one of its methods, you're gradually adding on to the spec.

And it's called the Objects Interface because beyond plot, there's a number of other objects that you pass to plot's methods to define the graph. So there's four basic types of objects, mark, stat, move, and scale. A mark is a visual element like a dot or a line, a stat defines a statistical operation, a move implements a positional adjustment, and a scale maps from data to visual properties.

How the Objects Interface works

So you call the plot constructor with a data source and some assignments from variables in the data set to roles in the plot. Now this looks a lot like a function call in the classic interface, but if we compiled this spec, we would see just an empty graph with some labels on the axes. You won't actually get a data visualization until we add layers to the plot.

And you do that, hopefully intuitively, with a method called plot.add. So every layer is defined by at least a mark object. Here I'm using a dot mark, so it's producing a scatter plot.

And you'll notice that when I define the layer, I'm setting the point size of the dot to a specific value, and then also associating the marker, the dot marker, with another column from the data frame.

So there's three ways to define visual properties here. There's properties that are in the plot constructor that are mapped in all layers. There's properties defined in the mark constructor that are set directly to the value you give, independent of any of the data. And then properties defined in the call to plot.add, which are mapped, but in a layer-specific way.

For this plot, there's no real difference between whether you specify color in the plot constructor or when you're calling add, but if we added another layer, such as a line representing a regression fit, you'd see that only the colors of the line are mapped.

So as in classic Seaborn, the default scales are chosen for each mapping variable based on the data types and other inferences about the plot spec. But there's a method called plot.scale that allows you to explicitly override and supply the mapping you want to use. So for example, here I'm choosing a slightly more intuitive mapping from true-false to a circle and an X. For full customization, you can pass an instance of a scale object, like a nominal scale or a continuous scale. There's also a number of shorthands, like just passing a dictionary or a list that imply a specific scale type.

The notion of small multiple plots remains very important to Seaborn. Any plot can be distributed across a grid of subplots by calling plot.facet and supplying one or two additional columns from your data frame that define the rows and the columns in the facet grid.

And there's also a method called plot.pair, which also produces subplots that look a lot like a facet plot, but much like the classic Seaborn pair grid class or pair plot function, here we're taking multiple columns from the initial data frame and assigning them to the different subplot X or Y axes. So this is like faceting, and you could basically reshape your data and also accomplish this through a facet operation, but sometimes it's more convenient to skip the reshaping step.

So unlike in the classic Seaborn interface, you use the same term to either map a property based on data or set it directly. What matters is where you do it. And you can map a much wider range of visual properties. So that includes multiple color attributes, the alpha or the opacity of the mark, and whether the mark is filled.

Where appropriate, you can control style properties like the marker shape or the dash pattern that's used for a line or the edge of a patch. There's a number of size-based properties like the point size of a dot mark, the width of the lines or edges of a patch, and the strokes that define line art markers. If you have a text mark in your plot, you can also control the font size, alignment, and offset from the anchor point. Usually you'd set these directly, but it's possible to map them too. There's no distinction between mappable and unmappable parameters.

So I think the ergonomics here are a lot better. There's no more keyword redirection across multiple functions. If you're in a notebook or an IDE, all the properties supported by plot or by a specific mark are just going to tab complete out. The documentation is going to be right there.

If you think back to the discussion about adding a gap parameter to all the functions that support dodging, that's no longer relevant here. There's just a single object. It's called dodge. You add it to any plot spec, and the marks in that layer will get dodged. And the gap parameter can just be defined there. There's only three parameters in this class. It's very easy to understand what it can do, and then it composes with any mark. So we're not in the situation with strip plot and scatter plot where you have to choose between categorical adjustments and mapping additional properties.

The objects interface lets you combine all of these operations into one plot spec. Now with that said, this particular example is maybe a little overstuffed with information. The classic interface is opinionated, which is a nice way of saying it limits what you can do. And so usually that takes you towards doing things that make sense, but sometimes it prevents you from doing things that make sense in a specific context.

So the objects interface will make you make plots that maybe don't make so much sense. And that's because there will be some data sets, some contexts where this specific set of operations, when you throw them all together, does produce an informative graph. And so when I was developing the new interface, one objective I had in mind was make it possible to produce graphs that really don't make any sense so that you avoid preventing people from making plots that do.

And so when I was developing the new interface, one objective I had in mind was make it possible to produce graphs that really don't make any sense so that you avoid preventing people from making plots that do.

And I think I call this successful. So do you want a jittered bar plot with unfilled bars where the edge width represents an ordinal variable? I kind of hope you don't. But if you did, you can make it with seaborne objects.

Status and closing remarks

Okay, so final slide, what's the status of this work? It's released, but in a kind of experimental state, a few things that are kind of being sorted out, but it's usable, I use it every day. If you use the classic interface and you like it, and this has you concerned that it's going away, it's not, it's not being deprecated, I'm going to continue to support it. And with that, thank you.

I'm going to ask you one really quick question. In which language is the backend code for seaborne.objects written? It's all Python, it's all still talking to Matplotlib. I have some experiments where you can actually compile it down to JavaScript, and we'll see what comes of those. But yeah, at the moment it's still all Python. Thank you very much. Let's thank our speaker again.