Megan Beckett | Aesthetically automated figure production

Transcript#

This transcript was generated automatically and may contain errors.

Imagine you're an economics student and it's the start of the academic year. You're sitting in one of these chairs in the lecture venue and your lecturer is talking about changes in unemployment rates and productivity and brings up this graph from your economics textbook, showing the trends in change over time.

Well, having just lived through 2020, you likely know someone or know someone who knows someone who lost their job due to the COVID pandemic and your eyes immediately avert to this part of the graph. What's going on here? What would this graph look like if we could include more recent data? Wouldn't it be great if your lecturer could draw on a relevant recent experience to make the teaching more engaging and also have the textbook resources to support this?

The problem is this figure from an economics textbook was manually produced in proprietary software, where each element was individually placed by a designer according to their style sheet. It is very time intensive and expensive. Updating and reproducing figures such as this is one of the key contributing factors to the really slow revision cycles of such a textbook. Our world is rapidly advancing and changing, and although we may have the data available, our educational resources quickly become outdated and likely stay so for several years.

This is not a problem that is only common to the educational publishing industry. How can we invigorate such a traditional process to overcome this?

My name is Megan Beckett. I'm a data scientist from Cape Town, South Africa, and I'd like to tell you about how we at Exegetic Analytics partnered with a publishing company in order to prototype a pipeline to automatically produce figures for an economics textbook. There are thousands of these images, and each has been meticulously manually handmade by the designers according to their style sheet. The first question the publishing house came to me with was, can you automate this so that we can easily update them? Well, with R and the tidyverse , of course we can.

Building the automated pipeline

The first step in the process was therefore to figure out an automated pipeline. We started out with the data in Excel files and various other sources, which were ingested into Google Sheets as a central repository. The reason for using Google Sheets instead of a database was that we still wanted it to be easily accessible to non-technical users, so they could also access the data and update it as needed.

Each figure is then governed by a central R script where the data is read in, the figure produced in ggplot, and then the main output focus was SVG. However, with such an automated workflow, we can also output multiple formats, useful in their various publishing platforms, whether it's EPUB, print, or the web. Furthermore, by taking advantage of the Plotly library and adding a single line of code to our scripts, we can also create interactive versions of each figure for the web, further enhancing the teaching and learning experience.

This entire process is wrapped up within an R package with several helper functions to help automate this pipeline. When we get to integrating with the publishing house's processes, this will likely become an API.

Matching the aesthetics

So, we ticked the box of figuring out how to automate this pipeline so that the graphs can be easily updated. But the next question they came with was, well, can you make them look as good as our originals? As publishers, it was very important to them that the figures looked good for teaching, that they complied to their style guidelines, and that they were virtually indistinguishable from the originals. This second part of the project therefore focused on the aesthetics, and this is where ggplot really came to the party. And it's also a part of the project that I thoroughly enjoyed.

On the left-hand side here, you can see an example piece from the style sheet that the designers worked with in order to produce the figures. This needed to be translated into code. The vast majority of the aesthetics of the figures was controlled with a custom ggplot theme. Here I've shown a simplified version on the right-hand side to demonstrate what I'm talking about. We started out with the base classic theme and then built upon that, where each specification in the visual style sheet was translated into code as a specification to the theme. For example, the font families, the point sizes, colors and lines, and even the spacing around all of the elements. It really is quite incredible how much you can control the output of your ggplot figures.

I'd like to show you some of the outputs of what we achieved. On the left-hand side, you can see the original, meticulously manual plot. The purpose here is not to read and understand these graphs, but rather to make a comparison between this and the aesthetically automated. This here is our example from the beginning, from the economics lecture, with an additional plot below. If you do spot any differences, please do tweet me, as I do really love doing this.

Here is a second example, and what I want to point out here are some deliberate differences. You might have noticed this in the previous graph as well, but there are some dotted lines, blue lines, that show the trends for different time periods. When I calculated these trends from the data and plotted them on the graph, I noticed a stark difference, particularly for the 18th century, as indicated here. What I realized then is that the designer had visually looked at the data and placed the trend line by hand, according to what they thought was best. This is very error-prone.

So, not only is our plot on the right aesthetically automated, but it is also more accurate than the original, as it is derived directly from the data.

So, not only is our plot on the right aesthetically automated, but it is also more accurate than the original, as it is derived directly from the data.

This third example was really quite a fun challenge to solve, specifically around the layout and how we placed these dotted grey lines, extending from points on the top graph to corresponding points on the bottom graph. As I mentioned, about 80% of the aesthetics of the plots was achieved with the ggplot theme. However, here I also used the grid-extra and ggtext and ggtable packages, which really power the layout engine of ggplot. I was able to get into the underlying grobs and grid in order to produce such a layout. I really was quite proud of this image.

This last example here shows some trend lines for different cities. What I want you to take note of, though, are the icons used in the legend, called keyglyphs. These are not standard keyglyphs that come with ggplot. Typically, each geom has an associated function that will draw the keyglyphs in the legend, if it's required. For example, geom.point has a keyglyph function called draw-key-point that will then add points to the legend as icons, if needed.

What we did was look at the ggplot code for how these keyglyph functions work and create some of our own custom functions to manually draw these icons. For example, this one is a key path with a key symbol placed on the right-hand side of that path to obtain this icon. These keyglyph functions are then passed to the geom layer within the ggplot code so that we can update these key legends accordingly to how the original looks.

but I hope what I have demonstrated here is that with a few familiar tools applied in unconventional ways, you can completely transform a conventional process.

And from my experience, they won't necessarily come knocking at your door. So sit up and take notes of opportunities where you can really add value. For me, I'm thinking of those economics students sitting in their lecture venue at the start of the academic year. Not only would they have lived through the previous year, however bizarre and unusual it might have been, and come out different for it, but now their textbook can too.

Thank you. I do look forward to your thoughts and questions, and please do get in touch.

Megan Beckett | Aesthetically automated figure production | RStudio

Transcript#

Building the automated pipeline

Matching the aesthetics

Handling translation

Handling different numbering systems

Closing thoughts

Featured software#

rstudio

tidyverse