Resources

Megan Beckett | Aesthetically automated figure production | RStudio

Automation, reproducibility, data driven. These are not normally concepts one would associate with the traditional publishing industry, where designers normally manually produce every artefact in proprietary software. And, when you have 1000s of figures to produce and update for a single textbook, this becomes an insurmountable task, meaning our textbooks quickly become outdated, especially in our rapidly advancing world. With R and the tidyverse in our back pocket, we rose to the challenge to revolutionize this workflow. I will explain how we collaborated with a publishing group to develop a system to aesthetically automate the production of figures for a textbook including translations into several languages. I think you’ll find this talk interesting as it shows how we applied tools that are familiar to us, but in an unconventional way to fundamentally transform a conventional process. About Megan: Megan Beckett is a Data Scientist at Exegetic Analytics, where she consults, develops and leads several analytical projects across a wide range of fields and industries. *"Scientifically creative; creatively scientific."* This aptly describes her philosophy and approach in her work and life. Megan helped co-found and organises the Cape Town R-Ladies chapter and is a co-organiser of the satRday events in South Africa. She loves to paint, with her most recent work exploring the [biodiversity of southern Africa](https://mappamundi.netlify.app/portfolio/), and running is her passion, whether on the road or the trail

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Imagine you're an economics student and it's the start of the academic year. You're sitting in one of these chairs in the lecture venue and your lecturer is talking about changes in unemployment rates and productivity and brings up this graph from your economics textbook, showing the trends in change over time.

Well, having just lived through 2020, you likely know someone or know someone who knows someone who lost their job due to the COVID pandemic and your eyes immediately avert to this part of the graph. What's going on here? What would this graph look like if we could include more recent data? Wouldn't it be great if your lecturer could draw on a relevant recent experience to make the teaching more engaging and also have the textbook resources to support this?

The problem is this figure from an economics textbook was manually produced in proprietary software, where each element was individually placed by a designer according to their style sheet. It is very time intensive and expensive. Updating and reproducing figures such as this is one of the key contributing factors to the really slow revision cycles of such a textbook. Our world is rapidly advancing and changing, and although we may have the data available, our educational resources quickly become outdated and likely stay so for several years.

This is not a problem that is only common to the educational publishing industry. How can we invigorate such a traditional process to overcome this?

My name is Megan Beckett. I'm a data scientist from Cape Town, South Africa, and I'd like to tell you about how we at Exegetic Analytics partnered with a publishing company in order to prototype a pipeline to automatically produce figures for an economics textbook. There are thousands of these images, and each has been meticulously manually handmade by the designers according to their style sheet. The first question the publishing house came to me with was, can you automate this so that we can easily update them? Well, with R and the tidyverse, of course we can.

Building the automated pipeline

The first step in the process was therefore to figure out an automated pipeline. We started out with the data in Excel files and various other sources, which were ingested into Google Sheets as a central repository. The reason for using Google Sheets instead of a database was that we still wanted it to be easily accessible to non-technical users, so they could also access the data and update it as needed.

Each figure is then governed by a central R script where the data is read in, the figure produced in ggplot, and then the main output focus was SVG. However, with such an automated workflow, we can also output multiple formats, useful in their various publishing platforms, whether it's EPUB, print, or the web. Furthermore, by taking advantage of the Plotly library and adding a single line of code to our scripts, we can also create interactive versions of each figure for the web, further enhancing the teaching and learning experience.

This entire process is wrapped up within an R package with several helper functions to help automate this pipeline. When we get to integrating with the publishing house's processes, this will likely become an API.

Matching the aesthetics

So, we ticked the box of figuring out how to automate this pipeline so that the graphs can be easily updated. But the next question they came with was, well, can you make them look as good as our originals? As publishers, it was very important to them that the figures looked good for teaching, that they complied to their style guidelines, and that they were virtually indistinguishable from the originals. This second part of the project therefore focused on the aesthetics, and this is where ggplot really came to the party. And it's also a part of the project that I thoroughly enjoyed.

On the left-hand side here, you can see an example piece from the style sheet that the designers worked with in order to produce the figures. This needed to be translated into code. The vast majority of the aesthetics of the figures was controlled with a custom ggplot theme. Here I've shown a simplified version on the right-hand side to demonstrate what I'm talking about. We started out with the base classic theme and then built upon that, where each specification in the visual style sheet was translated into code as a specification to the theme. For example, the font families, the point sizes, colors and lines, and even the spacing around all of the elements. It really is quite incredible how much you can control the output of your ggplot figures.

I'd like to show you some of the outputs of what we achieved. On the left-hand side, you can see the original, meticulously manual plot. The purpose here is not to read and understand these graphs, but rather to make a comparison between this and the aesthetically automated. This here is our example from the beginning, from the economics lecture, with an additional plot below. If you do spot any differences, please do tweet me, as I do really love doing this.

Here is a second example, and what I want to point out here are some deliberate differences. You might have noticed this in the previous graph as well, but there are some dotted lines, blue lines, that show the trends for different time periods. When I calculated these trends from the data and plotted them on the graph, I noticed a stark difference, particularly for the 18th century, as indicated here. What I realized then is that the designer had visually looked at the data and placed the trend line by hand, according to what they thought was best. This is very error-prone.

So, not only is our plot on the right aesthetically automated, but it is also more accurate than the original, as it is derived directly from the data.

So, not only is our plot on the right aesthetically automated, but it is also more accurate than the original, as it is derived directly from the data.

This third example was really quite a fun challenge to solve, specifically around the layout and how we placed these dotted grey lines, extending from points on the top graph to corresponding points on the bottom graph. As I mentioned, about 80% of the aesthetics of the plots was achieved with the ggplot theme. However, here I also used the grid-extra and ggtext and ggtable packages, which really power the layout engine of ggplot. I was able to get into the underlying grobs and grid in order to produce such a layout. I really was quite proud of this image.

This last example here shows some trend lines for different cities. What I want you to take note of, though, are the icons used in the legend, called keyglyphs. These are not standard keyglyphs that come with ggplot. Typically, each geom has an associated function that will draw the keyglyphs in the legend, if it's required. For example, geom.point has a keyglyph function called draw-key-point that will then add points to the legend as icons, if needed.

What we did was look at the ggplot code for how these keyglyph functions work and create some of our own custom functions to manually draw these icons. For example, this one is a key path with a key symbol placed on the right-hand side of that path to obtain this icon. These keyglyph functions are then passed to the geom layer within the ggplot code so that we can update these key legends accordingly to how the original looks.

Handling translation

Having done this, the publishing house were now quite impressed with our skills and starting to believe in us that we could actually do this. The next question then was, OK, so can you do this in all languages? This economics textbook has been translated into several languages and so we needed to figure out how to handle the automatic translation of these figures in our pipeline.

This third step therefore focused on translation and the solution that we came up with was to capture the translated strings in a YAML file which is then read into the R script for the figure. YAML, if you're not familiar with it, used to stand for Yes Another Markup Language and it's now become the recursive acronym YAML Ain't Markup Language to specifically distinguish its use as being data-oriented rather than a traditional markup for documents.

It's often used in configuration files and you might have seen it or noticed it at the top of our markdown documents. That's YAML. It's very structured and importantly, it's human-readable. This is important for us because we needed the translators who are non-technical to be able to easily work with the format in order to capture the translated strings. But at the same time, I have a structure that has a clear format to it and it's data-oriented to work with in R.

On the left-hand side here, you can see an example from one of our YAML files. Each of the languages is denoted by an abbreviated two-letter word for example, EN for English. Each of the labels that then requires translation is assigned a variable name that's common across all of the languages. For example, x-axis label. The value for this label is then the text that is the translated string. For example, year, ano, or warsi.

In R, we then have the YAML package to work with these files. So the strings are read into R using the function read underscore YAML as a list of lists, which I then convert into a data frame. Within the ggplot code, each of the labels that requires translation is then substituted with that variable name. The list of languages that needs to be generated is obtained from the first level in the hierarchy of the YAML file. And then I iterate over those languages and replace the variable names with each translated string as we generate the output for that translation.

The fourth benefit of this process was that it allowed us to give the translators some more custom control over the labels. What I mean by this is, for example, in some cases we needed some part of the string to be in italics. For example, the unit. With just English, this is fairly straightforward to do within the R script. However, if we take multiple languages into account, how do we know which part of the string needs to be in italics?

This is where I use the ggtext package. This is a line from the theme function which I showed earlier where we replaced the element text function with element markdown. What this enables is that the translators can use simple markdown in their strings, which is then converted to display the correct output in the final image. For example, they could enclose a unit in asterisks and this is then converted into italics in the final output.

This worked well for labels that were accessed titles. However, we also have many labels that require translation that are placed on the plot with geom label or geom text. In order to cater for this, what I did was again use the ggtext package and create a function that corresponds with the original ggplot to override it. For example, geom text. Here we use instead geom underscore rich text, which again enables translators to be able to put simple markdown into their strings. That's then catered for in the final output.

Handling different numbering systems

One of the really interesting challenges to solve with the translation was the different numbering systems for different languages and how do we automatically cater for this in a single R script. What do I mean by this? Well, here you can see three graphs in three different languages. What I want you to do is take note of the y-axis and the format of each of the numbers.

On the left we have English and you can see that the thousand separator is a comma. We then have Spanish, which has a space separator, but only once the number is more than four digits long. Then we have the Finnish and here you can see that the thousand separator is a space. So how do we cater for this in the R script?

The solution I came up with was to create a number formatting function per language which uses the number format function to specify what the big mark should be for the thousand separator and the decimal mark, whether it should be a comma or a point. The key thing here is that each function for each language has the same name as the two-letter abbreviated version in the YAML file. This here shows English compared to Finnish, which is denoted by F-I.

The Spanish was a slightly more involved example and function. And what I did here was to iterate over the list of numbers supplied as the labels and first check whether the number is more than four digits long. If so, then it gets a space and otherwise there's no space for the thousand separator.

Within the ggplot code, within the R script, the production of each of the languages is handled within the for loop and as mentioned previously, this list of languages is obtained from the first level heading in the YAML file, for example, E-N, E-S, F-I. In the ggplot code, I then use the very handy match.fun function from base R that takes the language input as a string and then uses the number formatting function that corresponds with that in order to format the numbers for the Y axis according to what is needed for that language. This here shows the Y axis, but it can just as easily be done for the X axis.

Closing thoughts

Having figured this last piece in the puzzle out for how to automatically translate images within our pipeline, I then went back to the publishers and heard a resounding, let's do this from them. So we are now at a stage in the process where we are figuring out how to integrate with their publishing pipeline in order to automatically and beautifully create thousands of images for their economics textbook.

Peeking into our workshop from the perspective of the publishing industry, what we have done may seem incomprehensible, but I hope what I have demonstrated here is that with a few familiar tools applied in unconventional ways, you can completely transform a conventional process. There are so many industries and processes out there that are still very manual and time-intensive and battling to keep up with our ever-changing world. They don't know it yet, but they could really benefit from your skills.

but I hope what I have demonstrated here is that with a few familiar tools applied in unconventional ways, you can completely transform a conventional process.

And from my experience, they won't necessarily come knocking at your door. So sit up and take notes of opportunities where you can really add value. For me, I'm thinking of those economics students sitting in their lecture venue at the start of the academic year. Not only would they have lived through the previous year, however bizarre and unusual it might have been, and come out different for it, but now their textbook can too.

Thank you. I do look forward to your thoughts and questions, and please do get in touch.