
Nicole Kramer | A New Paradigm for Multifigure Coordinate-Based Plotting in R | RStudio
R is unparalleled in its ability to transform raw data into a wide array of beautiful graphics, all within the same environment. However, when it comes to complex, multi-paneled plots, users rely on 3rd party graphic design software to arrange plots. Here I present the new world of programmatic, coordinate-based multi-figure plotting in R. Employing grid Graphics and drawing from the paradigms of base plotting and ggplot2, I am developing a package that will revolutionize the way plots are laid out in R. Not only will individual plots be aesthetically customizable and tailored for speed, users will also be offered exquisite control over all aspects of page layout, plot placement, and arrangements. Come join me in changing how we plot in R! About Nicole: Nicole Kramer is a third year [Bioinformatics and Computational Biology](https://bcb.unc.edu/) graduate student at the University of North Carolina at Chapel Hill. She works in the [lab of Dr. Doug Phanstiel](http://phanstiel-lab.med.unc.edu/), where her and her colleagues use experimental and computational techniques to study human genomics. Prior to grad school, Nicole received her B.S. in Biological Engineering from MIT in 2018. When not doing science, you can find Nicole petting dogs, admiring giraffes, or knitting tiny animals!
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, my name is Nicole Kramer and I'm a third year bioinformatics and computational biology graduate student at the University of North Carolina at Chapel Hill. Today I'm really honored to be talking to you about a new paradigm for multi-figure coordinate-based plotting in R that I've begun to make possible with my package, BentoBox.
The inspiration for this package and this functionality within the R plotting environment came from some of my own stressful experiences as a grad student making figures and plots. There was one time when my advisor asked me to make a multi-paneled figure that looked something like this. There were two heatmap style plots in a specific genomic region. There were two tracks of genes below that. There were six different tracks of bin signal data below those. And there was a bar graph of some statistical analyses done in R.
And since I work in genomics, all of this data was huge. One heatmap came from a file that was 55 gigabytes and the other came from a file that was 14 gigabytes. One set of those bin data came from three different files that totaled 1.2 gigabytes and the other came from three other files that totaled 1.7 gigabytes. And on top of this data being huge, the process to make this combined figure was extremely tedious and time-consuming with all of its elements coming from different places. There were different screenshots of a couple of genomic browsers. There was a plot made from data analyzed in R. Everything was cropped and arranged in Adobe Illustrator. And all of the fine-tuning and nice labels were also made in Adobe Illustrator.
So when my advisor asked me to change the genomic coordinates I was looking at in this figure, it wasn't a simple fix. And I went to my overcrowded laptop screen with a bunch of genomic browsers open, with my Dropbox open, with all the files I was working at, my RStudio window, my Adobe Illustrator window, and the paper I was taking inspiration from. And I became completely overwhelmed. I thought that there had to be an easier way to make figures like this beyond existing browsers, beyond existing programmatic libraries, and things that I didn't need to use graphic design software for that would let me make and arrange all my plots in one place. Something that was entirely reproducible by being completely programmatic, yet entirely customizable, and efficient for handling large data specifically.
Something that was entirely reproducible by being completely programmatic, yet entirely customizable, and efficient for handling large data specifically.
And so my team and I developed a package called BentoBox that allows for coordinate-based plotting in R, where plots can be programmatically made and arranged on a user-defined page layout with common units of measurement. Here is an example. I'm showing you two tiny page markings with inches and centimeters.
Making figures entirely programmatic
So let's talk about how BentoBox makes figures entirely programmatic. If we go back to that figure I made by combining screenshots in Adobe Illustrator, I can now make a precisely tailored version of that figure entirely in R. The code on the left is every command needed to make the figure on the right. I've squished it down for you here, and it's not important that you look at the code I've written, but just know that if I gave you this file and the data files, you could make the exact same figure that I've made on the right. So here I can give my advisor the file, the data, and they can make that figure just as they want it. I can define all my files and data in just one place, and here with one line of code I can easily change the coordinates that I'm looking at while keeping the rest of the figure structure and aesthetics the same. So if my advisor wants to change the region but have everything else look the same, I now only have to do that with one line of code.
Customization and coordinate-based placement
BentoBox also makes plots entirely customizable, just as you could for other plots in R. I now have the power to really customize a bunch of things about my plots. I can really easily change their color palettes, font sizes, font types, dimensions, etc. etc. So let's say my advisor didn't like those heatmaps being red, and he wanted those bin signal tracks to be slightly taller. We can now easily make both of our heatmaps with a blue color palette and quickly adjust the dimensions of all those tracks, all while maintaining picture clarity and not having to squish any of our data.
One of the really special customizations in BentoBox is its ability to perform coordinate-based plotting on a page with precise placements and dimensions. So here we can define a BentoBox page that is exactly 9 inches wide and 5 inches tall, and now we can make and arrange all of our plots within this defined landscape. So if we want the top corner of a plot to be 2 inches down from the top of the page and 1 inch from the left, and we want the plot to be 3 inches wide and 1 inch tall, we can make exactly make our plot with these parameters, and it'll fit right in that box we defined. This exact sizing and placement is particularly important and useful for standardized and accurate data comparisons.
So this both applies along the y-axis. If we're looking at two plots side-by-side and they have the same scale, we want to make sure we're comparing that data correctly. And it's also important for comparing data along the same coordinates of the x-axis. So here in the case of genomics, we're looking at a specific region along the genome, and we want to compare the data that's specifically in that region.
Comparing BentoBox to Patchwork and CowPlot
And while there are other tools like Patchwork and CalPlot for arranging plots in R, none are as precise and specifically tailored for a multi-plot environment like BentoBox. So here I've made two example ggplot2 plots using the MTCARS dataset, and I'm going to show you how they work differently in Patchwork, CalPlot, and BentoBox. Patchwork is specialized for the easy arrangement of plots and grids. So here we can just define plot1 plus plot2, and Patchwork will place them side-by-side for us. And CalPlot expands on these grid layouts and gives you slight more control of the relative dimensions and plot placements. So here I can define that I want the left plot to be slightly smaller and shifted to the left to the right of the plot on the right.
But these options both have issues of placement and size accuracy, which can become a problem both in terms of aesthetics and data interpretation. So as you can see here, the relative placements and sizes of the plots mean that they can easily get squished or stretched, especially when we're moving around that R graphic device window or if we're exporting it in different sizes. This does not happen with BentoBox plots because everything is aligned and spaced exactly how we want it to be. So here if we're placing those same plots on a page that is defined to be 8 inches wide and 4 inches tall, we can see as we move the R graphic device window around, it adjusts to maintain those dimensions.
Efficiency with large datasets
And despite plotting and customizing extremely large datasets, BentoBox functions are extremely efficient because they were optimized to be able to quickly plot large datasets. So if we go back to our previous figure that we made with BentoBox, we can see how quick it is to generate each of the main elements of our plot. So our two heatmaps here took only 0.6 seconds to plot. Our genomic labeling and our gene tracks took 0.06 seconds to plot. Each of our signal tracks took less than one second to plot. And adding our ggplot2 took 0.4 seconds. In the end, it only took my laptop about three seconds to generate this whole figure coming from many large data files.
And if our advisor wants us to visually scan through our large datasets, we can now programmatically and easily make them aesthetic informative figures in bulk. So here I've made 20 files of the same style of figure but looking through different regions of a file. And they're all plotted in the exact way I want them and I can easily scan through all of this data.
Walkthrough of a BentoBox figure
So now that we've gone through some of BentoBox's key features, I want to take you through a walkthrough of a plot that we made with some genomic data. So first, we can initialize our BentoBox page with the exact dimensions in our unit of choice. As an American, I have chosen inches here. This function will by default add ticks and guides to aid you in your figure building process to make sure you're placing things and arranging them in the way you want.
So first here, we can add one half of a heat map from a file and its associated color scale at one corner. We can add the bottom half of this heat map from a different file in a different color palette with its associated color scale. We can replace those two heat maps with one square heat map that's exactly three inches wide and three inches tall and easily change its color palette. We can mark up the dark pixels on that heat map that in the genomic field we denote as DNA loops and we can do this with a variety of options. So here we can do this with circles or with arrows or with squares and in this case we can just highlight one specific loop region and annotate that with text if that's what we want to look at.
We can next line up and add a different representation for these loop regions as two boxes connected by line segments and this track is exactly the same width as our heat map and in the height that we want it to be. Or we can change this data into a different representation by marking them up as connected arches. Below this and lined up to our heat map and these arch representations, we can now add different bin signal tracks from different files and in different colors. For each of these we can change the y ranges and the heights of all these different plots.
We can then line up the track of genes that fall in this exact region of data that we're looking at. So we can show both strands of the DNA, show the structure of the genes, and label them. We can easily change the colors of every element of this gene track and make it all gray or all a different color or we can just highlight a specific gene in a color of our choosing. And finally below this entire collection of data we can label the x-axis of genomic coordinates that we're looking at and really line it up to everything we've built up so far.
But beyond looking at these lined up tracks of different data that you might already see in a genomic browser we can also add various plot elements to our page in different locations, sizes, and orientations, really with no limitations. So first on this right side we can now add a picture representation of a chromosome and a specified width. Below that we can zoom in and highlight a specific region of the chromosome and add two different heat maps of the same size and region right next to each other to compare those two different data types. We can change those square heat maps into triangle ones if that suits our fancy.
We can add two more bin signal tracks below our other heat maps, this time making them the same width as our smaller heat map since we're comparing along that axis. We can zoom in on a specific region of that bin data to compare a specific peak of data. And we can also add the sequencing read pileups that represent that bin signal data, and we can change the color for each of the two strands of DNA. And finally again we can add our labeled genomic coordinates below each of these zoomed regions of data.
And when we're done building up and tweaking our plot to our liking, we can remove all of those guides from our page and export our publication quality figure in the size we made it at the beginning so we can use it in research papers or presentations just like this one. So this entire plot is exactly what I've built up in BentoBox.
BentoBox vs. existing libraries
And so if we compare BentoBox to existing programmatic libraries, we can see that both are programmatic and reproducible, both are fully customizable, and both integrate a wide variety of data types. However, BentoBox is specialized to handle large data sets like genomic data, and only BentoBox gives users the precise control of plot placements and dimensions.
However, BentoBox is specialized to handle large data sets like genomic data, and only BentoBox gives users the precise control of plot placements and dimensions.
BentoBox currently has functions for genomic data, but I hope its paradigms will extend the R plotting environment for all kinds of data visualizations. If you're interested in trying out BentoBox and exploring how you can use it with your data, you can get BentoBox from my lab's github page, or feel free to tweet at me directly. Thank you so much for your time.
