
Alicia Schep | Auto-magic Package Development | RStudio (2020)
Vega-lite is a high-level grammar of interactive graphics implemented in Javascript; it renders interactive visualizations in the browser based on a JSON specification. In Python and Javascript, the Altair and vega-lite-api packages have demonstrated how the development of APIs to build Vega-Lite graphics can be partially automated based on the Vega-Lite JSON schema, which describes the required format for a Vega-Lite JSON specification. This talk will describe the development of the ‘vlbuildr’ package for building Vega-Lite specifications in R and the ‘vlmetabuildr’ package for building the ‘vlbuildr’ package. The ‘vlbuildr’ package seeks to provide a pipe-friendly, “R-like” functional interface for building up simple to complex specifications for Vega-Lite graphics, which can in turn be rendered as an HtmlWidget by the ‘vegawidget’ R package. Building such an API in a fully automated way from the Vega-Lite schema presents considerable challenges, so the approach taken here was to rely on partial automation. Human judgement dictates the basic contours of the API, such as what groups of functions to include and how various types of building blocks will go together. The part that is automated is filling in many details such as the different variants of a group of functions, the exact parameters needed for each function, and the documentation of those parameters -- the parts that would be extremely tedious to port over!
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi my name is Alicia Schep and I'll be talking to you today about auto-magic package development or an effort to build a VegaLite interface in R that I call VLBuilder and if you're interested in the package at the end, the GitHub page is under the VegaWidget organization and in the readme to the package there is a link to the slides for this presentation.
An alternate title I could have come up with here if I had a more literary bent would be a tale of two PRs because a lot of this work was motivated by a PR several years ago and recently vindicated by another one and so I'll be sharing about both of those in this talk.
Why VegaLite and Altair
I want to start off first though talking about a little off topic about ggplot2 because ggplot2 is what originally sold me on R. I did programming in Python before I learned R and I thought it was kind of awkward, kind of weird, but ggplot2 was amazing.
I could do all of these things once I learned those basic building blocks, learned that underlying grammar, I could all of a sudden make all of these different kinds of plots and it felt really powerful and really special and so over the years even as I oscillated between using Python and R, I would always come back to R for the visualization part because ggplot2 was just so good and in my opinion nothing in Python really quite measured up, but that recently has changed and Python I feel like can now compete in the visualization domain and that's because of a package called Altair.
If you use Python at all, I highly recommend checking it out. It's really amazing. Developed by Jake VanderPloss and Brian Granger and a whole team of folks and is still under active development and really well maintained.
And it is like ggplot2 in many ways. It's built on top of a grammar of graphics so it has some of those same properties that once you learn the basic patterns, you can all of a sudden make all sorts of different kinds of visualizations. But it has in some ways even more powers than ggplot2 because it enables you to make interactive graphs.
This is an example of a graph made by Altair just taken literally from the Altair example gallery where you're plotting in one graph like temperature over the year in Seattle, but then when you pan and brush, you can see different types of precipitation status over those time ranges.
And this all looks very complicated, but in Altair it actually makes this very easy because it has those basic building blocks which you can easily tie together to make something like that. So it's not, this isn't, you know, if you were trying to make this in like D3, it would probably like take, well, it would at least take me like a year and many thousands of lines of code in Altair. This is just a little bit of Python code.
And the reason it's so powerful is that Altair is built on top of VegaLite, which implements this interactive grammar of graphics. VegaLite itself is a higher level abstraction built on top of something else, Vega. And all this was developed by the Interactive Data Lab at the University of Washington.
And just as a quick, you know, I'll be using some terms in this talk, just as quick ggplot2 VegaLite terminology translations. Much more of you are familiar with ggplot2 than VegaLite. In ggplot2 where you'd talk about a geom, that's a mark, an aesthetic is an encoding, and a stat is a transform. But it has a lot of the same underlying concepts.
VegaLite is in JavaScript and it uses a JSON specification to tell the plot what to draw and how to, what should be the mark, what should be the encoding, and so on. So if you actually try to make a plot this way, it can get a little bit clunky. Like building sort of nested lists isn't like the most fun way to program.
It can be a little, you know, where exactly do you have to put all your curly braces and all. And this is where a package like Altair is really powerful because it builds an API on top of that so you don't have to worry too much about that complicated nested list. You can just use some ordinary Python methods to build up that plot.
There's also now in JavaScript a nice API to build up those plots called VegaLite API. And in a number of different languages there are really great packages. So what I want is an R API for making VegaLite charts that hopefully is of a similar caliber as to some of those other ones that I just mentioned.
The tale of two PRs
And now some of you might be thinking, well, hasn't there been an R package that's even called VegaLite for many years? And you're absolutely right. And it has, it was developed by Bob Brutus and it is in many ways a really great package. It provides a really nice API where, you know, it uses the pipe that people are so familiar with where you can, you know, iteratively build up components of this plot.
The thing is it's not up to date with some of the latest and greatest VegaLite features including some of those really killer features like the interactivity and the view composition where you can put different subplots together. And the reason it's not necessarily up to date is because it's actually, you know, when you build this sort of interface to this other thing, it can be kind of hard to update over time because it's, VegaLite is pretty big at this point. And so trying to keep it up to date is a pretty big endeavor.
And I know that because I actually tried to update this package so many years ago. So when I told you this would be a tale of two PRs, this is that first one I would talk about.
And so I wanted to, you know, the package was at VegaLite 1.0, I want to update it to VegaLite 2.0. So I spent quite a lot of time like going through and this was a massive PR. So just pro tip, if you're a first time contributor to a package, this is not what you should do. This is like exactly like the opposite of what you should do. Like I changed 88 files and like 48,000 or something lines of code.
You know, it took a while for it to actually get accepted. It did eventually make its way in. But, you know, one of the things I learned was like this took a lot of work. And like the thing is the VegaLite developers are really productive. They're continuing to make improvements. And so like this method of like trying to keep this RAPI in line with this changing VegaLite definitions was just going to be a Herculean task.
this method of like trying to keep this RAPI in line with this changing VegaLite definitions was just going to be a Herculean task.
So I stepped back. I was like, okay, maybe I'll focus on other things. But then out of the blue someday, I got an email from Ian Little. And he was embarking on a new effort, which was to actually make a port to VegaLite through Altair using the reticulate package. So in some ways it's a little funny using Python to get to JavaScript, but it actually works really well.
And he was working with Haley Jepsen and Heikki Hoffman from Iowa. And this package is now on CRAN. It's also available here at VegaWidget Altair. And so I was really excited by this effort, and I got involved, and I started to chip into that package.
The VegaWidget ecosystem
I think my biggest contribution was actually suggesting to Ian that maybe he could split off one part of the package into something else, which is that he was using Altair to use some code to make that specification for what the chart should be. But then another big part of the code was just taking that specification and rendering it in R so you could actually see the plot. And I said, well, maybe we should make that a separate package to just do the render. And so that became the VegaWidget package.
So VegaWidget just takes a JSON spec and then uses the HTML widgets package in R to turn it into a plot. And this was primarily developed by Ian, who actually at this very minute is giving a talk in another room, which is kind of unfortunate timing.
But he's a really smart, very nice guy. You should go find him and talk to him if you're interested in any of this stuff. And myself and Stuart Lee helped with some of the JavaScript functionality in this package.
And so what this package also does is it provides interfaces from that HTML widget to other parts of the R ecosystem, in particular Shiny. So it has functions to help you listen to events, to connect to what in Vega world is called a signal, which is basically a dynamic data value, and also to let you update the data that's in the plot without completely re-rendering it.
What VegaWidget doesn't do is it doesn't help you make the spec. So it basically expects some JSON to be given to it, which in R form is going to be basically a big, complicated, nested list. So this isn't a package that you're usually going to want to use directly. You're not going to want to go and make this kind of list because it's a little bit clunky and you'd have to store in your head exactly how to make this kind of thing.
But this package is designed to be a helper to other packages, where those other packages would help you make the spec. And then they can rely on the shared interface to actually render it. So one of those packages is an R port to the Altair package, which uses Altair via reticulate to actually make that spec and then gets rendered.
Ian and Haley and others are now working on ggvega, which is a package to take a ggplot object, turn it into one of those specs, and then have it render. And then I've been working on VLBuilder, which is basically, in some ways, like a spiritual successor to that VegaLite package, which tries to make a native R API to make these specs, which then can be rendered via VegaWidget.
Goals of VLBuilder
So the goals of VLBuilder were they wanted a pipe-friendly approach for building up VegaLite specs. They wanted something that felt sort of natural to an R user who might be familiar with the tidyverse. They wanted to stay pretty close to the VegaLite spec, so not inventing new abstractions, not inventing new terminology, but just kind of taking what is in VegaLite and making it accessible in R. But maybe a little syntactic sugar here and there for when you have common things to do that might be a little clunky and trying to make those a little easier.
Another key thing I wanted was to have autocomplete and documentation help, so that, you know, if you're trying to add an encoding, you can see what are the different parameters I can give to that. And that's something that actually is a little bit missing from the Altair package, because it's going through Articulate. You kind of end up having to really use it effectively. You often have to end up going to the Python docs. They're not quite as easy to access for an R user in RStudio.
And the most important thing, well, in my mind, developing this, was that it had to be fun to develop and easy to update as the VegaLite scheme evolved, because I learned from that other PR that, like, you know, if you just, you know, try to build something up manually that tries to match VegaLite, it's going to take a lot of work when VegaLite evolves. And I didn't want to get caught in that same bind again.
Auto-generating the API from the schema
So here I took inspiration from Altair and the VegaLite API packages, because both of those approaches in Python and JavaScript respectively actually auto-generate big parts of their API using VegaLite itself. So VegaLite has a schema which can derive what a VegaLite specification should look like. You can use that to build up an API. So I thought, why don't I try to do this in R too?
So just to go a little bit more into, like, what a schema is and what I mean by that, the VegaLite schema is basically telling you in that, like, kind of JSON, that list object where you're specifying the different parts of the plot, what is an acceptable sort of format for each of those parts? So for example, you know, you might say here an aggregate can be one of these other three types and then this type can be one of these 12 different values. Hopefully, it also includes descriptions for what different fields mean. So in terms of trying to build up the documentation, a lot of what you'd want as documentation is actually embedded in this schema document.
And so the approach I took for building an R API from a JSON schema was one of semi-automation. So early on, I was like, oh, maybe I'll try to automate it as much as possible. But at some point, like, making something very generalized and totally automated becomes a very hard problem. And so I try to strike the right balance between using some human curation and some automation.
So the human curation part is sort of identifying what families of functions do I want. So I know that, you know, an encoding is going to be a very important part to make a VegaLite spec. I'm going to want a bunch of different functions that let you add different kinds of encoding. I'm also going to want that for a mark. So the mark is sort of like the geom and geom. So I want, like, a VL mark point, VL mark bar, VL mark line, et cetera. And so I'm going to want all these different kinds of families of functions, and they're going to have this kind of format.
But in terms of figuring out exactly which ones I need and what's going to go inside them, what are the acceptable parameters, there I'm going to use automation. I'm going to pull those things out of the schema and plop them into my function rather than doing that manually by copying and pasting. I'm going to write functions to do that.
And so all of this sort of functions to build these functions lives in a package within my package. So if you look in VL builder, there's a build directory, and inside of that, there's a package I call VL metabuilder, which is the package that builds VL builder, and the main function is one that's called just create API, and it takes as an argument the schema. So as the schema updates, as there's a new version of that schema, I can give it that new version, and I'll create a new API.
And so as an example of how to make a family of functions, I have that create API function called something called create encoding functions, which will make all those encoding functions. It first will pull all the possible encoding, so it'll take a function that says, okay, this is where in the schema I'm going to look for them, and has a function to then pull out all the different options that would be available. It then iterates through all of them and makes the appropriate functions.
So here, actually, I make sort of for each encoding, I want both a function that will sort of add one to spec or one that just makes that object on its own. You can see there's some deprecation, so I have even functions to generate my deprecated versions of my functions as well.
And so here, in terms of what a function to make a function is, here I have my create encoder function that calls another function, because a lot of them actually share a lot of properties. So I have a generalized make function that takes in, most importantly, where to look in the schema. So this is a path to a particular place in that specification for what can go in a particular part of a Vega-Lite spec, and it looks like what are all the available options.
And then there's a couple extra things where I specify, as you can tell, like one thing that gets used a lot in this effort is glue. The glue package is really handy for this kind of effort, because you're basically trying to make a bunch of strings where you're putting different pieces together. So that has come in really handy in this effort.
And so then it ends up making a function that also has documentation, because it actually pulls all those little pieces of documentation out of that schema object. And so I don't have to write any of it. I can just use what's already been created, but make it accessible in R in a way that the R user doesn't have to go to some website. They can just type in help and get it right there in their RStudio pane.
Putting it to the test: VegaLite 4
This is just one of hundreds of functions generated by VL MetaBuilder. Now this was recently, this package and the approach I took was recently put to the test, because in December Vega-Lite 4 came out, which added new features, new transforms. And so it was like, okay, well, now previously I've been using Vega-Lite 3 to generate my API. I want to make it compatible with version 4. Hopefully my automation will just take care of it, and I can just sit back and relax. And that actually did come to bear.
So it was another large PR, this time to my own repository, which helps a little bit. But behind the sort of numbers that, you know, 190 files changed and tens of thousands of lines of code, there was actually very little work on my part. This took just literally like an hour or two, one day. When I first put in the new schema, one thing broke, I fixed that, and then all of a sudden it kind of just worked. And that felt really, having gone through that earlier, experience really great.
When I first put in the new schema, one thing broke, I fixed that, and then all of a sudden it kind of just worked.
And you can kind of see that in some of the more detailed descriptions that GitHub will give you. So, like, in terms of all those lines of code changing, a lot of them were actually just in the schema file that gets included in the repository itself. But then the other thing is I had this file called autogen API, which is all what gets auto-generated by VL MetaBuilder, and lots of stuff changed there, but in most of my other files, very little changed.
And then in terms of which files were changing, actually a lot of the files changing were the documentation files, which all, pretty much all, are being, like, sort of auto-generated through auto-generated Roxygen documentation.
So, you know, if you're ever finding yourself writing a bunch of repetitive functions and documentations, which can often happen when you're writing a wrapper package, I'd recommend writing another package to write your package. It can actually end up being less work for you down the road.
And, of course, I'm not the only one that's had this idea. Other folks have, you know, put in place efforts to auto-generate code in R. So, Miles just recently asked a question about whether people were doing this, and there was a great thread of different people who were doing this, and he put some of those suggestions into place and into a package he's developing. There's also a scaffolder package that's being developed that I think is actually trying to be a much more general solution to this, which I'm really excited about that package and other efforts in this vein.
Closing thoughts
And so, finally, I just want to leave you with a little bit of a reminder that this whole Vega widget ecosystem exists in R. If you're interested in making interactive graphics using VegaLite in R, I'd recommend checking it out. At this point, the Altair and Vega widget packages are fairly mature. They're both on CRAN. The other packages are a little bit still more experimental, and there's a great community of people involved in these efforts.
There's also, I think, opportunity for new APIs. If you're itching to make a higher-level API for interactive graphics, I think the Vega widget package makes that effort easier, because if you can just make that VegaLite spec, it can then do the whole rendering part. And so I'm really excited to potentially see more experimentation in this space, because I think that, yeah, the Vega widget package makes that kind of experimentation possible.
And so, yeah, that is my talk. And if you're interested, again, I'll point you to GitHub, where the package lives, as well as all those packages live in that Vega widget organization on GitHub. And the link to those slides is in the readme for VLBuilder. Thank you.
