Resources

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right. Thank you all for coming to this session. This will be the tidyverse session, and I have the pleasure of introducing Karthik Ram for our first speaker. He'll be going over a guide to reproducible analysis.

All right. Thank you, Sean. Today I'm going to give you a talk about how to make your data science notebook or just about any R project that you might have a little bit more reproducible than it already is.

And I just want to point out that at the end of my talk I will give you a link to a GitHub repo that contains all of the links, the readme, other demos and tutorials and so on. So feel free to listen and not take notes just yet.

So this is a situation that I frequently find myself in. I find a cool project on GitHub. I try to clone it. I can see that they've got a few packages listed, a few datasets mentioned. But even if I'm able to install those packages, the code doesn't quite work correctly and I start debugging and things don't quite work, and I don't have a beautiful R markdown at the end. I just have a very, very sad GitHub repo with lots of errors, and then I just move on to something else.

The research compendium

And so the main thing I want to talk to you about today is this idea of a research compendium. And this is an idea that was put forth back in 2004 by Robert Gentleman and Duncan Temple-Lang. And the general idea was that you can create sort of this container for your data, your code and your text, but also take advantage of software engineering principles to make sure that this package that you have or compendium that you have can easily be maintained, updated and then shared with everyone else.

And so what I want to encourage you after this talk is to go back and set up a compendium for your project as a way to make it easily shareable and useful to other people. Even though setting up a compendium was very complex and painful back in 2004, thanks to Tidyverse plus DevTools and friends, it's a lot easier to do now.

And so Ben Marwick, Carl Bettinger and Lincoln Mellon wrote a very nice paper last year on how to set up a research compendium or a practical guide to setting up a research compendium. And I just want to talk through a few of these ideas that came out of that paper. The first one being that you should just organize your project in a way that makes sense to you. There are no hard and fast rules about how to set up a compendium.

Many of you attended Jim and Jenny's workshop over the last couple of days on how to set up your R projects. All the advice you got from there is excellent for a compendium. And so just do it in a way that makes sense for people that you might collaborate with. Once you've got that going for you, make sure you have all of your code and your data and all of the artifacts separate from the outputs that you might want to generate.

And then the last bit that is not necessarily about Docker itself, although I'll spend most of my time talking about Docker, is figure out some way to describe your computing environment as well as you can so that someone in the future can actually run your analysis or then extend it one way or another.

And so to make your compendium shareable and useful, you have to have four separate things. One is make sure there's a license that tells people how to use it. Make sure it's under some kind of version control, preferably Git. Make sure there's enough metadata as part of your compendium. And then the last thing, which is often ignored everywhere, is that you need to have a long-term archive. So GitHub is not a long-term archive, but Zenodo.org, for example, is a long-term place where you can deposit your compendium so that it continues to persist even after GitHub disappears or any other services disappear.

So GitHub is not a long-term archive, but Zenodo.org, for example, is a long-term place where you can deposit your compendium so that it continues to persist even after GitHub disappears or any other services disappear.

Using the R package structure

So it turns out that the R package structure is already a very fantastic way to organize a compendium, and there's not really very much that you need to do. So many of you in the room are package developers or everyone in this room is a package user. And so if you've ever looked at the contents of an R package, you might see a description file that looks like this. This is a standard Debian control format that contains a lot of generally useful metadata. It has a title, an author, a version number, a date, and then somewhere in the bottom it contains useful bits like these are the dependencies for this project.

But since everybody is binge-watching Marie Kondo on Netflix, I can make a Marie Kondo reference and say, Marie Kondo, the shit out of this file, and just make it a very, very minimal description file. You just need to say that it's a compendium, have a title, possibly a version number, and just your dependencies that are either on CRAN or CRAN plus GitHub.

Simply adding this file to a project and calling it a description file makes it a compendium, and you already have a research compendium now. And once you turn something into a compendium, now you can take advantage of DevTools and Friends. So DevTools can help you make sure that your packages or your compendium can be installed. You can take advantage of all the tools that help you generate documentation and websites and also link everything up to continuous integration services to make sure that your compendium is still working over time.

So how do you set up a compendium? Depends on the scale of your project. If you have a very simple analysis that is just a simple R Markdown notebook, you can have this very minimal structure, just a description, a readme file, a license, and then the key bits here that are different than a package are that you have a folder containing your R Markdown file and then a data folder containing, say, a small amount of data.

And this would work for a small project or a small report, but then for a lot of you, you might end up with a slightly more complex compendium. So the only thing that's different here from the previous version is that you have decided to write a few custom functions for this project. And this is a situation where these functions are not broadly useful outside the scope of this project, which is why you haven't written a package. So these exist within the compendium, and now you have a namespace file that captures some of these dependencies.

But for a lot of you, your actual real-world long-term research project is going to be far more complex. So in this case, you have taken all of your functions and moved those out into a separate package except for a few that are critical only to this project. But the new elements that you'll start to see are that you have issues with keeping track of all of the outputs that you have to generate, all of the scripts that you have to deal with. So you need some sort of workflow. And in this case, that is a makefile. And then because your project's dependencies can change over time and packages can change over time and the same package can have API changes over time, you need to capture your computing environment. And in that case, I have a Docker file.

Managing data

So you will have to decide what level of complexity you want for your compendium. And so for the rest of my talk, I want to focus on three separate components of a compendium and how to think about them and how to use tools for that particular purpose. I'd like to talk about data. I'm from industry. I don't have any good answers or solutions from you. You've already got a data engineering team that can help you. So the advice that I have is mostly for small to medium-sized projects. The second one is how to isolate your computing environment. And then the third one is talking about simple and complex workflows.

So I'll start with data. So how do you manage data in the context of, like, a small to a medium-sized project? Well, a very simple answer is if the data are small enough, they're just tabular data, you can stick them inside of your compendium itself. Because GitHub lets you have files up to 150 megabytes each, and a compendium is not a package that is going to go on CRAN, you can just have arbitrary data files. And if you want to keep things very useful, you can even have a data raw folder that contains the raw data and then some scripts that turn it into useful data.

If you write a package alongside your compendium and you want it to go on CRAN, you can include data in there, but your overall package size cannot exceed 5 megabytes. Nick Tierney and I started poking around all the data that exists inside of CRAN packages, and right now about 37% of them have a ton of data in there, and some projects have hundreds of tabular data files that are shipped inside of an R package. So that is one simple, easy solution that is often overlooked.

Another one that I want to talk about is this fun little package by Carl Bettinger called Piggyback, and Piggyback allows you to attach arbitrary files, these can be any kind of files, up to 2 gigabytes each, and as many files as you would like to a GitHub release. So what is a GitHub release? If you have written software before and you release a new version of your software, you can tag a release on GitHub. When you do that, you get binaries attached, and Piggyback just allows you to attach more data files on to that release, and the functionality for this package is very, very simple. There are only five functions.

You can create a new release for your repo, tag it with a version number, and then just upload whatever files that you would like to upload, and when you do that and go to the release section of your GitHub repo, it's just your repo slash release, or releases, you can see these files show up there. So you can set up your compendium in such a way that when you're running the code, you can have all these files downloaded back straight from GitHub, and because GitHub is a very fast CDN, you can get access to these files pretty easily, no matter where you're running your compendium.

If you've got a medium data situation where you have a large number of highly compressed text files that contain data locally, you can use this package called ArcDB that is built upon DBI, and it allows you to chunk the data using your memory limits in and out of any database backend that DBI supports. So this is a nice way to either get lots of data back from a database, or if you've already acquired data through some other means, this is a way to use the limitations of your setup to get data into a database.

Isolating your computing environment

The next bit I want to talk about is isolating your computing environment, and this is really important because even I can never actually run any of my old analysis, especially things that involve ggplot2 from last year or the year before or the year before that. I have all the packages. The code looks the same, but nothing really runs.

This is a paper in Nature Biotechnology that talks about how having just different versions of R can produce entirely different results from your same analysis or your same code. So those of you that are familiar with Docker, Docker is a very nice way to just containerize your existing environment so that even if you change machines or upgrade machines or move organizations, you should still be able to run the same code with the same set of dependencies and then get the same output back.

And adding a Docker file to your existing R project is quite easy these days. There's an R package called containerit. It's not on CRAN yet. It's only on GitHub, but it can take any arbitrary folder or get repo and generate a nice Docker file for you. The Jupyter project from the Jupyter Notebook team has a more general solution called repo to Docker that will do it for either a Python project or an R project or projects that contain both.

And adding a Docker file to your project has many advantages. One very simple straight up one is that you can then Dockerize that particular project, launch a container, run your analysis, and then get back out. But I want to spend a few minutes talking about this fantastic open source project called Binder, or I should rather say Binder Hub.

Binder Hub is an open source project that allows you to launch a live notebook from any GitHub repo on demand and let someone else run this analysis, poke through your code, make some changes, see the results for themselves, and then when they close the browser window, that particular instance is immediately killed off.

And because it's easier to show you how this thing works, this is an example real project from my collaborator Carl. I'm not on this particular collaboration. There's a button called launch Binder, which once I click will either build a Docker image for the first time or if it's already found an image from before, pull it out of the cache and then go on to install any other additional R packages that you might need that is not already captured in the Docker file and then drop you into an RStudio server.

And as it's par for a live demo, it doesn't actually work as quickly as you expect it to. But in about 30 seconds or maybe a few minutes, this will actually drop you into a RStudio server. I'm just going to let that run while I go back to my talk.

And so you end up in an RStudio server now, and here are the two beautiful things that are happening. All of the R packages that you need for this analysis have already been installed, and all of your scripts, data, and code have already been copied in. And there's nothing more to do other than to open up a notebook and then start running through it.

And the way this works is that JupyterHub does 80% of this work, and BinderHub just does a little bit of coordination. It reads in a repo, sees if there's a Docker file, if there's a Docker file, it checks to see if there's a Docker image. If not, it builds one and then hands it off to JupyterHub to allocate some resources to it and launch a server on demand. This service is free. And then mybinder.org is an example of a BinderHub.

And so how do you set this up for your R project? Well, there are many different ways to do this, and in fact, I'm not going to go through all the different ways because that is going to be part of my readme. A simple way is just to add a text file called runtime.txt and then add R and then a particular date that corresponds to the version of R that you might want to capture. This also means that the Rocker project will pull all the CRAN packages that correspond to this exact date. So if your project moves along or your system moves along, everything will be locked in time.

And then you have another text file called install.r that just contains a list of package install commands. This is the simple, most basic way to set this up. Then you can go to mybinder.org, paste the URL for your GitHub repo, click launch, add that Binder button, and then you should be good to go.

So this is the basic approach, adding an install.r, runtime.txt, and then if you want to add any Linux dependencies, you can add an app.txt. The only downside is it's very slow and every time you change anything, it will take hours to rebuild your Docker image. But then there's a faster way to do this is by adding a Docker file using one of those two packages, like container.it or repo.docker. And because we're going to use a Docker image that already contains tidyverse, things will move along very quickly.

And then the pro option is because now you're all experts on Research Compendia and you already have a description file, then you just need to add a very, very short Docker file. It could even just be four lines that says start at this base image, move all my files into my container, and then install these few additional packages for my description. Those of you that don't know about the Rocker project, it contains daily versions of different instances of R, including ones that contain RStudio and then specialized ones for geospatial data and so on.

Workflows with Drake

The last bit of that I want to talk about are workflows. So I imagine that a lot of people in this room are quite familiar with Makefiles, and Carl Broman, who's here in this room, has a fantastic tutorial on Makefiles. And so having a Makefile somewhere inside of your compendium allows you to keep track of all of your scripts and all of your inputs and outputs. But those of you that have used Makefiles probably know that it can get cumbersome very, very quickly.

The wildcard syntax is not intuitive to remember. Like I have to Google this thing every damn time. And then it generates so much output that it just clutters your compendium every single time you run something.

So I want to give a shout out to a fantastic project called Drake, which is also part of ROpenSci. The author for this project, Will Landau, is in the room somewhere, and you should go talk to him if you're interested in talking about Drake. And what Drake is, is if you want to think about a piece of software that mimics Make but it's very R-centric, that's what Drake is. And Drake stands for Data Frames in R for Make.

And of course, it is impossible for me to do a demo in the amount of time that I have left. But I want to mention that what Drake allows you to do is generate these very, very complex Makefiles without writing them out. It does beautiful wildcard expansions and then takes advantage of numerous parallel backends including things like future. It gives you a nice dependency graph with all of your inputs expanded. And then it'll also estimate runtimes. And because it uses a system called Store, S-T-O-R-R, as the backend, it does not clutter your entire environment and allows you to quickly read outputs back as and when you need them.

And then this would be a beautiful dependency graph that it generates. It's an interactive graph. You can actually see runtimes, missing objects, and all the relationships in your workflow. So having Drake as part of your compendium instead of a Makefile can be a nice way for you to scale very large projects. And then the author will actually develop this for a real-world use case for a very complex Bayesian analysis project.

Take-home messages

So I have two take-home messages for you. One is that turning your project into a research compendium is as simple as throwing in a disk description file in there. And it can be a very, very minimal description file. The tiny bits of metadata, dependencies. And then you can use these modern tools like Binder to launch notebooks that make it more accessible for people trying to understand your analysis.

But at the same time, all these solutions like Binder and Piggyback are designed to help you. But I cannot guarantee that GitHub will remain around forever or Binder will remain around forever. But, for example, having your data exported in some flat file format and throwing it in Zenodo would be a nice long-term archive. But for the short term, Piggyback will help you in your analysis pretty quickly. Same thing with Binder. Binder will just look for the Docker file, launch a beautiful RStudio server. And if Binder ever goes away, you still have a Docker file, and I think Docker will still be around for another five years.

And then Drake also conveniently exports the entire workflow out of Drake into just basic R. So at some point you have a collaborator who hates Drake. Just pull the whole thing out. So that is my talk. And this is where I would like to get you to. Everything beautifully works. And the last bit is a Docker file in there. And then I have a very detailed readme here that includes a lot of links and stuff. So thank you.