Karthik Ram | A guide to modern reproducible data science with R

Transcript#

This transcript was generated automatically and may contain errors.

All right. Thank you all for coming to this session. This will be the tidyverse session, and I have the pleasure of introducing Karthik Ram for our first speaker. He'll be going over a guide to reproducible analysis.

All right. Thank you, Sean. Today I'm going to give you a talk about how to make your data science notebook or just about any R project that you might have a little bit more reproducible than it already is.

And I just want to point out that at the end of my talk I will give you a link to a GitHub repo that contains all of the links, the readme, other demos and tutorials and so on. So feel free to listen and not take notes just yet.

So this is a situation that I frequently find myself in. I find a cool project on GitHub. I try to clone it. I can see that they've got a few packages listed, a few datasets mentioned. But even if I'm able to install those packages, the code doesn't quite work correctly and I start debugging and things don't quite work, and I don't have a beautiful R markdown at the end. I just have a very, very sad GitHub repo with lots of errors, and then I just move on to something else.

The research compendium

And so the main thing I want to talk to you about today is this idea of a research compendium. And this is an idea that was put forth back in 2004 by Robert Gentleman and Duncan Temple-Lang. And the general idea was that you can create sort of this container for your data, your code and your text, but also take advantage of software engineering principles to make sure that this package that you have or compendium that you have can easily be maintained, updated and then shared with everyone else.

And so what I want to encourage you after this talk is to go back and set up a compendium for your project as a way to make it easily shareable and useful to other people. Even though setting up a compendium was very complex and painful back in 2004, thanks to Tidyverse plus DevTools and friends, it's a lot easier to do now.

And so Ben Marwick, Carl Bettinger and Lincoln Mellon wrote a very nice paper last year on how to set up a research compendium or a practical guide to setting up a research compendium. And I just want to talk through a few of these ideas that came out of that paper. The first one being that you should just organize your project in a way that makes sense to you. There are no hard and fast rules about how to set up a compendium.

Many of you attended Jim and Jenny's workshop over the last couple of days on how to set up your R projects. All the advice you got from there is excellent for a compendium. And so just do it in a way that makes sense for people that you might collaborate with. Once you've got that going for you, make sure you have all of your code and your data and all of the artifacts separate from the outputs that you might want to generate.

And then the last bit that is not necessarily about Docker itself, although I'll spend most of my time talking about Docker, is figure out some way to describe your computing environment as well as you can so that someone in the future can actually run your analysis or then extend it one way or another.

And so to make your compendium shareable and useful, you have to have four separate things. One is make sure there's a license that tells people how to use it. Make sure it's under some kind of version control, preferably Git. Make sure there's enough metadata as part of your compendium. And then the last thing, which is often ignored everywhere, is that you need to have a long-term archive. So GitHub is not a long-term archive, but Zenodo.org, for example, is a long-term place where you can deposit your compendium so that it continues to persist even after GitHub disappears or any other services disappear.

So GitHub is not a long-term archive, but Zenodo.org, for example, is a long-term place where you can deposit your compendium so that it continues to persist even after GitHub disappears or any other services disappear.

Karthik Ram | A guide to modern reproducible data science with R | RStudio (2019)

Transcript#

The research compendium

Using the R package structure

Managing data

Isolating your computing environment

Workflows with Drake

Take-home messages

Featured software#

rstudio