Kevin Ushey | renv: Project Environments for R | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

Our next speaker comes to us from RStudio , here to talk to us about renv. Please join me in welcoming Kevin Ushey.

Thank you for coming to my talk. My name is Kevin Ushey, and today I'm here to talk about renv .

So before I talk about renv, first I want to talk about the motivation and the problem that we're trying to solve.

So have you ever finished a project, come back, say, a year later, and asked, why is my dplyr pipeline suddenly throwing an error? I swear it worked before. What happened to my ggplot2 plots? Why are the bars upside down? I swear that worked before. Okay, I reran my analysis, but now NLME is complaining about model convergence? I swear that worked before.

And so my goal in renv is how can we make sure that this doesn't happen again?

PackRat and its limitations

Now before we talk about renv, I want to talk about PackRat. So first, can I get a quick show of hands who has heard of or used PackRat?

So what is PackRat? And I'm going to borrow a quote from Douglas Adams, the story so far. In the beginning, PackRat was created. This has made a lot of people very angry and widely regarded as a bad move.

This is kind of tongue-in-cheek in that, you know, you can have success with PackRat, but it is not a pit of success. It works, but for the average user, and especially if you're a new R user, if you tried using it, you probably ran into a roadblock, it probably fell over on its face. And even if you're an advanced R user, when things go south, it's very challenging to figure out how to get yourself out of that hole.

So renv's goal, ultimately, is to be a better PackRat.

So renv's goal, ultimately, is to be a better PackRat.

By using project local libraries, you can rest assured that upgrading packages in one project will not risk breaking your other projects.

The renv workflow

And so if you've seen PackRat, this will feel familiar. Once again, the API is quite similar, but I'll walk through it. The first step in activating renv for a project is just calling renv init. And one thing that it does that is different from PackRat is it actually forks the state of your current world of R libraries into a project local library.

So the idea is you're working on a project, you have it working. Say you haven't adopted renv as your default workflow or you don't manage your library paths in a specific way, but you're saying, okay, the state of the world is good right now, I want to capture that in this project. So what you want to call is you call renv init, and it takes the packages you have installed in that library and moves them into the project library. So it's like forking that state into your project.

The other thing is infrastructure-related is we create a project local .r profile. And that basically makes sure that every time you start R in that session, renv automatically makes sure you use that project library. So you don't have to worry about managing your library paths. Renv does it for you automatically when you launch R.

And this mirrors what I said before around making sure that after you call renv init, you can work exactly the same as you were before, with the only difference being now you have your own project local library.

There are two main differences you'll see after you call renv init in a project. The first, you'll see a nice little banner printed saying, like, OK, renv has activated this project for you, and this is the version of renv that it's using. And the second thing is, as I've said, the library paths will now be changed.

Snapshot and restore

So that is the first goal, making it easy to use project local libraries. The second goal is to make it easy to save and load the state of those libraries. Or in the parlance of renv, snapshot to save the state of a library and restore to reinstall those packages into a library.

You can capture the state of your project library using snapshot. And so when you call snapshot, it'll give you a little log of what's changing, of what dependencies it's capturing when you call snapshot. So in this case, I'm just capturing three packages for a project, let's say, and it's capturing from CRAN the markdown package, the rmarkdown package, and the yaml package.

And this star is just as a placeholder to say, we didn't know about this before. Now we know that using version 1.1 of markdown, now using 2.1 of the rmarkdown package, 2.2.0 of the yaml package. And I say, do you want to proceed? Yes. Please write those. And it writes it to this thing we call the lock file.

So we're serializing the state of your project library to a file called a lock file, which we'll name on the file system as renv.lock. So this thing is just a text file. It's a shopping list. It's the packages that you had installed in your library, their versions, and where they came from.

So why do we want these lock files? There's three cases that I outlined specifically here. One is for time capsules. This is the case where you've, OK, I'm done a project. I want to make sure that this project still runs a year from now. To do that, I want to make sure I know what packages I have, and I want to make sure those packages exist over time so I can revisit later.

For collaborative workflows, if you're working with a set of collaborators, you want to probably make sure everyone is working with the same environment. If people have different computer machines, one way to make sure you synchronize the environment is share the lock file, have everyone restore from that lock file, and that way you're sure everyone's using the same set of R packages.

And also for deployment. So if you're working on something locally and it works on your machine, you might use renv to give yourself some extra guarantees that it works on the machine you're deploying on by replicating your library on that other machine.

So given a lock file, you create your lock file with snapshot, and you restore the state of your library later with restore. And so restore will give you a very similar kind of output. It'll tell you what packages are going to be installed, how they differ from the package that was already installed, if any, and what's the version.

And fortunately, renv is able to restore from a number of sources. If you've used remotes or dev tools, it basically understands all those same sources, CRAN, Bioconductor, GitHub, GitLab, Bitbucket. If you have another remote source that you'd like renv to support, you can let me know on the GitHub issue tracker, and we can see if that can become a possibility.

Another big thing is I've made an effort to ensure that you can authenticate with private repositories. So if you have your own, say, CRAN-like repository hidden behind some kind of authentication mechanism, renv has some tools to make it easier to authenticate.

So these are the three main features of renv, and I think these are the three things you need to know if you want to get started. The rest of the talk will be kind of extra on top of that. Init to initialize your project, give yourself a project local library, snapshot to save the state of your library once you are ready to save the state, and restore if you need to restore your project library from a previously generated lock file.

The global cache

And so one major issue with project local libraries is the duplication of identical packages across projects. So if you've used PackRat before, you've probably seen this issue before. So if you had ten projects that used dplyr 0.9.2, then you have ten project libraries that also have dplyr 0.9.2 installed. And this can be costly, both in terms of disk space to use and installation time, especially if, say, you're on a Linux machine, you're installing these things from sources.

So imagine having to install the tidyverse every time you start a new project. And so this is what I got from running it on my Linux VM, and it took just over six or seven minutes, which is too long when you want to get started on something. You know, when you're starting a new project, you want to just get going. You don't want to spend ten minutes just to get up and running.

So the way renv solves this problem is with a global cache. And what this basically is is a giant bucket of all the packages you have installed. And in each of your project libraries, rather than having the actual installation of that package, it's a link into that cache. So the idea is that rather than having ten installations of dplyr 0.9.2, you install that package once, you put it in the cache, and then renv knows how to pull that out whenever it needs to.

renv vs PackRat demo

So I want to just demonstrate quickly why I'm so excited about renv versus PackRat.

So I have this simple project that uses the tidyverse. We're going to call it PackRat project, and we're going to call it PackRat init. And so this is the kind of experience you might have had initially using PackRat, and already it's complaining about RStudio API because I have a development version that doesn't know how to get from CRAN. Seems to be kind of sitting and waiting. We don't quite know what's going on.

Just like imagine you're excited to get started on a project, and oh, wait, so now I have to go do all this stuff.

So PackRat is busy. Let's go to our renv project, which has the same library tidyverse in a file, and let's initialize it, and let's see what happens. Initializing project. Discovering packages. Giant list of packages. We've written your lock file. Okay. You're done. You're ready to go. Now you can start working with renv.

I'm sorry that I put too many slides that can fit in this talk, but I will put this online. If you want to see these later, you can go to the link here.

Kevin Ushey | renv: Project Environments for R | RStudio (2020)

Transcript#

PackRat and its limitations

What is renv?

Understanding library paths

The challenge of shared library paths

renv's solution: project local libraries

The renv workflow

Snapshot and restore

The global cache

renv vs PackRat demo

Featured software#

renv

rstudio