Resources

Kevin Ushey | renv: Project Environments for R | RStudio (2020)

The renv package helps you create reproducible environments for your R projects. With renv, you can make your R projects more: - Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. - Portable: Easily transport your projects from one computer to another, even across different platforms. renv makes it easy to install the packages your project depends on. - Reproducible: renv records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. In this presentation, I'll introduce renv and some of its main workflows

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Our next speaker comes to us from RStudio, here to talk to us about renv. Please join me in welcoming Kevin Ushey.

Thank you for coming to my talk. My name is Kevin Ushey, and today I'm here to talk about renv.

So before I talk about renv, first I want to talk about the motivation and the problem that we're trying to solve.

So have you ever finished a project, come back, say, a year later, and asked, why is my dplyr pipeline suddenly throwing an error? I swear it worked before. What happened to my ggplot2 plots? Why are the bars upside down? I swear that worked before. Okay, I reran my analysis, but now NLME is complaining about model convergence? I swear that worked before.

And so my goal in renv is how can we make sure that this doesn't happen again?

PackRat and its limitations

Now before we talk about renv, I want to talk about PackRat. So first, can I get a quick show of hands who has heard of or used PackRat?

So what is PackRat? And I'm going to borrow a quote from Douglas Adams, the story so far. In the beginning, PackRat was created. This has made a lot of people very angry and widely regarded as a bad move.

This is kind of tongue-in-cheek in that, you know, you can have success with PackRat, but it is not a pit of success. It works, but for the average user, and especially if you're a new R user, if you tried using it, you probably ran into a roadblock, it probably fell over on its face. And even if you're an advanced R user, when things go south, it's very challenging to figure out how to get yourself out of that hole.

So renv's goal, ultimately, is to be a better PackRat.

So renv's goal, ultimately, is to be a better PackRat.

What is renv?

So first, what is renv? And here is the sort of sales pitch of what renv is. It's a toolkit used to manage project local libraries of R packages.

So you can use renv to make your projects more isolated, isolated in that every project gets its own library of R packages, because each project has its own library, you don't have to worry about updating packages in that library, potentially breaking your other projects.

It'll help you make your projects a bit more reproducible or portable, because it captures the state of your R library and the packages installed in that library, it captures those into a lock file, and that lock file is kind of like a shopping list of the packages used in your project, and renv knows how to use that thing to recreate your library in the future.

And that ties into the reproducibility story. You use snapshot to create that lock file, and later you use restore to restore your library.

Now, renv tries to prescribe a default workflow that just works for the average user, but remains flexible enough that alternate workflows can still be built on top of renv. And so one of the hardest things for me in building renv is just everyone has a different idea in how projects should be managed, how dependencies should be managed, and ultimately renv has to take an opinionated take on how to approach this problem that should at least work for the average user.

Understanding library paths

And so before we talk about renv, I just want to give a lay of the land of what it looks like when you're working on a project without renv.

So you've started a new project. You're ready to use dplyr to analyze some data, and so you're going to load and use in your project. You're calling library dplyr. And so what happens when we actually run this code? R is going to search your library paths for a package called dplyr that you've installed, and then it's going to load it.

So the natural questions that arise from that are what is a library path? What are the active library paths? And how does R search those library paths when loading a package?

So first, a raise of hands. Who thinks they can tell me what a library is? You don't have to tell me, but just if you think you know what it is.

I'm happy to say that the definition is quite simple here. It is simply a directory. It's a directory in which packages get installed. Nothing more, nothing less. Let any remaining mystique be dispelled. It's just a directory. It's a folder on your system. It's where packages go to live and get loaded.

Now where a bit of the extra complexity comes in is that each R session can be configured with multiple library paths. That is, a bunch of folders where R will search for your installed packages. You might have seen the .libpaths function, which is used to find those library paths. So for example, on my system, I've got two of these.

And just to give some terminology for these libraries, the first one you see is most often called the user library, because it's your library. It's where the packages that you download and install go. The last one you have in that list is the system library. And that's most typically where the packages that come with R are installed.

And if you're on Linux, you might see another set of libraries, which are often called site libraries. You can think of these as the library paths that your administrator might take care of for you. So if you're in some organization, your R administrator might want to install some set of packages in the site libraries that would then be available to all users on the system.

And then next, so when R wants to load a package, how does it search these things to find it?

And so what happens is the first installation of the package that is discovered and loaded, no size say the first. So if you've ever, say, run into a situation where you have multiple copies of a package installed, say one in your user library, one in your system library, the first one that it finds, the one in your user library, is the one that gets loaded.

But if you've ever thought, like, wait, where is this package actually coming from? You can use the find.package to ask R, hey, where does this package that I want to use, where does it actually live? And so for me, dplyr, I've made sure to install it in my user library.

Now, I should say, if you're on macOS, you can actually install CRAN packages directly into your system library, which I would argue is not a great idea because you want to have this separation of packages that belong to you that you've installed versus packages that belong to your R installation. So mixing them up can cause some surprises.

The challenge of shared library paths

All right, so what is the challenge here? So by default, each R session is going to use the same set of library paths. So you've got this library, it's really multiple libraries, a set of library paths, and all the projects on your system, unless you're already managing your library paths in some way, are going to go and use those same library paths.

It means that, say, if you had dplyr 0.8.2 in that library, that's the package that you're using in every one of your projects.

But of course, different projects can have different package dependencies. For example, suppose project one, maybe something you were working on a while ago, that uses dplyr 0.7.8. And that's, say, an older version. You got everything tidy, wrapped up, maybe it's associated with a paper that you published, and you're thinking, okay, I'm done with this, it works, I don't want to change it.

Project two, on the other hand, you're using some more modern version of the package, you've upgraded to dplyr 0.8.2. And say, if you were a developer, maybe you're even using the development version of dplyr in another project. Unfortunately, in the current state of the world, these projects all share the same library path. So if you were to install one of these packages, you're changing the version of dplyr that's used in all three of those projects.

So in this world, if you were trying to switch between projects, you'd have to go back and say, okay, wait, I need to go back and install dplyr 0.7.8 for project one. Oh, wait, but now I need to reinstall 0.8.2 for project two. And as you can imagine, this could spell disaster.

renv's solution: project local libraries

So the solution, at least the solution posed by renv, is to ensure that each project gets its own unique library of packages. By using project local libraries, you can rest assured that upgrading packages in one project will not risk breaking your other projects.

And so it's from this idea, the use of project local libraries, that the renv package is born. We give each project its project local library, we make it simple and straightforward for R sessions to use that local library, we provide tools for managing the R packages installed in those project local libraries, and we try to make that experience as seamless as possible.

By using project local libraries, you can rest assured that upgrading packages in one project will not risk breaking your other projects.

The renv workflow

And so if you've seen PackRat, this will feel familiar. Once again, the API is quite similar, but I'll walk through it. The first step in activating renv for a project is just calling renv init. And one thing that it does that is different from PackRat is it actually forks the state of your current world of R libraries into a project local library.

So the idea is you're working on a project, you have it working. Say you haven't adopted renv as your default workflow or you don't manage your library paths in a specific way, but you're saying, okay, the state of the world is good right now, I want to capture that in this project. So what you want to call is you call renv init, and it takes the packages you have installed in that library and moves them into the project library. So it's like forking that state into your project.

The other thing is infrastructure-related is we create a project local .r profile. And that basically makes sure that every time you start R in that session, renv automatically makes sure you use that project library. So you don't have to worry about managing your library paths. Renv does it for you automatically when you launch R.

And this mirrors what I said before around making sure that after you call renv init, you can work exactly the same as you were before, with the only difference being now you have your own project local library.

There are two main differences you'll see after you call renv init in a project. The first, you'll see a nice little banner printed saying, like, OK, renv has activated this project for you, and this is the version of renv that it's using. And the second thing is, as I've said, the library paths will now be changed.

Snapshot and restore

So that is the first goal, making it easy to use project local libraries. The second goal is to make it easy to save and load the state of those libraries. Or in the parlance of renv, snapshot to save the state of a library and restore to reinstall those packages into a library.

You can capture the state of your project library using snapshot. And so when you call snapshot, it'll give you a little log of what's changing, of what dependencies it's capturing when you call snapshot. So in this case, I'm just capturing three packages for a project, let's say, and it's capturing from CRAN the markdown package, the rmarkdown package, and the yaml package.

And this star is just as a placeholder to say, we didn't know about this before. Now we know that using version 1.1 of markdown, now using 2.1 of the rmarkdown package, 2.2.0 of the yaml package. And I say, do you want to proceed? Yes. Please write those. And it writes it to this thing we call the lock file.

So we're serializing the state of your project library to a file called a lock file, which we'll name on the file system as renv.lock. So this thing is just a text file. It's a shopping list. It's the packages that you had installed in your library, their versions, and where they came from.

So why do we want these lock files? There's three cases that I outlined specifically here. One is for time capsules. This is the case where you've, OK, I'm done a project. I want to make sure that this project still runs a year from now. To do that, I want to make sure I know what packages I have, and I want to make sure those packages exist over time so I can revisit later.

For collaborative workflows, if you're working with a set of collaborators, you want to probably make sure everyone is working with the same environment. If people have different computer machines, one way to make sure you synchronize the environment is share the lock file, have everyone restore from that lock file, and that way you're sure everyone's using the same set of R packages.

And also for deployment. So if you're working on something locally and it works on your machine, you might use renv to give yourself some extra guarantees that it works on the machine you're deploying on by replicating your library on that other machine.

So given a lock file, you create your lock file with snapshot, and you restore the state of your library later with restore. And so restore will give you a very similar kind of output. It'll tell you what packages are going to be installed, how they differ from the package that was already installed, if any, and what's the version.

And fortunately, renv is able to restore from a number of sources. If you've used remotes or dev tools, it basically understands all those same sources, CRAN, Bioconductor, GitHub, GitLab, Bitbucket. If you have another remote source that you'd like renv to support, you can let me know on the GitHub issue tracker, and we can see if that can become a possibility.

Another big thing is I've made an effort to ensure that you can authenticate with private repositories. So if you have your own, say, CRAN-like repository hidden behind some kind of authentication mechanism, renv has some tools to make it easier to authenticate.

So these are the three main features of renv, and I think these are the three things you need to know if you want to get started. The rest of the talk will be kind of extra on top of that. Init to initialize your project, give yourself a project local library, snapshot to save the state of your library once you are ready to save the state, and restore if you need to restore your project library from a previously generated lock file.

The global cache

And so one major issue with project local libraries is the duplication of identical packages across projects. So if you've used PackRat before, you've probably seen this issue before. So if you had ten projects that used dplyr 0.9.2, then you have ten project libraries that also have dplyr 0.9.2 installed. And this can be costly, both in terms of disk space to use and installation time, especially if, say, you're on a Linux machine, you're installing these things from sources.

So imagine having to install the tidyverse every time you start a new project. And so this is what I got from running it on my Linux VM, and it took just over six or seven minutes, which is too long when you want to get started on something. You know, when you're starting a new project, you want to just get going. You don't want to spend ten minutes just to get up and running.

So the way renv solves this problem is with a global cache. And what this basically is is a giant bucket of all the packages you have installed. And in each of your project libraries, rather than having the actual installation of that package, it's a link into that cache. So the idea is that rather than having ten installations of dplyr 0.9.2, you install that package once, you put it in the cache, and then renv knows how to pull that out whenever it needs to.

renv vs PackRat demo

So I want to just demonstrate quickly why I'm so excited about renv versus PackRat.

So I have this simple project that uses the tidyverse. We're going to call it PackRat project, and we're going to call it PackRat init. And so this is the kind of experience you might have had initially using PackRat, and already it's complaining about RStudio API because I have a development version that doesn't know how to get from CRAN. Seems to be kind of sitting and waiting. We don't quite know what's going on.

Just like imagine you're excited to get started on a project, and oh, wait, so now I have to go do all this stuff.

So PackRat is busy. Let's go to our renv project, which has the same library tidyverse in a file, and let's initialize it, and let's see what happens. Initializing project. Discovering packages. Giant list of packages. We've written your lock file. Okay. You're done. You're ready to go. Now you can start working with renv.

I'm sorry that I put too many slides that can fit in this talk, but I will put this online. If you want to see these later, you can go to the link here.