Resources

Gabor Csardi | pkgman A fresh approach to package installation | RStudio (2019)

The main goals of pkgman is to make package installation fast and more reliable. This allows new, simpler and safer workflows, such as separate package libraries for projects. In this talk, we will show the features that make pkgman fast, convenient and reliable. Features that make pkgman fast: * Concurrency: pkgman performs all downloads, package builds and installations concurrently by default. * Metadata and package cache: pkgman caches all metadata and all downloaded and locally built packages in its cache. * Lazyness: pkgman only downloads and installs packages if needed. Features that make pkgman convenient: * BioC and GitHub packages are supported seamlessly. * Informative UI. pkgman can lay out the installation/update plan, that the user needs to confirm. It returns data about downloads, builds, installations, etc. Features that make pkgman reliable: * Dependency solver. pkgman makes sure that you end up in consistent, working state of dependencies. * Private library: pkgman's own dependencies do not affect your regular package library, and vice versa. pkgman does not load any packages from your regular library. About the Author Gábor Csárdi Gábor is a software engineer at RStudio, working in Hadley’s team on R infrastructure packages

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Great, so thanks for coming to this talk. I'm Gabor, I work at RStudio, I'm a software engineer and I will show you today an R package that is called actually not packageman, there is a name change here, but it's called just package or pkg, which is, you know, interesting if you talk about the package package, so I would just probably call it pkg, that maybe sounds better. So pkg is a package for package installation and I will, you know, tell you briefly why we are writing this package and what are the goals that we want to have, we achieve and what features the package has, you know, to achieve these goals and a little bit about, you know, project-based workflows, which is kind of the long-term goals that we want to have and about ongoing work.

Goals: cheap and reliable package installation

Okay, so goal number one, so there is already a way to install packages in R, right? In base R there is a function you can use, you know, to install packages and it works actually pretty well. It installs packages from CRAN or any kind of CRAN-like repository, right? And then there are other tools, some of them developed at RStudio, like the devtools package and the remotes package to install packages from other sources. So there are a lot of tools already, you know, that exist basically to install packages, so why are we writing, you know, another one?

So basically, to summarize it very, you know, simply, we want to have cheap and reliable package installation and I will tell you, you know, what I mean by cheap and reliable. So cheap is easy, so it should be, you know, minimal effort to install the right packages for your projects in the light library using some simple, you know, short function call. And it should also be fast, you know, this is actually very important because if it's fast, then it is okay to do it, you know, kind of frequently. And that in turn is important because the kind of workflows we are shifting towards to require, you know, more frequent package installation. But if you know this is, you know, hard and slow, then people would just not do it. So we would just not, you know, get those workflows.

So reliable means these are not very surprising things, right? So it should leave the packages in a working state. That doesn't actually always happen, you know, with the existing tools, so we are trying to make, you know, more, make an effort that this happens and your package is installed in a working state and it's not in a broken state because of some error happened, you know, halfway through the installation. We want to leave the project in a working state so, you know, all the dependencies are properly satisfied in your library for your project. We want to put all the dependencies of a project in a single library because this will help us with the workflows we want to, you know, tend towards to. Related to this, if you have multiple projects, you know, the dependencies and the packages, you know, that you need for these multiple projects, they should not interfere. You know, if you have a project that needs a certain version of dplyr, you know, and then you have another project that needs a different version, you know, of dplyr because you just started that project, you know, one year later and then the dplyr syntax maybe changed a little bit or the semantics changed a little bit, then, you know, you should be able to work with these, you know, two projects without having any kind of interference. And, of course, so package pkg is an R package itself and so it has its own dependencies and those should not interfere, obviously, with your regular, you know, project dependencies either.

Making pkg fast

So how to make it fast? We did a lot of effort, you know, to make pkg fast. The first one is to make all the HTTP requests, you know, concurrent. So if you are installing CRAN packages only, then this is kind of, you know, impossible to achieve. If you have packages from other sources like packages from GitHub or packages from your own repositories, you know, then it's a little bit harder. You know, if typically if you're installing a package from GitHub, you have to make several requests to the GitHub API and then, you know, that package that you're installing from GitHub might depend on other packages on GitHub and that means you need to make even more requests to the GitHub API. So if you don't do this concurrently, it will be just very slow. The package builds and the installations are also concurrent. Pkg is quite lazy, so by default it doesn't actually download or do anything if it doesn't, you know, have to. So if you don't, you know, explicitly tell it to, that you want to update, you know, to some, to the newest version or a newer version, then it will not update, you know. If the current version is good enough for your project, then it will not do anything. It will not update just for the sake of updating.

And probably the most important, you know, two features for speed is the cache, the cached metadata and the cached package files. So with the current tools, probably all of them, you know, package, two tools managing packages and package installation, you know, if you want to, you know, if you install a package, you know, you need to make a download. You need to download the package file. If you do the same thing five minutes later, you will probably download it again, which is not how, you know, an ideal package manager should work. There should be some cache and that makes it really, really fast.

There should be some cache and that makes it really, really fast.

Safety and reliability features

So there are some safety features because we want, you know, a reliable way to install packages. The first one seems like a joke, right? So it's, but it's actually not. So some of the tools actually, they don't fail if some error happens and it will just, you know, they will just leave you in a broken state with a broken installation. So it's nice to fail if there's an error, you know. So if you can't install something, you should fail. I mentioned this already, all the dependencies go in the same library. What package it does is it solves the dependency structure of your installation and if there's a problem with that, it reports it up front. So it doesn't actually start, you know, installing packages until it knows that, okay, this will be okay. You will get a consistent state of the library and it reports all the conflicts up front and then you can, you know, see what you can do about those conflicts.

All the dependencies of PKG itself, they're in a private library so they don't, you know, interfere with your other normal, you know, packages, package dependencies. And this works, you know, this happens seamlessly so you don't really need to, you know, worry about this. PKG just puts them there and just uses them. It also uses a subprocess to perform the installation because in R and the package system, you know, it's often a problem if you want to install a package that is loaded in a session. Okay, so usually when you, you know, your package installation fails then, you know, the older R users will tell you to, you know, quit all the R sessions that you have and then start with an empty session to install because of this problem. So to work around this, you know, PKG uses a subprocess so its own dependencies are only loaded in the subprocess and they're loaded from the private library so you don't have this interference there. Yeah, and you also try to, you know, detect if you have a package loaded from the library that you want to overwrite, you want to install through just to, you know, detect these possible problems.

Convenience features

Okay, so it has some convenience features. It supports CRAN, BIOC packages, out of the box, GitHub packages really easily. So there is something called the remotes field in the description file. If you haven't seen this before, that allows you to depend on non-standard packages, you know, in your own package and make the installation just work, you know, seamlessly. So you can put, there's an example at the bottom, so you can put this entry in your description file and the first, actually both lines, both of those dependencies refer to GitHub. So it's just, you know, username slash repository and then the second example, it has an at dev, that's a branch on GitHub. It just means that, you know, for this project or for this package, you need to install, you know, the glue package from GitHub and you actually want to have the dev branch so we support all this. This is not really new, you know, the dev tools package has supported this for a while and the remotes package as well. It supports local package files and package trees so you can install obviously from, you know, files on your own disk. It's a nice convenience feature that it will show you the download sizes, you know, before you actually do the installation and also in the progress bar, so if you have like a big installation and your Internet is, you know, not the best, you will see, you know, how long it will take to install all that.

Demo

So I'll just show you a little demo, you know, how this works in practice. So this is just a very basic R session that I have here. Is this big enough or should I make it bigger? This should be good, right? Okay, so the main function that you use is called package install and then, you know, obviously CRAN and bioconductor packages will just work seamlessly. I will give a separate library here not to, you know, mess up my own libraries, so it will just install there. So if you do this, you know, then PKG will solve your installation, it will find all your dependencies and this looks like this is a CRAN package, you know, CRAN is in a consistent state almost always, so the installation is possible and it will also tell you that it will install, you know, these 18 packages and it will download them from CRAN. I just cleared my cache, so that's why they are not cached. So if I want to continue, I just say yes and then it will, Internet is really fast here, so it will download them very, very quickly and then, you know, install them very quickly.

So this is not surprising, right? If you, obviously, if I run this again, you know, then it will take a look at the installed packages, you know, and then see that actually, it sees that actually there's nothing to do, so it does nothing, right? Now, the more interesting case is that, you know, when you install, when you download these package files, it also adds the package files to the cache. So if I were to completely remove this library like this and I do this again, and in a more realistic situation, you know, what happens is that you just give your project to somebody else, to your colleague, and your project, you know, dependencies are described in your description file and your colleague will, you know, install all the dependencies for your project. Now, if your colleague has all these packages in the cache, like we do now, so it tells you that all 18 packages are cached. So if these are commonly used packages, you know, you don't actually need to download them, you know, it will, even if the Internet was slow, though it is not slow here, you know, this is really fast because everything is coming from the cache.

Okay. So GitHub packages work really easily. Bioconductor packages as well. You don't need to do anything, you know, special for a bioconductor package. This is a bioconductor package. It's just automatically included. The only glitch with the bioconductor package is that we don't know their sizes up front, so it will tell you, you know, that it will download 16 packages from CRAN, roughly 20 megabytes, and, you know, 12 more packages that, but the size of those is not known. And it will obviously add all these to the cache, so if you do it again, you know, next time, you don't need to download them.

So for packages from GitHub, let's say, I can install PKG itself, maybe. That's a good example. And it will also tell you that, you know, there is one package with unknown size and that's the rlib slash PKG package because that's coming from GitHub and we don't know the sizes of those packages, actually. And there are a bunch of packages in this library that are already there, so you don't actually even need to install them. Yeah, it needs to build PKG because that's not a built package. It's just a repository. So that's how it works, the simple parts at least.

Let me go back to the, yeah, so I showed you GitHub packages. And the other thing I want to show you, what happens if you have a conflict. So I artificially created a conflict here in this branch, the conflict branch. So in this branch, the PKG package depends on a package version of some other package that is not available. So if you try to do that, you know, this, you know, PKG tries to solve the dependency tree, so this will come up immediately. It will tell you that it can't, you know, it can't install package cache. And then you can look, you know, into your dependencies and see why that is the case. Another one, another conflict that I artificially created is the conflict branch of this package. And if I try to install this together with the CLI package, but not the master branch, but say this branch, then what happens is the PKG package, this branch of the PKG cache package, you know, depends on one version of CLI. So that's what you can see there, cannot install dependency or leave CLI master. It wants the master, you know, branch. And, but I also, you know, upfront said that I wanted to install this other branch. And then, you know, our libraries are just such that you can't install two different versions of the same package in the same library. So this is a conflict. This is not, you know, anything that we can do. So PKG tells you upfront.

Project-based workflows

Okay. So the main goal actually, or the, you know, longer term goal and the reason why we want to make, you know, package installation cheap and reliable is that we want to have safe and convenient workflows. And these would be workflows that are centered on projects. So I will show you one workflow that is already possible with the PKG package. There is some more work to be done to make this super convenient, but this, if you organize, you know, your project or your package this way, this already works. And it's relatively actually easy to do it. So you would have a new project or package, you know, some files and directories that you can, sorry, that you can see here. So the R directory is the usual, you know, place for your code. If this is a package, you know, this structure is not really surprising. The package usually has an R directory, has a description file, right? If it's a project, you don't necessarily need to create this, but you can, obviously. And if you do, then PKG will help you, you know, with the installation and the separation of your project libraries and so on. So the description file will be used to, you know, describe the dependencies. We need to create an R profile, .rprofile file here, which will set up, you know, to your project library. And it's very simple. So there's a short example in the bottom. Basically, it will source the main R profile, you know, it will create your directory for your library, which will be the R-packages directory, and then it makes sure that it is added to your library path. So that's the bottom line. And then in your description file, you can, sorry, you can describe your dependencies the usual way. You know, either you can just use a depends or imports field if this is a package, but potentially with versions. And you can also use the remotes field if you want, you know, a package from a non, you know, canonical source. And then to do the installation, you know, after you have this setup, you can use, you know, one of those comments in the bottom. So I guess the most common one would be local install deps or local install dev deps. So that will install, you know, all your dependencies into the R-packages, you know, library, which is already set up in your profile.

Ongoing work

Okay. So that was, so this is a relatively simple project-based, you know, workflow. It's not super convenient because it's not fully automated, right? So you need to actually, you know, add, say, the remotes fields and so on. So there's some work that you need to do, you know, to make this work. You need to create this R profile file, you know, manually and so on. So some of the ongoing work is just to make this better, you know, have better tools to, you know, to declare the dependencies of a project, say snapshot those dependencies, you know, to know their exact version so that when you deploy your project, you know, you can recreate those exact versions. So that's, so this thing doesn't exist yet in PKG and we are working on that. And whether it will be in PKG package or not, that is not, you know, we don't know that yet, but it is definitely something that we want to have. It would be nice to have, you know, better failure diagnostics, so we are working on that too, you know, so that if an installation fails, then it will tell you that it failed because you don't have some library on the system, you know, present. And I mentioned the snapshotting already. It's possible to make it even faster and we will add more remote types. So I will just leave you with this. So this is the, there are the links in the bottom. The package is not on CRAN yet, but it should be on CRAN, you know, any time basically. It's in the queue, so it can get there any time. Until it gets there, you know, an easy way to install it is to use this install github-me service from github. And you can take a look at it at the repository as well. And that's all I have.

Q&A

Thank you very much, Gabor, for that introduction. We may have time for one very quick question. Down near the front, Davis, just catch the microphone. There you go.

So, honestly, I think the most annoying part of installing R packages is when you have to build packages that have compiled code like C++ from source. So are you planning to add more support in your package for, to manage the R, .R, make VARs file and, like, the system path and so on?

Yes, so we definitely want to help with that. So one thing is that, you know, if you manage to install from source, then we add the package to the cache so you don't need to do it again. That's one thing. And the other thing is that we want to, you know, look at the system requirements of your package to try to figure out why you couldn't install it and then give you some, you know, helpful message, you know, what software or libraries you need to install the system, you know, to get the installation done.