
Alex Gold | Managing Packages for Open Source Data Science | RStudio
What You'll Learn: With over 15,000 R packages on CRAN, over 230,000 on PyPI, and more arriving every day, the task of managing a package environment for data science in R and Python can be daunting. In this webinar, you will learn about the most current strategies and tooling for creating and maintaining a reproducible package environment. Whether you’re an individual data scientist or the administrator of an entire RStudio Team cluster, you’ll better understand how you can enhance your ability to work easily and safely with open-source data science packages. About Alex: Alex is a Solutions Engineer at RStudio, where he helps organizations succeed using R and RStudio products. Before coming to RStudio, Alex was a data scientist and worked on economic policy research, political campaigns, and federal consulting
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thanks so much for tuning in today to talk about managing packages for open source data science. My name is Alex, and I'm a solutions engineering manager here at RStudio. Just a little about me so that you sort of know who I am and why I'm talking about this. I'm a former data scientist and a data science team lead. I worked on sort of health care, politics, economic policy kinds of issues when I was a full-time data scientist. And since coming to RStudio, I've been a solution engineer here for about two years.
So let's dive right in and talk about managing packages. So, you know, data science is a lot like this delicious sandwich. This sandwich looks incredible, by the way. Anyway, data science is a little like making a delicious sandwich. Maybe it's an app that you're making, maybe it's an API or a port or a scheduled job of some sort. And it takes a little bit of planning, a little preparation, and a little skill to make something really, really great.
Today, we're going to talk about sort of one important aspect of the open source data science, which is package management. And so we'll start talking about how, like, package management can screw up your plans to make beautiful, delicious sandwiches or data science products. Some background on sort of how it all works. And then finally, some thoughts on, like, how it can be better and how it doesn't have to always be so painful.
The pain points of package management
So first, you know, sort of the pain. So one reason this can be painful is that folks are blocked, right? You're blocked from installing packages. Either you don't have permissions to install on your machine or you're blocked from public package sources like CRAN, Bioconnector, PyPi. And that's a big problem. Or you have no way to share private packages.
And sort of that leads into the situation of it feeling like IT and data science are at odds. And this isn't so surprising, right? Like, data scientists want the software they want, and they want it when they want it. And that's not necessarily always IT's concern, right? IT or admins have a lot of concern about platform stability and security. And those things don't have to be at odds, but they can be.
There's the issue of fragility. You know, when you're sharing your project with somebody else, are you confident that they'll be able to open it up and actually use it? Or, you know, might they open it up and find themselves stuck because they don't have the right package set?
And then lastly, sort of like, when you come back to a project in the future, right? Like, this is just a form of sharing with future you, but how confident are you that, like, your package set hasn't changed in the meantime, and you can sort of get back to the same state?
And so, to be completely clear, like, you can't fix all these problems. And particularly if you're just a lone data scientist, it may not be possible to fix all these. You may need some help from others at your organization, but you can set yourself up for success here.
Mise en place for data science
I love to cook. And when you're talking about cooking, there's this concept called mise en place, which means, like, putting in place in French. At least that's my understanding. I don't speak French. But, you know, what it's really about is, you know, you sort of cut, measure, and prep all of your ingredients before you ever turn on your stove, right? So, like, by the time you're sort of turning on the gas, by the time you're starting to cook, you know, everything is in place and sort of ready to go by that time.
And so, you know, one of the reasons that, like, you can't guarantee outcomes, but you can really guarantee your process. And so that's what sort of the mise en place step is meant to do when you're cooking. And you can do something very similar, right? Package management is very much like the mise en place of doing data science, right? It's getting all your ingredients, all your things in place before you go.
Package management is very much like the mise en place of doing data science, right? It's getting all your ingredients, all your things in place before you go.
How repositories and libraries work
So, let's take a step back and talk a little bit about, like, how do the ingredients get there in the first place? And how do you make sure they're ready to go when you're doing data science? We're going to stick with this metaphor. So, your food starts at the grocery store, right? It's, like, packaged up. It's on the shelf. It's ready to go. And you go and get it. And then when you want to cook something, you go and buy the right ingredients, and you bring them home to your pantry, right?
And it's really the same for packages. For packages, you know, your grocery store is your repository. So, this is sort of an external place where you have all of your packages, and they're stored, sort of ready for you to go in an inert way. And then there's your library, and that is sort of your pantry where you keep your packages that are just for you and the things you're going to make.
And so, for many teams, the repository that they use is public, right? It might be public CRAN, public bioconductor, public package manager. Other teams choose to maintain private repositories. But the libraries are always private. And the libraries are, you know, anywhere there's a kitchen, anywhere there's actual R or Python processes running in the IDE, in a notebook, on RStudio Connect, in a Docker container, right, there's a library attached to that. And the shopping process is the install, right?
So, just to be a little more, like, precise about this, you know, I have these two sort of side by side, but really the way this works is that, like the grocery store and pantries, many libraries are served by one repository, right? One repository provides packages for many libraries.
So, a repository is just a file server. It has a bunch of directories indexed in a particular way. And so, just to be precise, this is the link. You'll note it's a URL, right? It's a real, like, link to an internet site for a particular package on CRAN, right? Now, CRAN is the main R package repository, public one. And so, what you'll see here, right, is, like, this is for Windows. It is using the development version of R, R version 4.1. It is for the A3 package, which I chose because it is the alphabetically first package on CRAN. And it's version 1.0.0.
One of the other important aspects of the repository is that the repository is not local. It is almost never the case that your repository is on the same machine as where you're actually doing your work. Very often, the repository is going to be, like, actually in the cloud. It's actually going to be a website, like that URL I just showed you. Or sometimes it's going to be an internal networked server.
That's very much in contrast to the library. The library is just a directory, but it's on the same machine as the running R process. And so, you'll see, like, this is a Linux path to, you know, inside my home directory. It's the Rx86 64, you know, Linux library. And you'll see, like, this is specific to R version 4.1 and the A3 package.
And so, you know, one of the big differences here is that libraries are specific to the version of R. They are not specific to the package version. And this is because, like, R needs a way to resolve definitively to a particular package. And so, this is the distinct one of the big distinctions between the repository and the library is that the library has at most one version of any package. The repository has many versions of each package and even many versions for different operating systems.
How package installs work in R and Python
So, let's get a little into the nitty gritty here of, like, how these package installs actually work in R and Python. So, let's say I go to install a package. In R, I do that with install.packages. In Python, I do that with pip install package. And you'll notice, like, here is one of the differences, which is that in R, I do all of my actual package management and installation from inside R, right? I type install.packages at a running R interpreter. That's different in Python. In Python, you use a separate binary to manage your package installs. Most often, that's pip.
But in terms of what these tools do, they're conceptually identical. So, in either case, what they're going to do is they're going to look at your sort of settings. In particular, in R, it's going to look at your options repos. In Python, it's going to look at your pip config. And from there, it's going to find the URL of a repository that's going to be a real URL, either on the internet or a networked server. And from there, it's going to move that package into your particular library.
Now, when I go to actually load a package in either R or Python, I do that in R with the library package command. In Python, I do it with import package. And so, what I'm going to do from there is I'm going to — or what R or Python are going to do, they're going to look at the in R, the .lib paths. Or in Python, the sys.path. Hopefully, you've never had to look at these before because they just resolve automatically. From that, R or Python will know which libraries to look to and will bring that back into your running R or Python environment and then you can use the functions from those packages.
So, you know, this was a lot of nitty-gritty, in-depth stuff. And, you know, if this all felt like this was really exciting and interesting to you, congratulations. You're a huge nerd and you're in great company. But, you know, if this sort of went over your head a little bit or didn't feel that important, that's great. That's totally fine. You should never need to manage this directly. But I share all this mainly to make the point that you can see why this is a little bit frustrating, why there's no sort of like single silver bullet to make this all work. Successfully managing a package environment requires, you know, at least three different components, right? It requires the repository, the library, and the settings that tie it all together, right? All three of those things need to work.
The package management prime directive
That said, there is a happy path. It is possible to make this delicious sandwich, whatever it is. And it's not even that hard. But it's, you know, you're going to want to do some planning up front. You're going to want to do your mise en place. And so, we're going to talk for the rest of this talk about two ways that you can make this better. And the first is to develop a package management plan with your IT department. And the second is to have project specific libraries that you can manage as an individual data scientist.
And so, from here, we're going to get to the package management prime directive, which is a little grandiose. And so, the first is that, you know, repositories should be as broad as possible, right? These should be managed centrally, if they're being managed at all. And they should try and serve as many users as possible.
That is the opposite from libraries, which are specific. One other thing I should mention about libraries is that they are pull only, right? You have to pull packages to your libraries. And so, you know, that means that the data scientists should have control of what they pull into their own libraries. This is sometimes a point of friction between IT teams and data scientists. And it's worth sort of, you know, having a conversation with your IT team about how you can safely manage libraries as the data science team.
So, on your repository, like, you don't necessarily need to manage a repository. A lot of organizations choose not to manage a repository. If you are going to manage one, that's generally on IT or admins. And, you know, sometimes you have data scientists serving in this role of sort of the data science platform admin. And that's fine. But it's good to distinguish what you're doing as a data science platform admin versus as a data scientist. And so, most often, we see IT admins managing a repository. Because the job here is, you know, to stand up a server, to network it correctly, to make sure that it's available to users, to make sure that it stays up, has the appropriate amount of stability and security.
On the library side, that really falls to the data scientist to manage either a user level library or increasingly, and as I'm going to suggest to do here, a project level library, right? An individual library that corresponds to a single project is the safest path. Because then you know that even if you do other projects, you won't have to worry about breaking your project library.
Tools for managing repositories and libraries
So, on the repository level, you know, obviously, we recommend using RStudio Package Manager as the package repository for CRAN, Bioconductor, and PyPI repositories. There are also open source products like MiniCRAN that you can use if that's what your organization or you prefer. On the library side, there's renv and VirtualEnv. In R, you know, the two options that most people use to do this are either PackRat or renv. renv is a full replacement for PackRat and in fact was written by the same person. So, if you're using PackRat, upgrade to renv. If you're not managing project libraries, start doing it. Use renv.
On the Python side, there are a ton of options for how to do this. There's VirtualEnv, there's Conda, there's PyEnv, there's all these different things. And if your organization has a standard, that's fine. Use what you want to use. But if you aren't using anything yet, I would encourage you to use VirtualEnv. It's relatively simple. And as far as I've seen, I've seen lots of organizations sort of successfully use VirtualEnv.
And, you know, in terms of the settings, that can be configured inside the RStudio IDE if you're using the RStudio IDE. Or you can use config files like rprofile or rprofile.site or renviron or renviron.site to set those settings.
Why organizations manage private repositories
So, let's talk a little bit about why organizations want to manage private repositories. And the main reason here, right, is that IT when IT needs to control the set of packages that's available to people, this is how you do it. And I want to emphasize this. That if you are in an organization where you cannot install things, install R packages on your machine, you're in a broken state and you need to figure out a way to convince your IT team to let you install stuff. It's really, really hard to work in that environment.
One answer is to get a centralized server. But the other answer is to allow installs only from a private repository, right? So, that's the reason we see people standing up private repositories. That's because they're in offline or air gap situations. That is, there's no access to the internet from where you're installing packages. Now, that's a common sort of thing in high security environments. And that's totally fine. The solution is to have a repository behind your firewall and then allow people free and open access to that repository and lock them from using any repository outside the firewall.
Similarly, if you have to validate package sets into your environment, right, if that's why you can't install packages, because you can't install sort of anything, again, the way to solve this problem is to have the repository be the gateway, right? The repository chooses what are the sets of packages people are allowed to install for safety or security reasons. And then what they can install. This can also work really well if you have projects that require sort of, you know, package validation. And then, like, once those packages have been validated, you can only use that set.
Demo: using renv for reproducibility
So, first, let's open up this very ominously named package fails project here. And so, you know, what I want to point out is this is some code in an RMarkdown document. It last ran in 2018 in June. I'm not going to get into really what this code does. And so, what I'll say up front actually is like what this code does is pretty silly. So, please don't judge me based on that. What I wanted to do is I wanted to find an example where upgrading the package versions would cause something to fail. And in a tidyverse package.
So, here's code that ran in 2018. And so, I'm going to try and run it today in 2021. And what we're going to find is that this code no longer runs because the argument data is missing. I mean, I do know what's up. But like, you know, this is the kind of thing that can take forever to debug and sort of get stuck here. If the eagle-eyed among you will notice that the problem is actually that this argument was renamed to dot data in tidyR later.
And so, like, if I look here, what version of tidyR am I running? I am running tidyR 1.0.2. So, I happen to know that as of June, and let's look again at when this last ran, this ran in June 2018. And as of June 2018, the version of tidyR back then was like 0.8.1. And so, what's happened is tidyR has gotten upgraded because I did something else on my system that bumped the version of tidyR. And so, now my code is broken. It's just broken. And that's really frustrating, right?
So, let's go to the much more friendly named renvWorking example here. And what you'll see is this is exactly the same code, identical code. And this is going to run. R markdown, render. Boom. Done. And now I've got a version that works today, right? Like this ran today, just now, 2-17-2021.
And so, what's the difference, right? Like if you've been paying attention, I'm sure you realize the difference is that I'm using renv here. So, like inside the working version, the version of tidyR is 0.8.1, right? Whereas in the non-working version, it had gotten upgraded in a way that I didn't really mean for it to.
And the way this, to be transparent, let's check out those .libpaths, which I mentioned was the way that it looks up where the library is. And so, you'll see like the library in this version, this is like my user-level library, and this is the system-level library. So, like at some point in my user library since 2018, I've updated the version of tidyR, which is not very hard to imagine. On the other hand, inside this version that works, the libpaths 0.2, this is my home directory. So, like this is inside the renvWorking. Inside this project, there's an renvLibrary that is going to be keeping sort of this package safe for me for later.
So, that's great. Like renv has done its job. I am keeping myself safe. So, that solves one problem, right? Like my package versions will stay consistent as long as I keep this renvProject with this project.
So, what happens now if I want to install another package? Let's say I want to install ggplot. Oh, great. So, it installed ggplot 2.2.1. So, that's great because ggplot 2.2.1, I happen to know, is the version from June 2018. Currently, ggplot 2 is on 3.3.3, which is not the version I want, right? Like that's too new. It might break. So, how do I get ggplot 2? And of course, if I look at my options repos, that's the answer, is that my repo here is pointed at a public RStudio package manager. And I happen to know that this is the URL format that means that this is a dated repo.
So, let's look at what that means. So, here's RStudio package manager. This is public RStudio package manager, packagemanager.rstudio.com. You can go there and use it as you wish. It's free and available to use. And so, what I'll see is that if I go here to the setup pane, I can get, if I go back to June 2018, there's that 72, right? That is from that date. So, what that means is that every time I install packages from this repository, they will work together as of June 2018. And so, I never have to worry about installing a new version that's going to mess things up, because they will always play nicely together.
The other thing I will say, and this is, again, a thing that's different between R and Python, is that in R, you need to specify that you want binary repositories for your operating system. They're specific to the operating system, and you can change that here to choose the right repository. This is really important, just because using binary repositories can really, really speed up install times by, like, several orders of magnitude. And so, it's worth it to choose a binary repository. That's not necessary for Python. Python, by default, will always give you the binaries for your system.
The renv lock file
Now we know how I got this, you know, how this package worked, that sort of thing, why my tidyR works, but how is this being maintained? And so, what renv does is renv saves what's called a lock file. And so, this lock file contains my version of R, right? It says it's R version 3.6.2. This has that repository in it, right? That's that repository that I used, and so that, oh, that's too much. This is that repository that I had set, you know, when I started this project, and then I have a list of all of my packages, what version they are, what repository they're using. If I have GitHub packages, it will record that as well.
And so, this gets recorded in my lock file. And so, if I want to make another project, let's say I want to share this with my colleague, right, all I need to do is I need to share the renv lock file with them. So, here's what I'm going to do. I'm going to take that renv lock file, and I'm going to copy it to the broken version.
So, all that I've done, right, like this is the version of this project that does not work. tidyR is wrong here, right? And tidyR does not work here, right? This is, again, that wrong version of tidyR, the new version of tidyR. And so, now what I'm going to do is I'm going to do renv init, which will stand up my renv virtual environment, submit it to this project, and it will restore it in particular from the lock file. The other thing I just want you to notice is that was almost instantaneous in terms of restoring that lock file. And the reason is because renv maintains a user-level cache. So, if I've ever installed these packages before, they automatically get restored, and it is super fast.
And so, now if I go, you'll see package version, tidyR, 0.8.1. Awesome. Now I've got the right version. Now I can run this code, and it just works, right? So, this is really awesome, and it makes an easy way to share a lock file with or to share a project in an environment with somebody else or with future you.
So, this is really awesome, and it makes an easy way to share a lock file with or to share a project in an environment with somebody else or with future you.
Starting a new project with renv
Great. Okay. So, one more question that comes up all the time is, like, okay, great. How do I do this for a new project, right? Like, how do I start this afresh? And so, it's really quite easy. When I go to new project here in the RStudio IDE, new directory, I'm going to start a new project. I'm going to call this my new project. And now, all I do is I'm going to click here, use renv with this project. And now, renv is going to be set up with this project as it starts here on RStudio server. It's going to set up a new project for me with my sort of folder there with my Rproj file, the project folder, and an renv library that is attached to the project by default, which is really sort of a nice thing to do here.
Now, so, what this means is that whenever, you know, if I go to install packages, it will install, you know, again, it linked in from that cache super, super quick. And so, you know, that's the part of using renv. Now, what if I want to use a snapshot of repository? So, there's a little bit of judgment here around what the date is to choose. You want to choose a date that sort of like works for your project. My general recommendation is to choose a date that's sort of like around the time your project is pretty stable. So, you probably don't want it to be like the day you start the project, but maybe like when the bulk of the coding is done and you're thinking, okay, it's time to save this for later.
You can do renv modify and go in here. Here's my like renv block file, and I can edit. Let's get a repository that's like more up to date. Let's say I want to do like yesterday. I can get that lock file, the right repository. I just put it in here. And now, when I start a new session, it's going to automatically set my repository to be the right repository.
I'm going to take just a second here to reload. But after it reloads, what you'll see is that it's going to be that updated RStudio, that updated version of the package manager repository. Now, if I want to say the way, one other thing just to be aware of is the way to capture dependencies is I'm going to save an R file here. And now, when I do renv snapshot, it's going to introspect on those files. And it's realized that I need ggplot2 in here and all of its dependencies. So, this makes it really easy to use. You can also set up auto snapshots so that it will snapshot any time you type install.packages inside this project folder.
Wrapping up
Great. So, that's basically what I wanted to share in terms of using renv. And so, we're going to wrap up now. And so, just a couple of things to review here.
So, the thing you want to do, right, is two things to encourage you to do. Do a package management plan with your group, right? How are you managing repositories? What are those settings supposed to be the default repos? And how are you managing libraries? And before you start your individual data science work, do your mise en place. Use renv. And, you know, I just showed off how to use renv for R. The process of using virtual end with Python is really, really similar.
Really similar. If you want more information about this, we have a website, environments.rstudio.com, where you can feel free to check out more information on this. And if you're interested in RStudio package manager, if that seems like something that your team could use, feel free to send us an email at sales at RStudio.com, and we can talk more with you there.
Q&A
Also interested in RStudio's Docker images, but they are advertised as experimental. Any idea of when they will be out of experimental? No. I don't know that we have any plans to make them, you know, more strongly supported than experimental now, for the time being. At this point, you know, if you want to use them, you should feel free to use them. But no imminent plans to stabilize them as far as I know.
Oh, this is a great question. Somebody asks, interested in the advice that users should manage their own libraries. To that end, what's your advice on the role of a centrally managed site library? This is a great question. The short answer here is that I do not believe it's necessary. I strongly recommend against trying to disallow users from being able to pull their own packages, right? Data scientists should be able to manage their own specific packages for themselves. And the reason for that is that, like, for example, let's say that, you know, I have just a centrally managed set of packages. Then, like, somebody has to do that management, right? Then IT has to manage that library, which can be very difficult in the managing not only what packages, but what versions, what's installed, what date was it installed, how does that correspond to the project, right? This gets really complicated really, really fast. And so, I strongly recommend against having all the packages in a centrally site managed library.
So, there's another version of this that's a little bit softer, which is, like, an admin stands up a sort of starter set of packages that are centrally installed, and then users can install their own packages on top of that, but, like, they can use the starter set if they want to so that, you know, that works more easily. There's nothing wrong with doing that. It's totally fine. In most cases, it is unnecessary, we tend to find.
Here's another great question on what does renv saves include all of the packages that are installed or only those actually used inside a project? Yeah, this is a great question. By default, if you do a snapshot, it snapshots only those packages that are in use at the current time. And what that means is that, like, it can find them somewhere in, like, a library call or a, like, you know, a colon colon call, right? If it can find either of those, it will capture those as dependencies.
This does sometimes get people into trouble. If people are doing this and then, like, I'll apply lib install dot packages, like, this will not get picked up from renv and also is a bad process, right? If you're doing this, don't do this. Use renv instead. It works way better. It captures versions and not just packages and also doesn't mess up, right? Like, this could mess up somebody's entire package environment. There is a way in renv to turn on auto snapshotting, which is basically any package that gets installed will get snapshotted. And so, that can end up being, like, more than the packages you actually need. And then there's a prune command, basically, that allows you to prune the renv library back to only the things that are actually being used.
Do you have a free version of package manager? Yes, it's the public one. RStudio package manager is public package manager. If you want something behind your firewall, an internal version, that is where our pro product comes in.
Package manager, we've got to control the list of packages it caches. When will it do the same for Python packages? No guarantee on when. It is under active development by the package manager team and will be out shortly. For those of you who aren't quite aware, package manager, the private version, supports Python repos, PyPy repos. Currently, though, it only supports full mirrors. It does not support subsets and that sort of thing. This will be forthcoming in future versions of package manager.
What is your recommendation to use a mix of validated public packages in one environment in a highly regulated environment? Yeah, this is a great question. So, there are a bunch of different ways to do this. The most common way we see this done is, oops, let me, I'm going to go back to, this is our internal sort of demo version of package manager, is you can create a validated repository here inside RStudio package manager, and that can sit aside, right? Like, you can have multiple repos, and so, like, you can do this yourself.
You can create a validated repository that has only those validated packages in it, right, and you can specify the versions and the sort of packages, and you can have that sitting beside. This is a full CRAN mirror, right, and so, what you can do then is you can say, like, okay, for the validated projects, use the validated repository, right, and you, again, just go to the setup tab here, make sure you're using the appropriate binary repository, and if you want to, choose a date if you want to. And if you're doing sort of a more public project that doesn't need, you know, that sort of high degree of validation, just use the sort of all-in public repository.
I know we did not have time to get to all the questions here. Unfortunately, we're, I think, out of time. I will be putting together some more materials on this topic and sort of emailing them out, so I really appreciate all the questions. I will take a look at all these questions, and we'll probably include a little FAQ and some of the future documents we send out, and we'll try and address as many of these as possible. Again, really, really appreciate everybody's time today, and for sticking with us through all the technical difficulties today. I've never quite had a webinar go like this, but I really appreciate everybody sticking with us, and hope you got something out of the time today. So, thanks so much for joining, and look forward to hearing from you more soon.
