Sean Lopp | Announcing Posit Package Manager | RStudio (2019)

Posit Package Manager is the newest professional product that helps teams, departments, and entire enterprises organize and centralize package management. If you've ever struggled with IT to get access to a new (any?) R package, reproduce an old result, or share your code with others, Posit Package Manager can help! We'll introduce the new product, discuss how R repositories can be used to solve problems and take a sneak peek at what is coming in 2019. VIEW MATERIALS https://github.com/slopp/rspm-rstudioconf About the Author Sean Lopp Sean has a degree in mathematics and statistics and worked as an analyst at the National Renewable Energy Lab before making the switch to customer success at RStudio. In his spare time he skis and mountain bikes and is a proud Colorado native. *Posit Package Manager, formerly known as RStudio Package Manager

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Well, thank you so much for taking some time today. I'm excited to introduce you to the newest product, commercial product from RStudio . Maybe not the sexiest product, but probably our most important to date, if you ask me, which is RStudio Package Manager.

So what we're going to talk about today, we're going to talk about how it's important to organize packages, and one way to do that is with a central repository, which is what Package Manager helps you create. We're going to talk about automatic versioning, or how you can make sure that your results are reproducible and that the package environment those results depend on is also reproducible. Look at how you can use a tool like Package Manager anywhere you're running R, whether that's within the open source IDE, in a batch script somewhere, or in RStudio Connect.

Then we're going to spend most of the time looking at these different ways you might want to expose packages to users, whether it's all of CRAN, a part of CRAN, maybe some local packages, or even pulling packages from Git. And then finally, we'll briefly discuss how you can track all that's going on through some usage metrics.

The love-hate relationship with R packages

Before we do that, though, I want to quickly lament an alternative talk that I don't have the chance to give, which is 10 things that I hate about R packages, and if you've seen this excellent movie from the 80s, you might recognize the killer quote at the end, and I think it does a good job summarizing how I feel about packages. I love them, but I also hate them a little bit, but I hate the most the way I love them.

I love them, but I also hate them a little bit, but I hate the most the way I love them.

And I think we can all relate to that. As R users, R packages are what make the language so powerful. And I think we love them, that's why we're here, but they don't come without their set of challenges. And so I want to quickly discuss a few of those challenges that maybe you've experienced.

So the first one is compiler errors, if you've ever tried to install a package, especially in a Linux system, and gotten garbage like this, you've had a moment of hate, especially for the XML package, man, it's tricky.

This is a bit harder, at least for a compilation error, when you go to install a package, it'll fail, and it'll fail in an ugly way, but at least it fails. There's other errors that are a bit more subtle, which is when you go to install a package, and the install succeeds, but actually using the package fails.

And so this is a discussion pulled from the Rlang package, and basically what this poor user had done is they upgraded Rlang, but they did not upgrade dplyr . Because dplyr depends on Rlang, they got into a situation where both packages were technically installed, but neither were usable, and so Leonel, the very helpful package maintainer he was, kind of walked the user through this situation, his first comment up there is what I really want you to focus in on, unfortunately in R there's not a great mechanism for preventing this type of problem, so it's a problem that you can find yourself in quite frequently.

Package management as a supply chain problem

This was made even worse by the fact that the user then tried to install the latest version of dplyr onto their Windows machine, and CRAN was serving an out-of-date binary, so the problems got worse and worse. So these are kind of the tip of the iceberg.

There are two things that maybe you've encountered yourself, but I want to do a bit of fear mongering here, which is even if you're a single data scientist, there's a lot more that can go wrong, and the problem is it can go wrong further down the road. So I want to discuss a few problems that if you have the problem, it's really too late. It's kind of like a smoke detector and a fire. If you're burning in the fire, you probably wish you had invested in the smoke detector, and there's a similar issue around package management. It's critical that you think of your strategy for managing packages before you're in the fire.

So what does that fire look like? Well, three things we've seen time and time again, unfortunately, with some of our customers, is that the first one here is you make a critical decision based on an analysis, and then you're not able to rerun the analysis, because when you click go on your code, you get a package error, and that's a really tough place to be in, and it can be quite hard to try to figure out what you need to back out or reinstall or change to get the analysis to run again.

The second one here is maybe a bit more mundane, but if you're an R user today who does all their work on a desktop and you can install packages from wherever you want, congratulations, you're in a happy spot. You won't be there for long. As your analysis leaves the organization or permeates the organization, you'll likely be asked to move your analysis into other places, other servers that might not have as easy access to the internet, and at that point, if you haven't thought at all about what packages you're using, you're going to have a hard time working with IT to get those packages into a happy spot, and then finally, even something as simple as sharing your code with a colleague, if you don't have a plan for package management now, it can be really hard, and you'll end up in a situation where maybe the new junior R user you've brought on board is spending their first day looking at your session info output trying to reinstall packages, so none of these things are where we want to be, but the good news is that we can come up with strategies that will prevent these problems from happening in the first place.

And so that's what we're going to talk about today, is one role or the role that package manager can play in those types of strategies, and so what package manager allows you to do is instead of having every R client install packages from the various places they live, you know, in GitHub or on CRAN or maybe in your own Git enterprise server, instead of having all of the R clients install on their own, you can have a central place or your own repository that organizes, collects, and governs what packages are available, and then have your actual R clients install from there, and there are many ways to create these repositories, there's open source packages like mini CRAN, and we believe that package manager provides a great end-to-end solution for organizations that are interested in solving this type of problem.

To maybe help you wrap your head around this, I'm going to try an analogy with bad clip art. It's worked for me before, and I hope it works again, but I love to cook, but I wouldn't consider myself a professional, I'm kind of a hobbyist, and so I typically, you know, whatever I have in my pantry, I throw into a pot, hope for the best, I use, you know, my home kitchen appliances, and I look for great deals at the grocery store, and I think that's maybe similar to when I started in R, you know, I had the laptop that was available to me, I had the tools that were available to me, and I installed, you know, whatever I needed to try to get the job done, but a professional chef would never take the approach I take to cooking.

If they make their day-to-day business, you know, by selling delicious things to eat, then they're probably going to upgrade each of these things. They're going to have a detailed recipe to make sure they get the same results each time, they're going to have an industrial kitchen instead of, you know, my home kitchen, and maybe most importantly, they're going to focus and own the supply chain for all the ingredients that they need, and we think it's important that our users who are using R day-to-day to think of the same upgrades.

So there's a lot of excellent tools that have been built and will continue to be built for managing the project to ensure that you have reproducible results, whether that's something like PackRat or even it's a great start just to use an R markdown document where you're forced to knit it from end-to-end every once in a while. We also work with a lot of organizations who spend quite a bit of time focusing on that equipment upgrade. They spend time thinking about how many CPUs we'll need, how much RAM we'll need, getting a new cluster or the latest compute technology, but what separates those organizations from really exceptional organizations is the organizations that also consider the supply chain for data science, and here that's critically package management.

So I'm not going to claim that RStudio Package Manager will solve all of the problems that I discussed at the beginning, but it will help a lot with the supply chain and is a powerful tool that will put you in a good place to have a strategy that meets all of the needs. And then finally, before we hop into the demo, I just want to mention we've heard this in pretty much every talk today, that this type of supply chain management is not something you can do on your own. So you're going to have to work with a number of different groups, IT ops, information security, business users, and we've worked really hard to make sure that Package Manager makes life easy for all those different groups so that you can go to your IT operations person who doesn't know how to spell R and walk them through the process of how they're going to set up an environment that will make you both successful.

Demo: RStudio Package Manager

So let's actually look at the tool. A bit of context, RStudio Package Manager is an on-premise commercial product. So it's something that you would install and have for your own corporate use. I'm going to show you a demo server that we've set up, and you're more than welcome to follow along or poke at this throughout the conference. It's demo.rstudiopm.com. And I'm going to start by looking at maybe the simplest case, which is I just need access to CRAN packages.

So here we're looking at ggplot2 . And one of the things that's really exciting about Package Manager is that it knows a lot about the ggplot2 package. It knows all the dependencies that the package has. It also knows critically about all of the reverse dependencies, so packages that in turn rely on ggplot2. We're looking at the main, kind of the current version, but Package Manager also gives access to all the previous versions of the package as well. So that's kind of that first critical bit, is to ensure reproducibility, you have to have a way to get all of the older versions of things.

And so Package Manager makes those available so that you can see them and install them. That's the first level of versioning that happens at the package level, but we also version everything that happens on the repository as well. So here we're looking at a log. Like any organization, we need to take in updates from CRAN over time to get access to new packages or updates to new packages. But any time we do that, we record a transactional version ID, and that ID allows us to either install packages from this moving tip, the latest and greatest, or to install packages from an older point in time.

So how do we do that? How do we actually install packages using this tool? Well, we've worked really hard to make sure that our users don't have to change their behaviors or the tools they're familiar with. So all you have to do is grab a URL. Again, you can pick if you want the latest and greatest, this default URL. And in RStudio 1.2, we've added a new user interface for specifying that repo option. So you just throw the URL in there, and then you can use install.packages as you normally would. If you're running R on a server, your admin can set this for all users so that right out of the gates, they get the right behavior.

But I mentioned those versioned IDs. So you can also, as of the latest version of Package Manager, pick from an older point in time. So it's January. I have an analysis that I created back in July, and I need to recreate the environment for that analysis. Well, I can kind of time travel back to the repository as it existed back in July and install any packages from that point in time.

And I'd like to kind of briefly go on a tangent here. So I'll come off of my pedestal in just a second. But people ask us all the time how Docker relates to reproducibility. And Docker is an amazing tool. It allows you to programmatically specify what your environment should look like. But time and time again, and in fact, I'm sure you'll see this in other conference talks, you'll see a Docker file that looks like this, where somewhere there's a line that says run install packages. And it's great that you're specifying what packages should be in your environment. But the problem is any time you recompile this Docker image, you're subject to get whatever install that package gives you, so you're not actually going to get the same reproducible environment any time you recompile the Docker file.

Luckily, there's an easy fix here, which is to use one of those techniques for freezing the versions of your packages. And so here we're using that frozen URL. And now we're guaranteed that any time we build this image, we're going to get the same exact set of packages and the same versions of those packages.

Repository flavors: subsets, internal packages, and Git

So that's kind of how you might share all of CRAN. There are other options as well. So a lot of our customers work in really specific validated environments. In those cases, you might not want all of CRAN accessible. You might want only a part of CRAN that's specifically designed and locked down for a project. And so we allow you to easily create these subsets of CRAN inside package manager.

So here, if we look at this repository called validated, it basically has two packages at the top level, shiny and then XGBoost. So we have a record of each of those packages being added to the repository. And along for the ride, all of their dependencies were added as well. If you wanted to add another package, so say I now want to extend this project to include Plumber , you can use the package manager CLI. So that's what we're looking at here is how admins would interact with this server. And package manager uses that knowledge of the dependency tree for a package and the repository that it lives in to figure out what other things would need to change for Plumber to be added to this repository.

So here you can see some of the dependencies of Plumber are already present. Others would need to be added. It uses that same intelligence to give you a preview of what would happen if you wanted to update one of the packages. So we've worked really hard to make sure that if you move forward in time, you're doing so in a consistent way. And so you never end up in that crummy situation where all the packages will install but some of them won't work.

So those are the packages from CRAN that are available in this repository. But if we look back at the log here, there is another layer, which is that we have an internal package, so a package that's not available on CRAN that we also want to make available to this project. That's called RStudio internal. And to do that, we have this package in Git on our Git enterprise server. And all we have to do is tell package manager about it. We can ask it to give us the commits related to this package or specific tagged releases. And package manager will turn that Git repository into an R package that then users can install with installed up packages. So you don't have to worry about trying to get every R user in your organization access to your Git infrastructure. It can all go through this central place. So you have a lot of visibility there because it's central.

And so whether you're using, you know, these packages from Git, this subset that's locked down for a project, or you're just accessing all or part of CRAN, because everything's going through this one place, you can see and get visibility into what's going on. So here we can look at package downloads over the last 90 days. We can see what packages are popular. And we can answer that pesky legal question, which is what's our risk exposure to maybe the GPL license. So it's a great benefit of going through a central place is that you have much better visibility into what's occurring.

So to do a quick summary of what we just looked at, we organized packages in a central repository. We saw the two different levels of versioning that occur, both at the package level and at the repository as a whole. We saw how you might integrate package manager with R and actually use this thing by just calling installed up packages. Looked at the different flavors of a repository that you can create, and we briefly saw those usage metrics.

What's coming next

I want to quickly talk about what's coming up next. If you think that's that very first thing I hated about R packages, it's that they can be hard to install. And so we're gathering a lot of information about what underlying system requirements are necessary for packages to compile, and we're going to make those available to users through package manager. We also hope later this year to make Linux binaries available as well. So if you're operating in a Linux server, you don't have to incur the cost of compiling it every time you want to use the package.

Then finally, if you're using Bioconductor, I'd love to talk to you more in the professionals lounge. There's a lot of ways to use Bioconductor with package manager today. We'd also like to hear from you about how we can make it more of a first class citizen in the tool.

All the details are going to be available online, but the better way to learn more is to swing by the professional lounge. We can talk about things, drill into specific parts of the demo. And then finally, before I end, you know, you might be sold on package manager or you might not be. And either way is fine with us. But I think it's critical for all of you to have a conversation about where packages are coming from in the projects you're using.

So to encourage that type of discussion, we're announcing a challenge here. If you go to this URL at the top, grab out your phone and take a picture of the slide, where you'll be able to install a package called count depths. It's a really simple package. It's just going to tell you how many packages your project uses. And you might be surprised at the number. A simple shiny app I created had more than 112 package dependencies by the time you counted all of the other things that were brought in.

A simple shiny app I created had more than 112 package dependencies by the time you counted all of the other things that were brought in.

So the challenge part, tweet us at our studio with that number. Get your colleagues to do it, too. And you'll be entered into basically a raffle to win a year free of package managers. We'll give you a license for a year. But either way, we think it's critically important for everyone to have a conversation about, you know, the 114 packages that you're using, where are they coming from, and how are you going to ensure that you always get them so that, you know, the recipe you're creating is delicious each and every time. So with that, swing by the Pro Lounge, and I think we might have time for a few questions. Thank you.

Q&A

Thank you, Sean. We do have time for a couple of questions. It looks like we've got some hands up. So we'll pass the mic. Hi. Do you have any strategy about dealing with system level dependencies coming from externally from the system that it's living within?

Yeah, definitely. So one of the challenges, as I mentioned, with R packages is that an R package can depend on other R packages, obviously, but it also can depend on system dependencies. So the XML2 package requires that you have the lib XML Unix dependency. And so the first challenge there is that R packages are not required to state what those dependencies are. So there is no database yet with all of that information. So the first thing we're doing is creating that database of information, and then we'll surface it to users through Package Manager. So when you go to install a package, it'll tell you what other system dependencies you'll need.

Hi. So this whole program looks really, really great with the package management. But for those of us who say aren't working in industry, but maybe working in academia and can't be spending $1,000 a year on a package manager, what recommendations do you have? Will there be a reduced free version of this at some juncture, or would you put your money behind Mini-Cran?

Yeah, that's a great question. Because I don't know if I would. So all RStudio products are available for free. All of our commercial products are available for free to teaching institutions. There's some terms and conditions that apply, of course, but our sales team would be happy to walk through those. And they're available at significantly discounted rates if you're not teaching, but working at an academic university for the purposes of research.

I have two questions. For the Linux binaries, what platforms do you intend to support? Will it be Ubuntu, Red Hat, or others? Yes. So we would love to hear from you about what platforms you're using, but our initial support will be for Ubuntu and Red Hat.

And second question, for the package, when you install the packages via the package manager, is there any kind of, I don't know, additional security around it? Is there package vetting above and beyond what R-Core does for Cran? Yes, that's an excellent question that we get quite frequently, is how do I guarantee the security of an R package? So that's not a problem that package manager is tackling. To be honest, we would need to be a different company to vet all 14,000 packages securely. One thing we can say is that if you don't have a central place that you're installing packages from, there's no way you'll be able to address any security concerns that your IT group might have. So one of the recommendations, for instance, would be to use the fact that you have a central repository as a place for IT to incorporate what other security processes they want to put in place, whether, as Mark mentioned at the end of his talk, scanning the binaries for viruses.

Featured software#