Resources

Joe Roberts - Saving time (and pain) with Posit Public Package Manager

CRAN, Bioconductor, and PyPI are incredible resources for packages that make performing data science in R and Python better. But there’s also a better way to obtain those packages! Companies like Databricks are leveraging Posit Public Package Manager to make their users’ package installation faster and more reproducible. Learn why, and how anyone – anywhere – can easily get started using Public Package Manager. Talk by Joe Roberts

Oct 31, 2024
15 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So hi, I'm Joe Roberts. I am the product manager here at Posit and I want to talk today about how to save time and pain with Posit Public Package Manager. So what we're going to talk about today is, you know, obviously what is public package manager if you haven't heard of it before. And then we're going to talk about some of those common package pain points that you have dealing with packages in R particularly, but we'll talk about others as well. And then how you can move public package manager if you want to for your repository.

But first we need to start with CRAN. And you all, I know you all love CRAN whether you know it or not. I know that some of you, you know, CRAN is the official home for over 20,000 user-contributed R packages like many that we've seen here. And what's great is that installing packages just works with R's installed R packages. When you know it or not, when you hit type installed R packages or do it from the IDE, you're going out to CRAN generally to get those packages, download them, and bring them back in and install them. And my hot take today is that the biggest reason that R has become so popular for data science is exactly because of that. Because it's so easy to go to CRAN, get new packages that extend the use of R to so many great useful things in data science.

And my hot take today is that the biggest reason that R has become so popular for data science is exactly because of that. Because it's so easy to go to CRAN, get new packages that extend the use of R to so many great useful things in data science.

CRAN does this through dozens of CRAN mirrors that are provided all across the globe. And to that, we have our own way of dealing with CRAN. We call Posit Public Package Manager. We'll refer to it a lot as P3M, those three P's in the M, just to make it short and simple there. But what Public Package Manager is, is a free hosted service based on our professional Posit Package Manager product that has a lot more features. But our Public Package Manager provides everything in those complete CRAN mirrors that you would normally use by default by going to get anything going, plus many additional features.

Public Package Manager has been used widely in the R community since 2020 when we first introduced it. And we get over 40 million package installs from Public Package Manager every month. And really, we designed it, and the reason we put it out there is to address the common pain points of using those public mirrors like CRAN.

Common package pain points

So, let's talk first about what some of those pain points are. Let's see if anyone can relate to any of these here. How about the first one? My code used to work, but now it doesn't. Maybe you say, I wrote this code two years ago, but I got a new laptop. I'm trying to, I saved my, I seem to get there, but I can't get the same results on my new laptop because something's changed. Maybe I updated one package to get some new functionality, and everything broke, and now my whole project doesn't run.

Why do these things happen? Well, several reasons could be for the case there. One, new versions of those packages may have changes that break old code. Most CRAN R package authors are pretty good at keeping backwards compatibility, so you update to a new version, old code should still work, but it's not perfect, and sometimes they intentionally break things to enable better functionality down the road. Similarly, dependent packages may be out of date. So, even though you updated one package, it may expect functions that are in newer versions of other packages that you aren't, that isn't aware of there. Because CRAN itself only guarantees package compatibility between packages at any given latest point in time.

So, that's, if you install all these packages at the exact same time from the same date, they should all work together right, and they're tested, they're validated, do that there. But if you have them, one installed last week, one installed two months ago, and you're mixing and matching that way, you can get out of sync in ways that just break. And third, what we see even more frequently on CRAN, sometimes CRAN packages just disappear. They take them off, they get removed for whatever reason, and maybe you can't recreate that in the new.

So, how we help with P3M. So, what we do at Posit is we take daily snapshots of CRAN, we also do it for PyPI and Bioconductor, and you can actually configure on your side then, R to install packages from a specific date, ensuring that even if that date is not the today, like the like you can do by just getting the latest, you can go back and say, I want all the packages from February 18th, 2023, because that's what I have this year, I can do that there. And then, if you reinstall all those packages from that snapshot in your repository, at least all those packages will be the same, at least compatible with each other, as far as that goes there.

So, you can use those, for example, to recreate the package environment for an old project from your old laptop that crashed. As long as you know roughly when the project was written, you can get pretty close by doing it that way, usually. But more importantly, you can also future-proof your work by tying it to today's snapshot and saying, I want to always use this snapshot, and I'll manually choose when I want to upgrade these later. But that way, if you're sharing with someone else saying, here, here's my code, install all the packages from this snapshot so you have the same things that I built it on when you try to reproduce it yourself.

Slow package installation

Next pain point, installing packages is really slow. And we see this in certain environments. You may recognize this if you are using Posit Workbench or some other Linux-based environment where it's just sent to be slower for whatever reasons we'll talk about. Maybe you need to use an older R version. Maybe for whatever reason you can't use R44, you need to use R36 even still, or 4.0, because whether it's compatibility, whether it's policy issues or whatever reasons you have in your environment, you still need to use older R. Or maybe you're using Docker containers and deploying it in there, and your containers take forever to build every time you try to build them.

So, there's several reasons why this happens. One, installing packages from source is slow. And most packages, so any source package is where how most are natively available on CRAN, and it may contain C, C++, Fortran code that has to be compiled, which takes some time. Other packages, some packages require additional libraries in order to do that or build tools that make it hard. You have to install all this extra software just to build the package. Still, there's the concept of binary packages, which basically pre-compile all of that work for you. And just, and when you go, then you can just download just your pre-built package that has everything ready to go, and then just unzip it and go. And CRAN, unfortunately, only provides binary packages for Windows and Mac, and only on the current and previous version of R.

So, if you're on Windows and Mac, you probably don't run into this specific problem very often, if you're on a relatively recent version of R. But in different environments, different combinations, this still can be an issue. So, again, how does package manager help? We provide pre-built binaries for all of CRAN on not just the current and last release, but the current and five old releases. So, we actually do provide binary packages for all the way back to R3-6 right now. We do it for Windows and Mac and on 12 different Linux distributions. So, this actually gives a, for like wherever you're running Workbench, most Docker containers are built on, you know, any number of these different Linux distributions that can really speed up the time here. And it's also, it's also complete for a lot of the cloud compute environments, snowflakes, Databricks, because those are tend to be Linux-based containerized environments otherwise, where, you know, you're installing packages repetitively and it just, you're tired of it taking so long.

So, let's talk about why, you know, when it takes, why this is so important here. So, what we do, say, let's take a snapshot. This is a snapshot of CRAN from today, let's say. We, these are, there was, these were the new packages were added to CRAN today. So, we every day, we'll look at CRAN, we will go, oh, well, we also then need to look at all the reverse dependencies of the packages, because in order to make sure that the binaries are compatible, we need to make sure that any package that depends on one of those packages is also rebuilt at the same time. You know, reverse dependencies are, some have, most have a few, you know, you take, then you take something like dplyr that has 3,500 reverse dependencies. So, that's a really big major, it's a packages that we have to start rebuilding every, when one package changes.

Then, like I said, we take those, and then we do it on every single R version that we support. So, we have to replete that out six times to get all the way out there. And then, we do that same combination for every distribution we support, and then next, now we're down to like, you know, sometimes 10,000 packages we have to rebuild in a day in order just to keep these snapshots compatible and binaries there. So, it's a lot that we do in there that we then make available through public package manager so that you can actually save all that time on your end and not have to build them yourself.

Speed comparison: CRAN vs P3M

Installation-wise, then, we end up with, let's look at them side-by-side here. You know, CRAN, I've got one of these is going to CRAN, one of these is going to public package manager. They're both, obviously, download packages. This is where I just picked TensorFlow because it has a handful of dependencies that take a while to build. As you see on the CRAN side, it's going and it's doing all this compiling. It's doing all this, lots of stuff on the screen. I'm doing this on Ubuntu as Linux here just because it's where we actually see a major difference there, and there's a lot of build chain there. Public package manager, downloading all these packages, and then, oh, look, all it has to do is install them. It's just unzip, install, install binary package, done, done, done, done, done.

And in 40 seconds, we've got public package manager done. I don't have enough time in order to give me the hook off the stage if I let this finish on the CRAN side, but it takes several minutes to get the same equivalent thing done, and imagine if we have a whole lot more packages, it could be even worse.

Getting started with P3M

So, moving to public package manager. We gave this a simple name as well, more recently, to make it easy to find. No QR code because you can just remember p3m.dev is the way to get to our public package manager offering. And when you get there, you'll go to our public package manager page here. It's relatively straightforward, we hope. We redesigned it to make it that way. And one big setup button, three easy steps, and the one step setup button will take you to our setup page.

This gives us to an option, basically, we want to walk you through very simply with a few questions about your environment to know what's the right way to serve you the packages that you need. You know, your operating system, if you're on Linux, what distribution you're on. This is where you can select the snapshot that you actually want if you actually, you know, if we want to use a specific database snapshot. If you say no to this, you'll see the latest, just like you do on the public CRAN, which is also a good way to start if you just want to get the latest and greatest. And then we'll ask you a question about what environment you're running it in. That's just to help give you the right information with some simple instructions for how to configure your environment to update that repository URL to where you actually want it to be.

This repository URL is exactly what you actually are going to want to then copy and paste. And then you'll put it into, for example, your RStudio IDE or Repositron. You can set these options, repos in any, whatever R environment you're in. In RStudio, there's the global options packages section. You can just change that primary CRAN repository to public package manager. And again, we provide instructions for all of the different, for various different ways. And most places, most whatever you're using, if it's not in our instructions, their instructions will tell you how to update your package repository or even, or may say, you know, how to change your CRAN mirror you want to use there.

Short and simple is where we, you know, if you just want the apples to apples to what the public mirror defaults are, p3m.dev slash CRAN slash latest gives you the exact same thing you see there. And most importantly, we also have full mirrors and snapshots of both PyPI. And we'll be releasing hot preview for Bob snapshots for Bioconductor later this week. So all things going there. So we'll have those snapshots available on everything there.

Also, of course, feel free to explore around the public package manager website. You can search about packages, see information a lot, and a lot cleaner. And we think easier way to see it. And then browsing the CRAN website via Google and including things like security vulnerabilities, especially useful on the Python side is just get a quick rundown. If there are any issues with any versions of these packages that you have. So I encourage you all to, if you haven't already, try using Posit Public Package Manager today and see if it makes your life any easier. And, you know, there's a lot more options we have as well in our commercial product as well that can make life easier for teams and organizations. So feel free to stop by the product station up by the lounge and we'd love to chat more. Thank you so much.

Q&A

Thank you, Joe. And I will say, I do love p3m.dev. I can never remember public package manager. It's like too many characters. But, yeah, just for a few questions. So our first question is why use package manager over something like Rn or can they be used together? They absolutely can be used together. In fact, Rn that you if you have a if you have Linux, if you're using a Linux environment and are restoring Rn will generally try to even find if you're using public package manager will automatically try to find the Linux binaries for those packages as well. But yes, certainly these are companion to those package management tools like pack or Rn or or any various other places there.

OK, great. Thank you. Does p3m provide pre-built binaries for different architectures like Intel versus Apple Silicon Macs? In certain cases, yes, but we're trying to expand that on some parts there. Yes. OK, so I can try to arm on Linux. That's the one you said. Sorry, Linux. We're still working on Linux arm, but but expanding as we go there and definitely reach out and give us more feedback on what we actually want to see more of if that's the case there.

OK, having set up p3m multiple times recently, I can attest the website makes it trivially easy to do. Thank you. Is this a gratis thing or a purchasable product? Looks like they're what we're already paying for as a corporate posit product. What's the difference, if any? So everything in public package manager is the the is available if you already have it, if you've purchased our commercial public, our commercial package manager product at any level there. What we you know, what public package manager is just a scaled down version that just includes the public mirrors as well as the binary packages and snapshots. Everything else on the commercial offerings add in things like support for bringing your own your own internally built packages, additional package security features and a lot more, you know, administrative governance features on top of that.

OK, and this question is not on the list, but it just came to me, I was wondering about IPI support. IPI, we have support as well. IPI and Bioconductor now, as I said, snapshots for Bioconductor coming later this week in an update. PIPI support, we have snapshots just the same way we have for for CRAN. We don't have we don't build our own binaries for for PIPI, though, because mainly most binaries are built and binary wheels are already available on PIPI for most things there. And so there's not a lot of added value. Plus, there's like 400,000 projects on PIPI, and that's a lot of work to build a lot of packages, so.

OK, well, thank you, Joe Roberts.