
How I Learned to Stop Worrying and Love Public Packages - posit::conf(2023)
Presented by Joe Roberts The popularity of R and Python for data science is in no small part attributable to the vast collection of extension packages available for everything from common tasks like data cleaning to highly-specialized domain-specific functions. However, with that ease of sharing packages comes a larger target for bad actors trying to exploit them. We'll explore these security risks along with approaches you can take to mitigate them using Posit Package Manager. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Managing packages. Session Code: TALK-1079
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, I'm Joe Roberts. I'm a product manager here at Posit, and I manage the Posit Package Manager product appropriately enough. And one of the biggest questions that I get asked in my role is, particularly from IT administrators or infosecurity, is how can I be sure that the public packages that my data scientists want to use are actually safe to use?
And it's a very good question, and one that we'll dig into a bit today, but I realize that there's a lot of, just a lot of you aren't IT administrators, a lot of data scientists in the room who you already know and love public packages, so you don't need to be convinced of this. So we're going to change the topic for you. For you, imagine this is how to convince your infosec team to let you use the public packages that you need. And hopefully I'll give you some tools here to actually have that conversation when they come and say, hey, you can't use these anymore because they're a security risk.
So what I want to go over today is just, we'll talk a bit about the public package repositories, because it's important to understand how they work and how they differ in order to have a good sense of what the security risks are. We'll then talk about those types of security risks, at least some of the more common ones, and then, of course, how to mitigate those risks using tools like Package Manager, but also even if you're not using Package Manager, just some tips and processes you can use with either your open source tools or whatever you're using for package management, and we'll see how that all works.
Public package repositories
So first let's talk about public package repositories. The big ones that you're most likely to deal with as a data scientist are, of course, CRAN, the main R package repository. If you're a life scientist, you've got Bioconductor with more packages for R. And then on the Python side, you, of course, have the Python Package Index or PyPI.
So let's take a look at first, I'll talk about CRAN and Bioconductor together because at least as how they approach public repositories, they take a very similar approach. And if any of you have ever tried to submit a package to CRAN or Bioconductor, you know just how difficult that can be sometimes, and for good reason. So these are all, in order to submit a package, you must, it has to pass a whole lot of checks, it has to have good documentation, at least complete documentation for all the functions in your package, all of these various things. Ultimately, if you think of it, it's kind of like that exclusive club with the bouncer at the door who's basically saying, no, you cannot get in here unless you look the part.
To be sure, I mean, it is, they're not curating your content, they don't care if you can dance, but they do want to make sure you look the part to go in the club. Alternatively, the PyPI takes the opposite approach to that. It is your Wikipedia of packages. Anyone can submit anything to PyPI. You can go on right now, while we're sitting in this talk, create an account in PyPI and upload a Python package that will instantly become available to pretty much the entire world. Which has its advantages, because it does give a lot more community. There's a far more greater number of packages available on PyPI. But also, that introduces, of course, some different security risks that are kind of more common there as well.
Security risks: package quality
Let's jump into security risks, because that's the interesting part. I grew up with some security risks. This is one of the first initial things that a lot of people care about with packages is, how do I assess package quality? Or really, does this package that I'm seeing on the public repository do what it's supposed to do and do it accurately and give me the right results? It's a hard question to ask, because we didn't write the package. Often we're trying to use packages that we may not be domain experts in ourselves. That's why we're going to get a package that someone else has written to do it.
But there are some ways that we can actually get a sense of this. Things like, get the number of downloads. A more widely used package is probably that way for a reason, because people trust it. The author reputation. Did Hadley write the package? It may be OK compared to if someone who just has one package in the entirety of CRAN, and you don't really know who they are or what credentials they have there. Not to say those are bad, but it's just one factor we factor in. Documentation. Good packages have good documentation. You know what it does. You can actually read it and see what's going on there. A more interesting one is revision history. If a package that you see has frequent updates has been updated recently, it can also give you an indication that someone actually cares about this and is working to fix the bugs and maintain it as we go. And of course, tests are always nice if they're in there, too, because that can also give you an indication that, hey, there are tests that show this is doing what it's supposed to do.
So how do we mitigate these? I mean, ultimately, there is no substitute in package quality for a human review. Someone actually has to take a look at this and decide, is this package good enough to use? Often that falls on you as the individual data scientist who's actually wanting to use this, but there are some more systematic ways of doing it, and there are assessment tools to help with that. Our friends over at the R Validation Hub have created this risk metric package, for example, for R, that essentially gathers a lot of these metrics that you got from those things we just talked about, looking at number of downloads and various other attributes, and a whole lot more that you can then weight how you want to to decide if a package is good sense anyway, if the package is worth actually using.
Once you have done that, you can use tools like Posit Package Manager in your organization to create a curated repository that basically includes just the packages that you or your organization is deemed safe to use, and then your users can, your data scientists can just know that if it's in this repository, it's been vetted by our organization or someone that we trust to actually decide these things there. So that's what we kind of a bottom up approach to security, where we start with nothing and we just include approved packages we want to use.
Security risks: vulnerabilities
Most talks of security risks always usually jump straight to vulnerabilities, which are, of course, more conventional security risks that you want to worry about. And when you're worried about, the nice thing about public package repositories is that public packages that have been published in one of these repositories really is a known quantity. So you don't have to do typically what you typically would do on, say, a new development project or something else, where you need to do conventional scanning of it. What really is the case is that these public packages are known quantities that have been scanned by other people already. I mean, people are already looking at these. So if there are issues, often those come up very quickly and are reported in.
So these could come up, these could be in the source code, they could be in compiled code, they could be in external libraries that they're linking to. And sometimes they're malicious, but often these vulnerabilities just show up due to poorly written code or just mistakes. So when you have those and you see the scanning of those, you can actually look at vulnerability databases that actually have already looked at these packages. And as long as you look and look them up, say, this package, this version, in these databases, you can get a good sense of, are there any known vulnerabilities?
We here at Posit just started working with our consortium on a CRAN and Bioconductor Advisory Database. That's, you know, there are relatively, again, due to that, the bouncer, the door of CRAN and Bioconductor, there aren't that many vulnerabilities known out there. But there are a few that have come up, usually in external packages they link to. So we have that. It's patterned very closely after the much larger Python Packaging Advisory Database that PyPI has, and that's maintained by the Python Packaging Advisory Group. Both of those feed into the Open Source Vulnerabilities Database, osv.dev, that Google started, which is a great place to actually go to find, you know, for any data science package now, at least in R and Python, it's a good place to start to look there. And once you have that, you can leverage features like in Package Manager called Package Blocking, where we would actually, you can actually say, I want to take a top-down approach to security, where instead of just approving the packages we want to use, we can start with, we'll allow you to use anything in the public repository, but we're going to block all of these versions of these packages that have known vulnerabilities in the database and continue to update that as new vulnerabilities come out.
Package confusion: typosquatting
So those are two kind of broad areas of security risks in packages themselves. We're going to jump now over to some, a couple of areas that are more in the infrastructure used to deliver packages, which have seen, particularly on the Python side, some real-world exploits in the last year plus. So I group these under package confusion, because they're really all about trying to deceive the user into installing the wrong package, or not the one that they actually are looking for. So two of those that we're going to talk about today are typosquatting and dependency confusion.
Typosquatting, and so we have a little example. So we're going to walk through the life of a package installation here. I'm going to talk about these in terms of Python and PyPI and using pip to install them, not to say that those aren't possible to happen, these aren't possible to happen on the CRAN side or the bioconductor side, but again, due to the curation, it's much less likely to happen there. And these are all ripped from the headlines of real-world examples that have happened in the Python community here.
So we're a data scientist. We have an empty, well, starting with an empty environment. We have PyPI. We're going to do some machine learning, so we want to get the TensorFlow package. It's over sitting in PyPI. Great. We do pip install TensorFlow, and pip goes out. Oh, I'm going to look up TensorFlow on PyPI, grab it over my local machine. Everything's perfect. That's what we want to happen.
Now, let's start again here. We've still got our PyPI, but now we're going to make a typo. Why do you call it typosquatting? Well, we have a typo, and I accidentally typed, like, we all make mistakes once in a while, two Ns in my TensorFlow when I do pip install TensorFlow. Normally, what would happen? Pip goes out and goes, PyPI, I want TensorFlow with two Ns. PyPI goes, well, I don't got one of those, and so pip returns an error. And actually, this is a good thing, because this is what you want to happen in those situations. You go back, you see your typo, you move on.
Now let's get a little more nefarious here. So we're back with our PyPI here. I then am a, wearing my black hat, my malicious Python, malicious hacker trying to get into your system here. I go and look at PyPI, and I go, hmm, what are these packages that are very popular out here? Oh, TensorFlow. TensorFlow, to be fair, has 600,000 downloads a day from PyPI. And so I'm going to go out here, and as a user, I'm coming with pip install TensorFlow with two Ns. I'm going to go out here as a malicious user, and I'm going to use that permissibility of PyPI to install a whole bunch of different variants of the name TensorFlow with a whole bunch of typos that I could totally conceive of. When there's 600,000 downloads of this a day, I don't need to get too many of these right, because they're going to actually hit there. Tip goes out and goes, oh, I actually see a TensorFlow by that name, so I'm going to download that, and you've been exploited.
TensorFlow, to be fair, has 600,000 downloads a day from PyPI. When there's 600,000 downloads of this a day, I don't need to get too many of these right, because they're going to actually hit there.
And so how do we protect against this actually happening in the real world? Well, this is a tough one to mitigate against, but one of the things to be aware, again, as I said before, this is a higher risk on PyPI than CRAN, so don't go out being terrified of everything. This is not about being scared. But the big advantage in this is that many, because of the recent exploits and this happening, many in the Python community are watching for these malicious packages. So they do not stay on PyPI very long, and that's actually a good thing. I mean, most of these are found within a matter of days, if not less.
And so one mitigation is just to avoid the latest packages and updates from public repositories, which sounds counterintuitive, but latest isn't always the safest from a security standpoint. So one of the ways you can actually leverage this in Package Manager, as we've talked about before, is repository snapshots. Pin your installations to a slightly older version and give that time to get through a combination of things like package blocking that we talked about before. Give yourself time to let that be found and blocked, and you're only sacrificing a little bit of latest and greatest updates, and you can always make an exception if there is some really important update to a package that you trust that you want to go download. But also just get it right and avoid your typos.
Package confusion: dependency confusion
And the last one I want to talk about is dependency confusion, which is probably one of the more sophisticated ways of handling this. You'll notice it's very similar to typo squatting, but let's start with our same empty environment here. This time we've got our own internal packages. Say we have, my company does a lot of development of our own packages. We're here at Posit, so we created this Posit tools package that we don't intend to publish because it's just got internal code for connecting to our internal databases and things like that. But we're going to, so we have this internal package. It could be just somehow installed somewhere. It could be in our internal, we could be sophisticated enough to have our own internal repository that we install from that's pip-like or PyPI-like. Along with that, we want to get some packages from PyPI, of course, because we need to use pandas and requests or something with it.
We do a pip install. In this case, more commonly, we probably have an extra index URL rather than having our own PyPI mirror. So we say, in addition to the full PyPI, I want to use our internal server for additional packages, and I want to install Posit tools. Great. We get Posit tools. It didn't, pip doesn't find it on the public repository. It goes, oh, look at the extra index. Oh, we found it there, so we install it from there. Perfect. This is a happy way to work, an encouraged way to work with multiple packages there. And then I want pandas, for example, as well. It goes, pip does the same thing. I found it on PyPI. I'm not looking one place only, so we're fine.
Let's go back again. This time, I'm, my nefarious hat again here, I'm going to guess that a company like Posit has a Posit tools package. You could be a GSK, assuming someone has a GSK tools package. I don't know, it could be anyone. There's lots of, or heaven forbid you have a package called internal tools or something like that. And so I put a package, another evil package. I guess the name you might use for an internal package. I give it a very high version number, so it looks newer than everything else. And then, same thing, I want to upgrade my Posit tools package, or could be installing from first time. Pip goes out and goes, oh, you want to upgrade this one? I see a copy of this on PyPI. It's got a newer version than the one you have. Before I even look for your internal one, I'm going to take this one that I found, this nefarious one off PyPI, and you've been exploited again.
So there are solutions to this one I love, because this one is a very simple solution if you have Posit Package Manager, so that's why. And it's using a unified local and public repository here. So in Package Manager, you can actually take your internal packages, combine them with PyPI packages in a single repository, put the preference on internal packages first, and then when I call pip install, using my internal PyPI mirror that I have in Package Manager, install Posit tools, I will ask Package Manager, who will always give them back the internal package, no matter if anyone adds a new package by the same name in PyPI later on. And so this completely insulates you from any of this dependency confusion in that way, which is great. And same thing, if I'm doing pandas, it goes Package Manager knows it's not in the internal source. I'm going to look in the public source and find what I need here.
Takeaways
All that to be said, I know that some of this sounds a little scary, especially if you were, imagine if you were on the info security side and going, this is terrifying that something like this, some simple mistake like this, and a fairly unsophisticated attack like this could cause problems within our internal network. The takeaway is, public packages, while they do present risks, you don't need to be scared of them. Understanding these risks helps give you the knowledge to manage them, and there are strategies and tools like Package Manager or others, even just understanding, you know, how that works, you can think of other ways you can actually solve some of these problems the same way to help you reduce those risks.
And so you can have those conversations with your security teams and say, hey, don't just say no, let's at least turn that no into a let's talk about what we can do to make this better. And that's hopefully what I can take away from it. Thank you so much.
The takeaway is, public packages, while they do present risks, you don't need to be scared of them. Understanding these risks helps give you the knowledge to manage them.
Q&A
Thank you, Joe. We do have a couple questions. So are there any scanners that you know of that do exist for R code itself that look for any malicious intent?
Very, very, no. I mean, very, very few. Most of the scanners out there that are either commercial solutions, besides some common linters for R, they're just checking source code there. But as far as checking packages, most of those that you see listed out there as so-called vulnerability scanners for R are doing little more than exactly what we're talking about here. They're just cataloging what package names and versions you're using, looking them up in the vulnerability databases. And frankly, for the most part, that's what Python is doing as well. There's a little bit more out there for Python scanning.
I had one that I was thinking of before. And I was thinking if we at Posit do any active curation of the Python, any kind of malicious things out there, and pull them out of Package Manager.
Not currently. But between you and me, we're working on stuff.
Any other questions? Let's see if there's any more in the Slido. So there's nothing in the Slido. We do have one more minute if anyone has any questions.
It does not currently use the same CVE type scoring that, yeah. I mean, most of the stuff that's gone in there has come from, can be linked to CVEs that do have a scoring. But there's not actually a scoring implemented in that OSV database, right.
