How I Learned to Stop Worrying and Love Public Packages - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Joe Roberts. I'm a product manager here at Posit, and I manage the Posit Package Manager product appropriately enough. And one of the biggest questions that I get asked in my role is, particularly from IT administrators or infosecurity, is how can I be sure that the public packages that my data scientists want to use are actually safe to use?

And it's a very good question, and one that we'll dig into a bit today, but I realize that there's a lot of, just a lot of you aren't IT administrators, a lot of data scientists in the room who you already know and love public packages, so you don't need to be convinced of this. So we're going to change the topic for you. For you, imagine this is how to convince your infosec team to let you use the public packages that you need. And hopefully I'll give you some tools here to actually have that conversation when they come and say, hey, you can't use these anymore because they're a security risk.

So what I want to go over today is just, we'll talk a bit about the public package repositories, because it's important to understand how they work and how they differ in order to have a good sense of what the security risks are. We'll then talk about those types of security risks, at least some of the more common ones, and then, of course, how to mitigate those risks using tools like Package Manager, but also even if you're not using Package Manager, just some tips and processes you can use with either your open source tools or whatever you're using for package management, and we'll see how that all works.

TensorFlow, to be fair, has 600,000 downloads a day from PyPI. When there's 600,000 downloads of this a day, I don't need to get too many of these right, because they're going to actually hit there.

And so how do we protect against this actually happening in the real world? Well, this is a tough one to mitigate against, but one of the things to be aware, again, as I said before, this is a higher risk on PyPI than CRAN, so don't go out being terrified of everything. This is not about being scared. But the big advantage in this is that many, because of the recent exploits and this happening, many in the Python community are watching for these malicious packages. So they do not stay on PyPI very long, and that's actually a good thing. I mean, most of these are found within a matter of days, if not less.

And so one mitigation is just to avoid the latest packages and updates from public repositories, which sounds counterintuitive, but latest isn't always the safest from a security standpoint. So one of the ways you can actually leverage this in Package Manager, as we've talked about before, is repository snapshots. Pin your installations to a slightly older version and give that time to get through a combination of things like package blocking that we talked about before. Give yourself time to let that be found and blocked, and you're only sacrificing a little bit of latest and greatest updates, and you can always make an exception if there is some really important update to a package that you trust that you want to go download. But also just get it right and avoid your typos.

Package confusion: dependency confusion

And the last one I want to talk about is dependency confusion, which is probably one of the more sophisticated ways of handling this. You'll notice it's very similar to typo squatting, but let's start with our same empty environment here. This time we've got our own internal packages. Say we have, my company does a lot of development of our own packages. We're here at Posit, so we created this Posit tools package that we don't intend to publish because it's just got internal code for connecting to our internal databases and things like that. But we're going to, so we have this internal package. It could be just somehow installed somewhere. It could be in our internal, we could be sophisticated enough to have our own internal repository that we install from that's pip-like or PyPI-like. Along with that, we want to get some packages from PyPI, of course, because we need to use pandas and requests or something with it.

We do a pip install. In this case, more commonly, we probably have an extra index URL rather than having our own PyPI mirror. So we say, in addition to the full PyPI, I want to use our internal server for additional packages, and I want to install Posit tools. Great. We get Posit tools. It didn't, pip doesn't find it on the public repository. It goes, oh, look at the extra index. Oh, we found it there, so we install it from there. Perfect. This is a happy way to work, an encouraged way to work with multiple packages there. And then I want pandas, for example, as well. It goes, pip does the same thing. I found it on PyPI. I'm not looking one place only, so we're fine.

Let's go back again. This time, I'm, my nefarious hat again here, I'm going to guess that a company like Posit has a Posit tools package. You could be a GSK, assuming someone has a GSK tools package. I don't know, it could be anyone. There's lots of, or heaven forbid you have a package called internal tools or something like that. And so I put a package, another evil package. I guess the name you might use for an internal package. I give it a very high version number, so it looks newer than everything else. And then, same thing, I want to upgrade my Posit tools package, or could be installing from first time. Pip goes out and goes, oh, you want to upgrade this one? I see a copy of this on PyPI. It's got a newer version than the one you have. Before I even look for your internal one, I'm going to take this one that I found, this nefarious one off PyPI, and you've been exploited again.

So there are solutions to this one I love, because this one is a very simple solution if you have Posit Package Manager, so that's why. And it's using a unified local and public repository here. So in Package Manager, you can actually take your internal packages, combine them with PyPI packages in a single repository, put the preference on internal packages first, and then when I call pip install, using my internal PyPI mirror that I have in Package Manager, install Posit tools, I will ask Package Manager, who will always give them back the internal package, no matter if anyone adds a new package by the same name in PyPI later on. And so this completely insulates you from any of this dependency confusion in that way, which is great. And same thing, if I'm doing pandas, it goes Package Manager knows it's not in the internal source. I'm going to look in the public source and find what I need here.

Takeaways

All that to be said, I know that some of this sounds a little scary, especially if you were, imagine if you were on the info security side and going, this is terrifying that something like this, some simple mistake like this, and a fairly unsophisticated attack like this could cause problems within our internal network. The takeaway is, public packages, while they do present risks, you don't need to be scared of them. Understanding these risks helps give you the knowledge to manage them, and there are strategies and tools like Package Manager or others, even just understanding, you know, how that works, you can think of other ways you can actually solve some of these problems the same way to help you reduce those risks.

And so you can have those conversations with your security teams and say, hey, don't just say no, let's at least turn that no into a let's talk about what we can do to make this better. And that's hopefully what I can take away from it. Thank you so much.

The takeaway is, public packages, while they do present risks, you don't need to be scared of them. Understanding these risks helps give you the knowledge to manage them.