Resources

Can I trust that package? (Colin S Gillespie, Jumping Rivers) | posit::conf(2025)

Can I trust that package? Speaker(s): Colin S Gillespie Abstract: We often forget, surrounded by hex stickers and bad R package puns, that not everyone is as trustworthy as us. This, I suppose, means that when IT asks, “Is this package valid, secure, and trustworthy?” it’s not that unreasonable a question. But this throws up multiple issues. There are thousands of R packages on CRAN, and that doesn’t include the R-universe, Bioconductor, and GitHub. Packages are updated all the time, so how do we keep up? More to the point what does valid, secure and trustworthy even mean? In this talk, I’ll discuss the litmusverse. A suite of packages for assessing a package risk. Importantly, it’s not one size fits all. Instead, it’s about defining your risk appetite and acting accordingly. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm someone with a clicker that doesn't work. So let's go off to a good start. I'm someone with a laptop that doesn't work, so it's even better. Right, about me, almost an ex-academic, so down to one day a week, which is just enough time for the university to forget that you exist, which is nice.

I'm the co-founder of Jumping Rivers. Hopefully it's the last day of the conference. You must have seen our absolutely massive booth with banners and stickers and chocolate everywhere. So if you don't know who we are, please grab me, I'll go to the booth.

I wrote this book with Robin about a decade ago. Don't buy it, because it's now completely out of date. It's got a chapter in this new thing called Dplyr, which you might have come across recently. So, but I did write that a long time ago.

Jumping Rivers, in case you're wondering, which of course you are, even though the name tells you everything. We do lots of posit maintenance and support. Kubernetes, high availability, GXP, all the fun stuff that keeps you up at night. Lots of validated assured R packages, and with all the fun that comes with that. Data science consultancy, R, Python, anything in those spaces we do. So if you've got a shiny app you need help with, accessibility, because you're following guidelines and trying to build things properly. Again, we can help you. And we can do training on basically everything. So we've got lots of courses in all of R, Python, Git, Stan, you name it, we've done it. So that's the summary.

Measuring package trust

So how do we measure package trust, right? So if I gave you a package, an R package, and I said, do you think you would recommend it to your friends and family? You'll go, what friends and family do you think I have that wants an R package? But ignoring that small thing, do you think you would recommend it to me, right? I'm sure we're friends. So what would you do?

Well, you'd look at it and you'd go, oh, it's made by Posit. So Posit are okay. I'll give them a bit of trust. So you'd go up there. So that seems good. But there's no recent commits. So you'd sort of go down a bit, right? Don't think it's too controversial, right? Hadley's not going to come and kill me. It's not on CRAN. So that's possibly, oh, well, okay. It's made by Posit, but it's not really been looked after and there's no recent commits.

That's tricky. It has been written by someone who can spell color correctly. So that, to me, that's all that it needs. I think you can also write R packages, but you can spell color. So I'm good with that. It's got lots of stars. So we're going, well, we're in a good space here, right?

And then you find out it's ggviz. So for people that don't know what ggviz is, if you go and ask Hadley, he'll more than happily tell you all about ggviz. So he made a promise in about 2016 that he's building this package with interactive graphics and it never happened, right? So on the GitHub page, it does say, do not use this package, right?

But it's sort of taking a, the words rollercoaster are somewhat of an extreme set of emotions to go for this talk, but a rollercoaster of emotions, shall we say, about this of Posit good, not good. And so you've got all this bits of information. How do you combine it into some sort of coherent score?

So package choice comes up all the time. So we do lots of training, right? We've been doing training for a decade and I've been teaching even longer. And we always get asked during, can we trust this package? How do you know it?

And so this is me being confession time, right? So I'm going to confess because clearly this isn't going on YouTube. And so for the last few years, our standard practice, and this has worked well, is you sort of wave your hands and you go, well, this CRAN and CRAN's really good. But grumpy, but they're really good and they keep track of packages. So if the package is in CRAN, it's all fine. And go, and so what do you do for a living? And then you change the conversation and you move swiftly on. And that's worked, that does work as a strategy. People might walk out thinking they've been done, but well, too late, right? Not the best approach, I would suggest.

Risk appetite across organizations

So can we trust this R package? All right, and it depends on your risk appetite, okay? So on the left, and I've no idea what side of the arrow left means in this context, but the ones I'm about to appear on the left, we've got students who, let's face it, just don't care, right? They'll just install any old rubbish.

And it's silent above students who've got academics who really don't care either, right? You know, they just install whatever's there and hope for the best, right? Again, they'll assess if it's correct, but you know, that's it. And then we've got startups who care a bit more, but they're also absolutely skint, so they only care so much, you know, so you have to pull things apart.

We've got governments who want to care, but again, are somewhat limited by funds, and so they're trying to sort of work through, you know, practical things of, we need to make sure things are safe and secure, but we've got budget responsibility and it's hard. And then at the other end, we've got finance and pharmaceutical companies who take it to the extreme, right? So pharmaceutical companies, validated environments, GXP, you know, this matters to them. They really have strict standards, okay?

So you've got this whole part. And then you've got sort of, I suppose, validation on demand. So by validation on demand, it might be of, you ask me of, is dplyr okay? And I go, yes, it's fine, right? That's sort of validation on demand, right? You ask someone, someone gives you some, hopefully they know what they're talking about, and you don't just ask a random person, but that they've given you some sort of idea of this package is good. And then you might have validated environments. So me saying that dplyr is okay, is not good enough. It has to be run in your system to make sure it works in your system and it's been installed properly. So you've got that sort of part. So that's how I sort of see risk appetite.

And if I was to ask you, where are you on this graph, all right? Anyone in the more students and startup side, who's going to be brave enough to put the hand? There's more and more as you're sort of looking around. Who's in the other end where you actually need to care quite a lot about this?

And there's a whole bunch of people that have wandered into the wrong room because they've not put their hand up. I would say that most organizations are in both sides. That's my firm belief. So even in a pharmaceutical company, you're doing GXP, validated runs, you really care about FDA, but you also get the intern coming in for three months who wants to build a shiny app. And those two people aren't the same.

You're wanting a set of packages that this intern can use. They should still be safe and secure, but it's not FDA, right? You don't have to jump through every hoop. And you've probably got someone in the middle who's not the intern, who's not validated, but they're wanting to do maybe research. And so you're wanting the statistics packages to at least give the right answer most of the time. Yeah, so you're in both sides.

How to assess a package

So package risk. How would you assess a package? So you could look at author fame, code coverage. You could look at the number of vignettes, source control, issue closure rate. You know, these are all sort of sensible things. Depending if you're an organization, you might care about the license. Again, if you're an academic, yeah, license, smizes, you'll use anything. News, security. Yeah, you might have a passing interest in making sure you're not going to get hacked. Downloads, some idea of popularity. Dependencies, does anyone else use that? So you've got all these things.

And I can say, well, what would your top three be? So have a two second look at this list. Get your top three in your head. I'm not going to ask for shouting out in case nobody shouts out. So we've sort of found the top three for most organizations looks like this. License, it's got nothing to do with the package is any good. It's purely regulatory. You know, they don't want it to be reusing the wrong thing. Security, they really do care that you're not going to get hacked and someone has to, I'm going to say, tick a box, but not tick a box just in the bureaucracy way of actually they've thought about it. You know, they have to make sure that it's correct. And code coverage, which is a big thing of you've got a package. Do you have any unit tests? You know, how sensible.

Introducing Litmus

So how do we sort of go about this? So we've been working on something called Litmus and essentially what we do, and this is based on risk metrics. If you know about risk metrics, it's sort of based on this idea and built on that idea. So we've got some, we've got assessments that everybody in this room can agree on, right? You're never going to get a bunch of engineers and statisticians to agree on anything except this, right? We all agree that our command check is important, I hope. We all agree that security is important. We all agree that issues are important.

We all fundamentally disagree on how important they are and we could all quibble about the fourth decimal place late into the night over lots of pints of alcohol. And so then we've got a score and so everybody can have a different score where we can sort of put all those quibbles in that. So we're separating out information and then how we assess that so we've got flexibility. Spoiler, even though everybody's really opinionated, tons of it doesn't really matter in the big picture. You know, you sort of change the overall score of the package but it's more or less the same.

Spoiler, even though everybody's really opinionated, tons of it doesn't really matter in the big picture. You know, you sort of change the overall score of the package but it's more or less the same.

So that gives you an overall score and then also we tend to pull out high risk issues. So even if a package is wonderful, if it's got a security vulnerability, then it should still be flagged. Even if a package is absolutely great, if the license is not compatible, then that should be flagged. So you've got this sort of trying to capture both. This is what we call litmus.

Scoring packages

So how do you score a package? So I think hopefully you've got an idea of we get some metrics, we download some stuff, how many downloads, we calculate a score, we add them all together in some weighted average, we come up with a number. How would you go about scoring?

So for example, let's make it a bit more concrete. Vignettes. We use a binary score of does a package have it or not? And we spend far too much and there's many hours of my life that I won't get back of thinking, well, but a package might have two vignettes or three vignettes or it might have a vignette that's only one page or four pages or five pages. And it really just doesn't matter. You know, you're sort of quibbling of, out of all the packages, most of them don't really have vignettes and those are the ones that you care about. The rest are just, okay, two vignettes is better than one vignette, but you know. But again, you could change it, but that's the sort of, getting into that mindset of what you're trying to capture.

And I told you I would disagree, you know. So people will come up to me afterwards and say, oh, no, no, no, you should do that. Dependencies, how big is an imports list? You know, how many dependencies in a package, right? We could argue that. And there's a podcast that I listen to that always asks, it's called More or Less, great podcast, and it always asks the question, is that a big number? As if I said a package has got 10 dependencies, is that a big number? Is 10 big? It might be big. Is 20 big? What about two? I don't know.

So essentially, we just cheated and analyzed all CRAN packages. And so what we did is, rank compared to everything else. So now we can figure out, is 10 a big number? You know, if we go, well, 10 is a big number, but every single package in CRAN is more than 10, it's a bit of a waste of time, we're all honest. You know, again, we're trying to get some detail.

Quantile 1 has only got one dependency. We stripped off R and base packages, right? But 25% of packages don't have a single dependency when you get rid of the obvious ones. 50% have three dependencies, and then the maximum is 49, which is quite a lot of dependencies.

And there's a little bit of survivorship bias, because CRAN can sometimes be strict. And so essentially, if a package fails, it gets kicked off. So I'm guessing that any package with lots of dependencies from a long time ago has probably been removed if they're not updating, because you're sort of getting rid of that. So that's why you've got that long tail. But essentially, we just map the ECDF, empirical CDF, so convert that to a probability, oh, it's got between zero and one. We do the same with code base size. So, you know, all of CRAN, you know, the average, the median number is 600 lines of code. The maximum is 100,000, which seems like a big number to me, if we're going back down there. And again, we just map it to the ECDF, empirical distribution. Again, I thought far too long about log transformations and everything else, whereas if you just rank it, things just sort of come out, and that's what you're trying to capture.

The Litmus dashboard

So moving to R. We work with lots of companies, and one of the problems is they set up R, they set up whatever, Connect, and people in this audience will be stunned to realize this. Then they don't touch it for the next four years, and everything just, you know, you've got this set up, and it just sort of sits there. And that's sort of similar with package validation. Someone sat and thought about it, but we've gone to a conference, and we're hearing about lots of cool things. You need to try and build in procedures of how do I get new things in my organization, okay?

And so there's a URL, and I'll put the URL, or a QR code at the end. So there's a dashboard essentially of, it's just showing all the packages, and you can just sort of look at all these scores and figure it out. Just now it's static with 1,000 packages, but in the next sort of four to six weeks, we're planning on, you know, all of CRAN, CI, CD, job, pushing things up to date, you know, so it's all open.

So there's a dashboard overview of all the packages, okay? So here's 1,000. We've got a histogram of, you know, in this collection, it's got 1,000 packages, so we've got an overall score, you know, so it ranges up from 0.5 to one. It turns out that we typically use, you know, whilst this is a bit flippant about what packages we use, intuitively we do the right thing. You know, we might not follow a process, but even an academic as me at times at least tries to make sure that the package is working and it's sensible, you know, so we don't just choose things at random, but we need to show a process.

We've got a score breakdown, okay? So we split this up into documentation, code, maintenance, popularity, and then you can wait however you'd like to wait. So in this analysis, code is weighted a bit more than documentation because we think code's more important, but you can change it, okay? We've got the bit about high, those red lines, high risk. So of these 1,000 packages, there's one that's been flagged with a security vulnerability, okay? So then you might think of, so security vulnerability, what do you want to do about it? And it's not of can't use it, it's about someone going in and going, I understand the risk, I'm okay with that risk, or I'm not okay with that risk. So it's not of we can't use it, it's should I use it, right?

You can do package drill then. So, you know, you get all the scores and all the parametrics that we've talked about. You can see what's good, what's bad. You can do a comparison with other packages for a particular, you know, how does .whisker, I have no idea what .whisker does, and it was on my laptop and I installed it and must have run it, but so, yeah, what it does, so that's good. We can record decisions.

So that's important if you're in an organization where you need to show what you're doing. So I have chosen to use this package. There's a security vulnerability, but I'm okay with the risk. And sometimes you think, you couldn't possibly do that, that happens all the time. You know, you've taken, you've assessed the risk and you think that risk is okay. But many organizations, you have to make sure you can record that risk to say, yes, Colin came along, he looked at that and he said, it's okay because of these reasons, right? You know, the reasons could be, it's a vulnerability, but the package is only used internally. It can't access anything external. Fine, that seems okay.

So summary, a lot of thought goes a long way. So the dashboard, the QR code, that will take you to, you know, the idea next maybe four or six weeks, you would have all the packages in CRAN. So if you want to see a package, you can have a look, help win that argument with IT, okay? We're starting to do sort of rapid GXP type environments. So if you're in that sort of space, please come and talk to me. Also do another service called diffify.com, which go to that URL, basically type in any package you want and it will show you what functions have changed between any versions. So you can mess a bit with that and that's just out there. So thank you very much and happy to take questions.

Q&A

Thank you so much, Colin. It was very interesting to hear about all these different metrics. From the questions from Slido, we have, when a measure becomes a target, it ceases to be a good measure. As we move towards objective measures of packages, do you anticipate packages evolving to be ranked higher without necessarily improvements to overall quality?

So, yes. So in some cases of a package, so one of the metrics is, does it have a source URL? GitHub, GitLab, Atlassian, whatever you want. And if an author thought, oh, I'm getting a zero here because it's not in source control, I think we could all agree of, it's a good push, right? You know, they're now hitting a target, but I think we can all agree it's a good target. And you go, oh, but they might then write tests that aren't very good. It's like, they could. But who could be bothered with that? You know, in the grand scheme of things, could you really be bothered to go to the effort to get 100% test coverage that anyone with more than two seconds worth of looking at your code would go, that's rubbish, right? You know, so it could theoretically happen, but I just don't fundamentally believe they will. And if it did, it would be spotted. Because again, we don't just install packages at random, you know, whereas I was making flipping jokes. We do think about that package is solving this problem, and it's been written by someone I respect, and I'm installing it and using it. And we're doing it intuitively, and this is just formalizing that.

All right, lovely. This is a short question that I'll rephrase a little bit. Can we treat tidyverse as a single dependency?

So the way we do it is if some, so when we're working with pharmaceutical companies, if they say, we only want tidyverse, we go, that's fine. We work the dependencies of the dependencies. So that will end up being 100, 200 packages. And then it's, again, it's up to, you know, we would then say, you should think about all those packages, because they've all been used. But, you know, you can do what you want. Typically for a small analysis, you know, so a bit of data cleaning, a bit of ggplotting, a little bit of stats, you're ending up with about 500 packages from the start. You know, that's a sort of baseline and sort of hand-wavy things. All right, lovely. Colin, thank you very much. Thank you.