Can I trust that package? (Colin S Gillespie, Jumping Rivers) | posit::conf(2025)

Transcript#

This transcript was generated automatically and may contain errors.

I'm someone with a clicker that doesn't work. So let's go off to a good start. I'm someone with a laptop that doesn't work, so it's even better. Right, about me, almost an ex-academic, so down to one day a week, which is just enough time for the university to forget that you exist, which is nice.

I'm the co-founder of Jumping Rivers. Hopefully it's the last day of the conference. You must have seen our absolutely massive booth with banners and stickers and chocolate everywhere. So if you don't know who we are, please grab me, I'll go to the booth.

I wrote this book with Robin about a decade ago. Don't buy it, because it's now completely out of date. It's got a chapter in this new thing called Dplyr , which you might have come across recently. So, but I did write that a long time ago.

Jumping Rivers, in case you're wondering, which of course you are, even though the name tells you everything. We do lots of posit maintenance and support. Kubernetes, high availability, GXP, all the fun stuff that keeps you up at night. Lots of validated assured R packages, and with all the fun that comes with that. Data science consultancy, R, Python, anything in those spaces we do. So if you've got a shiny app you need help with, accessibility, because you're following guidelines and trying to build things properly. Again, we can help you. And we can do training on basically everything. So we've got lots of courses in all of R, Python, Git, Stan, you name it, we've done it. So that's the summary.

Spoiler, even though everybody's really opinionated, tons of it doesn't really matter in the big picture. You know, you sort of change the overall score of the package but it's more or less the same.

So that gives you an overall score and then also we tend to pull out high risk issues. So even if a package is wonderful, if it's got a security vulnerability, then it should still be flagged. Even if a package is absolutely great, if the license is not compatible, then that should be flagged. So you've got this sort of trying to capture both. This is what we call litmus.

Scoring packages

So how do you score a package? So I think hopefully you've got an idea of we get some metrics, we download some stuff, how many downloads, we calculate a score, we add them all together in some weighted average, we come up with a number. How would you go about scoring?

So for example, let's make it a bit more concrete. Vignettes. We use a binary score of does a package have it or not? And we spend far too much and there's many hours of my life that I won't get back of thinking, well, but a package might have two vignettes or three vignettes or it might have a vignette that's only one page or four pages or five pages. And it really just doesn't matter. You know, you're sort of quibbling of, out of all the packages, most of them don't really have vignettes and those are the ones that you care about. The rest are just, okay, two vignettes is better than one vignette, but you know. But again, you could change it, but that's the sort of, getting into that mindset of what you're trying to capture.

And I told you I would disagree, you know. So people will come up to me afterwards and say, oh, no, no, no, you should do that. Dependencies, how big is an imports list? You know, how many dependencies in a package, right? We could argue that. And there's a podcast that I listen to that always asks, it's called More or Less, great podcast, and it always asks the question, is that a big number? As if I said a package has got 10 dependencies, is that a big number? Is 10 big? It might be big. Is 20 big? What about two? I don't know.

So essentially, we just cheated and analyzed all CRAN packages. And so what we did is, rank compared to everything else. So now we can figure out, is 10 a big number? You know, if we go, well, 10 is a big number, but every single package in CRAN is more than 10, it's a bit of a waste of time, we're all honest. You know, again, we're trying to get some detail.

Quantile 1 has only got one dependency. We stripped off R and base packages, right? But 25% of packages don't have a single dependency when you get rid of the obvious ones. 50% have three dependencies, and then the maximum is 49, which is quite a lot of dependencies.

And there's a little bit of survivorship bias, because CRAN can sometimes be strict. And so essentially, if a package fails, it gets kicked off. So I'm guessing that any package with lots of dependencies from a long time ago has probably been removed if they're not updating, because you're sort of getting rid of that. So that's why you've got that long tail. But essentially, we just map the ECDF, empirical CDF, so convert that to a probability, oh, it's got between zero and one. We do the same with code base size. So, you know, all of CRAN, you know, the average, the median number is 600 lines of code. The maximum is 100,000, which seems like a big number to me, if we're going back down there. And again, we just map it to the ECDF, empirical distribution. Again, I thought far too long about log transformations and everything else, whereas if you just rank it, things just sort of come out, and that's what you're trying to capture.

The Litmus dashboard

So moving to R. We work with lots of companies, and one of the problems is they set up R, they set up whatever, Connect, and people in this audience will be stunned to realize this. Then they don't touch it for the next four years, and everything just, you know, you've got this set up, and it just sort of sits there. And that's sort of similar with package validation. Someone sat and thought about it, but we've gone to a conference, and we're hearing about lots of cool things. You need to try and build in procedures of how do I get new things in my organization, okay?

And so there's a URL, and I'll put the URL, or a QR code at the end. So there's a dashboard essentially of, it's just showing all the packages, and you can just sort of look at all these scores and figure it out. Just now it's static with 1,000 packages, but in the next sort of four to six weeks, we're planning on, you know, all of CRAN, CI, CD, job, pushing things up to date, you know, so it's all open.

So there's a dashboard overview of all the packages, okay? So here's 1,000. We've got a histogram of, you know, in this collection, it's got 1,000 packages, so we've got an overall score, you know, so it ranges up from 0.5 to one. It turns out that we typically use, you know, whilst this is a bit flippant about what packages we use, intuitively we do the right thing. You know, we might not follow a process, but even an academic as me at times at least tries to make sure that the package is working and it's sensible, you know, so we don't just choose things at random, but we need to show a process.

We've got a score breakdown, okay? So we split this up into documentation, code, maintenance, popularity, and then you can wait however you'd like to wait. So in this analysis, code is weighted a bit more than documentation because we think code's more important, but you can change it, okay? We've got the bit about high, those red lines, high risk. So of these 1,000 packages, there's one that's been flagged with a security vulnerability, okay? So then you might think of, so security vulnerability, what do you want to do about it? And it's not of can't use it, it's about someone going in and going, I understand the risk, I'm okay with that risk, or I'm not okay with that risk. So it's not of we can't use it, it's should I use it, right?

You can do package drill then. So, you know, you get all the scores and all the parametrics that we've talked about. You can see what's good, what's bad. You can do a comparison with other packages for a particular, you know, how does .whisker, I have no idea what .whisker does, and it was on my laptop and I installed it and must have run it, but so, yeah, what it does, so that's good. We can record decisions.

So that's important if you're in an organization where you need to show what you're doing. So I have chosen to use this package. There's a security vulnerability, but I'm okay with the risk. And sometimes you think, you couldn't possibly do that, that happens all the time. You know, you've taken, you've assessed the risk and you think that risk is okay. But many organizations, you have to make sure you can record that risk to say, yes, Colin came along, he looked at that and he said, it's okay because of these reasons, right? You know, the reasons could be, it's a vulnerability, but the package is only used internally. It can't access anything external. Fine, that seems okay.

So summary, a lot of thought goes a long way. So the dashboard, the QR code, that will take you to, you know, the idea next maybe four or six weeks, you would have all the packages in CRAN. So if you want to see a package, you can have a look, help win that argument with IT, okay? We're starting to do sort of rapid GXP type environments. So if you're in that sort of space, please come and talk to me. Also do another service called diffify.com, which go to that URL, basically type in any package you want and it will show you what functions have changed between any versions. So you can mess a bit with that and that's just out there. So thank you very much and happy to take questions.

Q&A

Thank you so much, Colin. It was very interesting to hear about all these different metrics. From the questions from Slido, we have, when a measure becomes a target, it ceases to be a good measure. As we move towards objective measures of packages, do you anticipate packages evolving to be ranked higher without necessarily improvements to overall quality?

So, yes. So in some cases of a package, so one of the metrics is, does it have a source URL? GitHub, GitLab, Atlassian, whatever you want. And if an author thought, oh, I'm getting a zero here because it's not in source control, I think we could all agree of, it's a good push, right? You know, they're now hitting a target, but I think we can all agree it's a good target. And you go, oh, but they might then write tests that aren't very good. It's like, they could. But who could be bothered with that? You know, in the grand scheme of things, could you really be bothered to go to the effort to get 100% test coverage that anyone with more than two seconds worth of looking at your code would go, that's rubbish, right? You know, so it could theoretically happen, but I just don't fundamentally believe they will. And if it did, it would be spotted. Because again, we don't just install packages at random, you know, whereas I was making flipping jokes. We do think about that package is solving this problem, and it's been written by someone I respect, and I'm installing it and using it. And we're doing it intuitively, and this is just formalizing that.

All right, lovely. This is a short question that I'll rephrase a little bit. Can we treat tidyverse as a single dependency?

So the way we do it is if some, so when we're working with pharmaceutical companies, if they say, we only want tidyverse, we go, that's fine. We work the dependencies of the dependencies. So that will end up being 100, 200 packages. And then it's, again, it's up to, you know, we would then say, you should think about all those packages, because they've all been used. But, you know, you can do what you want. Typically for a small analysis, you know, so a bit of data cleaning, a bit of ggplotting, a little bit of stats, you're ending up with about 500 packages from the start. You know, that's a sort of baseline and sort of hand-wavy things. All right, lovely. Colin, thank you very much. Thank you.

Can I trust that package? (Colin S Gillespie, Jumping Rivers) | posit::conf(2025)

Transcript#

Measuring package trust

Risk appetite across organizations

How to assess a package

Introducing Litmus

Scoring packages

The Litmus dashboard

Q&A