Jeroen Ooms | Monitoring health and impact of open-source projects

Transcript#

This transcript was generated automatically and may contain errors.

Good morning. Let me start with a question. What do you think is the most important aspect of using R successfully? Is it being able to write high-performance code, or making beautiful graphics, or perhaps learning how to write R packages? Here's my take on it. I think the most important aspect of using R successfully is being able to find and choose the right tools for the job that you're working on. I think a lot of data science projects, they get stuck, or they're not as successful as they could be because they chose the wrong tools in the project to build on. Once you've gone with a certain tool, it can be very hard to go back and fix that.

My name is Jeroen Ooms . I've written quite a few R packages, some of which you may have used. I'm also the maintainer of the compilers and the build infrastructure for R on Windows. I'm a staff member of the R OpenSci group at UC Berkeley.

Within R OpenSci, we're working on a big ambitious new project called R Universe. In essence, R Universe is an open platform based on Git for managing personal R package repositories. That means that once a package is registered, the system will automatically build binaries and documentation and so on directly from Git every time the package author pushes an update. These can be R packages that are also on CRAN or Bioconductor or just your personal R packages. That really doesn't matter as long as the source code is available from Git.

However, publishing packages is actually not the goal of this project. More importantly, it is a starting point for us to experiment with calculating metrics and showing other background information about a package that may tell us something about the health and the role and the impact of a particular project. Towards the end of this talk, I will briefly explain what R Universe does and where this is coming from. But mostly today, I want to talk about why we believe it is important to have better tooling to actively monitor the status and the health of research software. Because I think there are some important differences between using open source software and commercial software that are not always immediately obvious. In particular, when it comes to expectations about the quality and the lifecycle and the participation in open source projects.

So, I hope to make you think a little bit about things you may want to consider when you're going to use open source software and some indicators that you may want to look for in a healthy project.

Choosing the right R packages

So, how do we discover and choose the best R packages? Because R is amazing and you can do so many cool things with it and there's very few rules or limitations. But this also makes our ecosystem sometimes feel a bit like the Wild West of research software, where it can be very hard to find the good stuff or to judge if something is trustworthy. Suppose you're a data scientist and you need to do a particular analysis or you need to read a particular file format or you need an HTTP client to interface to some open data web servers. How do you know what's out there and how do you know if it's any good? Maybe you find an article or a blog post that mentions an interesting package. So, you read the description and you maybe run an example and it seems to work. So, problem solved, right?

Well, as you probably know, anyone can create an R package. You just put your code into the right format and then you pull it up on GitHub or you submit it to CRAN. Now, CRAN will actually check if the package can be installed on Windows and Mac and Linux and if it does not violate any policies. But that's about it. There's no gatekeeping or judgment on the contents of the R package. And this is what makes open source software so great because anyone can participate. But obviously, the quality of these contributions can vary.

And what is perhaps less obvious at first is that different package authors have very different ideas on what you can expect from them in terms of maintenance and support and participation. And these things become more relevant as you start using R more seriously. Now, of course, when you're just toying around for a homework assignment or a one-off problem, you just do whatever works and move on.

But if you start to really depend on a package, I think the dynamic changes a bit. Suppose you're really going to build an R package in your dissertation research or your company's data pipeline. At some point, you should probably ask yourself some questions like who made this package? What's their background? Is this an expert? What other things have they made? And how can you trust the results are correct? Has there been any peer review or scientific publication to accompany this package? Is it reliable? Does it get tested a lot? Because if it works for one example, that doesn't always mean it's going to work for any data. And what if you have a question about the software or you find a bug? Is the maintainer going to be available to help you? Is the project still actively developed? And what are the maintenance expectations? Is the package even going to be around three years from now when you plan to finish your dissertation? And who else is using the software? Is this like an established project? Is there a large user community where you can maybe ask questions?

So what you do not want is that you become fully dependent on a piece of software in your research or in your business. And then at some point, a problem arises and you run into a really bad bug or the thing is crashing or it's depending on some other thing that has been retired or it just doesn't work anymore on the latest version of Windows or Mac. And then it turns out there is no issue tracker. And then you contact the maintainer, but you get no response. Or worse, you get a response like, well, this package is part of a research project that no longer exists and I've since then transferred to another university and I don't have time to work on this anymore.

And this happens a lot. I've certainly been there. And it's good to remember that the package author really doesn't owe you anything. Most maintainers are very nice and very friendly and they're very interested in helping you while also using your feedback to improve their package. That is what open source software is all about. But sometimes you find out that some software that you were using may actually not be in very good shape.

the package author really doesn't owe you anything. Most maintainers are very nice and very friendly and they're very interested in helping you while also using your feedback to improve their package. That is what open source software is all about.

The lifecycle of open-source software

I think a lot of this also comes down to the unique life cycle of research software. All software has a life cycle, but for open source projects, this is often somewhat unclear and unpredictable. Most projects start out as an experiment and then a small fraction of those projects, if it works well and the maintainer has enough time to put into it, it may gradually develop into a mature established piece of software. But eventually one day all software becomes obsolete and it gets retired or replaced with something better. And in commercial projects, there's usually a license that states that this product is officially supported until at least 20 blah, blah, blah. But in open source projects, there is no such thing. Things are supported basically for the extent and the duration that the author has an interest in maintaining that thing.

For example, a lot of research software only exists as a proof of concept to accompany a scientific paper. And once the thing has been published, the author has really no intention of ever touching the software again. So if you want to actually build on that software, then you're basically on your own. But many R packages these days, they are the exact opposite and they are not published for their scientific merit, but they do something very useful. And it is in the interest of the package author to get as many people as possible to use the thing to maximize the impact of the work. So when we're using a piece of open source software, it can be helpful to think for a second about where in the lifecycle this project is. Is this an experiment or a mature project? Is it still actively being developed or is it something of the past?

Indicators of a healthy project

So how should we handle this? What are things you could look for when you're shopping for R packages? What are indicators that reveal something about the health of a project or that may give you a sense of what to expect? And what are red flags to avoid common pitfalls? I think we can roughly distinguish three categories of indicators. The first category of indicators are technical indicators. And those are things that we can measure relatively objectively. So for example, we can look at the dependency network and the CRAN homepage of a package already shows you the reverse dependencies of a package. And so those are other packages that depend on that package. And so that gives you a sense of if the package is trusted by other developers.

But just counting the number of reverse dependencies doesn't always tell you the whole story because some of these reverse dependencies are way more important than others. Or sometimes you find a package that is used by 10 other packages all by the same author, just like with scientific citations. And so an alternative metric that you could look at are to count the recursive reverse dependencies. Or you could weigh the reverse dependencies by their own relative importance. So you get something of a page rank statistic.

And besides dependency relations, other technical indicators that you could look at are like download statistics or project activity in terms of releases or commit activity. And I think it's especially interesting to look at how these numbers change over time so that you get a sense of the lifecycle of a project to see if the product is gaining traction, or if it's on its way out, or if it has found like a stable role within the ecosystem.

Let's actually look at an example that's close to my heart. And this is the curl package. So if you need a web client in R, you probably need something with curl bindings. But there's actually two CRAN packages that can do this. One is called rcurl and the other one is just called curl. So which one should you use? Well, if you look at the CRAN homepage, you find a very similar description and even a similar number of reverse dependencies. And you can confirm this in R as well. But if we start looking at the recursive reverse dependencies, we can see a much bigger difference. And the reason for this is that the curl package is used as the foundation for some important web frameworks in R, such as the header package.

And also, if we look at the reverse dependencies over time, we can see that the use of curl has been steadily rising over the past few years, whereas the number of packages that use rcurl has been pretty stale. And we can see a similar trend if we look at the download statistics for these packages over the years. And the story here is that the rcurl package is very old. It has been on CRAN since 2004 and has a paper about it from 2006. And it was some very pioneering work at the time. However, a lot has changed since then. And the package has some very fundamental problems. And the maintainer of rcurl is not very active anymore. So we wrote the curl package specifically as a replacement for rcurl with a modern design to be simpler and more robust. And it works very well. And most users have switched over to this new package since then.

But the old rcurl package is still around. However, it's not in very good shape. And if you use it today, you quickly start running into some of these problems. So I think this is one example where metrics can help you make a more informed decision and potentially save you a lot of problems down the line.

But of course, these technical metrics don't always work. For example, many R packages are very niche and they're useful only to a very particular specialized group of people. So the package may not have many downloads or reverse dependencies. But if this is something in your field, it may be exactly what you need.

And the second category of indicators I want to talk about are social indicators. I consider these to be all the things that have to do something with the people behind the project and the way in which the development and the participation gets organized. And in my experience, these are by far the most important things to look for in a healthy open source project. But these are not easy to quantify. A lot of it just comes down to getting to know the community a little bit.

So open source projects are, for better or worse, most of the time maintained by one or two people. And there's this fantastic book by Nadia Ekpal from earlier this year that I really like that talks about the reality behind open source development and why this is the case and the consequences that this has. And this is again one of those big differences with commercial software. A lot of people are used to thinking of software as written by anonymous engineers from a big company. And that is just not how it works for us. Open source software is written by people like you and me. And you should think of R packages more like a scientific publication or a piece of work by a local artist or a musician. It really matters who is making that thing.

Open source software is written by people like you and me. And you should think of R packages more like a scientific publication or a piece of work by a local artist or a musician. It really matters who is making that thing.

So the most important social indicator is basically who is the author, what organizations they are part of, what people they collaborate with a lot, if they're still active or not, and what they're currently working on. And based on the type of package, you may care about their formal qualifications and if they have an online presence like a blog or a social media. Another social indicator is basically how is the product managed. Is there a public place for reporting bugs and do these typically get resolved? And if the product is open to contributions from external people and what that process looks like.

To me, this is very important. I believe every R package should have a public place where everybody can post their issues so you can see what problems other users are running into and people can help each other out. But let me be clear that you should not expect every R package author to be a full-time tech support person. Answering all these questions and reviewing suggestions can be very exhausting and sometimes you just don't have time to work on a project for a few days or a few weeks or a few months. But for me, the important thing is transparency. If I have no way of seeing what are the common problems or questions with a package, that is a red flag. There may be some exceptions if the package is relatively small and maintained by a professor that's really the expert in the field. It may be okay. But generally speaking, I think a functional issue tracker is really a minimal requirement for a healthy open source project.

And finally, other social indicators you could think of are available resources for learning about the package. For example, if there's good documentation either by the package author themselves or by users writing blog posts. But for example, also if there's many questions and answers on Stack Overflow, that may be a great source of information and a sign that other people are using this package.

And then there's a third category of indicators and these are perhaps more specific to R because we have a little research software. And these are scientific indicators. So for packages that are specifically implementing some scientific method, you are probably interested in the validity of those results. And for example, if the package is fitting some model, you want to be sure that the estimates are accurate and that the package is handling all edge cases properly. And this is of course very difficult to judge, but there's some things we can look at. For example, some packages have gone through a peer review process. For example, in the Journal of Systical Software or in R OpenSci. That's a good sign. It's not a guarantee that the package will be in good standing forever, but at least the author has gone through an effort at some point to have a colleague have a critical look at the code.

And of course, we can look at citations of packages in scientific publications. This is very tricky because many researchers don't cite software and if they cite, they probably just cite R or tidyverse and not the individual packages. But what is interesting is that many journals these days are starting to require researchers to include the analysis code with their publication for reproducibility purposes. And so you can imagine if a lot of code becomes available this way, we can start to analyze some of this code to see which R packages are commonly used in scientific research. So I think the scientific quality of R packages is by far the most difficult to judge, but it's very important.