eRum 2018 - May 16 - Hannah Frick

Hannah Frick: Navigating the Wealth of R Packages

Hannah Frick

May 18, 2018

6 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, I'm a data scientist at Mango Solutions, and as you all know, we have quite a lot of packages on CRAN, over 12,000 at the moment. That is brilliant to have that choice, but occasionally also a little overwhelming, especially for people who are new to R. So, when you try to find the right package for your task, two questions are likely to come up, I want to do X, which package or packages can I use, and if you find several, you end up with a question, which one do I pick?

And I'm not here because I have the answers. I don't think there are ultimate answers to this, but I want to share a workflow I've seen several people do and some tools for along the way.

So first, you obviously need to find packages with the right functionality, so I'll be talking about that a bit, and then you usually have a selection of packages that you try to narrow down by some high-level comparisons, and then you just have to get in and explore them in more depth before you can make your choice.

Finding packages

One way to keep up with all the things that are out there, I don't necessarily would advocate for this to be the most efficient way to understand what's out there, but certainly a fun way are all the different R blogs out there, and both R Weekly and R bloggers are aggregator sites, and you can contribute your own blog to it by providing them with your RSS feed.

If you'd rather have search functionality, Rseek might be your thing. That is a Google custom search which gives you results from R-related sites like the R project site itself, GitHub, R bloggers, Stack Overflow and friends, and these kind of sites.

Cran-tastic is a community site for R packages where you can search, review, and tag Cran packages, and on Cran itself are the task views. They are in that menu on the left-hand side, but I've met quite a few people who are not aware of them, so I thought I'd point that out.

They provide you with information about which packages are available given a certain topic like Bayesian inference, or natural language processing, or high-performance computing, or web interfaces, or psychometrics, or whatever floats your boat. These are usually curated lists where they group packages according to sub-topics.

Curated lists are hard to keep up to date, so, if you spot something that you think should be on there, or want to contribute in any other way, I think the best way is to email the maintainer of the task view. That should always be on the top with a name and an email address where you can contact that person.

Another site is Metacran which aggregates information from Cran and GitHub. It also has search functionality and sections like featured packages, and trending packages, most downloaded, most dependent upon. So I think these kind of measures try to capture package popularity.

Comparing and evaluating packages

That may be useful to discover what's out there, but also when you try to narrow down your selection, however, Metacran doesn't give you direct functionality to look at these numbers for your custom selection of packages.

But let's give that high-level comparison another thought anyway, so, what other aspects might you be paying attention to? Is it stable? Is it actively maintained? When was the last update? What's going on in the issue tracker? Is it dependent upon what about tests? Does it use continuous integration? What about the documentation? Does the package have a vignette? These are all things that we could try to capture in metrics.

I think many of us will probably also pay attention to who wrote the package. That doesn't make for a good metric, though, so metrics are a tool. Use it as it's appropriate.

That doesn't make for a good metric, though, so metrics are a tool. Use it as it's appropriate.

I'm aware of at least two packages out there that collect package metrics, so one is creatively named package metrics, that's a project that I worked on with these lovely people, and Colin also contributed some code. It gives you information about, like, from Cran and GitHub, for example, when was the last commit, the last issue closed, how many contributors are there? These kind of things.

At Mango, we also have a product where we check our packages in depth, and as part of that, we also have a package to collect package metrics, and we've made it available on GitHub that gives you similar Cran and GitHub stats along with cyclomatic complexity and a few others, but these metrics are obviously not a substitute for actually an in-depth exploration of a package.

So sort of maybe on two ends of a formality spectrum, you could go and try it out yourself, play with a vignette, that kind of thing, or, yes, one minute! Or you can rely on a formal review process, for example, through academic journals, like the Journal for Statistical Software, or the R Journal, or there's also the R OpenSight software review, and part of that review process at R OpenSight is the good practice package which is also a Mango package that I've recently released to Cran which lets you run a set of 230 checks on good practices for building R packages on there just to give you a quicker sense of maybe how that package was written than if you would open up and read the source code yourself.

And with that, I have a bunch of links for all the stuff that I've talked about, and I just want to say, like, many ways lead to Rome in this case, and all of these are tools, and pick the ones that are right for you and your task.