Resources

Marie Vendettuoli | Lessons learned developing a library of validated packages | RStudio

Full title: Towards an integrated {verse}: lessons learned developing a library of validated packages Developing R packages as a unified {verse} – a set of packages that work well together but with each focusing on individual tasks – is an efficient strategy to structure support for complex workflows. The ongoing challenge becomes managing the growth of related packages in a holistic manner. This is especially problematic in industries with a heavy emphasis on stability, for example if packages need to be validated prior to use in production. In this talk, I will discuss a paradigm for developing and maintaining validated R packages, emphasizing the following areas: 1. Strategies for organizing packages to prevent excessive re-work 2. Facilitating responsive, iterative development and 3. Empathy for developer and user experiences About Marie: Marie Vendettuoli is a Senior Statistical Programmer at Statistical Center for HIV/AIDS Research and Prevention (SCHARP - https://www.fredhutch.org/en/research/divisions/vaccine-infectious-disease-division/research/biostatistics-bioinformatics-and-epidemiology/statistical-center-for-hiv-aids-research-and-prevention.html) @ FredHutch. She holds a PhD from Iowa State University in Human Computer Interaction and started developing R packages for use within regulatory frameworks while working as a Data Scientist at USDA Center for Veterinary Biologics (https://www.aphis.usda.gov/aphis/ourfocus/animalhealth/veterinary-biologics/sa_about_vb/ct_vb_about). Before discovering R, Marie worked in a CBER (https://www.fda.gov/about-fda/fda-organization/center-biologics-evaluation-and-research-cber)-regulated laboratory. Her main interest is developing analytical infrastructure to facilitate scientific analysis for fellow data scientists working in a regulatory environment

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Maria Vendettuoli. I'm a Senior Statistical Programmer at Fred Hutch in the Statistical Center for HIV AIDS Research and Prevention. At RStudioCon 2020, my colleague Ellis Hughes shared technical details demonstrating how our organization is tackling the challenge of software validation for R packages. It's now a year later. We have three packages released and are tackling the extended challenge of ensuring that our entire verse is fit for purpose as an integrated unit. Today, I'll be sharing my thoughts regarding multi-package development in the pharma environment.

Setting the team up for success

My first recommendation is to set the team up for success by getting infrastructure in order. Moving away from historical validation practices means updating or generating new SOPs. Another area for infrastructure support is to establish common development guidelines that get shared across all packages. This is especially important if packages are implementing legacy code that was previously siloed. The ultimate benefit, however, is code portability within the overall project.

Addressing development from an empathy perspective

My next recommendation is the need to address development from an empathy perspective. At SHARP, we observe four distinct roles. One, users rapidly processing data. These would be primarily statistical programmers. Two, data analysts reviewing code for context. These would be individuals who need to be able to rapidly assess context from function names alone. Three, stakeholders interacting with functions solely from the package validation report. So, this could be someone involved in writing specifications and test cases but aren't participating in coding activity. Four, package developers and maintainers who seek to eliminate duplication or re-implementation with package updates.

Now, one common need that transcends all of these user groups is to have a vocabulary which allows us to place functions within the larger context of assay-specific data processing tasks. That is, we want to survey what has been implemented, filter either by assay or task, and call up help page documentation on demand. So, this is actually a reference to Schneiderman's information-seeking mantra, which is typically cited in visualization literature. In this context, we are treating the function names as the identifying handle and leveraging RStudio's autocomplete as the filtering mechanism as we navigate to a particular help page or call the function. For example, one step of data processing that occurs for all assays is the transformation to a CDISC Atom basic data structure. Because of the function naming convention we have chosen, a list of all available BDS transformation functions are displayed and we can select for the assay of interest.

Package validation and systematic mapping

Focusing on the package validation elements, we also need a systematic approach to how we map package requirements, test cases, and test code files. Validation is a form of user acceptance testing, where we have to show that for every software requirement there was an appropriate test successfully executed. The modular approach we use at SHARP is incredibly flexible, which has the development benefits I will discuss later, but it does require some organization to track. In addition to enforcing a file naming convention, we also scrape these files to generate a tabular display in the validation report, which links requirements and tests. Between these two implementations, both developers and validation report stakeholders are able to verify at a glance that complete coverage exists.

Between these two implementations, both developers and validation report stakeholders are able to verify at a glance that complete coverage exists.

Managing technical debt across packages

The last couple recommendations I have address the topic of technical debt. You may be wondering, why are we splitting assays into separate packages? A couple reasons. First, FDA guidance asks us to revalidate the entire system when one element changes. Now, change is expected, input data structures get updated, analysis needs drift. However, an update to one assay type should not introduce the appearance of change affecting other assays. Secondly, from the user perspective, we need to put some conceptual structures around processes. Sure, every data set will be converted to BDS, but the definition of that basic data structure will vary from assay to assay, and we need to capture that distinction in a manner that is easy to digest.

What we also want to avoid is excessive re-implementation. As we have expanded to multiple assay packages, we have identified utility functions that can be shared and moved into their own foundation package. We can enforce package code dependencies through routine use of the description file, but the power of modular validation means that we can move the associated specifications, test cases, and test code files as well. Validation of the lighter assay package with the new utility package is simply compiling the R markdown document without any need to rewrite content. Likewise, when we need to add additional specifications, test cases, or test code to an existing package, an updated validation report will add the new child documents to existing sources. This has the added benefit of being able to use a simple diff to compare validation reports across package versions.

This has the added benefit of being able to use a simple diff to compare validation reports across package versions.

So, I wanted to take a moment to thank those who made this work possible, especially the preceding efforts by Ellis Hughes and the combined contribution of the SHARP pData standardization team. Future work includes formalizing our validation practice through a FUSE-ORG collaboration with deliverables expected later in 2021.