
Ellis Hughes | R Package Validation Framework | Posit
From rstudio::global(2021) Pharma X-Sessions, sponsored by ProCogia: in this talk I discuss the process developed for validating internally generated R packages at SCHARP (Statistical Center for HIV/AIDS Research and Prevention) - the R Package Validation Framework. I cover the elements of the framework and basics of applying it with some examples. By using tools native to the R package building infrastructure, validation can become an integrated part of your package development, improving the quality of both the package and validation. About Ellis Hughes: I am a statistical Programmer at Fred Hutch Cancer Research Center where I work on a team that evaluates potential HIV vaccine candidates. Having graduated from Washington State University with a degree in Bioengineering, I found a passion for programming in R. I now organize the Seattle UseR group, and enjoy building packages to automate my workflows. Learn more about the rstudio::global(2021) X-Sessions: https://blog.rstudio.com/2021/01/11/x-sessions-at-rstudio-global/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Today, I'll be talking about the R Package Validation Framework, which is a project I've been working on for the past little over a year or so now. And I'm really excited to share with you all what we've been working on and the updates that have come out of this.
So my name is Ellis Hughes. I'm a statistical programmer. I have a background working in statistical genetics. Currently, I work at Fred Hutch under Sharp working on HIV vaccine research. I'm pretty heavily involved in the R community. I'm one of the Seattle Use R Organizers. I also am one of the organizers for the Cascadia RConf. And I also run a screencast called TidyX, where we go through and explain how R code works, usually from Tidy Tuesday submissions. So you can find me on Twitter at Ellis underscore Hughes.
What is validation?
So a common question that any of us, if we've ever tried to introduce new software into the pharma world is, is it validated? That's a question that we're always going to be getting. Is it validated? But what is validation? It can mean different things to different people. Yes, there's an official definition, but it means different things to different people.
So as I understand it, the definition of validation is establishing documentary evidence that our software performs some sort of process, procedure, activity, and compliance with our specifications with a high degree of assurance. So in layman's terms, we're creating documents to prove that our software does what we say it does.
But why do we even care about validation? Why is that question even asked? Well, a lot of people refer to the fact that we'd have to validate our software for FDA submission. But there's a lot of unspoken benefits around validation that don't always get talked about, such as improved quality and safety of our code, because we've gone through and checked it. We're confident in the quality of the code, as well as the fact that it'll return proper results. It results in faster processing. Because we already have this code now set up, we can use it across multiple projects and trust that it'll work. And then it also promotes trust. Because we've performed this validation, we've gone through and vetted it the best to our abilities. So we are confident that when we give it an input, it'll give us a consistent output, as well as throw an error when things go out of bounds.
So validation practice can be a really high bar. And there's a lot of documents that go into creating validation. Because first, you have to fill out a form for specifications, planned uses, environments you're going to be using it in. Then write out some of your code based on those specifications. Record the function authorship in some external file, potentially an Excel file. Then get another form to document your test cases, test environment, how you plan on doing your testing. Then maybe the last form to show how your testing plan is comprehensive, that you're proving that all your specifications are being met through your testing plan. Have a third party come through and manually evaluate your tests and get screenshots for results. And then you have to review your documentation, combine it into a final validation packet for release. But information is shared across all of these documents. So any time one of them updates, you need to go back and update it. And if you manually evaluated all your tests and had to make an update, you're going to have to go through and rerun everything. So it's incredibly inefficient. And at any time, it can feel like game over because you're just redoing your work over and over again.
R and validation as best friends
But today, I want to tell you that validation and R can be best friends. They can work together to provide a validation framework such that we can achieve all the goals of validation without as much additional stress and work that can be involved with that. Using a combination of R Markdown, TestStat, and Roxygen2, we can essentially make validation push button for validation and generate a document that looks like this, where we're able to capture a place for signatures for everyone that's been involved with the project. We're able to have information about the environment in which we performed our validation and record who wrote what pieces of the project, the specs, the functions, the test cases, the code, as well as share what the coverage is between our specs and our test cases. And finally, record all of our content here, including our test results, all at the click of a button without having to redo as much work.
But today, I want to tell you that validation and R can be best friends. They can work together to provide a validation framework such that we can achieve all the goals of validation without as much additional stress and work that can be involved with that.
And this is the R package validation framework. So there's five key elements to the framework. There's recording your specs, writing your code, recording your test cases, writing your test code, and finally generating the documentation. The advantage of using this framework is, well, it's integrated into the R package development process. It takes you from the beginning of the ideation of I want to create a package all the way through the process to creating a package that now you've validated. It's native to R programmers. This framework lives within the R package itself. So you don't have to leave your R package environment in order to perform your validation or have all the documents in there. It allows for iterative development. Because we don't have docs living across multiple locations, it allows us to update pieces without having to make sure that we've updated the five different other documents with that piece of information. It's reusable and extensible because it's reliant on code. We're able to just rerun things and make sure it'll just run and work. And it's extensible. If you want to move pieces out of your package into a more utility focused package, you can totally do that and just copy out your test cases and specs and whatnot into that new package. And you don't have to be putting in a whole new round of effort into doing that.
Specifications
So we're going to start at the beginning, which is specifications. Now, specifications define the expectations of a package. So you kind of think of that as like the blueprints for your package. So it's really defining what is the goal of your package. A good specification is it is going to have several different elements. It's like, what will the thing be doing? What are the expected inputs? What are the expected outputs? What are warnings or errors that should be triggered when things like letting us know, letting the user know when things are going out of bounds? But critically, a specification will not rely on external knowledge. You can assume that there's basic knowledge in the programmer side of things, but do not assume that they have contextual knowledge around the statistics or the process you may be trying to create into your R package.
So if I were to be writing a specification for my presentation today, I'll say my presentation will cover the contents, my team's approach to validation. It'll be roughly 15 to 20 minutes long and it'll be entertaining. Now, when I go to convert that into a specification file that I'd save into my package, it needs to have these several different pieces here. First, it needs to be machine readable because we're going to need to be able to read that later on. It's going to need to be independent. So we don't want to have multiple specification files that link back to one another because that can create codependencies and make it difficult to separate pieces out. And then close proximity to the task. So we want to have it saved within our R package so that it lives where the programmers are doing their work. When we're documenting our specifications other than the actual specifications themselves, we really want to be capturing information such as who wrote those specifications and when it was written.
So if I were to be taking the specification that are earlier and converting it into what I'd be using for the R package validation framework, I would use something like this. And it's written in Markdown where we have I have a header at the very top there where I have the section where I wrote that spec and I wrote it on this date. My specifications are my presentation will clearly explain the validation procedure. It'll be 15 to 20 minutes long and it'll be entertaining. And I'll measure that by causing at least three people to laugh. And optionally, spec 1.4, fame, glamour and maybe start a branded accessory chain. We'll see.
Code development and documentation
All right. So let's get down to business. We have our specs. It's code development time, the piece we've all been waiting for. And don't worry, I'm not going to sit here and tell you how to write code. I am going to trust that you're going to be following good programming practices and everything around that. However, what I am going to talk about is documenting your code for validation. And what we want to be capturing around that is who wrote the code, when they wrote it and making sure to update the documentation of ownership whenever the code gets updated.
The way we're going to go about doing that is using Roxygen tags. So if you're writing your code according to good programming practice, you're going to be adding documentation around that. And a lot of folks are going to use Roxygen to do so. So we suggest using these two Roxygen tags at section last updated by and at section last updated date to capture this information. And the value of doing this is it provides close proximity to the task. Whoever is actually working on the function at that time doesn't have to go to an external word or document to make sure to update. Oh, I update if this function here and this is when I did it. No, it's actually right there with the function. So it makes it simple to do. It's a natural extension of the documentation already being performed because we're already using Roxygen too. And an added benefit is it adds documentation that the functions itself because it's a Roxygen doc or Roxygen tag. It's actually when you compile your documentation, it'll be added in there.
So this is a simple function that I wrote. It's a joke where you have a punch line and a setup. But really what we care about is this lot, these two Roxygen tags here. So I was the one that wrote this function and this is when I wrote it. And it's a very simple function. But say Joe King came through and updated my joke to make it so that had a new parameter. So the user could specify how long it waited before it delivered the punch line. Well, Joe updated that. He updated his normal Roxygen tags, but then he also updated the updated by and date to Joe and when he updated it.
And now that we now that that's in there, when he regenerates the documentation, the help pages have who wrote that function and when they updated it. So you are now the owner of that that function for validation purposes.
Test cases
Cool. All right. So now we're going to connect the dots. So we have our code and we have our specs, but we need to prove that we've met our specs. And the way we do that is through test cases. So test cases prove what you wanted to do to what you did. Now, a single test case can satisfy multiple specifications if it's covering a lot of the functions that you're going to be using and stuff like that, you can prove that you've met your specifications. However, every single specification must be satisfied by at least one test case. And the risk of a specification drives them the level of testing performance, the number of times you'll be testing that function with new inputs.
Now, a good test case will will share which specifications are being met and how they're being met. Detail the required setup. So if there's any data that I need to be loading in, if there's any environment variables need to be setting that little detail, this is what you need to be doing. It'll be it's a recipe, really, for how to get from this setup input into the desired output with clear expectations for what you're going to be testing against. So if you're reading in a data set, you want to be clear to say, rather than saying, check the length, go check that the length of this data set is thirty four rows. But critically, there's no code provided because you want whoever is going to be performing the test testing to not be able to copy and paste the code because then you didn't actually test it.
When you're documenting your test cases, we're going to capture the same information that we have been who wrote the test and when they wrote it. But we're also going to capture which specifications are being satisfied because that can update as tests update. And we're going to use Roxygen tag surprise, surprise to do that. So at such and less updated by at such and less updated date and at such and specification coverage. And so if I were to be writing a test case for my presentation today, I have my sections at the top here. So I wrote this test case. This is when I wrote it. And these are the the the test case and which specs are being covered. I have a setup where I create my presentation. Then I will go through and prove that I met my specifications by my first test, test that the presentation was informative by asking my audience that they what they learned.
I will time my presentation, make sure it's between 15 and 20 minutes long and tested as entertaining, accounting the number of chuckles from the audience, which unfortunately I can't hear you right now. Hopefully you're laughing. Or when I practice it to my significant other, the number of eye rolls. And trust me, there were a few of them.
Test coding
All right. So now we have all the setup here. So now we're going to fill in the lines with test coding. So test coding is the actual implementation of the test cases in code. We're going to be recording the results to prove that we've met our specifications. But very, very importantly, a third party be doing this to somebody that was not involved in the writing of the package or the writing of the test cases can now go through and perform the testing. They could have written the specifications. But so the benefits of doing this test coding, it's going to help resolve interpretation errors in your documentation and examples because this person doesn't actually know how to use your package. They're able to they're going to be using your documentation to see how they're supposed to be achieving the test cases that you lay out. And if they have any, they can raise any questions about it, about your documentation. And this wasn't very clear to me because you're as the author intrinsically understand how the function is supposed to be behaving. This new person doesn't know that. So they can ask you to improve your documentation or examples. If the test case itself isn't very clear, they're going to come to the wrong conclusion. And so then it's going to fail and they are able to you're able to go back and then improve your test cases. And the failing test case is not necessarily a bad thing. It just means that you need to improve your documentation. And then any any interpretation errors in your specifications, if they didn't write their specifications, well, they can once again go back and help improve those. And then in general, it helps identify improvement. So if your code runs slowly or if an argument is incredibly clear, they're able to go back and go, hey, we were missing this argument here. Can you please add that in? As I was going through the test cases, it didn't really make a whole bunch of sense to be able to have to set it up this way. And so it really allows for iterative improvement before you even get to the first release.
So we're going to use a combination of test that and surprise, surprise for oxygen, too. We're going to document our test code with who wrote the test code and when they were using the same rocks and tags last updated by and last updated date. And so this is an example of what some test code might look like. So we have the header here at the top. And so each test piece will have its own test that section here. It's exactly like any other test that but we have this last updated by just her last updated date. Here's the expectation, the test they have, the code they run and the expectations completely like any other test. And the reason we're using test that it's a familiar framework. It's used by a ton of packages on crane right now. It's it's this de facto standard basically at this point for unit testing. So we're going to kind of steal that and use it for our purposes.
It can also be run and developed either interactively or in batch. So we don't have to have somebody manually running it, but they can manually run it as they're building up their test. But the the real and the big reason why we use test that is for the reporter objects that live within test that. So reporter objects are a special object that live inside tests that help track and test each expectation and reports whether they succeed or if they do fail, why they failed. And that's very important because we need to track that for our validation purposes. So a standard reporter object output kind of looks like this. So if you've ever built a package and done like control shift T after you put unit tests in there, you've actually used a reporter object. You just didn't really know because under the hood. But what we were able to do is we took we execute the reporter object, but then we use a custom function to extract the results for us and make a nice table that shows which test was run, whether the results were as expected or if they were different, what what they were that was different and then a pass or fail. And we can put that into a table.
Validation documentation
Right. Proof is in the pudding. So what do we have so far? We have our specifications. We have our code. We have a test cases and our test code. But, you know, as we said, validation was all about documenting that we did all this. And this is our validation documentation. So we're going to use our markdown to create a vignette where we're going to record all of the all the information that we created so far into a vignette. We're going to have a sign off slash approval section based on your organizational requirement. We're going to capture environmental information such as the R version and package dependency versions in the R markdown because it's dynamic. It'll update whenever you rerun your validation documentation. And here you can also capture any other organizational requirements that might have been found in any other documents for validation. Then using our markdown, we're going to dynamically read in our specifications. Once again, it was very critical that they were machine readable. Scrape the test code for authorship, read in the test cases and drop those into the document and then execute all the test code and capture those results into the tables that we saw.
And so this is the general format of what you might see your package as under the vignettes folder. We're going to have the validation R&D, which is going to have all that code in setup and recording of the environments. This is then going to be the results, typically in a PDF. We're going to have our validation folder where we have our specs, our test cases and our test code and all the content underneath there.
So all together now. So like I said, we're going to use R markdown, test that and Roxygen too. And with their powers combined, we're able to generate a reproducible validation report at the click of a button or at the build of a package because it lives in the vignettes folder. And when you build it, you can tell it to execute your vignettes so you can make sure that you don't have an R package that has failed its validation. And we generate this exact same report that we had before. But now you know where all this came from. We wrote in these signature pages here. This validation environment was scraped or was generated at the build of this document. We pulled in all this information about who wrote these specifications, who wrote the functions, who wrote the test cases, who wrote the test code. We were able to show coverage of all the specifications to the test cases. And we're able to drop it all into this document dynamically. We don't have to manually go through and update it every time.
And with their powers combined, we're able to generate a reproducible validation report at the click of a button or at the build of a package because it lives in the vignettes folder.
What's next
So that's that's the framework. Where are we at now? So currently I'm working on a white paper with a team from Fuse to describe the framework process in more detail, as well as generalizing and recording the optimal processes that we suggest you follow when you're using this framework. And it's very much under construction. Hopefully it'll be coming out later this year. Really excited for this. We're also working on a ValTools R package. So this is going to be based on the white paper. It's very much in progress, but it'll provide tooling that's not currently available with use this or dev tools to help folks perform their validations using this framework. And hopefully that'll be coming out later this year as well.
So many thanks to all the folks that have been involved with this framework, Fred Hutch, so Marie Venditulli, Anthony Williams, Jimmy, Barthi, Raphael, Alicia, Shannon, Paul and Kate, as well as the folks that are on the Fuse R package validation framework working group, because you really helped me figure out how to expand this framework and really make it more general so it works for more folks. So hopefully now you understand that validation and R can truly live together forever. Thank you very much for having me talk today. RStudio as well as Procogia and putting this together. You can find more code in this presentation at github.com slash the bioengineers slash validation underscore RStudio underscore 2021. Thank you so much.
