Elevating Public Health Decision-Making with R Packages (Kylie Ainslie, RIVM)

Transcript#

This transcript was generated automatically and may contain errors.

Okay, so this may look familiar to some of you. It is what I lovingly refer to as the Project Folder Hellscape. Looking at it, it's chaotic. There's a lot of different folders with not very informative names. There's a lot of different files, a lot of different versions of the same file, and a lot of files that have really uninformative names, like MD7, I have no idea what that means.

And at one point or another, when doing this project, this was from my PhD, I asked myself, where is that script? And which file is current? And what was I thinking months ago?

And I'm sure some of you can relate to this. And so after I finished my PhD, and I moved on to a postdoc, I started to think, okay, how can I do this better? How can I move away from this chaotic Project Folder Hellscape into something more organized? And so what if I told you that I found that better way, that I could turn this chaos into a nice, organized file structure that I could use for every code-based project?

So you may ask, how? Well, by structuring your project as an R package.

And I don't mean from the software development perspective, where the primary goal of the project is to end up with a production-ready R package. I mean by using the actual file structure of an R packaging package, and leveraging the benefits of an R package to better organize, document, and ultimately share your code.

And so for the rest of this talk, I wanna walk you through how I've used this workflow, and I wanna provide you with examples of tangible impacts that it's had on my work. And so in order to give you a little bit of context for one of the project I've used this approach on, we need to travel back in time.

And because I used this approach, my work was traceable. I could go back and say, I made that decision because that was the information that we had at that time and then we got new information on and on and on.

And so the first two features of this R package workflow focus mainly on your individual workflow, how you move through a project. But what I think makes the R package really powerful is that when you combine it with some sort of file sharing platform, whether that's GitHub or a shared network drive or something, then all of a sudden you have collaborative magic.

Because now all of a sudden you have this beautiful, organized, standardized, documented, for lack of a better word, package, that enables you to install it easily. So rather than having to source a bunch of separate scripts, you can install a package in one line of code. You also have a way to have seamless handovers. So it becomes very easy for you to onboard a new person or even for yourself to come back to a project after six months.

Another important feature is that you can create reusable tools that you and your colleagues and other people at your organization can use. So I found out in my team that actually me and two other colleagues were all trying to use the same method and we were independently coding that. But with this approach, someone can code it one time and we can all use it. And this is also really important in infectious disease work because I realized that I'm lucky enough to have the resources to build these tools. That is not the case everywhere. And it's very important that everyone has the capability and the resources and the tools with which to analyze their own data.

And finally, it facilitates collaboration. So it's one thing to have a conversation with someone outside your organization or someone on another team about things you're working on, but it changes the game when you're able to show them what you're working on and generate ideas for how you can either work on the current project or extend it later.

And so with all of this sharing capability comes reproducibility, which is a central principle in scientific research. And how this manifested during the COVID modeling was that because I use this approach, my model code was actually selected to be included in a separate R package called Epidemics, which serves as a transmission model library. So they were able to take my code, make it better, and include it in this other package for others to use. Another thing was that I was able to be involved in a European-wide effort to provide modeling results to help inform COVID public health policy. And that would have been much more difficult if I hadn't used this approach. And so ultimately, one national analysis has now become a piece of international infrastructure that will live beyond the work that I did.

And so ultimately, one national analysis has now become a piece of international infrastructure that will live beyond the work that I did.

Getting started

And so now it's your turn. I really encourage you to try using an R package structure for your next project. And if you're a little intimidated, you've never made an R package, there are some wonderful tools and learning materials available. And I want to point to two. So one is the uses package, which actually makes creating the R package structure really, really easy. And then the other is the R packages book, which gives you details about every step of the process, including some that I haven't even mentioned like testing.

And I promise you'll never go back to messy folders. And I'm also happy to announce that after years of sort of using this approach and iteratively improving, that I've just had my first package accepted to CRAN. And I don't know that that would have been possible without this. So with that, I would be happy to take your questions and please feel free to get in touch.

Q&A

Thank you so much, Kylie. We have a few questions here. So the first one is, are there any parts of the package workflow you've needed to work around for this use case to work well for you?

So I'm gonna answer a slightly different question in a way to answer that question. So one of the things that I think a lot of people think of when they start thinking about R packages and documentation is that it costs time. It costs time in order to do these different bits that you may not invest in a different workflow. And so there were times where we maybe had to focus more on just getting the analysis results and kind of tackle the documentation later because we were under such tight timelines. But ideally you'd be able to do both kind of together.

Yeah, we have quite a few here. So I'm gonna get through them. So how do you deal with multiple versions of data or programs?

So the way that I dealt with it, because like I said, we got new data all the time. And I actually kept copies of the data, different versions of the data because I needed to be able to reproduce the model results at a given time. So I actually kept them in a data directory in the package.

Totally makes sense. So do you have any recommendations for implementing a shared package development workflow when PHI is a factor and online tools like GitHub aren't allowed?

Can you define PHI? I wish I could. It is a private health information. Oh, private health. Yeah, so that is one thing that I didn't mention that obviously some of this can't be shared publicly. Many of us work on projects where you can't share this. In fact, I couldn't share this work publicly at first. It had to be shown to the government and approved and then it could be shared. But I actually, my team works on GitLab, which is in a secure environment. And so this works fine within a secure environment. As long as the team members are able to access some sort of shared environment, it works really well.

What would you have done differently now after seeing the European consortium changes to your code?

Oh, I would have done a ton of things differently. I would have changed how I designed the whole model, but I had about two weeks to create it. And so it was just kind of like, all right, just put something that works. And that's funny because that was also a question asked by the auditor. But yeah, I would have made it more efficient, but that's just, it was a fact of the situation.

How does your workflow work with computationally heavy simulations?

Yeah, that's a good one. So I don't actually do a lot of computationally intensive simulations. I was able to get it quite quick. But I think what you can do is if you have the resources to use some sort of high performance computing, you can have the functionality within the package and then you just run those simulations elsewhere. But I don't think this would prevent you from having the code in which to do something that's computationally intensive.

Do you have any recommendations for getting teammates on board with this very different way of setting up projects, especially with folks who are already resistant to putting time into documentation?

Yes. So I feel like this was asked by one of my colleagues. I have struggled a lot because I think they're tired of me talking about this at work, but I'm finally making headway. And basically I keep sort of showing them the benefits of this and highlighting pain points when it's not done and in gentle ways, not like yelling at people. But I do think showing how easy it can be and how ultimately it can save time, because that's always a consideration that, oh, no, it's more work to make documentation. But in the end, it will actually save you time and make it easier for work to be done.

Amazing. Thank you so much. Thank you. Thank you.

Elevating Public Health Decision-Making with R Packages (Kylie Ainslie, RIVM) | posit::conf(2025)

Transcript#

The COVID-19 context

Organize

Document

Transparency and trust

Sharing and collaboration

Getting started

Q&A