Resources

Automating for Consistency (Kristin Mussar, Pfizer) | posit::conf(2025)

Automating for Consistency Speaker(s): Kristin Mussar Abstract: In pharma, our data can be limited, inconsistent or incomplete, and data cleaning can be time-consuming. In addition, the industry is highly regulated, and data transformations must be transparent. At Pfizer, we developed a custom R package that enables us to automatically quality control our clinical biomarker data. Our package allows for flexible data structures and external input by non-coders, producing consistent reports and clear documentation. In this session I will share some features of our package, as well as how automating our process has been a powerful tool enabling us to achieve consistency, with the hope that this helps others automate their own pipelines. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Great, so I'm Kristin and today I'm going to be talking about automating for consistency. So, my first opinion for you is perhaps a little bit controversial. So, your data may never be standardized. But, you can still build consistent processes and products, and automation is a tool that you can use to help.

And so, in this talk I'm going to be talking about how we built a custom R package to handle our disparate data sources, to deliver consistent reports for our downstream users, and how we used automation to kind of enable user adoption. And I hope that you can kind of take away some ideas from this talk and bring them back to your own work.

Why consistent data matters

So, why do we care about consistent data? So, consistent data is really valuable data. It's how we get data in the hands of people quicker, and it can lead to faster decision making, and also better decision making.

And I work in the pharma industry. So, I work in the pharma industry, and so we work with patient data. And so, every data point we work with is a sample that's been provided by a patient. And so, it's really important to remember that the patients are giving us a lot of information from this data, and that we really need to use every data point we can. And the patients are hoping that this can help them with their treatments, but moreover can help the future generation, so that patients with the same cancer as them in the future can have better outcomes.

And so, slowdowns in our process really mean that medicines aren't being delivered to patients as quickly as they can. So, we need to really move fast in our process.

And so, slowdowns in our process really mean that medicines aren't being delivered to patients as quickly as they can.

Clinical biomarker data at Pfizer

So, I'll give you a little bit more context in the specific data that we work with. So, we work with the clinical biomarker data, and I'm part of the translational oncology group at Pfizer in the bioinformatics group. And so, what this means is we're working with the biomarker data that comes out of clinical trials, and we're really trying to assess how patients are responding to their treatments.

And so, my group supports over 45 different oncology trials. And throughout these trials, patients will be providing samples. So, this might be a blood sample they provide when they're seeing their doctor. It could be a biopsy. And then, these samples are tested for things called biomarkers. And so, a biomarker is just a readout of biological activity. So, an example of one might be a DNA mutation that's indicating that there's cancer in the body.

And so, these samples are usually collected at the hospitals and then sent to external vendors. And the external vendors are actually going to be doing the testing. And then, they'll take the data and send that back to us. And we work with a lot of vendors, so over 20 vendors. And that means that we're receiving data in different formats from these different vendors. And it's also important to note that biomarker data is a little bit less standardized than other clinical data. So, some of the data is actually entered into spreadsheets manually by scientists at these external labs. And so, it's really important that we assess this data for quality.

So, I wanted to give you an example of what some of our data might look like and what types of quality control we're performing on the data. So, this is data from a single patient and a single biopsy. And so, what we're really trying to assess is, did we receive the data we were expecting to receive? Is it in the correct format? And did the tests get performed in the way it was supposed to get performed?

So, some of the things we might look at is just to see, like, the person who was performing this test, did they think they could actually evaluate it? Was it acceptable? And so, we're looking for these values to say yes. Other things we might see in the data set are percentages. And so, we want to make sure these are all numbers. They're between 0 and 100. And then, there's a few things we might actually do that are calculations between different values in our data. So, some of the tests are really sensitive. They need to get tested right away when the vendor receives them. And so, we might do some calculations on these dates that we received just to make sure it was done quickly.

The problem: inconsistency at scale

And so, you might say, this type of QC looks pretty straightforward. But when I joined the team four years ago, our QC was taking a lot of time. It was very time-consuming. And this is because we had inconsistent data structures that really required manual coding and review for each data set. We had another group of colleagues that actually didn't have any programmatic support. And they were doing this all manually by eye. And it took hours every month. And it's not too bad when you're working on one data set. But when you have dozens of studies and dozens of data sets, it really adds up. And it also means that our data is not handled consistently between our trials. And it's really difficult to track what was done.

So, the issue that we're facing is not that our data has issues in it. That's expected. But that we have inconsistency between our different data sets. And that's really preventing us from scaling the process. And we need to scale the process in order to handle the volume of data sets that we're working with.

Defining the types of consistency needed

So, when we're talking about consistency, what exactly are we talking about? What types of consistency do we need in order to scale and automate our process?

So, it turns out, actually, a lot of different areas. So, we need our data to be in the same location and to be accessible at all the times. We need our data to be defined so we can find the specific information that we're looking for where we need to find it. We need our data to be internally consistent. So, when we receive data files, it's not changing too much from one file to the next file.

We also need to be consistent in terms of what we do to the data. So, if we are performing any transformations or processing on the data, it needs to be consistent between our studies. And then, if something was to change throughout the process, we need to be able to consistently track that so we can see what we're doing. And then, lastly, once we find issues, we need to generate consistent reports so that our downstream users can find what they need to find so they can act on the data.

The WrangleIt R package

So, when we started designing our system, we realized we really needed an adaptable solution that could handle the complexity in our data, but also one that would facilitate standardization and automation.

And so, I wanted to kind of compare two different processes or two different approaches that we could take. So, we could take a single process that's just one process that would be applied to all of our different datasets, or we could use a set of one-off processes that were kind of fit for purpose. And so, both of them have advantages. If we look at the single process, we can see it's going to be very standard and consistent and reproducible. And it's going to apply the same code exactly the same way to all our datasets. And that is really great. But we'll run into a few challenging edge cases that really strain the ability of the system to be able to handle all of the distinct datasets.

So, then we start kind of looking a little bit at the other end of the spectrum, where the flexible, adaptable, fit-for-purpose system really starts to look more appealing. And it's really kind of a balance between these that we realized we needed.

So, we worked on a solution. It's a custom R package that we wrote. It's called WrangleIt to handle our data wrangling. And it takes advantages of both of these systems. So, it is a single unified process, but it integrates multiple bespoke components that allow us to bring study-level specifics into our process.

So, this slide is just kind of an overview of what our package does. So, it will do some light data transformations. It also pivots data if that's needed. It detects duplicates, and it also applies our QC rules. It saves the output files, and then it also logs what it's doing.

And there's a lot of areas in which we kind of enable flexibility into our uniform package. And so, these are in these three areas, where we're transforming data, pivoting data, and detecting duplicates. We really need to know the specifics of the datasets in order to do those steps. And we have another step in which we're able to kind of take in external user input. So, we have another group that is a non-coding group that's really instrumental in determining what qualifies as quality and what means the data can pass along to analysis. And so, this is where we can kind of integrate that input.

Specification files

So, kind of how do we do that? So, we use specification files to kind of bring these study-level specifics into our common framework. And so, these files are really just providing context. So, specifics to the dataset that we want to bring in. These could be CSVs, they could be TSVs, they could be YAML files. Sometimes we use all of three. And they're intended both for humans to use, as well as machines to use. And they're really just more transparent than hard-coded solutions. So, it's a lot easier to find information you're looking for in a CSV than it is in an R script that's maybe 600 lines of code. And they're also really easy to edit. So, as things change throughout the study, which they do, we can go in and edit them and then also track what we've changed.

And then, we work with a lot of documents that are kind of defining our structure that aren't necessarily consistent upstream. And so, by pulling them into our own specifications, we can apply our standard code to them and ensure that we have consistency.

So, we work with three different types of specification files. So, we have a set of files that define what's in the data itself. What does it mean? How is it organized? Et cetera. We have another set of files that defines the QC criteria. So, what do we need to check to make sure that we can pass the data along for analysis? And then, also another file that's related to automation.

So, even though our files really vary in terms of what information they're containing, they have the same components in each of the files. So, each file is going to have information related to the subject or the patient on the clinical trial, the encounter or visit when they go in and see their provider, the samples, if they provided a blood sample or a biopsy, the test instance, so maybe this was just tested once, maybe it was tested again. There'll be that information in the data set. And then, also the results or the measurements that were actually assessed.

And so, we use a specification file called the data structure file to really make these associations between the common components and the data set specifics really explicit. So, in blue we can see we have the five different components laid out and then in green we have the data set specific values that we can really just slide into these components.

And then, I wanted to talk about another data set specification file as really an example of how we're kind of bringing our study level specifics into our common framework. And so, this is a data format table. This is a table that's generated by non-coders. And so, even though it's fairly consistent, our data columns might change, sometimes information will be in the wrong cell and sometimes it's just actually not correct. And so, we need to actually pull out this information into our own specification file. And so, we use a Python script to extract this information out of the table and then we also use, you know, human eyes to make sure it's correct. And kind of as we're going through this, we'll take the opportunity to see is there anything that needs to be converted. So, we might see we have a date format. We should make sure we're converting that into a date object. And we do that through the use of this file. And it just provides a really nice record of things that we've changed.

And then, the last specification file that I wanted to go through is our QC rules file. And so, this file contains the ID which will be the name of the rule that we're using. It will contain the expression. So, this is the actual R code that is getting run on the data and a human readable description. And it's really nice to have these three all really linked together. And we save this in GitHub. So, we have a very good source record of what we're doing to the data.

And then, if I wanted to compare it, the second table is what our output looks like. And so, you can also track if you have something like this HER2 which is shown up in purple here. You can actually track what is this rule doing? What is the actual code that was getting run? And what is the description? And so, this is really nice for both our programmers as well as our non-coders to really be able to see what this is doing.

And if we look at this in terms of a lens of standardization and flexibility, we have both actually in our outputs as well. So, we have flexibility in terms of we're going to report all of the data. And the data is gonna be a little flexible in terms of what's in each file. But we have common components as well. So, all of our outputs have a consistent location where we have the rules that have failed and then also if there's any duplicates that were detected. And so, it makes it really easier for our downstream users to find the information that they need to find.

The code and automation

And so, this is a coding conference. So, I wanted to show you what some of our code looks like. This is kind of the most basic script and this is what you would need to do in order to initialize your repo, read in your data, apply your specification files, run all your functions that are gonna generate the outputs, and then save the outputs. And so, it is just 10 lines of code. It's designed to be really simple. It does get more complicated, but at the heart of it, this is what we're trying to do on each of our data sets. And specifically, I wanted to call out that it is intentional that we run pretty much all our functions from one line of code so we can really ensure that the code that's being run is the same on all the data sets. And we really restrict the specifics to our specification files.

And we are automating this. And so, there's a few considerations we need to take into account when we're automating. So, all of our code and our specification files are stored in GitHub repositories. And that really allows us to run in the cloud on an automated schedule. And because we're running it on an automated schedule, we have no interactive input from users during our processing. And so, this means everything needs to be pre-specified. So, if you wanted to, say, maybe find a new file, a new instance of data that's come in, your script is going to have to be able to do that on its own without any input from you.

Enabling system adoption

And so, that kind of leads us to system adoption. So, so far, we've been talking about kind of how do we build a system that's able to handle disparate data sources using the same code. But now, how do we get people to use it?

So, my first suggestion is to use code as much as possible. So, people are really bad at doing the same thing the same way over and over again. So, if you can use code to do things like name your files, add timestamps, add git hashes, create your directories, create repos with your pre-existing templates so you don't need to remember what goes into each file, you can just use a template. Also, log what your program did and kind of when it performed these actions. So, it's going to get you most of the way to where you need to be in order to automate your system.

And then kind of the flip side of that coin is really limit how much of your system relies on human interaction. So, these parts really add complexity and will inevitably break. So, we have a downstream user group who likes to add comments to our files. As we get new data files in, it will take their comments and pull them over. And the system will definitely break. People like to rename columns, add columns, delete columns, rename files. They have done all of these and they did all break our system. So, it does take a little bit of troubleshooting when you have humans kind of in the loop there.

And then another way in which we kind of enhance system adoption is by making our users' lives easier. So, by delivering consistent reports, we're really allowing people to work independently and find the information they need for themselves. Also, if we anticipate what information they might want, we further allow them to kind of work on their own. So, they don't need to come back and ask us for questions all the time. And then if they're able to find information for themselves, we're really able to deliver data faster. And when people get access to data faster, usually they're quite excited and that really allows them to kind of enthusiastically take up your system. So, if you're able to kind of remove the bottlenecks in which maybe they have to realize they need a report, they have to email you, you have to find a window to run something, and then you have to run it, it's really much faster to do it in the background in an automated way. And then you can really deliver more in the order of hours rather than months.

And then you can really deliver more in the order of hours rather than months.

Closing reflections

And so, kind of as I reach the end of the talk, I wanted to reflect again on how automation specifically enabled us to get to consistency. And so, by thinking about automation, we had to really ensure that our data storage was in the same spot at the same time and always consistently accessible. And we had to also really think about how do we want to bring in our study level specifics into our common framework. And once we had those in place, we were able to take the same code and apply it to our distinct data sets.

And so, we were able to kind of start scaling our process. And as we scaled our process, we got to clean data faster. And as you get to clean data faster, you enhance user adoption, and then as more users use your system, more data sets get brought into the system, and then you have more consistency and ultimately result in more valuable data.

So, I worked in the pharma space, and this is really important because it allows us to make decisions on our clinical trials faster and get medicines to patients a little bit more quickly. But all of these principles are really applicable across industries. And so, my hope is that you can take some of these ideas from this talk and bring it back and apply it to your own work.

And if I have any time left, I guess I'll take some questions. Thank you.

Q&A

Okay, thank you, Kristen. Really interesting to see what you've built along with your team. So, we do have time for a few questions. Okay, great. So, the first one is from Marlene, and they ask, Kristen, how does this process deal with file versioning and data edits slash changes? Is the entire file removed and reloaded?

So, if a file is changed? File versioning and data edits. Oh, yes. So, we do get new, like the process is designed to handle new files. So, we'll get one file. It will generate an output off of that file. As we get like a new copy, it will redo the process, and then it will find issues in that file, and then it will save a new output. Typically, we're working with cumulative data, so we'll have the same data, and then new data just gets added. But occasionally, the data does change, and so the output will kind of detect that, too.

Okay, we have time for another question. So, this is from Anonymous. Why isn't the package open to the public? Are there plans to make a public version? Well, we're not sure. So, we just started sharing it externally with other groups in Pfizer, so we'll see where we go from there.

Okay, and one more question. Can other validation packages, like point blank, be used for quality control? I think it's a different type of quality control. So, we're trying to really, like, we want to ensure things are correct, but we don't actually want to over-QC. So, we want to find, like, what is really the necessary parts that need to be correct in order for analysis, because we don't want to exclude samples, because we're working with patient data, and so we don't want to say, like, oh, this patient provided the sample, but there's something minor wrong with it, and so we're not going to use it. So, we really do want to kind of keep our QC to what's exactly essential.

Okay, thank you so much. That's all the time we have for questions, but another round of applause for Kristen. Thank you.