Scale Your Data Validation Workflow With {pointblank} and Posit Connect

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, everybody. I'm very happy to be here. My name is Mike Garcia, and I'm a Director of Data Services for Medible, and today I want to talk about how you can scale your data validation workflow with the pointblank R package and Posit Connect. Ten years ago, I was a data analyst for a very large utility company in the country, and one of my main responsibilities was to create and maintain these energy savings forecasts, and I have a confession to make. I really didn't know what I was doing, and I was never confident in the numbers that I was producing, and that was because there was no formal process for these forecasting tasks.

And when it came time to do our data validation, to check our work, it involved a group of us going into a conference room, throwing everything up on a big screen, and just opening up Excel workbooks, tracing through cells, and looking at where the formulas landed. It was not a fun process. Today, I can confidently say that this is not how my team operates at Medible. In fact, we have robust procedures for everything that we do, especially when it comes to data validation. And that's what I want to talk about today.

By building upon our foundation of data quality measures at Medible, expanding the functionality of the pointblank R package, and leveraging Posit Connect with its API, we were able to design and implement a fully end-to-end scalable solution for our data validation needs. And so, for the next 15 minutes, I want to highlight four challenges that we came across, and how we addressed them.

Background on Medible

Before we dive any deeper, I'd like to give a little bit of background about Medible. So, Medible is a technology startup with a state-of-the-art platform for conducting clinical trials. Our clients are some of the largest pharma companies and contract research organizations in the world. And we license our platform to these clients in the form of web and mobile apps that we build and deploy for them as a service. These apps are used by patients and caregivers and doctors all throughout their trial participation.

And some of the activities that they may do on these apps are completing pain assessments on a weekly basis, or entering diary data every day detailing their experiences with a new investigational drug. All of these activities with the apps generate a lot of data, and that data gets stored securely on the Medible platform. And it's the job of my team in data services at Medible to transfer that data from the platform to the hands of the client.

But that's not as straightforward as you would think, because data models between studies and projects vary. And that's primarily driven by three factors. One, protocols just vary in complexity and design, and so our apps sometimes need to as well in order to accommodate the needs. Second, trials can last for years, but our technology moves a lot faster than that. So if we want to implement bug fixes or introduce a new capability, sometimes we need to do that on a study-by-study basis. And lastly, clients have very strict expectations on what they want their data to look like and how they want to receive that.

Now, all this flexibility meant that our processes, especially for data validation, had to be flexible, but that meant manual. And four years ago, that was okay. That worked well. But as the number of studies that we supported grew exponentially, we started to notice some data quality issues creeping into our deliverables. So late last year, we made this conscious effort to really streamline and automate a lot of what we were doing related to data validation, and that's when we stumbled upon the pointblank package.

Introducing pointblank

pointblank allows you to validate your data using these simple yet powerful validation functions. For example, if you wanted to test if a set of values in a column were less than or equal to another value or set of values in another column, there's a function for that. The way that it works is you create what is known as an agent object. You pass to it a single target table that you want to validate and some other data, like how you want to be alerted if a test case would fail. You then pass this object to these validation functions, and these act as sort of instructions for what your agent should do when it's ready to validate. And then when you are ready to validate, you call the interrogate function. So in essence, you're interrogating the agent for the information that you want. And the result is a really nice-looking, easy-to-digest HTML report.

On the left, you have a nice color-coding scheme that tells you what passed, what failed, or if you hit another threshold that you set when creating your agent. Each record represents a validation function that you set and the columns next to it that you apply them to. And on the right, you have some descriptive statistics telling you how many cases were run for each function, how many passes and fails, and their proportions. And at first glance, it was perfect. It was everything that we needed. The way that it integrated with packages that we used already and technologies, and also its intuitive API and its source code was easy to digest. However, once we actually started to implement it for our processes and data, we noticed there were four specific challenges that we had to address before we could do that.

And at first glance, it was perfect. It was everything that we needed. The way that it integrated with packages that we used already and technologies, and also its intuitive API and its source code was easy to digest.

Scale Your Data Validation Workflow With {pointblank} and Posit Connect - posit::conf(2023)

Transcript#

Background on Medible

Introducing pointblank

Challenge 1: operating on multiple tables

Challenge 3: automating with Posit Connect

Challenge 4: monitoring at scale

Q&A

Featured software#

blastula

pointblank

rmarkdown

Shiny

Scale Your Data Validation Workflow With {pointblank} and Posit Connect - posit::conf(2023)

Transcript#

Background on Medible

Introducing pointblank

Challenge 1: operating on multiple tables

Challenge 2: sharing and reusability

Challenge 3: automating with Posit Connect

Challenge 4: monitoring at scale

Q&A

Featured software#

blastula

pointblank

rmarkdown

Shiny