Resources

Scale Your Data Validation Workflow With {pointblank} and Posit Connect - posit::conf(2023)

Presented by Michael Garcia For the Data Services team at Medable, our number one priority is to ensure the data we collect and deliver to our clients is of the highest quality. The {pointblank} package, along with Posit Connect, modernizes how we tackle data validation within Data Services. In this talk, I will briefly summarize how we develop test code with {pointblank}, share with {pins}, execute with {rmarkdown}, and report findings with {blastula}. Finally, I will show how we aggregate data from test results across projects into a holistic view using {shiny}. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Leave it to the robots: automating your work. Session Code: TALK-1058

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, everybody. I'm very happy to be here. My name is Mike Garcia, and I'm a Director of Data Services for Medible, and today I want to talk about how you can scale your data validation workflow with the pointblank R package and Posit Connect. Ten years ago, I was a data analyst for a very large utility company in the country, and one of my main responsibilities was to create and maintain these energy savings forecasts, and I have a confession to make. I really didn't know what I was doing, and I was never confident in the numbers that I was producing, and that was because there was no formal process for these forecasting tasks.

And when it came time to do our data validation, to check our work, it involved a group of us going into a conference room, throwing everything up on a big screen, and just opening up Excel workbooks, tracing through cells, and looking at where the formulas landed. It was not a fun process. Today, I can confidently say that this is not how my team operates at Medible. In fact, we have robust procedures for everything that we do, especially when it comes to data validation. And that's what I want to talk about today.

By building upon our foundation of data quality measures at Medible, expanding the functionality of the pointblank R package, and leveraging Posit Connect with its API, we were able to design and implement a fully end-to-end scalable solution for our data validation needs. And so, for the next 15 minutes, I want to highlight four challenges that we came across, and how we addressed them.

Background on Medible

Before we dive any deeper, I'd like to give a little bit of background about Medible. So, Medible is a technology startup with a state-of-the-art platform for conducting clinical trials. Our clients are some of the largest pharma companies and contract research organizations in the world. And we license our platform to these clients in the form of web and mobile apps that we build and deploy for them as a service. These apps are used by patients and caregivers and doctors all throughout their trial participation.

And some of the activities that they may do on these apps are completing pain assessments on a weekly basis, or entering diary data every day detailing their experiences with a new investigational drug. All of these activities with the apps generate a lot of data, and that data gets stored securely on the Medible platform. And it's the job of my team in data services at Medible to transfer that data from the platform to the hands of the client.

But that's not as straightforward as you would think, because data models between studies and projects vary. And that's primarily driven by three factors. One, protocols just vary in complexity and design, and so our apps sometimes need to as well in order to accommodate the needs. Second, trials can last for years, but our technology moves a lot faster than that. So if we want to implement bug fixes or introduce a new capability, sometimes we need to do that on a study-by-study basis. And lastly, clients have very strict expectations on what they want their data to look like and how they want to receive that.

Now, all this flexibility meant that our processes, especially for data validation, had to be flexible, but that meant manual. And four years ago, that was okay. That worked well. But as the number of studies that we supported grew exponentially, we started to notice some data quality issues creeping into our deliverables. So late last year, we made this conscious effort to really streamline and automate a lot of what we were doing related to data validation, and that's when we stumbled upon the pointblank package.

Introducing pointblank

pointblank allows you to validate your data using these simple yet powerful validation functions. For example, if you wanted to test if a set of values in a column were less than or equal to another value or set of values in another column, there's a function for that. The way that it works is you create what is known as an agent object. You pass to it a single target table that you want to validate and some other data, like how you want to be alerted if a test case would fail. You then pass this object to these validation functions, and these act as sort of instructions for what your agent should do when it's ready to validate. And then when you are ready to validate, you call the interrogate function. So in essence, you're interrogating the agent for the information that you want. And the result is a really nice-looking, easy-to-digest HTML report.

On the left, you have a nice color-coding scheme that tells you what passed, what failed, or if you hit another threshold that you set when creating your agent. Each record represents a validation function that you set and the columns next to it that you apply them to. And on the right, you have some descriptive statistics telling you how many cases were run for each function, how many passes and fails, and their proportions. And at first glance, it was perfect. It was everything that we needed. The way that it integrated with packages that we used already and technologies, and also its intuitive API and its source code was easy to digest. However, once we actually started to implement it for our processes and data, we noticed there were four specific challenges that we had to address before we could do that.

And at first glance, it was perfect. It was everything that we needed. The way that it integrated with packages that we used already and technologies, and also its intuitive API and its source code was easy to digest.

Challenge 1: operating on multiple tables

So, the first challenge is we wanted pointblank to operate on more than just a single target table. Here's an example of a data model for our study, for a very simple study that we may see. And if we wanted to test each individual table, we would need to create each individual agents and apply them to the tables. And that's at the core of pointblank. And we didn't want to change that for our needs. Instead, we wanted to bundle all of these agents together into a collection that we can then operate on that collection.

And so, in typical R fashion, we wrapped everything into a list, and we applied some processing to it that's specific to our needs, and we called it a test plan. And so, essentially, it's a list of agents with some extra goodies that I'll talk about in a little bit. If you're familiar with pointblank, this is very much like a multi-agent report, and we borrowed a lot from that. We're just extending upon it to perform operations on that actual multi-agent instead of it being just a reporting function. So, in the background, this is a very simple concept, but we modified lots of pointblank and pins functions to actually do what we wanted to do with this. And also, we integrated it with the DM package, which we use for our data modeling needs.

So, when we want to create a test plan, it's very straightforward. You pass in your individual agents, and then you pass to it also a test name, which we will reference later on. And some of the data that gets stored with the test plan are the name that we just passed through, as well as a test version that we get from pins and Connect. Now, when we're ready to validate our collection of agents or our test plan, we have to run interrogate on each individual function, so each individual agent. So, we basically map the interrogate function to all the agents, and we also force the user to specifically pass in the data that they're going to validate. And the result is what's known as a multi-agent report.

Challenge 2: sharing and reusability

The second challenge is we wanted to implement sharing and reusability of the test plans that our team is creating. So, I showed a simple data model before, but in reality, they look more like this, and you can guess that if we're validating each table, the code, the pointblank code to generate all that can get pretty large and pretty unwieldy quickly. So, one feedback that we heard from the team is, how can we reuse these test plans so that we're not recreating the wheel every single time we start a new study or want to iterate on an existing test plan? So, copying-pasting is an option, but it's not really a practical one in a production setting.

So, if we revisit our create test plan function again, we could do a little more in it. We could add some functionality. So, we're going to take each of the individual agents inside of this test plan, and we're going to actually extract the code that's needed to create that individual agent. So, it's kind of like we're reverse engineering that part. And the reason for that is that we want to actually store that code instead of the actual object, the agent object itself on Connect. So, we're going to take that code, and we're going to convert it to JSON, and we're going to write it to Connect as a pin. And JSON is because we want full transparency for anyone who's not a data scientist to be able to go on to Connect and look to see what's actually going on with the code.

Now, with Posit Connect, if you have a lot of content on there, sometimes it's a little hard to filter around. So, you can set tags for easy filtering and navigating. So, we set tags so that this could be a test plan. And by default, when you publish content to Connect, it's just shared with yourself. You have to go into the UI and change that. So, we actually use the Posit Connect API to do that in the function itself. So, then we share access with our data team by default.

Here's a screenshot of what that pin looks like on Connect. So, in typical fashion, you have the time it was last updated, the format, which is JSON, and a streamlined description for it. And then you have your raw metadata that's provided by pins. And with pins, you're also able to pass in some user-defined metadata. And so, we hook up to that, and we include the name of the person that last modified the test plan, as well as a commit message for that new version of the plan. So, we want this to act more like a Git-based system.

So, if we want to take this and actually bring it back into our studio and iterate on it, or use it as a jumping off point for the next project, we have a convenience function called View Test Plan, where all we do is supply the test name, and if not the latest version, the test version. And what it does is in your IDE, opens up a new file pane, dumps all the code in there, and it's ready for you to modify as needed.

Challenge 3: automating with Posit Connect

The third challenge was we wanted to automate with Posit Connect, because we use it already for our study deployments. So, when you want to automate something with R, rmarkdown is usually the answer, or Quarto, or Flex Dashboard. And when you do that, there's also a lot of ways you can create that document. So, we wanted to streamline that part. And so, we created a function called Create Validate that actually takes your data and your test plan from Connect, and it feeds it into a template that we've created internally. So, all you have to do is run this function, provide your data and your test name, and you get an rmarkdown document that's ready to be knit.

Then, as we want to automate it, we have to deploy it to Connect. We can use the great push button deployment option, but even still, when you do that, there are some options that you need to set that we take for granted. Like, do you want to link your code? Do you want to force an update if something exists already? Do you have all the dependencies ready, those little check boxes that show up? So, again, we wanted to just make it easy as possible to do this in a production setting. So, we created Deploy Validation, which essentially wraps the deploy doc function from RSConnect. And similar to creating a test plan, we tag it, we share it with our data team, and we set a default schedule using the API to run.

And what that looks like is a Flex Dashboard with two tabs. The first tab is Data and Sources. It includes all the data that's needed to run and execute this test plan. So, that includes the target table, references, as well as any reference documentation that's needed. If you're familiar with pointblank, we use that for, say, preconditions. And then on the bottom, we have a URL to the Connect page for that pin, for that test plan that's being used in here, again, for traceability reasons. On the right, I'm just showing that it is scheduled by default for output once a week.

And here is that multi-agent report that I showed earlier, again, being shared with the data team. Now, of course, we want to know what's happening when we execute these tests. And with pointblank, we show that it integrates with blastula. So, we wanted to, again, borrow that concept and apply it to a test plan. So, we're going to revisit our execute test plan function again. And all we're going to look for is if any of the individual agents in that test plan failed. If they failed, I want you to send me an email saying that it did with a specific template. Even if it didn't fail, I still want an email telling me that it ran and it was successful.

Here's an example of what that email looks like. So, your subject will have the number of cases that failed. This is an example of an email with a failed test plan that failed the time the test was run and a textualized version of that multi-agent report. So, originally, we had the multi-agent report coming through email. The formatting was a little wonky, and we realized it wasn't that actionable for the team. So, this is what we have coming to our inboxes.

Challenge 4: monitoring at scale

And lastly, we wanted to enable monitoring of everything that's going on. So, you can imagine we have all of these test plans being executed automatically in our production setting. And we also want to monitor the health of all of our studies and test plans that are being executed. The point of this is we want to drive change upstream to the engineers, to the study teams that are actually creating these apps. So, in order to monitor it, we need to save our validation results first.

And so, again, let's revisit execute test plan. And we have this pseudocode here, save validation results, where when that runs, it takes your executed test plans and it passes them through as a pin into our cloud storage buckets. And then from there, we're actually extracting summary data from these test plans that go into a small database. And that database is powering a Shiny app that we've created that helps us monitor what's going on with the health of all these programs. So, we have some value boxes up top telling us how many tests run, what's our failure rate. This is all fake data, by the way. What's our failure rates and what are kind of the most offensive tests, the ones that are failing the most. We have a table of everything that was run and also some historical trends of failure rate to see what's going, what's improving or what's not.

And so, in summary, by addressing these four challenges, we were able to create that end-to-end scalable solution. But really, it wouldn't have been possible with all the work that came before us. And so, from a medical perspective, it was all those data quality measures that we actually used as the basis, as the foundation for these test plans in knowing what to validate. And in terms of our packages and technologies, everything here that's on the screen made this all possible. And that's it. Thank you.

Q&A

Thank you, Michael. So, what was like the biggest challenge that you all had in being able to set up all this infrastructure?

We had the infrastructure with just the fact that we automated data transfers. It was one getting to the mindset of actually making code to run these tests instead of doing it manually, which the whole team had done. But then also, our data is pretty specific in terms of like how the data is stored and what the column name looked like, things like that. Things that you kind of took for granted. And originally, we tried to run pointblank on it. We would get errors. And so, it's just basically going into the source code and just kind of picking apart what we needed, what we needed to change. And that's what took most of the time. That was the biggest challenge.