Resources

Zero Tolerance for Dirty Data: A Pointblank Prescription for Data Hygiene (Arnav Patel, Synovus)

Zero Tolerance for Dirty Data: A Pointblank Prescription for Data Hygiene Speaker(s): Arnav Patel Abstract: Data validation is an exercise that is commonly forgotten. It can be difficult to start the exercise, let alone figure out a holistic approach to cleaning and understanding data. This is a problem that affects us all regardless of experience, expertise, or complexity of work. With Pointblank’s suite of validation functions, you can scan your data, present that scan in a visually appealing and unique report, and then define validation rules at scale in a methodical manner. I want to show why pointblank, above other validation packages, is the best validation approach through its simple yet appealing reporting features and comprehensive validation schema. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, well, hello everyone. Good afternoon, my name is Arnav Patel and today I'll be telling you how you can cook with your data, building a recipe for data validation success with pointblank.

So I wanna clarify first that I'm not the contributor, I'm not a creator or contributor to the package, just a very humble fan who wants to reframe a perspective on how to use pointblank today. And also, if anyone from my company is watching, the opinions today are my own, they don't reflect company policy.

A little bit about me first. I am a quantitative analyst at Synovus Bank. The majority of my work revolves around data automation and risk modeling. However, in my spare time, I do like to cook and read books about food science. I'm better at one of those than the other. You can take a guess at which one.

But I learned to cook from my grandmother. She's the one who got me into it. She didn't follow any recipes and she seasoned with her heart in true grandmother fashion. When I tried to be like her and do it that way, very quickly realized I just didn't have her tacked in the kitchen.

And I very quickly realized that I needed some structure or a recipe to succeed in the kitchen. And I experienced very similar feelings when I would try to develop a new data validation framework at my job.

And for me, and probably like many others, good cooking is like good data validation. It requires a recipe to succeed. But bad cooking is like typical data validation. It can be unstructured, fragmented, and reactive.

Just like using unlabeled ingredients can lead to disastrous cooking, not documenting your work can lead to disastrous data validation approaches.

Typical data validation is often used and built on a variety of different libraries and approaches which can be fragmented and make things hard to follow for other users.

We've all been in situations where you deliver results and you get that dreaded email that says, hey, something just looks a little bit off. Can you check and see what's going on? You realize you made a mistake after the fact. Maybe someone's name was misspelled. Maybe you left a filter out that you should have implemented in the first place. Whatever the reason, implementing these scattered checks are reactive and they're a waste of time and effort.

So what we need is a better data validation approach. One that applies structure, leverages a cohesive toolkit, and proactively finds and prevents data quality issues. And after trying many different things, I landed on pointblank as that approach. And I'm gonna present a pointblank recipe that helped take my usual unstructured cooking of data to a documented set of steps that serves as a foundation for new dishes.

And I'm gonna present a pointblank recipe that helped take my usual unstructured cooking of data to a documented set of steps that serves as a foundation for new dishes.

And this is what the pointblank recipe looks like today. We'll be walking through each of those directions and what they look like in practice.

Scanning the data

So first, let's start by loading the packages and the data we need. Now, I'm a diehard Atlanta Hawks fan. The pullover's not an accident. And I really wanted to use data from them today, but there's no representation for a mediocre MBA franchise, so I had to settle for using Lakers data.

So let's just say you've been given a request, right? We wanna see who the top 10 highest scorers in a single game are. Right, think of this result as your final dish, the thing that you want on the plate at the end of the day. Right, how can you begin to understand the data that you're working with instead of taking a simple glimpse or a summary view of the data? Right, we can do what pointblank calls scanning the data.

Right, scanning the data breaks down the input table, returns an interactive report. Right, when you scan your data, you get the option to set what sections of the report that you wanna see. So here, I wanted to inspect an overview, variable specifics, a frequency of missing values by field, as well as a sample of the data. And we can see what that HTML scan looks like here.

We get our table overview, a breakdown of each of the columns, right? We can see here there's opposing team identifiers, game type, specific play type, the period of the play that's occurring, points, coordinates even, and we have a frequency for missing values by column, as well as a quick sample. Right, everything that we wanted to see, we identified in that scan data function we're able to see here.

But how does scanning actually improve upon typical EDA? Right, for one, scanning, I think, has a high knowledge to effort ratio. Right, the information that you get is very rich for very little effort. You can see just how quickly we're able to do that, and it's even quicker to produce. If you're using Positron, you can probably know that the Data Explorer can do some of this work, but I think where pointblank is a little bit superior is that you're able to assign the scan into a variable.

Chopping and cooking: expectation functions and the agent

Okay, so now that we have our ingredients identified, let's move on to how we can chop and cook them. Right, you can do that one of two ways, either by operating on the data object directly, or through the use of an agent.

Right, to operate on the data object directly, we first need to look at a typical data validation component, which is the dplyr filter chain.

Now here we're incorporating a set of filtering criteria for our Lakers 2008 object, right. We've all done this before, you've seen it a million times, I'm not gonna teach you what these mean or how to do it, but I wanna talk about why it's important in this context. Right, filtering won't stop data processing if criteria aren't met. What it'll do is just remove row silently. Right, it'll discard them without any sort of error messaging with what you're discarding, and those things could be useful.

So it's kind of like throwing out that bag of salad that you kept in the fridge and you swore you were gonna eat this week, just because a couple of those leaves are bruised.

But we can use pointblank to address the lack of communication, right, that comes from a typical filtering. And we can apply expectation functions from pointblank, which are these functions that use the call vals prefix here, and they work very similar to your typical filtering functions, but with a slight wrinkle, a positive one. We can still test our original criteria by default, but pointblank is gonna default to stopping your evaluation once a criteria fails. So you can see the error message that is produced after the call vals GT expectation was evaluated. And we take a closer look, and we can see that processing was halted because there's 400 records, because there's 400 records or test units that have a point value less than zero.

And since these functions are intuitively named and they work similar to our typical filtering, it's very easy for you to upgrade your code. If you're someone who feels like they don't know how to work with pointblank, or you have all these scripts that already have all this filtering code and you wanna convert them, or you can see just how easy it would be, right? The names are intuitive, it's easy to upgrade, and you don't have to spend a ton of effort to incorporate pointblank in this matter. All right, and getting an error message for failing expectations means that rows aren't silently removed, but it gives us an opportunity to proactively identify data equality issues.

All right, and think of the agent. This is the second part of chopping and cooking our data. Think of the agent like a sous chef, right? Just like in cooking and validation, you need someone to assist you and track your steps as you go.

All right, so here we're gonna create an agent and we're gonna supply it with a data table and our labels. And then we're gonna also add a feature that's called action levels. And this action levels are gonna flag a rule when a particular condition is met. In this case, the step will be flagged with a stop condition if there's at least one test unit or a record that fails to meet a particular criteria. All right, we can then supply the meat of the sandwich here, which are all your call values expectation functions. And then what we can do is called interrogation. What interrogation will do is apply all your expectation functions to the data object, as well as print a very pretty report.

And this is what that looks like. Right, in this report, you can see how many records have passed or failed each step, or what columns are being evaluated, and if a warning, stop, or notification threshold is met. Take a deeper look. In this case, we can see here in step three, as we observed before, there's 400 records that have point values less than zero. And there's a red circle here that indicates a stop condition was met, meaning that if you were to apply this particular criteria directly to a data object, as we saw before, you would result in a stop of the evaluation, right?

What's also very cool is that the report produces an interactive link to download a CSV. This just allows you to save it to disk, not in memory. So it's a very important distinction we're gonna see later. But what we're doing here, right, is building onto applying expectations directly onto the data object by adding a documentation layer to that process.

Collecting and extracting results

All right, so now that we've chopped and cooked our data, we need to decide how we wanna play our dish. And we can do that by a combination of what's collecting and extracting. All right, and our final dish is our results. Let's not forget that. I want that, keep that context in mind as we move forward here. We still need to figure out our end result. And to do that, we first need to figure out if the failing tests are full of bad or just bruised ingredients.

And we can do that, and to answer that question, we can, remember if we saw the report, there was 400 records with point values less than zero. Using the get data extracts function and specifying the exact step, we'll return a table of those records. Right here, it's step three, indicated by the I equals three argument. And since we know we're looking at points, we can go ahead and dig into the distribution of negative point values present. All right, confirming that we indeed see that there's 400 records with negative three points. So they're all the same.

And what does that mean? Right, we now have to ask ourselves that question. Is this truly bad data? Or is it just something that we can fix and retain and ultimately get our final result, right? So let's investigate.

All right, and it turns out that there was a data entry issue, and those negative three point values were just incorrectly entered three pointers. So we fixed the issue by creating a new object, the Lakers 2008 fixed. We supply that back to our agent and run the agent interrogation sandwich yet again with the updated object, and now we get this.

All right, you can see here we have all passing steps. All right, and now that we've addressed all the data validation issues and we know we have the data, like now it's about collecting the data that you want to use to move forward. All right, so it's about getting what we're gonna call the good data and finish processing.

All right, to get our good results, we can do what's called sundering the data. Think of sundering as separating the good data from the bad data. All right, in this instance, we're gonna get data that has passed all of our validation steps, and we can see what those records look like.

Okay, we can now use that set of good data to plate our final dish. This is just processing code to get to the top 10 highest single game scores, and this is what that looks like. All right, we have our list. This is our end result. Send this to your boss, your friends, whoever you want.

And so how do these results actually differ, though, if we were to use the filtering chain that we looked at before? That's important, all right, and because the filter process removed rows with negative point values and produced results that, while they looked appropriate, were ultimately incorrect, but because pointblank informed us of an error and allowed us to make changes proactively, we can see the discrepancies in results here between the two approaches.

And this comparison relates back to the example of reactive scatter checks, right? With a normal filtering process, you would have had to go back, figure out what went wrong, implement those filters, and ultimately get to what pointblank produced in the first place in one shot. It saves you a ton of time and effort. It's easy to incorporate, and I think that's the benefit of using pointblank in this example.

Writing and sharing the recipe with YAML

Right, now that our dish is plated, we can share it with others. We're gonna do that by writing it out to YAML. You look at pointblank, there's a ton of different ways to do this. I wish I could talk about all of them. I think YAML is the most informative and the richest and the best option in this context of this recipe, right? Because you've gone through all this work, it should be a shame not to share it with the people you work with or anyone that finds this interesting.

Right, so let's take a look at writing a YAML file out. Right, a prerequisite, though, to writing the agent out to YAML is you have to make sure that you supply a leading tilde to your table argument here when you're building your agent out. It's very important, otherwise it won't work. With that condition in mind, right, we can then use the YAML write function. And what it will do is create a YAML file that documents the specs for the agent and the validation steps. Right, for example, you can start to see all the potential values for a valid play type here, hook, layup, pullup, jump, right, a bunch of others.

I mean, you may be wondering what benefit do we get from this, right? We already have a report that documents all of our steps. Well, you can use this YAML file to recreate the agent or report at any time. Right, it's especially helpful in an automated chain of processing where if you're using like a target's pipeline or something, you can incorporate it to spit out a report and recreate it without having the data loaded in memory. Right, you can also use the YAML configuration to port the agent's build to new data. We're gonna touch on that in a little bit.

Right, so to completely recreate the agent, right, if you wanted to basically recreate the validation port for 2008 data, you can use the YAML agent interrogate function. And you can see that you get our report. Right, it's confirmed by all the passing tests here.

Right, so why gamble when you can YAML? So maybe you work with a team that uses Python or you know, you have a team with multiple different languages. And since the YAML format is language agnostic, you can share your validation recipe across different teams. Right, if you're curious, the YAML interrogate function in pointblank for Python takes the YAML config files and returns a validated object with that interrogation applied.

And writing and reading our plan with YAML is also valuable to recreate and port our agent or validation plan at any time. Right, it's particularly helpful though when you're encountering new data with a slightly different cross section.

Porting the recipe to new data

And so with that step, we've walked through our core recipe. But the key to a good recipe is ensuring that it can be applied to something else, making sure that it's adaptable, a little bit flexible, right, setting you a foundation as you approach new projects or dishes. So let's see how it holds up when dealing with a new request.

You get a Teams message, sorry to put Teams messages up. I had audio, I was really committed to the bit, but I feel like that would've been too much. So let's say you get this message from your boss and you need a way to quickly recreate and process the 2009 data.

So we'll talk about porting that recipe over, right? We mentioned we can use our YAML file to do that in associated functions. So since we're pretty familiar with this dataset, we can go ahead and straight to chopping and cooking it up. We don't have to scan it. And the best part about this is the validation plan is gonna do the hard work for you and catch anything that would prevent us from accurately getting our list of top scorers again, right? So to port it over, we can use YAML read agent. And what it'll do is it'll pull the agent specs for your 2008 validation plan. But here, what we've done is you can go ahead and set the updated table. So we're using the Lakers 2009 object. You can also update the labels to more accurately reflect what you're validating, right?

And when we inspect the interrogated results from this function, we can see this, right? We get different failing steps that existed in 2008, right? We can see that there's 163 records with an invalid plate type and at least 1,000 records with an invalid opponent, right? Taking a deeper look at that.

And if we wanna take a deeper look at just our failing steps here, right? What we can do is use the get data extracts function again and pull out all our failing records to investigate even further. It's important to note, like I mentioned before, that this all failures object is in memory. And so you're able to interact upon it and work with it to continue your processing. Whereas if you were to just download the CSVs, it's saved on disk. It's better if you just wanna export it visually, not what we're gonna be looking at deeper right now.

But let's take a look at all of our data failures and see if our ingredients are bad or just bruised. All right, we take a look at the plate types and opponents that are triggering failures and the results are interesting. We can see that there's a plate type called LeBron James, which if you have any knowledge of the subject matter here, you can probably say that that's an invalid plate type. So what we can pretty confidently say is just bad data.

However, we do see that there's an opponent here that's listed as OKC, which is the Oklahoma City Thunder. That is a valid opponent name, and in fact, they changed locations around the time of this data, so that makes sense to us. Let's imagine you have a little subject matter expertise and you can confidently say that that's appropriate in this context. So one bad and one bruised data.

We wanna make sure that we include OKC as a valid opponent, but leave out records of the plate type of LeBron James. It's important to know that in a situation like a filtering chain, we'll just drop both sets of failing records and then you would have had to go back and investigate when in reality, the records with OKC as the opponent should be retained and we want to drop those with an invalid plate type. So it gives us that flexibility. We're also able to do it proactively.

So as we fix our bad ingredients, we build a new object, All Opponents 2009, that identifies Oklahoma City as a valid opponent. To accommodate this new set, what we do is we use the same method that we did before to port our recipe over, except we remove step five, which originally used that All Opponents object to now use the All Opponents 2009 object, where that includes our updated list of valid opponents. After interrogation, we get this final report.

You can see now that the only failing expectation is the one that we expect to fail, the ones with the invalid plate types. We know we don't wanna keep those in our final calculations so we've achieved what we wanted to by using pointblank.

Now we have to separate the good from the bad and get the best part of our dish and work to get our 2009 top scores. So you can see here, same avocado sandwich and Kobe Bryant sits up top with 61 points. And it turns out our recipe was a slam dunk.

Wrapping up

So today I hope to have demonstrated the value of the pointblank recipe to you to understand and validate your data. And it comes from the ability to provide structure, proactive identification and handling of error messages and a cohesive toolkit. Everything we use was contained in the pointblank package and environment.

It's such a rich package with a lot of tools and I found myself getting lost in what to do when using the package at times. And I just hope today I was able to provide some sort of structure and a path for moving forward on how to use pointblank to validate your data. I sorta touched on a lot of the basics today but like I mentioned previously, I'm not the contributor nor the creator. I wouldn't be here without standing on the shoulders of giants like Richard Ayanome, Marisha Vargas, Jun Cho. They built something very powerful and I just wanted to thank them.

So next steps, take this recipe moving forward, apply this to your next project and approach new things with some confidence when using pointblank. You can visit the QR code that takes you right to the GitHub page for pointblank. And thank you. All right, thank you so much everyone.