
How to use pointblank to understand, validate, and document your data
R/Medicine Webinar This workshop focused on the data quality and data documentation workflows that the pointblank package makes possible. The speaker Richard Iannone used functions that allowed us to: 1. quickly understand a new dataset 2. validate tabular data using rules that are based on our understanding of the data 3. fully document a table by describing its variables and other important details The pointblank package was created to scale from small validation problems (“Let’s make certain this table fits my expectations before moving on”) to very large (“Let’s validate these 35 database tables every day and ensure data quality is maintained”) and Ionnone delved into all sorts of data quality scenarios so the viewer will be comfortable using this package in their organization. Data documentation is seemingly and unfortunately less common in organizations (maybe even less than the practice of data validation). We’ll learn all about how this doesn’t have to be a tedious chore. The pointblank package allows you to create informative and beautiful data documentation that will help others understand what’s in all those tables that are so vital to an organization. Speaker Richard Iannone, Software Engineer, Posit, PBC Rich is a software engineer at Posit that enjoys creating useful R and Python packages. He trained and worked as an atmospheric scientist and discovered working with R to be a breath of fresh air compared to the Excel-based analysis workflows common in that field. Since joining Posit he has been focused on developing packages that help organizations with data management and data visualization/publishing. When not working on R and Python packages, Rich also enjoys other things like playing and listening to music, watching movies, and getting outside
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Okay, I think we're going to get started. Hello and welcome. Welcome to today's webinar, how to use pointblank to understand, validate, and document your data, brought to you by the R Consortium. The R Consortium supports key organizations developing our infrastructure through grants and sponsorships worldwide. Please visit our website to learn all the details and how your organization can become a member. I'm Jesse Kasman, today's moderator.
Questions are welcome. We consider webinars an extension of the community. We like learning from you. You can ask questions through the Q&A icon at the bottom of your screen. At the end of the presentation, the speaker will be responding. The R Consortium hosts educational webinars on R-related topics regularly. Coming up next week, Johnson & Johnson will share insights into their work on the successful R submission to the FDA as part of the R Submissions Working Group. And in November, it's actually not a webinar, we are hosting a two-day fully virtual event on R plus AI. You can get past the AI hype and find out if and how you can use AI in your R programming needs. Please visit our website to find out more.
Okay, let me introduce our speaker, Rich Iannone. Rich is a software engineer at Posit who enjoys creating useful R and Python packages. He trained and worked as an atmospheric scientist and discovered working with R to be a breath of fresh air compared to the Excel-based analysis workflows common in that field. Since joining Posit, he has been focused on developing packages that help organizations with data management and data visualization and publishing. When not working on R and Python packages, Rich also enjoys other things like playing and listening to music. Rich, I see a couple of guitars in the background there.
Yeah, yeah. Watching movies and getting outside. So, okay, Rich, it's all you.
Yeah, okay, welcome. I'm going to share my screen. And we've got an hour, so I'm going to go through a bunch of material. And all of it, I suggest you just follow along. I'm going to show you lots of things. It's going to be all about data validation, validating data frames, validating tables, what have you. All the material is here in my personal GitHub, richianone.blankworkshop. So, everything I'm showing you is there. So, don't worry about, like, you know, like, having to, like, find this stuff. Just listen along. And have questions ready. Because there very well will be some.
Introduction to data validation
Okay. So, I'm going to start with an introduction to data validation. So, I'm going to move into this Quarto document. Hopefully this is large enough. This is as large as I can make it without becoming too large. So, let me know if it's not big enough.
Anyways, the libraries I'm going to load are pointblank, some Tidyverse stuff, Blastula even, and Palmer Penguins for some datasets. So, I'm going to run that right here in Positron. And so, data validation, what is that? It's checking data frames for possible errors, inconsistencies, spurious values. The way we do that in pointblank is there's three basic components to this workflow. You create an agent. This is, like, basically the reporter, the data collection object. And then you also declare a number of validation steps or functions. As many as you need to validate the data as much as possible. And then finally, you end it off with the interrogate function. So, basically right here. This is the pattern you'll see over and over again. You create an agent right here. Validation functions, I'll show you quite a few of those. And then finish off with interrogate. And we use the pipe in between each of these steps.
Okay. So, let's take a look at a simple one. Let's look at a table first. So, I'm going to run a small table right here and see it right here. And it really is a small table, just 13 rows. I'm going to use this for validation because it has a number of different types of columns and it's, you know, very easy to play with.
Okay. So, starting it off, create agent. I plug in the table with the table argument. I can give it a name. It's really just the same name twice. And then I can also give it a label. This is important for the report that you'll see. So, once I create this object, I can print the object and get a report. Okay. So, I'm creating the agent. Okay. This is what I got right here. So, it's pretty unimpressive because it shows what is essentially an empty field. But we're on the right track. Basically, it says no interrogation performed and we have no steps. Okay. This is almost like a blank ggplot plot with no data. So, let's add some steps.
So, in this case, I'll take the agent, I'll carry on, and I'll add a number of, like, validation functions. So, some of them begin with call vals. So, this is checking values within a column. Say I want to check values inside of column D to make sure that they're all above or equal to zero. So, that's what this does. Calls vals gte. I'm targeting column D. I'm saying it must be greater than or equal to zero. Here's another one. Call vals in set. So, this one I'm saying in column F, which is right here, the last column, the values must be in the set of low, mid, and high, these text strings. Okay. It looks like they are.
Okay. There's a number of other validation functions. Call is. Just checking for types. Column is logical. Column is numeric. Column is character. And we're just targeting columns. Sometimes you can actually target more than one column, and this maps it the same validation instructions to multiple columns. So, we're saying here that columns B and F should be character. Okay. And then there's another one here. Rows distinct. There's lots of these, but that is like the main part of the package, the library of these validation functions. Rows distinct. Okay. So, in this, we're saying that each row should be distinct from each other or unique. Okay. Entire rows.
So, I'm going to run this. Okay. Now we've got a more expanded report right here. We can see that we have a validation plan, and we have all these steps. Looks like seven steps in total. But we have nothing on the far right because there's no interrogation performed as before. So, to finish it off, as I said before, we have to use interrogate in the end. So, I'm going to run that and reassign it back to agent one here and then print the report. Okay. I'm going to make this a little bit bigger so you can sort of see what's going on. And we get to the report now. And we have it seems like green all the way down except for right here it's a lighter green. That's interesting.
But the key thing to note here is that we have things called test units. And basically for these call vals, validation functions, these will be the individual rows or the cells within each of the rows. So, we see here we tested 13 rows and all of them seem to pass. Same with the set based analysis. We tested 13 and all passed. No failing. Okay. So, right here to read this, this is the number of test units passed. And this is the proportion. So, 100%. And here none failed and is 0% failed. Here in rows distinct, we see that we don't have entirely distinct values across all the columns. We actually have two failing rows right here. And more interesting is we have a CSV button. We can actually press this and download a CSV of the failing rows. Okay. So, interesting. So, we have two failing rows. 15% of the data has failed in this case.
So, that's what this report shows. And basically it's a central part of pointblank. It's showing you basically how the validation went, you know, sort of came across. And also you can share this with other people that, you know, may not be using pointblank or may not be using R at all. This is a really good shareable report. And the nice thing is, like, it's HTML and you can hover over these things and sort of see, you know, plain text, sort of in plain English, like, what you expect from each of these validation tests.
And also you can share this with other people that, you know, may not be using pointblank or may not be using R at all. This is a really good shareable report.
Threshold levels and action levels
I'll go through more of this a little bit later. Things will eventually get filled in as we use more of the features of pointblank. Okay. And one of those features is called threshold levels.
So, we have this report, but we have no way of, like, flagging whether something is kind of bad, right? We see two failed rows, but we have no value judgment as to whether, you know, that's important or not. So, we can set that with something called action levels. Okay. That's the name of the function and the object. So, what you do in this case is you have three little buckets here, warn, stop, notify, and they appear here in the report as W, S, and N. And these are, like, priority levels. You can set different failure rates for each of these. And then what you'll get is they'll light up if we've exceeded a threshold for, you know, for these failure rates.
Okay. So, I'm going to run this. I'm going to run, make this object, and I can actually print this object to see it in the console. Okay. So, we see right here, warn has a failure threshold of 0.15 for all test units, 0.25 for stop, and then 35% for the notified bucket. Okay. And we set it, you know, we either set it as percentages or fractions, or we can use, like, numbers. Like, for instance, we can use three. We're going to stop at three failures. That's an absolute number. But in this case, I'll just use percentages, fractions.
Okay. So, we set that, and we can pass that object into the actions argument of CreateAgent. So, right at the beginning, we can configure the agent to have a global set of, like, failure thresholds. Okay. So, let's take a look at what this whole thing looks like. This is a new agent with small table. I gave it a name. I gave it a label as well. I gave it the actions object, and I'm doing a number of other validations on it. In this case, I'm checking for daytime objects, whether values are less than a value, looking for a regex pattern in column values, whether values are between two values, and so on. And some of these we've seen before. Okay. So, I'm just going to go ahead and run this.
This is a brand new validation, Agent 2. And I'm going to expand this a little bit. And we see here, okay, at the top, now we have, like, the global thresholds that we set previously. 50%, 25%, 35%. These are sort of increasingly stringent thresholds for problematic values. And we see right here that we have now more colors on the left. These status strips now show yellow and red in four cases. And if we just scroll over, I'm just going to make this a little bit bigger, we see that we have these circles now that weren't there before. Before we just had dashes. But now we activated some global thresholds. And once they're tripped, or basically they're entered, they're filled. Okay. So, this one here, the regex was totally wrong, or it didn't match. Six, seven of the values in that column, it matched for six, but not seven. And because of that, and we have, you know, a notify percentage of 35%, our 54% failure rate went over all these. So, that's why you get all three of these triggered right there. And then, of course, for any failing test units, we always have a CSV button. So, individual steps you can get CSVs for.
So, we see that twice here. In this case, we only have 15% failure rate for rows distinct, and that matches the sworn rate. So, we just get like that filled in. So, basically, this is a good way to see, you know, whether certain validations are very important. You want to be notified. And we're not covering this today in our quick workshop or webinar, but you can also assign functions to each of these. So, for instance, if, you know, this becomes notify, we can assign a function that just fires any sort of custom made function. That could be logging, that could be, you know, sending a notification, anything you can sort of create as a function that's based on this, these notifications.
Okay. So, let's do a little bit more. Let's go into a more detailed validation right here. This one is number three that we're doing here today. So, again, use a small table. And again, we put a name for that, and that appears just up here, table, and then small table. So, just a little bit of metadata you can have at the top. And then I gave it a label, workshop agent number three. Okay. That'll appear here soon. And again, I'll go over the different functions because so many of them, call is POSIX. So, we're saying here that the daytime value will be a date, will be a daytime type of set of values. LT means less than. So, the values in A will be less than seven.
Regex again. Okay. In this case, we're doing something a little bit special. We're taking that, we're taking the actions idea, and you can apply it to individual steps if you wanted to. So, each of these has a common set of arguments. These include actions as well. I know that create agent has actions, but these are more like global actions, whereas each of these steps can have their own set of, like, their own set of actions, which, you know, basically overrides the global set. So, basically, you have a chance to sort of customize each of these steps in lots of ways. So, here we're saying that instead of using proportional levels, we're setting absolute levels. We will have worn at one, stop being filled at three failures, and notify five failures.
Okay. So, that's interesting. So, let's try that out. I'll run this out. Make this a little bit larger. Okay. And one thing I didn't point out before is that in the console, you'll get, like, this little quick sort of, like, status, 10 steps. It'll show how many steps are on the end, and it'll show whether some of the conditions are actually tripped right here. So, we see okay for lots of them, but stop, warning, notify for some of them. Okay. I'll expand this a little bit more, minimize this, and we see that for the one step, call values between right here, that is this over here. Okay. Expect that values in D should be between zero and 4,000. That's what we have right here. Okay. But we have our own set of action levels. We said one, three, and five. Okay. So, here we have one failure. It's 0.8 percent. I'm sorry, 0.8, 0.08, I should say, as a fraction, and that would be 8%. But that's less than the global setting for thresholds, but of course, we set it here for one single threshold. So, one single unit will trip the warrant threshold right here. So, we see that right here. There's no other indication that we set levels for the step, other than that you just kind of know that they're here.
Validation function library
Okay. So, now let's have a look at all the different validation functions in the package. There's 36. So, definitely no need to really, like, look at this much. But I'll go over some certain features of what these are. There's basically groupings of these. So, let me go into that right now. So, the call vals group right here, we've seen this before. Basically, we're checking all the different cells in a certain column. It could be multiple columns because we could map over multiple columns. But typically, we're just going down cell by cell in a column and checking for certain things. And it's either, it could be less than a certain value, less than or equal to, equal, so on and so forth, in a range or not in a range, in a set or not in a set, or the values have to make a full set or a subset of something. Or it could be that you want the values to be increasing, like an index that always increases and it's an ordered table. You want to make sure that the values are always going up or down, whatever the case may be.
We can also check, this is pretty important, whether values are null or not. You may want no null values. You may expect no null values in a column. And this will quickly find out if that's the case. Call vals null. Another common one with strings is regex. We've seen that a few times. And basically, we're checking a pattern to make sure it fits. And typically, if you're checking that, it falls within a very specified pattern. And there's other ones, too, where you can provide your own expressions. This is more complex. But this is all part of the call vals family of validation functions.
We have other ones that check out, that check entire rows. We have rows distinct, which checks whether rows are unique. And rows complete, which checks whether there are any null values across rows. If there's one null value in a row, no matter which column, it'll be a failing row. And then also we have call is, so that family of functions. Basically, we're just checking types. So we're checking whether we expect a column to be a character column, numeric, integer, logical, factor, and so forth. Another great one is call exists. Sometimes when you're using, when you're doing your analysis and you have joins, you want to make sure that certain columns don't get, you know, disappear after some sort of operation. So calls exist is a really good one. Or sometimes you may have third party data, you want to make sure all the columns exist.
Okay. A few other ones which are special, I'm kind of skipping over a few, but these are some special ones. They're for more advanced validation needs. Conjoining just means that you expect certain values in this column based on some other column. So it's more like a, it's dependency on across columns. And especially just basically your own unique function that can test basically anything. Yeah. The only caveat is you have to produce either a vector of logical values or a table where the last column is logical. So that's it. That's a huge library of, well, lots of validation functions, but I think they cover a lot of what you might do when you're validating data.
Using validation functions directly on data
So let me show you another way that you could use pointblank. There's quite a few ways. Aside from like using an agent, as before, we could also use these validation functions directly on the data. So what this does is it acts as sort of like a validation filter. The data will pass through unchanged if the validation is successful, like no failures, or you can set a threshold as well. But if it does have failures above a threshold, which is one by default, it'll immediately create an error, which is great for notebooks or things like that, analysis. You want to make sure that your data is a certain way before you proceed further.
So let's take a look at small data again. So in this case, we want to ensure that the column values in column A are between zero and 10. Okay, let's just run this. I'm just piping right from small table to this validation function. And we see right here, I'm going to make this a little bit bigger, that we just get the data frame back. That means no problems. And we'll just make sure column values in A, yeah, they're definitely between zero and 10. We can visually verify here. Okay, so that's why this passed through. This is a bit of a spoiler, just a notebook error true, it means it will error, because there's obviously values less than five here, we see a two right away. So this one will actually flag an error. And it actually creates a real error. And it has a message, pretty useful. It says exceeds and failed test units, where values in A should have been between five and 10. And it says also, the call values between validation failed beyond the absolute threshold value of one. Okay, so that's the default. And you can change this with the threshold argument right inside this. And another cool thing is that it says failure level of 10 is greater than the failure threshold of one. So a lot of failures have occurred here. Seems like 10 of them.
Okay, so that's one way you can sort of catch errors. Very simply, just, you know, just using your data and not having to use reporter, this is more immediate. Let's try with call is. So the calls group will check whether a column is a certain type. So it's like two cases, one passing one failing. So this one right here, we say that call is character and the call in question here is column B. Looks like it is character. Okay, let's just make sure. Okay, so it passes through. X is a filter. In this case, we're not catching anything. So it just goes right through. So you can actually use this right inside of, if you want to, you know, do it, you can actually have more analysis if you want to right here. Or you can just have it, you know, being separate. But it's kind of a cool thing. You can put it in, in between. It doesn't affect the data, which is kind of nice. In this case, we're checking call is numeric and column in question is date. That's not numeric. So we will get an error right away. So it says here in this case, we're just checking one thing, which is the type. It says that fails beyond the absolute threshold level of one. And we have a failure of one, which is square that in this case, equal to the failure threshold of one. It's because it's not right. It's not numeric.
Okay. And in this case, rows distinct. We saw before, you know, a small table doesn't have distinct rows. Let's run this. Again, right here, we see like failure level of two. So we caught that. What about the head of small table? So that's like the first six rows, I believe. That is fine. That sails right through. There's no error there because this particular set of rows is distinct from each other. So wonderful. What about this one? Rows complete. So basically, that means there's no missing values. There's no NAs in all the rows. Okay. So that is an error. Because we actually have two rows which have some missing values within them. So there's two incomplete rows.
Okay. Another cool thing you can do is, and you can do this also with rows distinct, is you can narrow it down to a set of columns. This is basically like selecting beforehand and then running the same test. So we're saying here that if you subset to like the columns date time, date A and B, are the rows complete? So I'm going to run that. And it seems like, yeah, it just so this is kind of strange because it shows you the original data frame. But we just checked within these columns right here whether we have complete rows. And that's true. We don't see any NAs except for in row C. Or sorry, in column C. So we do have complete rows for this subset. Great.
Okay. And we have some match functions. And I'll show you what those are. So sometimes you may expect that you have a certain number of rows. Exactly. For reasons. You may have a join. And you want to check whether the join table has the same number of rows as the previous table. So that's one use case. But we're doing it very simply. We're not going into that. We're just going to use small table. And we're going to say, we're going to, you know, guess or expect that, you know, the number of rows in our small table is 13. Exactly. So let's run that. And it must be true. And we see it's true because we have the data frame here. Okay. And the same goes for columns. We have instead of row count match, we also have call count match. And a cool thing you can do with count is you can use another table. It'll just take the count of columns in this case from another table. And apply it to the value of count right here. It just so happens that Palmer Penguins, that the Penguins dataset within that has the same number of columns as small table. A little strange, but it's true. And we sort of see it right here. And that's how you use call count match there.
Data extracts and sundering
Okay. Now I'm going to get into like another thing, which is data extracts. So we saw before that we have these CSV buttons. Just go to the viewer right here. And this is a way to get failed rows. We see here that there's two failed rows available in the CSV file. It gives you a little preview. Seven here. One right there. But that's not so convenient if you're using R and you just want to know what the rows are. You can actually get those rows like in R with the get data extracts function. And that takes an argument called I. And what you use for I is actually the step. So if you look here, these are the different steps. And this is exactly the number you would use to get the different extracts. And extracts are essentially data frames. It's the same as these CSVs, which you haven't seen yet. So I'll say right here for agent three, which is this right here, I want the extracts for, well, all of them. I'm not going to write it down. So I'm just going to use it by itself. Like agent, agent three for get data extracts. I'm going to run this.
Now I'm going to make the console a bit bigger. Oops, sorry. I have to minimize that. I'm going to make this disappear for a second. I'm going to show the entire thing. It's like this. So it's essentially a named list. And the different names are the different steps that have extracts. So if there are failed rows, they'll appear like this. So step number two had these two failed rows. Step three had seven failed rows and so on. So this giant list, you can sort of subset from this and get the different data frames. But we make it a bit easier if you can supply an I argument. So if you wanted the data frame just for step nine, the failed rows for that, you would use I equals nine and get data extracts. So I'm going to run this. And we have it right here. So basically, these are the failed rows. And to understand what this refers to, let's look at step nine. Call values and sets. So low and mid, okay, in column F. Now if you go back to the console, we see that we have all these high values, which are not part of that set. Okay, so the set was a bit too small. Okay, so that's what we get. These are failed rows just for that step.
Okay, and taking this a little bit further, we can have something called sundering. It's a function. So basically getting good or bad rows from the original data set. Okay, so and again, this depends on your methodology. But if you want the best part of your input data for something else with the getSunderedData function, we can provide an agent object that was interrogated, basically just when we had before. Agent three. And we can get back either a past data piece. This is the default. And what this is is rows with no failing test units across all row-based validation functions. Okay, so it won't work for things like the column is or for, you know, other checks that check like one aspect of the table or the dimensions. It's for all those column value checks or call values checks. Alternatively, we can also get the failed piece, basically the rows that failed at least had one test unit across all the validations. And we can also get another data which is combined and has a row which is a flag, sorry, a column which is a flag column.
So let me show you how this works. So we'll create a new agent, agent four. And we're going to be really simple here. We're going to have two validation functions checking that values in column D are greater than a thousand. And here checking that values in column C are greater than values in column A and less than values in column D. Okay, and we'll also say that any NA values we'll just pass them through. That's fine. They're valid. Then we'll interrogate. Okay, so we see this and we see some passes, some failures for both steps right here. Okay, we didn't set any thresholds, so we're not seeing any of that. We just see hyphens here.
But this is the key thing right now. So we're seeing that validation report. Okay, so now let's use GetStandardData. And again, what this gives you by default is all the passing rows, rows that sailed through the validation with no problems in both types of validation. So let's just verify this right now. So we said here that columns D should be always greater than a thousand. Okay, and we see that here. Okay, these are all passing. And we also said that values in column C should be greater than A and D. Okay, so we see that it's A is two, this is three, D is huge. It seems like C is always going to be between or less than D. And it seems like these values are always greater than values in column A. And the missing values, as we said here, are just passed through. So those are valid in this case. So this really is like the piece or the chunk of the data which has rows that pass all validation steps.
Okay, we can get the opposite, the complementary piece. This is the type Fail. So in this case, we get eight rows. So remember, there's 13 rows entirely. Five plus eight is 13. So basically, this is always going to be an addition. It's always going to be a complementary piece. It isn't going to be in rows in the other one. So that's the Fail piece right here. So this could be filtered out. It could be used for further checking, for root cause analysis, what have you. It's good to know that you can get this and find out if there's problems further on your systems.
And there's a third way of doing this, a third type, which is Combined. I will show you that. So what that shows you is the entire data frame, which is 13 rows. And now we have this new column, .pbCombined, and it just says Pass or Fail, right? So that just says between all the steps that are considered for this, whether these rows passed or failed. And a cool thing is you could change the label in pbCombined. By default, it's the string Pass or Fail. But you can use True or False. Let's run that. And now, see, it turns into a logical. It really is a True and False. It just saves you a mutation step. You can also run it so you have 0 and 1 for the flag. It's an integer. It just knows what type it is. Okay, so that's Sundering of data. It's a way to sort of like, if you have more of these steps, you'll probably have less total rows in the Pass piece. But it's a good way of seeing which data is really good and which fails easily.
Accessing validation metadata and emailing reports
And as you can imagine, there's a lot of data that you get from these interrogations. What we're showing you in the report is quite a bit of it, but it's not all of it. But luckily, you can get all of it if you want to and make your own report or use pieces of it for your downstream analysis. So, what we do for that is we use the function GetAgentXList. Okay, it sounds crazy, but that really is what it is. And what it gives you is a giant list. And there's a print method for this. So, it shows you inside the console what exactly is in this list. And we see here we get some clues. There's dollar signs and then the name of the piece. And it tries to give you a little clue of what's in there and what the size is. In lots of cases, we have vectors. This is a vector of column names and column types. The validation part is like going to be right here. How many test units, how many test units passed, how many failed, and the fraction of passing and failing. And whether we had stop, notify, or warn conditions tripped. So, quite a few things here.
And we can explore that. For instance, now we have XList right here. We can use dollar sign N, as we see right here, and just run it. And we see that, oh, we get a vector. How many different test units are in each step. So, it's running from like one to I don't know how long it is. I think eight. Eight different steps. And we see how many passed with the N underscore passed part of it. There we go. That's how many passing test units we have out of total test units. And we can arrange this in a table. We can do all sorts of things with this. We can create a table. And now we have a data frame of, or a table of all the different steps. It's actually 10 steps. And all the different failure states that came about from the validation. So, pretty good stuff. So, lots of information inside here. You just have to take a look, peruse this, and just explore. See if there's anything interesting in here that you might use for your downstream analysis.
Okay. Now, a cool thing you can do with this is you can also email your reports. Like, instead of just having this here in the viewer, you can send like a smaller version of this in an email. This is important if you're running a validation, say, in sort of like a continuous process or like a timed process, say a cron situation. You may want to have like an email sent to you in certain cases. So, this is actually a good way to use ZXList. So, you might say here, if we have any notify states, then create that email and send it off. And to do the sending off, this package, or at least this function here, it integrates with Blastula, which is another package for R, for sending emails. So, it does the right thing. It creates the message body, basically the Blastula object, and then the second step is to send it off with this SMTP send, and you can use things like Gmail, what have you. If you have Connect or you use Connect, you can skip this entirely. You can basically create reports and have them emailed without this. But this is great if you don't have that.
Okay. And let me show you what that looks like. So, email create is one way to create the email, and I'll show you what that looks like just by itself. You can actually just create the object, basically pass the agent into email create, and it creates a Blastula object, which is visible right here in the viewer. So, let me actually just run this. Okay. It's a little bit larger, and this is basically an email. You can sort of see it has this little border around it, a little sort of wrapper, and then it has a summary saying that there's 10 validation steps and the date and time that the validation was run, and we have the report. It's a much smaller report. It excludes a few columns. It just tries to get across the main things, like what type of step it was and also what the passing and failing rates were and also whether there were warn, stop, or notify states in that. That's basically it for the email. So, if you actually sent it this way, it would appear this way in someone's inbox.
Customizing the validation report
Okay. So, we have this, and we have these reports. Say you wanted to customize them. You could do that. You could take the agent and use get agent report. I know you typically print it. That's fine, but if you use this get agent report, it gives you additional opportunities to customize the report because there's many arguments in here for customization. So, let's do that. Let's change the title to the third example right here. So, we see right here that we actually have this title replaced. Before, it was just called point blank validation, but you can change it to whatever text you want. Another option is to change the arrangement of the steps. Say you just want to surface the different steps by failure rates. We could arrange by severity. So, let's run this, and now we see that this step right here, which triggered stop and notify, that's now at the top. We have here these other steps, which triggered warn. They're near the top, and then everything else is now towards the bottom. So, the steps are out of order, but we're showing first what failed the most. We can do that, the same thing, but only keep the fail states. So, you have lots and lots of different steps, and you just want to see what failed. This is the report to use. Basically, we're ordering it by severity, and we're just keeping the fail states. So, you could do one or the other or both, depending on what you need, but this is great, especially if you're emailing or you're just showing to someone else and you just want to know about what went wrong.
Another cool thing to do, and maybe it's not so important here, but you can also change the language of the report. There's lots of languages supported, and basically all the different things here will be in a different language. So, interesting. You can just set it through the lang parameter of get agent report. Another cool thing is, although we didn't cover it right here, get agent report actually gives you a GT object. So, if you know the package GT, and that's what this really is, is the GT table, you can just use the GT functions to keep going and do more customizations if you want. You can add, say, footnotes. You can hide certain columns. You can do all sorts of things, like hide the header as well and change all sorts of things. So, that's another good use of get agent report.
Scanning data and drafting validation plans
So, yeah, that's basically the first part I wanted to show you. I know we're running a little bit low on time, but I want to show you really quickly one more thing you can do with this package, two more things, two more big things, and that is scanning your data and then drafting validation plans. Okay, so I just run this, and what that is is basically one function. I'm going to show that to you. It's scan data. So, if you get a brand new dataset and you know nothing about it, you just want to get a lay of the land, just see what is up, what is wrong with the data, or just understand it, you can run scan data. Okay, so I'm going to run this right now, and basically it has these different sections, overview, variables, interactions, correlations, missing, sample. You can set which sections you'll see and which order you want them in through just the sections argument. So, I'm going to show this to you in a more enlarged form. So, basically, it's going to give an overview of the dataset that you put into it, and it'll show you the different variables. For instance, sample number, this is interesting. You can sort of get many sort of simple descriptive stats, quantile stats, common values, and such for each of the different variables, which is nice.
So, really kind of cool. I keep going down, and you also get a missing values report. You can sort of see that right here in comments. Love is missing, which makes sense for a column called comments. It's quite optional, but even, you know, some other columns have missing values in certain parts of the table. You see near the bottom, there's quite a few missing values, and we also get a sample of the data. So, you can see, like, the top and the bottom part of it. Okay. So, really cool. It's a scan data function, and really nice for just checking out the data as a first thing or seeing if there's anomalies that might inform your validation process, and when you get the report, you could, you know, scan it, and you can export it to HTML. Of course, you can always include it in a Quarto or R Markdown document. They're quite nice to have in there, and lots of ways to use it and to publish it.
Okay. Here's one more big thing I want to show you. It's drafting a new validation plan with the draft validation function. So, we saw in the previous module, we were just writing a lot of steps, like up here, and this is good. You know, you have to do it, but here's a simpler way. If you're just getting started with a brand new dataset, you can take the dataset like this and just plug it in. This just means, this little tilde means we're going to load it. You know, this is like the loading. It makes it lazy, and it just knows that, you know, this is the dataset you want, and you can also add more transformation steps here if you want to. Okay. So, it's an expression for getting the data, essentially. As before, we can have a table name. We just gave it to, and a file name. Okay. So, we're creating a file here. So, draft validation. I haven't even told you what it does, but I ran it, and if you look in here, we get the new file called Storms Validation. I'm going to show you that file. Basically, what it is, is it gives you a brand new file with a whole, basically, expectation text, and it's kind of cool because it gives you, like, you know, it basically uses ranges of the data and other little things it finds out about the data, and then creates you a starter validation. It's pretty long and extensive, and even gives you, like, a way to sort of check the schema of the table. So, you need to change over time. This will catch it. It gives you lots. Let me just run it right now.
Okay. Great. I'm going to try to run this. Looks like it did not. There we go. Maybe I have to do this. No. Okay. My positron is messing up. There we go. But if you were to run this, it would run just fine. I'm going to try one more time. No, it doesn't seem to want to do it. Okay. Well, that's too bad. But if it did run, it would give you a whole validation, like an agent and a report, and you just essentially tweak it. Say these values are not quite right. They don't fit, you know, they're a bit too small in terms of, like, ranges. You could, you know, do it from here, and it would make much more sense to have this full thing in front of you instead of starting from scratch. So, that's all this is good for.
Q&A
I can't quickly show an example, but the function you would use, there's two of them. You would use either conjointly, which is conjoint. It would check across like a mapping sort of like. If you look at examples there, you'll find that. Another one is call vals exper, and you can use a case when expression. Maybe if I'm able to do this, I can quickly show you. It seems like this is a bit bored. Let's see here. Oh, yeah, there we go. Hmm. Oh, no. Positron. It's showing me the text here, but it's not quite working. But there's examples in here. I do exactly that. You just have to go down to the bottom of call vals exper, and you would see.
Rich, if you're expecting to show your screen, I'm not. Oh, yeah. Okay. I'm going to show it. Yep. One sec here. Share. Sorry about that. Yeah, it's showing very poorly, essentially, in Positron, but it's actually here in the documentation of call vals expr expression. That's how you would do that. You can use a case, like a case when if you wanted to.
Yeah, I'm going to put that in as a thing. I know, because you can't really verify what is, or you can't even tell what is global and what is local to the step. That basically is an enhancement I would put in, and I would probably just put in a marker saying that, or I would, another idea I have, of course, we don't have it now, is just like highlight each of the different circles that you'd have in those columns with, like, what is set for those. So, basically, I don't have it now, but I'll have it in the future, near future, I think.
Okay, thanks. Okay, this one's from Mauricio. I had a quick question regarding the agent table report. Is it compatible or easy to display within a Shiny app?
The R Consortium will have this video up on YouTube, the R Consortium YouTube account, so there's that much. That's not instantaneous. It's probably realistically early next week, but maybe, Rich, you have some? Yeah, I believe there's a, if you look on YouTube, there's even a previous workshop that is on YouTube from some years back, but still very valid today, and all these materials are available in my repo, rich.enn.com slash point blank workshop. It's basically all the stuff you've seen and more, actually, and there's a good set of vignettes available on the website, if you just look at my, you know, the actual pointblank repo, and there's more being created. We're working on new vignettes as well, so there's more on that soon.
Yeah, so about this, I've been thinking about this a lot through the history of this package, and the cool thing you can do right now is we have, we didn't see it in this workshop, because it's too much material, but each of these different validation functions has a pre-conditions argument, and that allows you to actually mutate the original table. So, you can add a column that is like basically a range, like a range of values which are, say, for instance, you know, like a cycle range that is calculated dynamically from an expression you provide in a certain step even. So, you can actually sort of like mutate your table and then use the validations for that. So, the way I see it is we have these really simple validations, and we have another way to get the table or different columns, what have you, into the shape you want for a more specialized validation. So, it's kind of like solving the having too many validation functions problem, because you can do more stuff to sort of, you know, warp it to what you need for certain checks. So, I think you could do this right now. You just need to have functions that create additional columns, and then you would just check on those pretty simply.
I think so. I mean, we include some ways to, some basically affordances to sort of instrument the process of validation, like getting continuous runs, and we have a lot of functions which create files, and you can read from those files. So, basically, you can serialize and deserialize. And so, basically, you can also run from YAML as well. So, I can see you're running an R script or just in CI or what have you, and you just run it, and it would essentially create the files as artifacts, and then you can sort of get them back and start an audit trail of different validation files. There's actually a presentation, not last year, but I think two years ago in Chicago in PositConf, where someone actually took pointblank and basically did a full instrumentation and continuous runs with lots and lots of datasets for some sort of medical application, I believe. So, check that out from PositConf in Chicago. I believe that was, I'm trying to think, 2022, I think. Could be off by a year, but yeah, there are ways, yeah. And let me know if there are three issues, if you're hitting it on any sort of problem, getting that going.
The next one is from Ashok. Great package and presentation. Can you add your own checks?
Okay, it's not quite easy to add your own checks, but what we provide is some functions which make it pretty much that. Essentially, you provide a function, and what you do is you basically, the function will take in the table. All you have to provide in the end is either a vector of true-false values, a logical vector, that is, or it could be a table where the final column is a column of logical values, again, to do the actual check and the counting of test units. So, basically, you can. The only thing you can't do is change the icon and other small customizations, but I'm thinking of adding that in. So, basically, this is the way to plug in your own because any function you want to run, you could with specially.
You can group. It's a feature we haven't seen here, but it's called segmentation. So, basically, you can just filter or have multiple group bys, essentially, and you can group by IDs. So, basically, you can group by IDs, filter, or have multiple group bys, essentially, and then run the validation separately on just those. So, that's a way of doing it. Another way is, of course, preconditions. You can do anything you want with the table and then, you know, pre-filter your table for one step or, you know, you can do it in multiple steps, but, yeah, there's two ways, segmentation and also by preconditions, which we haven't seen here, but it's there.
You can. Because the table is essentially a GT table, there's a way to save the table as a PDF. So, what you would do is you would use getAsianReport is what we saw near the bottom, getAsianReport. That immediately makes it a GT object, and then you just use GT save and then specify file name dot PDF, and that should get you, like, an image inside a PDF. It's not a full LaTeX sort of, like, table. It's essentially just, like, the image in a LaTeX, like a PDF file.
Okay. Fantastic. All right. I think that's the end of the questions that have come in. I really want to thank from the R Consortium everyone who attended. Thanks for your time and interest. We really appreciate it. We see you as part of the community, and we hope to see you again. Rich, thanks so much for all this great information. Like I said, actually staring at your IDE, seeing you go through the code and rendering it right there, it's just fantastic. Thank you so much. You're welcome. Yeah. Okay. Thanks very much, everyone. Take care. Okay. Bye.

