How to use {pointblank} to understand, validate, and document your data

Transcript#

This transcript was generated automatically and may contain errors.

Hello, everybody. We are excited to keep the positive and wonderful momentum going for the rPharma 2025 workshop series. And there's been a whole bunch of fantastic ones already. I am super excited to introduce our instructor today.

I definitely consider one of my R friends in IRL. I've known him for years. And he has built amazing tools in the R ecosystem. You may know him for his awesome GT package, but he does way more than that. And we are going to be talking about the pointblank package today, which is one of those underrated gems in the R ecosystem, and more recently, the Python ecosystem as well. So without further ado, I'm really excited to introduce our instructor for today, software engineer at Posit, Richie Ong.

So we're really excited to have you here. And I'll be watching the chat for any questions. And I'll serve them up your way whenever you like. But the floor is yours. We're ready to get all into pointblank. It's one of my favorite packages.

It's actually, OK, I know there's a Python workshop just before. Just so you know, we're doing this in R. But there actually is a pointblank for Python, just so you know. So if you were using R and you just had to go into Python, when you use pointblank, you can just transfer those skills right to the Python, which is pretty close to it.

Just a little tidbit of information. I'm going to share my screen. I am hoping that this is very large, large enough for everybody on the workshop. You may want to zoom in a couple of notches. Just a bit more. OK, yeah, maybe like this. A little more. A little more, OK, here we go. This should do it.

All right. That's more for the on-demand recording, especially. Yeah, yeah, good call. So what I'm going to do is I have everything that is here is actually in a repo. It's in a repo under my personal repo, richemn slash pointblank hyphen workshop. So all that you see here is there. And you can always get it at any time. And of course, it's available inside of Paws of Cloud, which is where I'm using this. But here it is right there, pointblank workshop. It's right in front of me.

So basically, I'm going to go through in this next two hours just a few of these QMD files, basically the first, second, third, and the fifth, I believe, is what I'm going to do. That'll fit nicely during the two-hour time. And it'll leave lots of time for questions and for your own experimentation. So let's get right into this.

Introduction to data validation

OK, so let's first introduce what is data validation. So pointblank is for data validation, obviously. And the workflow that we use inside this package is we create something called an agent. It's just a function called create agent. So that's the first thing you got to do. That's where you actually give the object your data, like the table you want to validate. OK, so that's very important. It's kind of like the first step.

And then in the middle, there's as many functions as you want, this declaration of validation steps using validation functions. OK, the more you add, the more stringent that your validation process is going to go down. OK, and then finally, there's a third step, the interrogation of the data. This is where the agent finally carries out the validation tasks. Up to that point, it just has the data, but it doesn't actually do anything with it. So you have to use this to sort of initiate the whole process.

So let's actually run through this. I'll take a table that's inside the pointblank package. It's called smalltable. It's really very small. I'm just going to run this real fast. And we see here, it's got 13 rows, eight columns. It's pretty good for just validation tasks, just to see, just to experiment with the package. OK, so I'm going to close this up.

OK, so I'm going to do that first thing. I'm going to call this object agent1. We're going to use the createAgent function. First argument is table. We give the smalltable from the package. We could optionally give it just a friendly name, smalltable like this in quotes. That's just for the display. In the end, you're going to get a report, and it's good to see the actual name of the table. So you're just sort of echoing that right here. Another thing you do is add a label. This is just a line of text. Again, that appears in the report. And I'll show you the report.

It's going to be like this. I'm going to hit Play, and it appears right here. OK, so I have to scroll a little bit because this whole area doesn't actually capture the entire table. What we see here in the top, pointblank validation plan, no interrogation performed, because all we did in the first step here is just create the agent. We didn't add any steps. So we see that this table is actually empty. So let's make it less empty.

Adding validation steps

Let's add some steps. So in this case, we're going to take the agent, again, that we already made. We save it over the initial one. But this time, we're going to add some steps. And there's lots of these. There's probably 30 of these. And but the good thing is they all sort of start with the same sort of pattern. There's groups of these. They start either with callVals, callIs. Sometimes they're special names like rowsDistinct. But there's just a few groups of these validation functions. So I'm going to add a few of these. And I'll go through what these actually do a little bit later. I just want to show you the entire process first. So let's actually run this.

OK, so now we get another table as a result. Again, we see no interrogation performed because we never used the last step, which is interrogate. But I wanted to show you what happens if you just don't do that. OK, so we get a number of steps. And this essentially is the validation plan. We have a number of steps. And what will happen when you use interrogate right here is that those steps will eventually have data.

OK, so we see it right now. OK, and now this is actually a good time to understand what this does, what the individual steps do, and what this table shows. So what we've done here, I'm just going back a little bit, we're checking that the column values in column D are greater than or equal to value 0, so a static value. We could compare it against other columns. But in this case, we're just choosing a static value inside of this value argument.

The second step is we're checking for values within a set. So column F will have text values. And we're saying that we expect values in column F to be either low, mid, or high. OK, two more, or three more, which begin with col is. So we're saying columns E and D are logical and numeric, respectively. And that columns B and F are character columns. OK, finally, for the last validation step, we have rows distinct. So in this one, we're saying that all the rows are unique.

OK, those are our expectations. So we see in the actual output, though, it's kind of nice to see this because we sort of see our validation functions being echoed here. And if you hover over them, we sort of see plain English expectation text. So that sort of echoes what we expect to see. And if you scroll a little bit to the right, we see something called units, pass, and fail.

OK, so units are, for each step, there's individual atomic test units. So for these callVals functions, we're checking values down an entire column, and there's 13 rows. So really, there's 13 individual checks. And because we see there's 13 units and 13 pass or validation, and we have none that fail, we see that all the columns match this expectation that all values in D are greater than or equal to 0.

Great, and we also see other ones, which are identity things, like columns are different types. They have one test unit because they're checking one thing for the type, and we see that they all seem to pass. It's only when we get to rows distinct, where we expect distinct rows across all of our columns, basically each row is different from the other, we get two failures in this column. So out of 13 test units, 11 pass, we have two failures. That's because we don't have entirely distinct columns in our small table data set.

Another cool thing is that we see right here a button, CSV. So we can actually click that, and what we get is it just downloads a CSV, which is kind of cool if someone doesn't have access to pointblank or R, and you just give them this file, and they're curious about which actual rows are the failure rows. They can actually look at this and see what the problems are, and this is available for all row-based validation functions. They can produce these if there are failures.

Another cool thing is that we see right here a button, CSV. So we can actually click that, and what we get is it just downloads a CSV, which is kind of cool if someone doesn't have access to pointblank or R, and you just give them this file, and they're curious about which actual rows are the failure rows.

Uh, basically these are very similar, except the big thing I want to, I want to design for was having, um, reporting that was publishable.

Summary of the validation workflow

Um, basically data validation, blank, blank, you need agents, a set of validation functions, and then that last interrogate step, get the last function call. Would it have to collect all the NYC taxi data if only a couple of columns are getting validated? Um, nothing is being collected. Everything's all being done on the database side, uh, for our, for everything. Like there should be no transfer of large datasets to your machine. Um, hope that answers the question because we're using dbplyr and, um, we don't collect the very end and we collect a small amount of data, which is basically just like the, the, the tally of results, which is not very large. Condense it down before we actually pull anything back.

Okay. So the agent, it creates a report that tries to be informative and easily explainable as those reports you've seen multiple times, we can set data quality thresholds with action levels. Uh, there can be D default, DQ data quality thresholds and steps specific thresholds. Okay. So basically you can use action, like the action levels function or the object created from it within certain steps or globally within, uh, uh, create agent, uh, there are 36 validation functions and they have a similar interface and many common arguments. Um, and it can be used with an agent or directly on the data. And directly on the data means that you pass data through or you error based on data quality, uh, test units failing.

We can get data extracts, uh, pertain to failing test units in rows of the input data set with get data extracts. It has an eye argument if you want to match it to a step. And that way you just get like the table itself and not a list of tables. There's the option to, to obtain sundered data, uh, which is the input data split by whether cells contain failing test units, and you can get either the passing piece or the failing piece or a combined report.

Um, there's a large amount of validation data with that get agent X list function. So it creates a list and the print method just shows you what's basically how to, what's in the list, basically what, you know, ways to access it and also how many objects are in each one of those. Um, if you want to email, um, as a result of validation, sometimes you're just running this on a schedule. You can do that with email create combined with, uh, functions in the blastula R package. And finally, there's some custom customization opportunities with get agent reports. Normally you just print the agent itself, but you could use this to give it some, some options for display.

Scan data

Great. So I'm gonna move on to the next step or the next step. Uh, the next QMD file in, in our set of QMD files from this, uh, workshop. And that is about scanning data and also using a function called draft validation.

Okay. So scan data. Um, what this is, is basically just like a large description of a dataset. Um, you run one function, scan data, give it a table. You can also provide, um, there's many different sections. It provides, it provides an HTML report in the end. You can cut it down or reorder just by using this, this sections argument right here. So each of these letters corresponds to a letter here, like overview variables, interactions, correlations, missing values, and finally sample. Uh, in this case, I just want the overview variables, uh, missing values and a sample of the data.

So I'm going to run this and they're quite sure where it'll appear, either it's on the right or below, um, cause it's been ages since I used this type of, uh, way of working. But we see here, it's going, it'll take some time, uh, depending on how many columns you have usually. Uh, but now it's done. Okay. So now we're taking a look at the table scan. And if you click here, you just shortcuts to get to different sections, or you could just scroll down.

Uh, we have overview, basically the sort of like the dimensions of the table. Uh, some information about the processing of the table and, oh, it doesn't make it easy to scroll down. Jeez, I'm going to use this variables. There we go. I'm going to try my best here to scroll down without the ID fighting me, but we see here, we see one, uh, column. It's a character column. We're showing just some, you know, basic data about the column and toggle details, this is where some good stuff is and I may not be able to get to it.

I'm going to actually take this and pop it out into a browser or somewhere else, anywhere else is good. Okay. There we go. This is much better. There we go. So if you click on this, we get some details. So you can see things like, uh, well, this is a character column. So we see common values. That's just the way he wants to print it out. And string lengths for the table. Uh, start with the plot there.

Uh, let's look at something that's more like a measurement. Um, let's look at column, column in depth here. There we go. So we see some stats about these values. Uh, kind of cool. You can observe outliers here really nicely. So this is a little bit kind of pre data validation, data quality. You just start scoping things out and really understanding the data. You see common values. Uh, it's not so great for numeric data, but it might be good for things like categorical data. Uh, min max, our max min slices, we can sort of see frequency counts of different values.

Um, but the really cool stuff is in the first one, the stats, descriptive stats tables. And we've got a few more here and we got some, you know, some, uh, Delta values here as well. Um, and a cool thing here, it's not really labeled. This is like missing values. If it's all blue, this is the top of the table. This is the bottom. It's sliced up into different sectors. We see here, we don't get a missing values cause it's all blue in these columns, everything from like this column here to, uh, the date egg column.

But we start to see missing values in certain parts. Um, you know, right here in this sector and also the bottom, there's some fraction of missing values. Uh, for these columns, we have even more missing, missing values distributed throughout. And what makes lots of sense is comments. These are totally optional, it seems. So there would be lots of missing values. So, um, basically when there's comments, they're pretty, they're pretty sparse. Uh, but it's a cool way to sort of like understand if you have just one missing value, uh, this won't be blue. You can flag out of really large tables. There's some missing values really quickly.

And then finally, uh, there's like the sample table. Um, not much here, but basically this is just a GDL table. Um, you know, the first five rows and the last five rows of the entire table, and you can scroll across, uh, to sort of see, see things and just, just get a feel for what the data is in the sample table. We sort of see here, if you have really good eyes, you can sort of see there's missing, uh, rows here, but it gives you like the actual row numbers at the end. So you even have a sense of what the size of the table is.

Great. So that's table scan. It's kind of cool. You can publish this as well. I've done it in the past. It's basically just a chunk of HTML in the end. Great. And yeah, some additional instructions here. Basically the sections could be reordered. You can, you can use a string like this, just capital letters for each of the different parts. You can omit some things. Um, that's a way of customizing it.

Um, I was going to show you something here too. Let's look at some pharma tables. Let's look at CDDM. Here we go. So in this case, I'm getting the, uh, adverse effects table here for the I just want the, uh, the first few columns, which are overview, um, sample. So sample second, and then variables last.

And I, I do apologize. It does sometimes take so much time, a little bit of time. Um, also because we're on cloud, it takes even longer because these instances are not very powerful. Um, but I'll just talk over this part and I'll get back to that part. Uh, and again, just like the validation reports, you can have them in different languages. Oh, it did pop out. So that's great. Uh, just certain things I want to show you here. I'm going to pop this out as before. Cause this has some really cool features.

Um, so again, overview this time to samples in the middle. And then down here, we have this cool thing. It's like labels. They get shown here too. So it's kind of anything. This is a label dataset. So they just appear, uh, right inside this report. So kind of cool. Um, and there's lots more columns. That's why this took a long time. Um, I believe ggplot for this cause sometimes there's, there's plots in some of these, um, we'll see these, these character value ones, their string lengths. This takes a little bit of time to process. So, um, that, that requires a bit more waiting. Uh, but kind of cool, kind of cool. It does scale to lots of columns, but you just have to, you just have to wait a bit.

Great. Okay. I'm not gonna show the other one, uh, but I'm going to go to the next section here. Well, actually I'm not going to skip. You can actually export this to an HTML file. Um, there's a function called export report. This works for all sorts of objects within pointblank. It could be a validation report. It could be this thing. Uh, it could be the next thing I'm going to show you later, but there's different types of reports and exporting of them can be easily handled with this one function called export report. You provide a file name, you provide the object, and it does the thing that you want. It writes that to disk.

Draft validation

Okay. Now I showed you this before, like, um, these validations right here, these validation workflows with lots of steps. I'm going to scroll up till I get to one which has lots. Um, here we go. Great. All this stuff here. So I'm going to get back to that. Um, but I realized that this could take some time. Maybe it's a bit discouraging when you're doing this. Maybe it's a bit discouraging when you start off in the package, you have to understand what these all are. You have to refute, refer to the documentation, uh, quite a bit. A little hard to get started initially. Uh, but we have something for that. We have a function called draft validation.

What it will do is it will generate a draft validation plan. So basically it'll just write a plan for you, uh, into a new R file using an input data table, uh, just like we did with scan data, you just pop it in the input table and away it goes. Uh, but with draft validation, the data table will be scanned to learn about its column data and it'll provide you a set of starter validation steps. Okay. So it'd be like a plan will be written for you. Okay. So let's look at storms from dplyr . Okay. Quite a few columns.

And, uh, what I'm saying is that draft validation, if you include that, it will, uh, it will look at it and make a file for you.

Uh, we have to do this thing here. This is a little bit strange. Uh, we have to provide formal notation, this sort of like Tilda initially, uh, because we're trying to be lazy with it. And, um, it's really an expression for getting the data. So you can do more things with this if you want to like, you know, like mutate select if you wanted to. Um, but we have to do this to make it work.

So I'm not gonna run it cause it's already been run. Actually. Storms validation. It's right here. Um, or I should say it was there. Here it is. Thoughts aside. So it creates this.

This is a brand new R file and, uh, it looked, you know, it generates like library pointblank, which is what you need as a minimum. Uh, we fetched the, uh, the data through dplyr storms table can also take an expression like this to, to get the data instead of just having it on. You can provide a recipe for getting the data essentially this way.

And, uh, it does that even as action levels, uh, just by itself, just to show you it's possible. It's just like a template thing. It doesn't mean much. You didn't provide that, but it just does it to show you that the, you know, the functionality exists, essentially. Uh, it does tell you things like, uh, you know, the audition plan was generated by this. And a cool thing is it provides this. We don't have it updated yet for native pipes, but that's the thing we'll do pretty soon.

Um, but does provide comments between each of these and provides, you know, these with values that work essentially. So if you were to run this, it would essentially run and it basically just uses the limits of the data for lots of these, these between checks. And then also at the end, it does a really nice thing as you get towards it. It does Rose distinct call schema match. It calls up the call schema helper function to define the schema for the table, and it just puts it within call schema match because it requires this object.

And then, so if you run this check again with the table, if there's a change in the schema, this will flag that, uh, and then it ends up interrogate, and then it just prints out agent in the end. So it gives you this file essentially is what I'm going to show you. So pretty cool. Um, it's a good step.

It's a good sort of like, uh, you know, you have like big tables and you're, you're holding off on doing validations because it does take some time. Uh, this is a good way to sort of like speed things up a little bit. Initially. So just want to show you that, so it's ready to run all validation steps, run without failing test units. That's the promise. At least I run it on multiple data sets to make sure that it seems to do the right thing.

It even knows about certain things like latitude and longitude, uh, columns, just by sniffing the column name and some of the content within it. So it does, it does a few things, a few extra things.

So basically all I want to say for this file is, uh, it's a great idea to examine your data. You're unfamiliar with using scan data. Um, cause it can inform your, your data quality checks. And, uh, speaking of which the draft validation function give you a really good quick start for data validation because it scans your data, but in a different way to create a file.

Expect and test function variants

Uh, so now I'm going to go to the next document here, expect to test functions, QMD. So this is all about, uh, using some cousins of the validation function, uh, their variants. They all begin with expect or began with test underscore, but they have the same names. So basically all those 36, uh, functions you saw, you're just glomming expect or test before it, and they have different functionality. And I'll show you what that functionality really is.

Um, so let's start with expect. So we have expect, uh, the prefix. It'll indicate that those functions are to be used with unit testing. So like test that, uh, and if you ever use test that a lot of functions are, they all begin expect.

Um, so another one is the test prefix. So those functions, those variants of the functions will, uh, either give you true or false and nothing else. Okay. They'll produce logical outputs. So this is great for conditionals or programming with, with data. Say, for instance, you saw, you don't want to carry on in a certain, you know, programming path, uh, based on some, some data quality issue. Um, you can do that, or it can redirect to something else. Uh, probably a message or some sort of failure or, you know, what have you, but it gives you options for programming. Uh, with data.

So let's first look at, um, expectation functions, the ones that begin with expect. Okay. So the test that package has, uh, has a questions of functions, uh, functions beginning with expect those functions. Um, these functions here follow the same convention and they actually could be used within the standard test that workflow. Uh, so you just have to provide, you know, test hyphen, um, some name provide, um, you know, pointblank's expect functions. You can mix them with, with, with test that's functions as well. And it works fine with test that's reporter.

Uh, but, you know, as opposed to what you do with test that we're testing data instead. Okay. So let's look at our table here, small table, and we want to test the values in the SQL, uh, really trust column names in this table, but that's, that's what we have. Okay. So we have these values within that column. Okay. So say, for instance, we always expect that those values are between zero and 10. Is any values. Those are fine. We're going to permit those. We're going to pass those. So we can use expect call vials between. So it's just like call vials between we saw before, just expect in front of it. So we can run that.

Okay. So any pass is true because we're, we're fine with any, any is passing. This is the left and right values. And this is the column we're checking. Okay. Run that. Run that. Oh, we see nothing. Nothing happened, which is actually good. Cause in test that you'd expect functions do nothing until they fail.

Um, okay. Let's try something that fails. So we're doing it with a different range and obviously it's going to be, um, not all the values within this range are, are going to be within the range. So I'm going to run this. Great. We get error. Um, so it's just like the error you would use if you used, uh, call vials between directly on the data. It's very similar to that, except this works within the test that framework. That's the key difference here.

Okay. So if you're doing test that use expect call vials between, or there's lots of options. You can use like these test values. They get true or false. And you can always use test that expect true or expect false. So a lot of ways to do it.

Um, so just like there's 36, you know, like regular, uh, validation functions, we get 36 expect functions, 36 test functions. Um, so one thing that you can do, this is like a little, little suggestion. Uh, you can use draft validation that we saw before to generate a validation plan with the data as a primary input. And then we have another function called write test that file. It'll create a test that dot R file using the agent from the draft validation file.

So that is actually right here. I'm using game revenue. It's used a similar way as the other one. Use the tilde in front, give it a file name, and it'll actually create this test that file. And it says right here, generate by blank. It opens up, uh, uses library statements. runs a library statement, it loads the data, and then it creates a number of individual test that, like, functions to wrap up these expect functions in the end. And because it begins with test, you can run tests right away, which is kind of cool. So if you have, say, for instance, a dataset in your package, you can test that dataset. Maybe it changes once in a while, you want to update it, you want to make sure that, you know, things don't go out of your expected parameters when things get updated, you can always run this as part of your package checks, which is kind of neat.

So the test functions. So I alluded to this before, if they begin with test underscore, and it matches all the other ones, they give us a single true or false. So, for instance, if you wanted to have a script to error that had errors, if there aren't any values in a daytime column of a small table dataset, we can write this. There we go. So if not, test call vals not null, it's a little bit confusing because we're using not twice, then we should have the stop statement. I'm going to run this, and we'll see that we don't have stop, we don't have the error at all. This one does. In this case, we're getting billions here from these two things. We're negating both, and we're checking as an or. A little bit confusing, especially in a workshop, but I'm just going to run it, and it shows that because of these tests, we do have a true passing through, or I should say a false passing through in this case, and we do get this, there are problems with small table.

So these are just variants of the validation functions. They may be able to test or expect, and again, you can validate tabular data and test that workflow. I do this sometimes for other packages. I got a package called font awesome. I'm always changing the dataset because there's always new fonts being added, so I use these expect functions within that to make sure that certain things are what I expect, like nothing gets too different because I have functions acting on the data. So I definitely do use these expect functions a lot. And tests, really good for programming. You don't have to. There's other ways to get true or false. We saw with the X list, you can just obtain them, get that vector of true or false, but this is not bad for a quick Boolean thing or logical value when you need it.

I got a package called font awesome. I'm always changing the dataset because there's always new fonts being added, so I use these expect functions within that to make sure that certain things are what I expect, like nothing gets too different because I have functions acting on the data. So I definitely do use these expect functions a lot.

Again, I will check for questions. There is cucumber in addition to test that. Yeah. There's actually quite a few, when you really break it down, there's actually quite a few unit testing frameworks for R. And I hope with the options that you can use, you can use these with as many as, you know, with a wide range of them.

Introduction to data documentation

I'm not going to take a break from this one because this was a short section. I'm going to go right into the next section, which is section 5. We're skipping 4 because this is only a two-hour workshop. And this is a pretty important section. And essentially what this is, is the documenting part of pointblank. So this is an introduction to data documentation. And it uses an entirely different set of functions, which are not validation, but more like getting metadata and trying to explain your data and still publishing what the data is. But this is more aligned with things like data dictionaries and data documentation. It goes by different names. But I do think that a good thing to do often is to document our datasets. I'm reading this line right here. And we can do that through the use of several functions that let us define portions of information about a table.

Okay. So, again, I just can't let go of a small table. I'm just going to use that here as well. And I'm saying let's document that small table dataset. It's easily available. It's right inside the package. Let's have a look at it again. Okay. Here it is. This is all of it. It's got these columns, date time, date, A, B, C, D, E, F. Not the most exciting table to document. But as an example, it's not too bad. We can at least see it all in front of us. There's only 13 rows. But really, when you're documenting data, you're not really concerned about the rows. You're more concerned about what the columns are, what the table represents, how you use it. More description type things than values. Values do play a part in this. And I'll show you that soon.

So, to start the process, we have another create function. This time it's called create informant. And this creates an informant object that is a bit different from the agent object. Okay. But it looks similar in terms of the way it's set up. We have table, table name, label. Those are the same arguments as in create agent. I'm going to pass a small table. This is just a name that's going to be used in the report. And we're going to call this metadata for the small table dataset. And as before, you just have to print the object to see it. Okay. So, we'll run that now. Okay. This table is not as wide as the other one. So, I don't have to scroll to the right constantly like I did before.

And again, we can customize this. You don't have to always see pointblank information. You can change this. There's a function for that. But what it shows you, again, is what type of table this is, the name that we gave it, small table. It gives us the dimensions at the top. Because it's pretty important to know more about the table. This is all about table metadata. So, we have right away rows and columns. Cool. And what we have here is another section. It really just has columns. And then each of the column names in a box. And then each of the column types just beside. Okay. So, that's what we start off with. It's not much, but it's something to go on. Okay. So, that's how we begin. Okay. So, it's automatically generating information.

Obviously, you know, this program knows nothing about what's actually in the columns or how to describe them. You provide that. And I'll show you how to do that next. So, we have like three functions here. They all begin with info. Info underscore. So, info tabular. This is where you just add some text. And it's just pertaining to the data table as a whole. You just want to give like an introductory paragraph. And it goes to the top. Info columns. This is for adding information for each table column. So, right now we see like a lack of information. We have nothing here to the sides. This is how you add information to each of these rows. And finally, info section. This is more like a free form type thing. It lets you add sections of text and provide ancillary information. So, basically you can add multiple sections, add text within those sections, use markdown. So, you can provide all sorts of information, like the table user, stakeholders, contact info, whatever you want. You can just add sections based on any metadata that is important. But it will be at the bottom of the table, the report table at the end.

Okay. Enough talking. Let's actually try doing this with some code. So, I'm going to start over again. Create informant with small table. Example two. In this case, I'm using info tabular. The argument is description. You can leave that out. There's only one argument. And we're just saying this table is included in the pointblank package. I've used markdown here. You can see the two stars on each side. This will be bold. So, markdown is just assumed as being there. Okay. Info columns. Date time. Only doing what? I'm only describing the date time column. And what I'm providing is info. Actually, that's not the name of the argument. You can actually use any argument name you want. It will basically just like use that in what's written below. You'll see. So, it will say this column is full of timestamps. And that's how I'm describing date time.

Okay. Finally, this one is a bit more complicated. Info section. You provide your arguments again. So, section name, further information. And then within that section, examples and documentation. And then we can write some multiline markdown here. Okay. I'm going to run this. Great. Okay. I'm going to scroll down a little bit here. I'm going to see if I can pop this out just so we can see this a little bit better. Great. I'm going to try that. Here we go. This is slightly better. A little bit small. Okay. That's much better.

Okay. So, this is info tabular. The text I provided. Description. The table is included in the pointblank package. I'm going to scroll up so you can see where it comes from. Description. That's where description comes from. And then this bit of text is right here. Rendered as HTML from the markdown. Columns, the first column is date time. We just so happen to provide that in info columns, the info, which is right here. So, basically, this bit of text, like this label, or this argument name, becomes this label. And then, like, this column is full of timestamps is this text right here. So, you can associate data, metadata, to each of the columns. Great. And then, finally, info section. That lets you add sections to the back. It's a bit multi tiered. Because we have one call of it. And this is a real argument called section name. We call that further information. That's right here. And then within that, we can have multiple of these. But we just chose one. Example of documentation. That's right here. And then the large piece of text we have right here. And this is a link. Because it's all just markdown and this does work. Because it's just using column mark to render that. Okay. I'm going to close this up. So, that's a very basic way of doing this. Okay. And, of course, we can go further. We can describe each of the columns.

But I want to keep going a little bit more on, again, what you use these for. So, tabular, again, is for stuff at the top to the table section. Okay. Use named arguments to define the subsection names. So, it can be description or it can be something else. And you just put in some text. Okay. Ideas. For instance, you can have, like, high level summary of the table. What each row of the table represents. You can describe that. The main users of the table. Description of how the table is generated. Information on the frequency of updates. These are just suggestions. But they're pretty good.

Okay. I'll get a little more into info columns, though. Because that's really where the most of the work should be done. You want to describe each of the columns. Okay. So, create informant is how you create the report. Info columns is for adding information. And to add information to the columns, you'd have to have a separate call each time in the columns. There's no way to bulk, you know, provide information. There is a way with a separate function, but not with info columns. Okay.

Okay. So, let's try this with a much more interesting dataset. Let's use the penguins dataset. In this case, this is a very long bit of code. But basically, what I'm doing, I'm creating the informant using penguins. Providing some information at the top. This is a label. And then each of the columns, I'm providing a description. Using markdown wherever possible. And another cool thing is that you don't quite see it yet. But you might notice that I used ends with mm. So, all these columns here, they end with mm. And if you use info columns, all the text you add is additive. It will just keep appending the text to the back. So, the order is essential here. So, basically, we're saying these columns right here, they have different bits of text, but they all pertain to some sort of units. Millimeters, in this case. So, we can actually use this. Anything that ends with millimeters, which are these columns right here, will get this extra bit of text in parentheses, in units of millimeters. So, kind of cool. So, let's run that. So, we can sort of see it.

Okay. I'm going to scroll down. I'm going to learn my lesson and just make this into a separate window. And expand this. And zoom a little bit here. Okay. Great. So, we add the pieces of text. And the new thing I want to note is that besides adding text for all these, this in units of millimeters wasn't repeated three times for each of these. It was just added multiple times with some type select statements. Because of the property of each time you call these, it will just keep adding text to the same key here at the end. So, we can add some common text to multiple columns. So, that's kind of cool.

Okay. And scroll down. So, again, type select. We can use all of them. All the different ones which are most useful are things like starts with, ends with, contains, matches, and everything, in case you want to include something to every column. Some piece of text. Pretty useful when you have things which are common or maybe the text is the same for multiple columns.

Okay. So, info section. So, it's that's a bit of a bigger function in that you can add multiple sections at the end of that informant table. But this is information that doesn't fit at the top, like the table or the column sections. So, it's kind of almost like additional information or reference information that's important. So, some of it might be like source information. Like, for instance, the citations for the papers that were used to get the dataset or the dataset was involved with. So, I'm going to run that. So, I'm going to take that same data or that same object, just add on info section with the section name source, and then provide this multiline markdown. Okay. I'm going to run that. Scroll down.

There we are. Okay. So, this hasn't changed. But this has. So, now we have references appearing at the top here. And these are all nicely formatted. These DOI links, they do what they're supposed to. The links are nicely formatted because of some internal stuff that we do in terms of styles. So, that's great. And the note here is basically just an additional note we put in, another piece of text. Okay. So, this is great. So, you can do anything you want with this last function, which is like info section. As many sections as you want, as much text as you want, for things which are not, don't need to be up front or don't need to be, don't really pertain exactly to individual columns. Great. We'll close this.

Okay. And then, yeah. Basically, I'm going through some ideas about other things that could go in the back. I'll just run through them because why not? Info related to the source, definitions, explanations, person's responsible, further details on table production, important issues with the table, notes on upcoming changes, links to other information, report generation metadata, include things like update history, person's responsible. If you can do it with Markdown, then you can do it here.

Okay. And I want to say, as before, you don't have to see that title at the top, which says pointblank informant. You can just get rid of that, change it. You can use get informant report, and you provide options there. Mostly with the title, you can change the width of the table a bit. So, it fits, like, you know, like whatever your main document you're putting this into. This is kind of like a keyword. This is all in the documentation for get informant report. You can include some static text or this keyword with quotes or sorry, with colons on each side, table name. So, the title will just become the table name. Let's actually run that right here. It's fine to just show it here and just close it up. It just says penguins because that's the data set we're using, the table name, it just being hoisted up as the title of this whole report. Great. And you might notice that it was not as wide because it's now 600px. Great. And as I've mentioned before, we have this function that does a lot called export report. You can use it with scan data reports. You can use it with the validation report. It just gives you an opportunity to take that object and to make HTML out of that. So, you can embed that anywhere else that you want it. So, that's export report here.

Getting deeper into data documentation

And I would say there's definitely more to this than that. So, there's actually another QMD which we want to get hopefully done. It's called getting deeper into data documentation. And I won't waste any time getting to it because it's actually pretty useful. That's what might seal the deal in terms of using this. Okay. So, there's more you can do. You can provide static text or you could do things like actually use parts of the data to change the text. Okay. And that's in the concept of snippets. Info snippet is one thing. And there's other versions of this. But let me show what that does. Okay. And best examples. But I want to set the stage saying that, you know, lots of information about the table could be from the table itself. So, if you wanted to show some categorical values from a column, you could just take those from the data instead of typing those in yourself. Or you wanted a range of values in an important numeric column. You can obtain that from the data. It could be KPI values that you can calculate using data from the table. So, you can actually fashion a function to get information from the table, include it in your documentation. So, I'll be done with the info snippet function.

Let's look at small table again. We're getting away from penguins. That's the small table. Because it's simple and it might illustrate our points better. Okay. Very small table. We've got numeric values and things like that. So, D, we could, for instance, we could make a function or make a small pipeline here to get the mean and round the value. Great. Okay. That's nice. So, it just proves that we can take the data and use it. So, let's do that. And apologize for the pipes. Don't have it yet in terms of native R pipes in these examples. But you could do it that way, too. But the idea here is that you create an informant and we're creating a snippet. We're giving it a name, mean D, and a function. We're buying a function to get that value. So, I'll run that. Great. I'm not showing the report because we're actually not done yet. We have mean D.

Okay. But what are we doing with it? Well, we're going to insert that into some text. So, the next thing we're going to do is take that same object, use info columns. In this case, we're looking at column D. And we're seeing here this column contains fairly large numbers, much larger than the numbers in column A. Have to write something. The mean value is, oh, check this out, in curly braces, mean D. That corresponds to this right here. So, we define what the snippet is, the snippet of text. We can insert it also with curly braces. And let's actually run that.

Great. Okay. Let's keep going. I'm going to do more of it. I'm going to write the whole thing again. So, we see it all together. Create informant. Create the snippet. So, we're getting some aggregate value from some of our data right here. We're calling it mean D. We're inserting it in a piece of text. And then because we have to access the data, we have to give permission to do that. We have to use incorporate. So, it's just like interrogate. If we're doing any of this stuff where we actually have to access the data, we have to use a function. In this case, that function is called incorporate. Okay. I'm going to run that. Okay. Okay. Kind of cool. It says some things in the console. Incorporation started. There's a single snippet to process. Information gathered, snippets processed, information built. Okay. Cool. That's encouraging. I'm going to run this now.

Okay. Now I'm really excited to get down to the column which I changed, which is D. That's the only column. Oh, of course, it's cut off. There we go. Okay. This column contains fairly large values, blah, blah, blah. And then the mean value is, oh, cool, this value I got right here. Nice. Which is far greater than any number in any of that other column. Doesn't matter what I said. But the key thing is, like, this value can change over time. Your table may change. But these values update. As long as you run that whole thing, use incorporate, it will take that recipe and you're allowed to insert values that you can obtain from the table, which is kind of cool for evolving tables. It's really kind of neat. It means that if you do run this, it doesn't get stale. You don't have to use manual values. It's really cool. And we have some variants. We built in some of these common some of the snippet functions for common things. So, we got a few

But the key thing is, like, this value can change over time. Your table may change. But these values update. As long as you run that whole thing, use incorporate, it will take that recipe and you're allowed to insert values that you can obtain from the table, which is kind of cool for evolving tables. It's really kind of neat. It means that if you do run this, it doesn't get stale. You don't have to use manual values. It's really cool.

functions available to make it easier to get commonly used text snippets. Okay. So, snip list. What it does is get a list of column categories. Nice. Snip lowest and highest. Get the lowest and highest value from a column. You can use that in a min or max type situation. Snip stats. Get an inline stat summary. Kind of cool. Okay. So, let's see how those are used. So, each of these functions can be used directly as an FN value and info snippet. And we don't have to specify the table since it assumes the target table is the one we're snipping data from. So, let's have a look at that. Let's go back to a good dataset, penguins. And we're going to say here we're going to get two snippets. Snip a list. We're going to get species. Okay. So, basically getting a list of values from the species column and a list of values from the island column. We want to know our islands.

Oh, and here's where it's used. Nice. And here's where the other one's used. Cool. Okay. Let's see it. Okay. I'm running that. And the console said stuff. Two snippets to process. Great. All check marks. Now let's look at this. And I'm going to zoom. Expand that. Okay. Very cool. I'm going to compare that against the text that we have. A factor. Okay. Species. A factor denoting penguin species. And then we said species snippet. Cool. Okay. Nice. That's right here. And then it just got that because a snip list. Okay. This can change over time. There can be less, you know, more species can be added. This will just keep track of that. And we use that snippet for island snippet right here. A factor denoting the island in the Palmer archipelago Antarctica. So, we have that list right here. So, it was obtained from the data, which is kind of cool. Great.

Nice. So, I'll close that. So, it also works for numeric values. Let's use snip list to provide a text snippet based on values in the year column. Okay. So, in this case, we're creating a snippet. Year. I will run that. I'm using the same object over and again. So, it's going to basically add to the same thing. I use incorporate as well. Really important here if you use snippets. Okay. I'm going to run this. And do this. So, the year should be put in as well. Here we go. Info for integer. Sorry, for year. The study year. 2007, 2008, and 2009. Nice.

Great. So, it can be used for numbers and also for categories, text. Okay. Snip lowest and snip highest. I think you know where I'm going with this one. I'm going to create those snippets. Here's a cool thing. Min depth. I specified it down here, but I'm using it up here. So, the order doesn't really matter in the end, which is kind of nice. You don't have to worry about that, which is super, super nice. There we go. I just want to demonstrate that. It could be down below. It could be all gathered up in one sort of spot. It shouldn't affect the outcome. Great.

Now, we have lots of snippets because we keep adding to this one informant object. Now, there's six in total. It keeps doing the other ones, too. Okay. So, let's print this out.

Great. We'll expand this. There we are. Okay. There we go. Integer denoting flipper length Okay. There we go. Integer denoting flipper length in units of millimeters. Largest observed is 231 mm. Great. Because of this. There we are. This text right here. So, we just insert the number there. And this one for bill depth millimeters, a number denoting bill depth in the range of 13.1 to 21.5 millimeters. That is up here. Great.

So, yeah, kind of cool. Like, your data can, again, your data can change. This keeps up with that, with those changes. You do all sorts of other things, too. You can use little tiny, well, small conventions like links can be written like this or link text like that. And dates can be enclosed in parentheses. And they can be formatted differently, is what I'm saying. So, pointblank will try to find certain things and try to make them look a little bit different. And so, let's actually take a look at that in action here. So, in this case, I'm putting a date within parentheses. Good stuff. And links, well, I don't know if you can see this, but well, this is not much different than when you have a markdown. But the dates thing is definitely different. But let's run this. Again, we're using the same object over and over again, rewriting it and seeing the change. Okay. No new snippets, just more text. I'm going to run this.

Expand that out. Okay. So, we see here there's a date. It just has another line. Sometimes dates are important. So, that's what that does. If you include parentheses around a date, it'll just do that. And here, some nicely styled links. Basically, the underline appears when you hover over. And that's all I want to say about that. But kind of cool that you can do that.

Labels and styled text

Okay. Labels. So, this is maybe getting a little bit too far. But we can actually enclose bits of text with double parentheses or triple parens for different types of labels. Like a rectangular label or a rounded rect label. Okay. So, you can do things like this. So, let's just do that. And run this. Okay. No new snippets. Same as before. It's good to see that it does that, though. We just add more text. And we have to scroll to the very bottom to see that. And, of course, I'm going to pop this out first. Okay. Now, I will scroll. Maybe zoom in a little bit. Additional notes. Data types, factor, numeric, integer. That's based on this right here. This section we added additional notes, data types, the subsection. And it just has text with the triple parens making, like, these rounded rectangle labels. Sometimes good to have.

Great. Okay. Style text. There's many more things you can do. You can also add in double angle brackets. You can put in some CSS. Just a quick way to style it any way you want. These are just a few suggestions that work well. You can change the color of the text, the background color. Do some text decoration, like, overlines, line throughs, underlines. Change the font style, letter spacing, a border around it. Change the font. Things like that. Bold, italic. The font size. All sorts of things.

Let's try one more example where we take these labels and we do change, like, the border values. Okay. This may be the last one, I think. Okay. Great. I'm going to run this. And there's just so you know, there's two vignettes in the pointblank website that shows all the stuff. Like, basically runs through exactly this, aside from, like, the content here. So, this is, like, pretty easy to find, this reference material. But here. We have these values that are in square, square bordered labels. And they have color fills for each of these. So, this might be good if you want to color code certain things. And not bad for metadata. Especially things like keywords and data types.

So, that's that. Okay. And that's finally it. Basically, we have a lot of snippet functions and we have info snippet, as well, where you provide your own function. Again, we have several provided that you can just use inside the function argument. And that's a way to query the table and to produce text that goes within your own text. You just use curly braces to insert the snippets, the snippet values. And if you print out, you always see this sort of thing. Incorporation started. You have to use incorporate. If you see that you still have curly braces in your text, that means you didn't use incorporate or maybe something worse happened. But that's usually what happens or what doesn't happen if you don't use incorporate.

And that is kind of it. Hopefully, this is a way to somewhat easily get data into sort of like an object that you can publish and provides information for other people that might want to use the data. I'm going to look at questions. And that really is like the end of my content here. And now I think for the last 5, 10 minutes, I will just take questions. It could be any question. Doesn't matter with me. I will answer them.

Q&A

And in addition to questions, there's actually some really funny comments in here. Like mean underscore D smells like glue. It is kind of like glue. I think it might even use glue under the hood. So basically don't use glue again, essentially, with any of these text snippets.

Okay. Got another question. What are some of the more important use cases of generating HTML data documentation? So if I publish data in a package, I might write a core document about it. That's kind of it. It might be for things like colleagues, people you're onboarding with datasets, especially if you have core datasets that, you know, they update frequently. And understanding it is very essential to carrying out your work. This is like a, quote, unquote, nice thing to do. But I also want to make it easy. And there's actually some functions which I haven't covered in here which allow you to get allow you to do the bulk thing of like use a data frame filled with descriptions of columns and then apply them to the columns in one pass. And you can imagine combining that with Excel, for instance. Someone might start in Excel, grab that data into a data frame and then use this function to get those pieces of text instead of using info columns multiple times. Which is fine. But there's other ways to do it.

So got a question here. When pointblank is used on database tables, are the calculations performed in the database or is the data pulled locally first? Definitely the first one, in the database. We do everything with dbplyr, which is loaded at dplyr load time, and we don't purposefully collect any data before everything is summarized down into the final results.

When pointblank is used on database tables, are the calculations performed in the database or is the data pulled locally first? Definitely the first one, in the database. We do everything with dbplyr, which is loaded at dplyr load time, and we don't purposefully collect any data before everything is summarized down into the final results.

Eric said my adventures with pointblank can be seen. I love this before, and I have to admit, I almost forgot about it. But it's wild. I loved it, and there's a link to it right now in the webinar chat to go see it. And I do recommend seeing it.

Okay, well, thank you for joining us today. And if you don't mind, I may ask one or two more things, because it's not every day we get to have you talk to us about all things pointblank. I've always been one of those people that subscribes to learning by doing. And so that little table contest was a great nudge to take your previous materials. And now, obviously, with this material, it's even more of abundance of materials to choose from. And one thing that pointblank really opened my eyes on is that this could be a great use case for kind of a CICD situation where maybe you're getting data updated every month, every week, or whatnot, but you have the same checks you want to run every time, right? And you don't want to do that manually. So being able to take what you've taught us here, putting it into a set of R scripts that could be called on like a cron job-like fashion on GitHub Actions, I think is a real game changer to what we deal with a lot in our day-to-day here.

Yeah. And I've known people that do that. And there's a whole thing I haven't talked about, which is the YAML workflow. You can express all these things in YAML if you want to. It's actually more shareable with people that don't use R and don't want to see R function calls. With one function called YAML interrogate, I believe it is. So that's a good way to run in production because it sort of splits. You have these very committable YAML files and just one R script which calls the YAML file with the data, which is pretty nice for a CI.

Yeah. And maybe not quite a related question, but would you say there's type of data that you wouldn't use pointblank for? Are there any types of variables or types of structures that you would say, nah, probably not a good fit? Probably non-tabular data and basically any data you do have in tables, which are things like list columns, doesn't really handle that. Unless you especially, and you can handle it yourself. I mean, you can do pretty much anything you want with that one function, which gives you carte blanche. Just write your own UDF, your own function to create a validation. As long as you give it the results that it expects, which is a list of, sorry, a vector of logicals or a table with a final column is logical. You can do pretty much whatever you want and whatever you can imagine in terms of validation.

Excellent. And I really love the idea of having these HTML based reports. I know I've been on a crusade in my day job efforts to get away from static, you know, Word document reports and whatnot. So having HTML to be able to get to these details in a novel way, I think has been a huge, huge help for making pointblank more mainstream in our data checking needs.

Well, that's awesome. That's great. Yeah, that's what I was thinking too initially. Although I didn't really know until people started using it and yeah, got reports back. That's actually good.

Does it matter if the source table is grouped? I think it either conveniently ignores it or ungroups it at the beginning. I know we handle it at some point, but I forget the actual behavior. I think it does matter, but it's handled is my question. If you have grouped tables and it does adversely affect a validation, let me know. And we can always provide some options there.

And I believe you touched on it earlier, Rich, but it sounds like Parquet files are just as supportive as anything else for pointblank, right? Okay. In the Python version, yes. They may work just by virtue of dplyr using them, but probably not. But I haven't had a report and I haven't actually done the legwork myself to see whether Parquet files do work. Like that they're verified to work. So that's a thing to do. But if you do Python, yes, the answer is yes. But we haven't talked about Python much here.

Have you thought about dplyr and using this with some of the new LLM agent-based tools? Have you thought about that piece for pointblank? Yeah, I thought about it. And that stuff, unfortunately, it's there, but it's in the Python version only. The Python version was developed later, just in the last year compared to this. So it applies with Python. There's stuff missing still. Like all this informant stuff, the data documentation stuff is not in there yet. But in the Python version, if you want to use AI, you can. There's quite a few avenues for that. There's a way to sort of create through LLM's validation plans, which is not dissimilar to draft validation. And there's also an assistant in there for just like talking and reasoning through. You describe what you want. It knows the API. And you describe which validations you want. And it will just like suggest them to you. And you put them in. There's also another validation function in that called prompt. And what you do there is you just provide text on what the check is. So a validation step is essentially a prompt. But you're basically using English to validate certain parts of your table with that function. Not in the R version yet, but I plan to get those things in due course into the R version.