
Validating and Testing R Dataframes with Pandera via reticulate - R-Python Interoperability
Presented by Niels Bantilan Original Full Title: Validating and Testing R Dataframes with Pandera via reticulate: A Case Study in R-Python Interoperability Data science and machine learning practitioners work with data every day to analyze and model them for insights and predictions. A major component of any project is data quality, which is a process of cleaning, and protecting against flaws in data that may invalidate the analysis or model. Pandera is an open source data testing toolkit for dataframes in the Python ecosystem: but can it validate R dataframes? This talk is composed of three parts: first I'll describe what data testing is and motivate why you need it. Then, I'll introduce the iterative process of creating and refining dataframe schemas in Pandera. Finally, I'll demonstrate how to use it in R with the reticulate package using a simple modeling exercise as an example. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: R or Python? Why not both!. Session Code: TALK-1123
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, everyone. Thanks for joining me here today. I know it's been a long conference, but we'll get through it. My name is Nils Bentilen. I'm the chief ML engineer at Union AI, which is an AI infrastructure company aimed at helping organizations get value out of their data through AI ML-driven applications. But today, I have my open source hat on, and I'm going to talk to you a little bit about validating and testing dataframes in Python and R with Pandera and reticulate.
So, back in 2012, I don't know if you remember this article, the Harvard Business Review named data science the sexiest job in the 21st century. But you know what isn't sexy? Dealing with invalid data, which may seem like, you know, a losing battle when incorrect or otherwise corrupted data gets through your pipeline, and what's worse, when others are relying on clean data kind of downstream from you, and, you know, if you find yourself having that responsibility of creating clean data sets for your team, then this is a very sucky feeling.
So, we all know that data validation is important, and we all know that it's tedious and often thankless, right? When we first learn any kind of CS concepts, we might have heard of garbage in, garbage out, data-centric machine learnings, more recent kind of buzz phrase, and data as code. And all of these sayings in the field point to data as one of the primary resources or assets that we use to get value or to deliver value.
But you might say, I just want to train my model, or I just want to make that pretty plot, right? And, yeah, I was here once in a previous job. I had to train a model. Looks very vanilla, right? You clean data, you split it and train test splits, you train a model, you evaluate the model and train test splits. And at that time, I was using some machine that wasn't as powerful as I wanted it to be, so it took maybe three, four days to train my model. And when I completed, I discovered, oh, no, my evaluate model step failed. And digging into the code and finding the bug, it was because the test data split that I was creating, it was fairly custom and not straightforward, was incorrect. And all I had to do, really, was do some basic assertions, basic validations to just test whether my data splits were correct, and I would have avoided having to wait and twiddle my thumbs for, oh, okay.
Yeah, so I hope that brief story inspires kind of or justifies this idea that data validation is about understanding your data and then capturing that understanding as a schema. So here's a little sketch of what a schema might look like, right? It's just a JSON blob. It says you have three columns and they have these types. So at its core, a schema is an artifact that documents the shape and properties of some data structure. In this case, we're talking about data frames. And secondly, schemas enforce that shape and those properties programmatically at runtime, perhaps.
So my job in this talk is not to convince you that data science is the most attractive part of data science, but I do want to convince you that data validation can be fun.
So my job in this talk is not to convince you that data science is the most attractive part of data science, but I do want to convince you that data validation can be fun. The way you do that is you reframe it as unit tests for your data. And just like whenever you run your unit test, you see nice green check marks and whichever test suite you use, you'll see some nice visual cue that all your tests are passing. Well, if you frame data validation in this way, you get that kind of nice dopamine hit when you see all these passing data set validation checks.
But you have to do a little extra work to get to this fun part, right? And there's a very minor shift in the way I would say one would work to incorporate data validation into their mindset and their workflow. So typically, you might define a goal for your analysis. You might explore your data, implement your data processing function. You kind of spot check the output of that function. And if it looks good, you just continue, right? Well, with the data validation mindset, essentially, it's very similar to unit testing or test-driven development. But in a very similar way, when you explore your data, you capture that understanding in terms of a schema. And then you validate your implementation of your data cleaning function. And when that passes, you continue your analysis.
And so I hope I'm getting the point across that there's no substitute for understanding your data with your own eyes. But once you get that understanding, it might seem like a Sisyphean task to create and then maintain it as your data set shifts over time as real-world data does. So I built Pandera in Python to lower that barrier to entry, to create your schema, and then iterate on it over time and collaborate on a code-first artifact that will validate your data. And my hope here is that organizations that use it, at first it was just me and my company at the time, would build a culture of data hygiene and seeing the value of doing these runtime checks.
Introducing Pandera and reticulate
So as I said, Pandera is a Python package. It's a data testing validation and testing tool. And in this talk, I'm going to show you how you can use it with reticulate, which is an R Python bridge where you can interchange data functions and objects between the two runtimes. So what? So the point here is that using Pandera in your Python and R stacks gets you a single source of truth for your schemas, data documentation as code for you and your team to look at, and runtime data frame schema enforcers. So you can spend less time worrying about the correctness of your data and more time analyzing, visualizing, and modeling them.
Defining and exploring the schema
So in the rest of this talk, I'm going to take you through a mini data validation journey, starting with defining our goal. So say we have a toy data set here of transactions consisting of grocery store items and their associated prices, and we want to predict the price of items. Obviously, there would be a lot more columns here, but just for the sake of learning about how Pandera works, we just have two right here. So we might start off exploring the data. We might print out the data types. We might describe some of the columns and plot out the distribution of things we want to kind of grok better with visuals.
As you can see here, we can start building our understanding of what our schema might look like, right? So item is categorical, represented as strings. It has three possible values, apple, orange, banana. Price is a float, which is greater or equal to zero. And neither columns can contain null values.
So Pandera gives you a very simple way to translate this understanding into a schema. So let's look at our first Pandera schema. First, we import the package. Then we create a class. And hopefully, this is kind of clear enough, even if you're not super familiar with Python and classes. But hopefully, you can kind of read this and see what's going on, where I'm subclassing DataFrameModel, and that's how I kind of define my schema. Then I define some attributes in this class. So the item attribute is a string. And in this field function, I'm basically adding some more constraints that are not just the data type. So I'm saying item must be in this finite set of values. And similarly, price is a float that's greater or equal to zero. And finally, both of these columns are not nullable. Nullable equals false is actually a default, but I'm showing it here just to be explicit.
Validating data and handling errors
So now that we've defined our schema, we can just call schema.validate. And if the data you pass into validate is valid, then it simply returns a valid data. So it applies a bunch of those checks. And if it's OK, you get your data, and you can continue whatever you're doing with that data.
However, if the data is invalid, so in this case, we're defining invalid data with a NAND value in the price column and then squash as an item, which has a negative price, which doesn't make sense unless I'm paying you $1,000 for squash. If I try validating this invalid data, I get a schema error. And if we catch that schema error, we can get the failure cases that Pandera found. In this case, Pandera is just going to kind of serially go through all those checks and raise the first one failure case it sees.
Now that might not be good enough. I want to see what all the errors are. So there's a lazy equals true option keyword argument you can pass to the validate method. And if you do that, Pandera will raise a schema errors exception. And in this case, the failure cases are a lot more informative. So we can see that failure cases is itself a data frame that contains information about where the error happened, what column was the offending column, which check was associated with failure, and the exact failure case value that caused this error, and then some bonus metadata of where the index was in the data frame.
So as you can see, this package is really optimized for just sort of like data science, data engineers, ML engineers who are in the weeds in the code and kind of want to roll their own kind of reporting layer. So this is how Pandera reports errors.
You can easily integrate this with your existing pipeline. So assuming that you have sort of functions defined for each step of your pipeline, this also works with methods, of course. But in this case, we have a clean data function that doesn't do anything. It just returns the input. So this is the incorrect implementation. But in order to check the output of this function to make sure it conforms with the schema as we've defined it, we decorate the function with this check types decorator. Of course, pass lazy equals true. And then we use type annotations to say, hey, this is a data frame that conforms with this schema. And once you've done this, clean data is now a data validator plus a function. So every time we call clean data, we check all the values of the output. In this case, we recover the failure cases we saw earlier.
Updating schemas and built-in checks
So you might say, well, squash is an actual grocery item that might make sense to have in this data set. So this is just a slide to communicate essentially that there is this kind of constant updating process where your schema is stateless. It's all described in your code. And so as your data shifts over time, you're going to have to update that schema and collaborate with your teammates, make changes to your source code, check that into Git. And you essentially have a track record of all the changes that happened with your schema.
So in a more realistic setting, you may have thousands, hundreds, or thousands of items. So in the wild, I've kind of seen people use a pattern where they kind of read in some metadata from a JSON file or something, or they just have it as a variable, like as a constant variable, and then they just plug that in here.
Pandera also provides some additional nice things to refine your schema and refine the validation routine you want to apply to your data. And the way you do this is you define a config class underneath your schema. And if you say coerce equals true, Pandera, when you call validate, it's going to try to coerce all the raw data types into the specified data types. This is a potentially destructive, lossy transformation. So if you don't want to do it, by default, it's false.
If you say ordered is true, then the raw data coming in needs to be in the same exact order as you specified in the schema. This is important to some people. Strict allows you to say, hey, I want to raise an error if there are random dangling columns that are just there in my data that aren't specified in my schema. I can actually say strict equals filter, and it'll remove those columns. I can say drop invalid rows equals true, and any checks that fail are considered invalid rows, and those are dropped if that's what you want. And then finally, unique column names is true, just does what it says. There are a few more options, but these just want to give you a sampling.
There are also some built-in checks. So as you saw from that field function that's attached to the column attributes, we can define some other constraints, like the values must encompass all of the finite sets. So unlike is in, which a valid column would just have a column with all As, a valid column in this case would be a column that has to contain at least one of each of those. You can also do string pattern matching. There are a few other built-in methods here, like in-range checks, some very basic ones, greater or equal to, less than or equal to, things like this.
However, if the built-in checks don't fulfill your use case, Pandera just makes it super easy to create custom checks. And so you do this by defining methods that you decorate with pandera.check. And in this case, this will just make sure column 1 and 2's means are within some range. So as long as this method outputs a Boolean, a Boolean series, Pandera will know what to do with it. You can also do data frame level checks by saying at data frame check, and you essentially have access to the entire data frame. And here you can do whatever you want, as long as you stick with that kind of constraint of outputting a Boolean, a Boolean series, or a data frame series.
The last part I'll show you before going into R is regex column matching. So say you have a data frame with 1,000 columns, and they all kind of have similar semantics, but they're kind of named in a similar way. You can specify an attribute that has an alias that essentially becomes a regular expression when you say regex is true. Same goes with these at check column level checks. And essentially, you will apply the validation to any column that matches that regular expression.
Using Pandera with reticulate in R
So just a brief comment before going into reticulate is this Quarto document is validated by Pandera. So minor meta comment there. All the code you see here has been validated. So that's awesome.
So just to address the elephant in the room, there are packages that do data validation like point blank and validate. Why would you want to do this Frankenstein's thing with Python in the first place? And to be honest, the real answer is because we can. But to give you a few reasons that may make sense, maybe you're already using Pandera, and you want to reuse those schemas in our runtime. Maybe the Pandera programming model just fits better in your brain. I don't know. Or you just want R and Python to get along, and you just want to give it a try.
But with that, let me import a few things in my toolkit to get started here. So using dplyr, knitter, and reticulate, and I'm using a conda env where I have all of my dependencies installed. And it just works, right? I define a data frame in R, and I use the pi entry point here. And it returns a valid data. Just a warning, this hasn't been comprehensively tested in all the data types, so I don't want to get you too excited. You can catch Python exceptions. Sorry, that's a typo. It should be R exceptions. But yes, the Python exceptions in the R reticulate runtime. So this is essentially the equivalent of the try catch thing in Python. And we get the same failure cases.
Synthesizing test data
And the last thing I just want to show you here is synthesizing test data with Pandera. So if you call schema.example or $example, I can create valid data under the constraints of my schema. And sort of running low on time, but this is super useful for unit testing. So yeah, maybe I can just show you this test process test that will create some mock data, try to process some data, and tell you if your tests have failed or not. And so in the happy path, they pass. But if there's a bug in the data, like a column that has a typo, the test will fail. And data validation is not only about testing the actual data, but also the functions that produce them. So you can get started installing Pandera.
But I do want to say last that I think our data wants to be validated, so we should help them along with that. Thank you.
I think our data wants to be validated, so we should help them along with that.
Niels, you have a lot of questions. I'm sure people are going to be super excited to talk to you right outside. We'll do one very quick one. Is Pandera configured to only work with Pandas DataFrames, or does it work with a variety of classes like Polars?
Yeah, Polars is on the roadmap, so that is an issue for this. It currently supports Moden, PySpark, so it's obviously Python-focused, Dask DataFrames, GeoPandas DataFrames, with Polars on the way. And then we kind of have our eyes set on X-Array and IBIS, so IBIS would be for in-database validation.

