Validating and Testing R Dataframes with Pandera via reticulate - R-Python Interoperability

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. Thanks for joining me here today. I know it's been a long conference, but we'll get through it. My name is Nils Bentilen. I'm the chief ML engineer at Union AI, which is an AI infrastructure company aimed at helping organizations get value out of their data through AI ML-driven applications. But today, I have my open source hat on, and I'm going to talk to you a little bit about validating and testing dataframes in Python and R with Pandera and reticulate .

So, back in 2012, I don't know if you remember this article, the Harvard Business Review named data science the sexiest job in the 21st century. But you know what isn't sexy? Dealing with invalid data, which may seem like, you know, a losing battle when incorrect or otherwise corrupted data gets through your pipeline, and what's worse, when others are relying on clean data kind of downstream from you, and, you know, if you find yourself having that responsibility of creating clean data sets for your team, then this is a very sucky feeling.

So, we all know that data validation is important, and we all know that it's tedious and often thankless, right? When we first learn any kind of CS concepts, we might have heard of garbage in, garbage out, data-centric machine learnings, more recent kind of buzz phrase, and data as code. And all of these sayings in the field point to data as one of the primary resources or assets that we use to get value or to deliver value.

But you might say, I just want to train my model, or I just want to make that pretty plot, right? And, yeah, I was here once in a previous job. I had to train a model. Looks very vanilla, right? You clean data, you split it and train test splits, you train a model, you evaluate the model and train test splits. And at that time, I was using some machine that wasn't as powerful as I wanted it to be, so it took maybe three, four days to train my model. And when I completed, I discovered, oh, no, my evaluate model step failed. And digging into the code and finding the bug, it was because the test data split that I was creating, it was fairly custom and not straightforward, was incorrect. And all I had to do, really, was do some basic assertions, basic validations to just test whether my data splits were correct, and I would have avoided having to wait and twiddle my thumbs for, oh, okay.

Yeah, so I hope that brief story inspires kind of or justifies this idea that data validation is about understanding your data and then capturing that understanding as a schema. So here's a little sketch of what a schema might look like, right? It's just a JSON blob. It says you have three columns and they have these types. So at its core, a schema is an artifact that documents the shape and properties of some data structure. In this case, we're talking about data frames. And secondly, schemas enforce that shape and those properties programmatically at runtime, perhaps.

So my job in this talk is not to convince you that data science is the most attractive part of data science, but I do want to convince you that data validation can be fun.

So my job in this talk is not to convince you that data science is the most attractive part of data science, but I do want to convince you that data validation can be fun. The way you do that is you reframe it as unit tests for your data. And just like whenever you run your unit test, you see nice green check marks and whichever test suite you use, you'll see some nice visual cue that all your tests are passing. Well, if you frame data validation in this way, you get that kind of nice dopamine hit when you see all these passing data set validation checks.

But you have to do a little extra work to get to this fun part, right? And there's a very minor shift in the way I would say one would work to incorporate data validation into their mindset and their workflow. So typically, you might define a goal for your analysis. You might explore your data, implement your data processing function. You kind of spot check the output of that function. And if it looks good, you just continue, right? Well, with the data validation mindset, essentially, it's very similar to unit testing or test-driven development. But in a very similar way, when you explore your data, you capture that understanding in terms of a schema. And then you validate your implementation of your data cleaning function. And when that passes, you continue your analysis.

And so I hope I'm getting the point across that there's no substitute for understanding your data with your own eyes. But once you get that understanding, it might seem like a Sisyphean task to create and then maintain it as your data set shifts over time as real-world data does. So I built Pandera in Python to lower that barrier to entry, to create your schema, and then iterate on it over time and collaborate on a code-first artifact that will validate your data. And my hope here is that organizations that use it, at first it was just me and my company at the time, would build a culture of data hygiene and seeing the value of doing these runtime checks.

I think our data wants to be validated, so we should help them along with that.

Niels, you have a lot of questions. I'm sure people are going to be super excited to talk to you right outside. We'll do one very quick one. Is Pandera configured to only work with Pandas DataFrames, or does it work with a variety of classes like Polars?

Yeah, Polars is on the roadmap, so that is an issue for this. It currently supports Moden, PySpark, so it's obviously Python-focused, Dask DataFrames, GeoPandas DataFrames, with Polars on the way. And then we kind of have our eyes set on X-Array and IBIS, so IBIS would be for in-database validation.

Very cool. Thank you so much. This was great. Thank you.

Validating and Testing R Dataframes with Pandera via reticulate - R-Python Interoperability

Transcript#

Introducing Pandera and reticulate

Defining and exploring the schema

Validating data and handling errors

Updating schemas and built-in checks

Using Pandera with reticulate in R

Synthesizing test data

Featured software#

reticulate