Resources

Jim Kloet | Garbage Data, And What To Do About Them | RStudio (2022)

No matter the requirements of the project, data are rarely ready for analysis without some intervention up front, often described as cleaning or tidying up your data. Researchers and data professionals employ many tools to make their data usable for their needs; but, there exist data that are so far beneath the threshold for usefulness that they cannot be used responsibly for analysis or decision-making, i.e. “bad data.” This talk proposes a framework for identifying bad data, with examples from both academic and industry; identifies challenges you might face from stakeholders when you identify bad data; and suggests concrete steps you can take to overcome those challenges now and in the future. Session: Generating high quality data

Oct 24, 2022
15 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Jim Kloet. I manage a team of data scientists in Chicago, and I am here to talk about garbage data. But before I talk about garbage data, I want to lead you off with a quote. A lot of you have probably seen this quote before. It's about statistical models. It's from the famous statistician George Fox. And he wrote a bunch of times, at least the first part, all models are wrong, but some are useful. But then sometimes he'd go on, he'd add things like, so the question you need to ask yourself is not, is the model true? Because it never is. But is the model good enough for this particular application? And I love this quote.

It's true. It's not explicitly about garbage data, but it turns out that you can just cross out models and replace it with data. And the quote's still true. Because all data are wrong, but some are useful. There's no such thing as a perfect data collection tool or data retention tool. And so the question that you need to ask yourself isn't, are the data true? Because they're not. But are the data good enough for this particular application?

All data are wrong, but some are useful. The question that you need to ask yourself isn't, are the data true? Because they're not. But are the data good enough for this particular application?

And so the you in this sentence that I'm thinking of is us, data professionals. And depending on where you're at, whether you're in industry or academia, the titles are going to differ. It could be a data scientist or engineer or architect or academic researcher. The point is though that we data professionals are often the primary deciders of data usefulness in our organizations.

And a typical workflow, at least for me, kind of looks like this. So we'll have some stakeholders that are super enthusiastic. I use a lot of emojis. So if you don't like emoji, you should leave. So stakeholders will be really enthusiastic. They'll come to us and ask for help. Like they brought us some data and I don't work in agriculture. This is a metaphorical haystack. And so then we spring into action and we think, okay, cool. We're on this hypothetical continuum of data usefulness. Do we think that these data should fall? And if they're maximally useful, the best case scenario is that they are just chock full of insights, which you see, the metaphor keeps going. Now we've got needles. Answers to our questions, solutions to our problems, whatever the stakeholders are looking for.

Because I mean, ultimately, stakeholders expect that there's going to be something in data, right? Like that's the promise of big data. And so even if we make this assessment that data are minimally useful, the expectation from our stakeholders is still like there's non-zero insights in there, right? Like that there's at least a needle in the haystack or two needles. And again, people are led to believe that data are always going to be the solution here. They're not going to be the problem. But I think Megan's talk highlighted, Max Kuhn's keynote this morning also called out data aren't always useful. And in a lot of cases, data are just garbage.

And garbage data are at all of our organizations, right? I think if you've worked with real world data, you definitely have encountered garbage data. And so you know then that garbage data is problematic and has costs and can lead to real problems within our organization.

Identifying garbage data

So my talk here, I want to make sure you all walk away with at least three points. First, I want to make sure you can identify garbage data in your organizations. Second, I want to make sure that you're aware of and can identify the costs of garbage data. And then finally, I want to make sure that you're taking action when you see garbage data.

All right, so hitting the first point here on identifying garbage data. So like we think an operational definition is a good starting point. And garbage data, I think, simply are just data that aren't useful for a particular application. So if we have stakeholders coming up to us, then the first question we really need to ask them is, are the data that you're giving us that we're working with relevant for the questions that you're interested in asking?

And so say our stakeholders come up with questions about hot dogs. And then they give us data like this. And it's adorable. And it's possibly useful based on the simple dogs being in there. Hat tip to Hadley for this joke, by the way, in the feedback. Thank you, Hadley. But as we're making our decision about how the data are useful or not, it's quite likely that we'll probably say, no, these data aren't relevant to the particular application. They're likely not going to be able to answer your questions.

But our questions usually don't stop there, right? We've identified the garbage data, but garbage data aren't always garbage. They could be relevant for a different thing. So the next question we want to ask our stakeholders, is there another application where these data could be relevant? And so if your stakeholders are anything like mine, they have lots of questions. It's not just that their questions are about hot dogs or one thing in particular. Oftentimes the data that they have are probably relevant to something else that they're interested in. And so working with them to find out if there's another question that the data can be helpful in answering seems like is another reasonable step to take.

I think that there's one other step to take after this, though. Say you say no, and the data are not necessarily useful for this secondary application. I think it's fair to ask, are there reasons to expect that these data are total garbage? Because again, we've worked with real world data. We know that there are definitely cases where your data are not useful for anything. And that often requires us to put on our sleuthing hats, which is usually a part of any data professionals arsenal.

I'm not going to go into all of the ways that data can be totally garbage, but I'm going to walk through at least a few examples because I'm sure that they will resonate with some of us. Megan also had a ton of awesome examples in her talk as well, so check out that repo. So things can go wrong during data collection. I used to work with handwriting data. Sometimes that data is illegible. This is not data from any of my previous studies. This is actually from a New York Times article about kids' handwriting. But the point I think is still well taken, that sometimes the data is just completely borked from the point of collection.

It can also get screwed up in retention, right? I've had a weird case where some of my data were corrupted from a hard drive failure, but not all of it. And it didn't become clear until I'm doing a little bit more digging and reading a whole bunch of logs, which is not my favorite way of spending time. And so retention is another opportunity for things to really go wrong and make your data totally garbage.

So a lot of times the total garbage results from someone accidentally borking the data. I have done this more times than I can count. And if there's anyone in the room who has not accidentally borked their data to make it useless or very close to useless, raise your hand. Well, actually don't raise your hand because it's going to happen sooner or later. So just be prepared for that sort of thing. So people make mistakes all of the time in data collection and retention. And so that's sometimes how data will become totally useless.

I think a recent case is where we had a lot of data that were replaced by default values. And so when it looked like 99% of our sample were default values, we had to start wondering if that was going to be useful for inference. The last case, which I think makes a lot of news and headlines is that sometimes people intentionally bork their data. This is what I think we used to think cyber criminals looked like. Maybe this is what they actually look like. I'm not sure. I don't necessarily think that this is like the most common scenario, right? People usually seem to be honest in my experience with their data, unless they're really incentivized to have data look a very specific way.

So in those cases, what do you do? You've identified that the data are totally garbage. Those need to be destroyed. I think in most cases. I can point to a number of times when I've worked with folks that have tried to make inferences from totally garbage data, and we'll get into the consequences of those in a little while. But if you can't put up, or if you can't totally destroy the data, put up as much warning tape as you possibly can, because you want to make sure that nobody else is trying to accidentally make inferences out of this.

So identifying data is useful. Identifying garbage data is useful. But I called out that there's a reason to do it, and it's because garbage data, oops, excuse me, I had a summary slide. So summing up part one, some garbage data are not useful. Some garbage data are useful for other stuff, some not useful for anything.

The costs of garbage data

So why do we care? Well, it's because garbage data have costs. And I think of those costs in a couple different ways. I think first of these unavoidable costs that you're going to have anytime you have a data project, right? This is a just example cloud bill, because every time you have a data project, you're going to incur costs with data collection, data retention, any kind of support around those data. There's always going to be a cost. Your data lake, your cloud provider doesn't know, probably doesn't care if your data are garbage. Those costs are going to happen whether your data are good or not.

You might not be able to avoid some costs, but the avoidable costs I think are really the ones you have to be on the lookout for, because that can really torpedo your organization. So first, the excessive regular old data costs, if you don't catch the fact that you have garbage data, well, you're just going to keep incurring bills and bills and bills and they add up. And I suspect that in a lot of organizations, they're bigger than $7,400 a month. So if you can catch garbage data when it happens, that can save a lot of costs.

Opportunity costs, I think, are another really important one to call out that are probably a lot less easy to point to, because they're not tangible. They didn't happen. You can't say I missed this thing, right? But philosophically, garbage data necessarily doesn't have any value. So any time you spend looking into it is time you should have been spending doing something else. But we're practical. We're not here for philosophy. And so from a practical perspective, you see that there's a lot of interesting things that you could be working on as a data professional, instead of plumbing the depths of these garbage data and trying to validate that your stakeholders might be able to find something or might not be able to find something.

And I think that leads to another level of cost, which are these people costs. I like to learn things. I like to help other people learn things and kind of answer their questions. And you can't do that with garbage data. And eventually, after having to work with garbage data over and over and over again, data professionals are probably going to go find something else to do or go look for data that may or may not be garbage. I guess as a spoiler, the grass isn't greener everywhere. If anyone's worried about their organizations being especially borked, it's dirty everywhere. But people will continue to look if the data that they're working with isn't allowing them to learn things.

And I think that that leads to maybe the most obvious kinds of costs, which are decision costs, right? Because if you're making decisions with garbage data, you don't know what's gonna happen, right? You're making decisions basically with random noise in a lot of cases. And that means something good could happen, but not because of any smart things we did. But lots of bad things could happen too. And that's why I think that ultimately these avoidable costs can really be catastrophic, right? So if you think about making bad decisions after all of your data professionals have left looking for greener pastures, after having spent all of your money retaining a bunch of garbage data, that feels like a recipe for disaster for an organization.

These avoidable costs can really be catastrophic, right? So if you think about making bad decisions after all of your data professionals have left looking for greener pastures, after having spent all of your money retaining a bunch of garbage data, that feels like a recipe for disaster for an organization.

And so the costs can really, really, really add up. Garbage data is something that we need to care a great deal about. So now I'm remembering my summary slides. So in summation part two, think about garbage data costs in a couple of different ways. Unavoidable costs, you're gonna have those anytime you have a data project. Avoidable costs though can be catastrophic.

Taking action on garbage data

So finally, we should take action when we see garbage data. What does that mean though? So we're the primary deciders of data usefulness. The first thing I think it means is when people come to us with questions and bring us some data, we have to be honest with them and direct with them. I know that I don't like to give people bad news. At various points in my life, people have called me Jim the dream crusher. And I felt bad. It was a joke, but I felt bad about it. And I don't want to be in that position, but the consequences of not being honest and direct with your stakeholders and letting them know when garbage data happens are a lot worse than the consequences of having some bummed stakeholders.

Now we don't want to bum any stakeholders out, but again, I think that we are obligated as the people who are kind of the line of defense against garbage data to say, this is garbage.

So the second thing, we don't want to bum people out. I think we need to lead with empathy. And this probably is just kind of par for the course for most of us, but it's useful to say it explicitly. Our stakeholders have a lot of things that they need. Our stakeholders have demands on them. They have pressures on them. They're not using garbage data or asking us to play around with garbage data because they think that it's a good way for us to spend their time. They're asking us because they need our help for something. And so if we're not leading with empathy, we end up with salty stakeholders who don't want to work with us anymore. And working with salty stakeholders is a good way to not have a job anymore or be a professional data professional anymore.

Another useful thing to try and do is, if you can, influence your organization to treat all problems as data problems from the very beginning. All projects should be data projects until otherwise kind of considered. So default to treating projects as data projects. So an ideal workflow is when my stakeholders come to me with questions rather than coming to me with data, that gives me the opportunity and my team the opportunity to evaluate which data sets might be most appropriate to help them answer those questions, which makes it at least more likely that we're going to find something in there. And again, they may not be as happy as if you'd given them a huge pile of needles, but they'll be grateful that you didn't leave them empty-handed.

And then finally, I think we have to be patient because organizations are used to doing things however they've been doing them. And coming in and saying, hey, you're doing it wrong, is never going to drive change. I have made that mistake. It's a bad idea. So be patient. Build those relationships.

And if you do all of this stuff, like if you're, I think the third part here, the actions that we can take here is like, we have to be honest, we have to be empathetic, make sure we're treating all projects as data projects, and making sure that we're patient. I think that'll help us accomplish all three of these points here of being able to identify garbage data, the costs, making sure that we're taking action. And I guess to wrap it up, I mean, if you do everything I suggest, your data are still going to be wrong. No matter what, there is no changing that. But there's a better chance that they could turn out to be useful, and that's our goal. Thank you all very much for your time.