Resources

Grant Fleming | Fairness and Data Science: Failures, Factors, and Futures | RStudio

In recent years, numerous highly publicized failures in data science have made evident that biases or issues of fairness in training data can sneak into, and be magnified by, our models, leading to harmful, incorrect predictions being made once the models are deployed into the real world. But what actually constitutes an unfiar or biased model, and how can we diagnose and address these issues within our own work? In this talk, I will present a framework for better understanding how issues of fairness overlap with data science as well as how we can improve our modeling pipelines to make them more interpretable, reproducible, and fair to the groups that they are intended to serve. We will explore this new framework together through an analysis of ProPublica's COMPAS recidivism dataset using the tidymodels, drake, and iml packages. About Grant: Grant Fleming is a Data Scientist at Elder Research, co-author of the Wiley book _Responsible Data Science_ (2021), and contributor to the O'Reilly book _97 Things About Ethics Everyone in Data Science Should Know_. His professional focus is on machine learning for social science applications, model explainability, and building tools for reproducible data science. Previously, Grant was a research contractor for USAID

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Grant Fleming, and I'm a data scientist at Elder Research, where I work on data science consulting projects with various public sector clients. And in that work, along with a lot of the work that I'm doing outside of work work, I'm having to wrestle with the issue of fairness and data science, specifically things like how do we make sure that the predictions being generated by our models are fair? How do we make sure we're actually recording the correct information in our data and not leaving certain important factors out? And if we suspect that something may have snuck in our model, how do we make sure that we're able to detect that element before we deploy our model and cause harm intentionally?

Because as we've seen probably in some of the news over the past few years, this is an issue that's, if not become more widespread, become more widely known. For example, in 2018, Joy Buolamwini and her research team at MIT published the GenderShades work that they did on facial recognition algorithms from companies like Amazon, IBM, and others. And in this work, they found out that these facial recognition algorithms were actually having much higher error rates, detecting darker skin faces versus lighter skin faces. And that obviously is something that we would not want to see in any sort of fair model.

Similarly, we saw something similar in 2019 in the healthcare sector, where an algorithm furnished to hospitals throughout the US, which was tasked with helping to allocate treatment to patients within the hospitals, was systematically underestimating the healthcare needs for black patients relative to white patients or patients from other ethnic groups. Again, likely a case of unintended bias, but nonetheless, something that led to harms in the real world.

Why these failures happen

And I could spend the rest of the session talking about these issues across the various different examples, which I'm not including, but that's not productive. Maybe what's more productive then is coming to an understanding of why this is actually happening.

So from my perspective, at least, it seems like this is in large part the result of us as data scientists thinking that our tools alone, the technical tools, the code and the data and the models that we were able to build are enough to solve these complex social problems. Like duct tape, we use them wherever we can to fix whatever we can. And some situations are helped by duct tape, but duct tape cannot alone solve everything. I think that's a good metaphor here for how we should think about it.

How we should think about it instead is our toolkit is not duct tape that should be used wildly wherever we can, but rather our toolkit is one that, along with people who have the subject matter and lived expertise in the areas that we seem to operate in, they, along with us, can lay brick by brick a solution to a problem or at least get closer to solving it.

And this is twofold. We from them can get a greater appreciation for the factors that can't be encoded in our data or aren't encoded in our data, but are nonetheless relevant to our work. And they, in turn, can help us to better diagnose issues within our models before they're deployed.

A framework for responsible data science

Sounds good, but how do we actually do this? How do we actually do responsible, ethical, trustworthy, however you want to call it, data science in practice? It's the million dollar question, and a number of researchers are working to try and answer this. And to my reading, it seems like they've coalesced around the three main factors that go into making a data science project, we'll just say a responsible data science project.

Those being, first, fairness in procedures. So basically, was the project designed in such a way that it can be conducted fairly? Did it include a diverse range of voices and all relevant stakeholders? Next would be fairness in predictions. So when the model was actually built, does the model perform equally well across different groups within the data? And the final one being fairness and outcomes. So once the actual process of the model is set in stone and is more or less fair, and even if we get relative fairness in our predictions, does the impact of those predictions actually improve fairness in the real world? Are the impacts of those predictions equitable?

Fairness in procedures

And where better to start thinking about that than ensuring inclusion. If you don't have a diverse perspective or diverse group of perspectives on your project, in all likelihood, something could slip through that would cause an unintended harm when your model was deployed. This is especially relevant for someone like me, who not only is a consultant, who's often coming into a data science project in a subject matter context where I have less expertise than maybe some of the people I'm working with, but also because I'm a white man. And that bears with it certain biases, some of them benign, some of them harmful, in how I see the world and operate within it.

And it's important for me and the people I work with on projects and the clients I work with that those biases are checked by the views that other people can bring to the table. And I think critically, this is something that is worthy in all projects and besides being the right thing to do.

Next though, would be reproducible code. We don't want code that just runs on our own machines. This is maybe easier said than done, but there are fortunately within the R ecosystem a number of packages available for this. I won't go too much into detail on what those are because so many other people have already spoken in depth about these. I'll reference two of them here. The first would probably be Dr. Anna Cristalli. She has a great talk called Putting the R into Reproducible Research, which you can find by Googling. It should come up.

And then another person, Karthik Ram, delivered a talk at the previous RStudio conference in 2019. Google that. You should find it. I don't remember the title, but between the two of them, they covered a lot of really great materials for doing reproducible work within R.

But back to fairness and procedures, finally, to wrap this all up, we want to make sure that we document everything we do and every decision we make in the same way that we want to have our code be reproducible by technical people. We also want non-technical people who may not be able to actually access our code directly to be able to pour through our results in a way that's easy for them to interpret and that they can actually evaluate.

And the Google Brain Ethical AI team has done a lot of good work on this. They've come up with data sheets, model cards, and audit reports. So three different sort of documentation modalities that you can all think of as nutrition labels, I guess, for each part of the project. And even though we don't yet have any way to do these within R, hopefully soon there will be some sort of R markdown templates that make this easier for us. I'm certainly on my own end trying to put together a package for it, and we'll hopefully have some more updates on that in the future.

Fairness in predictions

And so you're in good shape to start now thinking about fairness and predictions, starting with predicting accurately. This doesn't need a whole lot of development, but I include this here to make sure to point out that if you do not have high predictive performance overall on your model, by no conception can it be fair. Even if you perform equally across groups, getting a 10% level of performance across all of them is meaningless, and that's going to do you and the people this model is hoping to serve no good whatsoever.

But more subtle and probably more relevant to the work we do is predicting equally. And to explore this in practical detail and how we can actually work on this in our own projects, we're going to go through an example with the Compass Recidivism dataset. So this is a dataset released by the public interest group ProPublica, and what it tracks is demographic and criminal records information of defendants. And the goal of this is to take all this information and use it to predict whether these defendants will commit another crime in the next two years.

Because of the demographic information within this data, it's often used as sort of the iris or titanic dataset for fairness methods and evaluation of fairness maintaining approaches within data science, as we're about to do here, and as I already did here.

So on this plot, we can see two different bars, one for the accuracy of a logistic regression and one for the accuracy of a random forest model. And again, these are taking demographic and criminal history features and predicting a binary yes, no, will commit a crime, will not commit a crime within two years. We can see that they perform roughly the same, nothing really jumping out here, which is why maybe we want to dig a little deeper.

Maybe we can look at per group metrics. So in this case, instead of just looking at overall model performance, we'll still look at the performance of one model versus the other, but we're going to look at it with respect to a protected feature or sensitive feature, in this case, race, and then the protected groups, aka the levels of that feature, which in this case would be Hispanic, other Caucasian or African-American. And already here, we can see that there's some differences or disparities in performance coming to the fore, with African-Americans specifically, the model getting the lowest accuracy for African-Americans relative to other groups.

This is already a pretty compelling result, I think, goes to show that overall metrics alone are not enough to be able to identify when you may have some issues or unfairness in your model, but we still don't necessarily know why these disparities exist. These disparities have to be in part due to different rates of error, and we don't know whether that's more false positives or more false negatives for the different groups. So let's actually look at plots of error direction.

And here's where we start to see the real compelling causes come to the fore here, with in this case, the model hitting a almost four times higher false positive rate error for African-Americans than Caucasians, and similarly high rates for other non-Caucasian groups. This is clearly an issue of bias, and if we were not to rectify this or communicate this to our clients or the non-technical people we're working with in this project, and we were to just put a stamp on this and let it go, in the real world, this would perpetuate all sorts of biases that would be harmful, and we don't want that.

And here's where we start to see the real compelling causes come to the fore here, with in this case, the model hitting a almost four times higher false positive rate error for African-Americans than Caucasians, and similarly high rates for other non-Caucasian groups.

So to get a better context to that, before we actually communicate those conclusions, maybe we want to be able to say, okay, we've made every comparison between every different group, so Caucasians with African-Americans, Caucasians with Hispanics, and maybe we care specifically about the relative performance, so we want to look at the percent differences. That's where fairness metrics come in. They don't measure absolute performance, they measure relative.

When we look at it in this plot here, it's still the same y-axis, still the same x-axis, but now we get the percentages of false positive rates for each group relative to the privileged group, Caucasians. And again, we can see here that there's quite a lot of disparity. Why though? What is it in the features that are going into the model that are causing these disparities to exist?

It's hard to get at causality at any point within data science, but if we want to at least understand more about how the features impact these predictions differently, we're going to want to look at interpretability methods. These are often explained in the context of the black box, hence the image on the right. What they allow us to do is open this black box, because a lot of these interpretability methods are model agnostic, and they can be used even on models like XGBoost or Random Forest to generate the sort of explanations, maybe not quite to the same level of fidelity, but approximately, that we would get from a linear model, like linear regression or something like logistic regression.

I'm not going to go through these in detail within this presentation, because the visuals for them are a little complex and we don't have a lot of time, but I will mention that they come in two main varieties, the first being global methods, which explain the average effect of features globally. You can think of this as like the coefficient of a regression, and if you want to read more about them, you can look up something like permutation feature importance, which ought to be well known to a number of people, partial dependence plots, or individual conditional expectation plots.

We can also look at something in the category of the local methods, and this could be say LIME or Shapley values or derivatives of either of those, and these are really powerful because they allow us to say for an individual prediction on one observation, each feature value, we can quantify the exact amount that it contributed to the prediction. So if someone were to ask, hey, why did I get this specific classification from your model? You could tell them this feature contributed this much, this contributed this much, etc. A lot of useful explanation there.

Hopefully this is something that we can end up applying to most of our models, and it's certainly a set of techniques which are well developed within the R package ecosystem currently. The IML package by Christoph Molnar is available, and I use it in my work and I think it's fantastic, but there's also the VIP and DALEX, D-A-L-E-X, I believe, packages which include a number of interpretability methods as well.

Fairness in outcomes

Outcomes are all about the consequences and how we consider fairness in the context of the real world. Because these outcomes and the impacts they have on people are not necessarily something that we can measure in the data, or at least can, or at least do measure in the data, it's something that requires conversations not just with the people who maybe we're working with who are non-technical, who have the subject matter expertise for the projects we're working on, but also the people impacted by the models.

And talking with all these different groups and coming up with some conception of fairness to say, okay, maybe we're going to optimize for certain predictions despite these predictions coming at a loss of predictive performance for our model, because ultimately that's the equitable thing to do, that's something that's very project-specific. It doesn't make a lot of sense for me or anyone else to give any sort of real definitive guidance on that.

And just for a little bit more detail, maybe an example to make this clear, you could think of fairness and outcomes as being relevant in a case where maybe we're deciding whether someone deserves a scholarship, binary yes-no, a yes would benefit someone who comes from maybe a more disadvantaged background, much more than someone who comes from a more privileged background. And again, that's not necessarily something we'd measure within our model. But if it's something we want to try and work towards, then we need to make sure to mitigate the bias that we see in our model in a specific way to serve that end.

And that's where bias mitigation techniques come in. And I have the star on the word there in the previous slide, because these techniques, they sometimes work well, they sometimes don't, but by and large, they all have trade-offs. And that's because you're either modifying the data in the case of pre-processing that goes into your models, the model itself, or the predictions that come out of the model to reach some notion of fairness. And while that may improve that specific notion of fairness, you may see that there are losses in predictive performance or fairness across other dimensions.

And to show this, we'll go back and look at the compass example. So here, I have a plot, again, it shows protected groups on the y-axis and false positive rate on the x-axis. But now we're looking, instead of at logistic regression and random forest, we're looking at just the random forest. But we're looking at two sets of predictions here, one where the prediction thresholds are all set at 0.5, for if you have a higher than 0.5, we'll predict that you will commit a crime, lower the opposite. And one that's set to an optimal level for each class, such that it minimizes false positive rate. And we can see here that it does that well, all groups actually have a large decrease in false positive rate, which is great, at least relative to each other.

We need to actually go back a few levels to look at the per group performance metrics on the absolute basis to see what happened there. And when we do that, unfortunately, we see that most of the gains in fairness that we made in that previous plot are actually due to the error rate, mostly increasing for the groups, but increasing for the other groups more than it did for the African American group. So basically, what happened here is the model said, OK, you want to make it more fair across groups for false positive rate, we'll make it so that the other groups have more error and make it so that the African American group, their error relative to the other groups is lower than it was originally. Again, if fairness is our goal here, that's fine, that's not a bad result. However, it clearly did come with a trade off. And whenever we use these methods, this is just something we're going to have to consider.

Lastly, for fairness and outcomes, we want to consider the possibility of algorithmic recourse. So if our models are making decisions about people, we don't want them to lose the ability to actually have some influence over that decision. They should be able to appeal it, they should be able to ask for explanations as to why they were given that decision, and they should be given, if asked, or if they choose to ask, recommendations for what they can do to change it. For more info on this, I recommend looking at the work of Ruha Benjamin, William Isaac, or Shakir Mohammed.

Closing thoughts

And fundamentally, that's great, because doing data science in this way is the right thing to do. There's a lot of good moral justifications for why we should do our work with a greater conception for fairness and consider factors outside of the code and the data and the predictions that we normally deal with. Maybe as a profession, we should start thinking of ourselves not just as data scientists, but also ethnographers, sociologists, anthropologists. Not just thinking about it, though, actually getting that knowledge, and that's something we should seek out.

This is not just the right thing to do. Doing these practices and doing data science more responsibly in this way is doing better data science. By doing responsible data science, you'll build better models, you'll have less errors sneaking through to model deployment. And fundamentally, you'll be serving all the people that you're already serving much better than you are now, and with a greater eye to safety and fairness.

Doing these practices and doing data science more responsibly in this way is doing better data science. By doing responsible data science, you'll build better models, you'll have less errors sneaking through to model deployment.

If that's not worth doing, then I don't know what is. So what I have here to wrap up with are a few resources, additional links that people can look at, but I have a number of other resources in addition to code to generate those plots and do a few other analyses in the GitHub. So if you're interested, please do check that out. But otherwise, thanks for coming to the talk and for sitting here with me. Hope to connect with you soon and see you using these techniques yourself. Best of luck.