Grant Fleming | Fairness and Data Science: Failures, Factors, and Futures

Transcript#

This transcript was generated automatically and may contain errors.

My name is Grant Fleming, and I'm a data scientist at Elder Research, where I work on data science consulting projects with various public sector clients. And in that work, along with a lot of the work that I'm doing outside of work work, I'm having to wrestle with the issue of fairness and data science, specifically things like how do we make sure that the predictions being generated by our models are fair? How do we make sure we're actually recording the correct information in our data and not leaving certain important factors out? And if we suspect that something may have snuck in our model, how do we make sure that we're able to detect that element before we deploy our model and cause harm intentionally?

Because as we've seen probably in some of the news over the past few years, this is an issue that's, if not become more widespread, become more widely known. For example, in 2018, Joy Buolamwini and her research team at MIT published the GenderShades work that they did on facial recognition algorithms from companies like Amazon, IBM, and others. And in this work, they found out that these facial recognition algorithms were actually having much higher error rates, detecting darker skin faces versus lighter skin faces. And that obviously is something that we would not want to see in any sort of fair model.

And here's where we start to see the real compelling causes come to the fore here, with in this case, the model hitting a almost four times higher false positive rate error for African-Americans than Caucasians, and similarly high rates for other non-Caucasian groups.

So to get a better context to that, before we actually communicate those conclusions, maybe we want to be able to say, okay, we've made every comparison between every different group, so Caucasians with African-Americans, Caucasians with Hispanics, and maybe we care specifically about the relative performance, so we want to look at the percent differences. That's where fairness metrics come in. They don't measure absolute performance, they measure relative.

When we look at it in this plot here, it's still the same y-axis, still the same x-axis, but now we get the percentages of false positive rates for each group relative to the privileged group, Caucasians. And again, we can see here that there's quite a lot of disparity. Why though? What is it in the features that are going into the model that are causing these disparities to exist?

It's hard to get at causality at any point within data science, but if we want to at least understand more about how the features impact these predictions differently, we're going to want to look at interpretability methods. These are often explained in the context of the black box, hence the image on the right. What they allow us to do is open this black box, because a lot of these interpretability methods are model agnostic, and they can be used even on models like XGBoost or Random Forest to generate the sort of explanations, maybe not quite to the same level of fidelity, but approximately, that we would get from a linear model, like linear regression or something like logistic regression.

I'm not going to go through these in detail within this presentation, because the visuals for them are a little complex and we don't have a lot of time, but I will mention that they come in two main varieties, the first being global methods, which explain the average effect of features globally. You can think of this as like the coefficient of a regression, and if you want to read more about them, you can look up something like permutation feature importance, which ought to be well known to a number of people, partial dependence plots, or individual conditional expectation plots.

We can also look at something in the category of the local methods, and this could be say LIME or Shapley values or derivatives of either of those, and these are really powerful because they allow us to say for an individual prediction on one observation, each feature value, we can quantify the exact amount that it contributed to the prediction. So if someone were to ask, hey, why did I get this specific classification from your model? You could tell them this feature contributed this much, this contributed this much, etc. A lot of useful explanation there.

Hopefully this is something that we can end up applying to most of our models, and it's certainly a set of techniques which are well developed within the R package ecosystem currently. The IML package by Christoph Molnar is available, and I use it in my work and I think it's fantastic, but there's also the VIP and DALEX, D-A-L-E-X, I believe, packages which include a number of interpretability methods as well.

Fairness in outcomes

Outcomes are all about the consequences and how we consider fairness in the context of the real world. Because these outcomes and the impacts they have on people are not necessarily something that we can measure in the data, or at least can, or at least do measure in the data, it's something that requires conversations not just with the people who maybe we're working with who are non-technical, who have the subject matter expertise for the projects we're working on, but also the people impacted by the models.

And talking with all these different groups and coming up with some conception of fairness to say, okay, maybe we're going to optimize for certain predictions despite these predictions coming at a loss of predictive performance for our model, because ultimately that's the equitable thing to do, that's something that's very project-specific. It doesn't make a lot of sense for me or anyone else to give any sort of real definitive guidance on that.

And just for a little bit more detail, maybe an example to make this clear, you could think of fairness and outcomes as being relevant in a case where maybe we're deciding whether someone deserves a scholarship, binary yes-no, a yes would benefit someone who comes from maybe a more disadvantaged background, much more than someone who comes from a more privileged background. And again, that's not necessarily something we'd measure within our model. But if it's something we want to try and work towards, then we need to make sure to mitigate the bias that we see in our model in a specific way to serve that end.

And that's where bias mitigation techniques come in. And I have the star on the word there in the previous slide, because these techniques, they sometimes work well, they sometimes don't, but by and large, they all have trade-offs. And that's because you're either modifying the data in the case of pre-processing that goes into your models, the model itself, or the predictions that come out of the model to reach some notion of fairness. And while that may improve that specific notion of fairness, you may see that there are losses in predictive performance or fairness across other dimensions.

And to show this, we'll go back and look at the compass example. So here, I have a plot, again, it shows protected groups on the y-axis and false positive rate on the x-axis. But now we're looking, instead of at logistic regression and random forest, we're looking at just the random forest. But we're looking at two sets of predictions here, one where the prediction thresholds are all set at 0.5, for if you have a higher than 0.5, we'll predict that you will commit a crime, lower the opposite. And one that's set to an optimal level for each class, such that it minimizes false positive rate. And we can see here that it does that well, all groups actually have a large decrease in false positive rate, which is great, at least relative to each other.

We need to actually go back a few levels to look at the per group performance metrics on the absolute basis to see what happened there. And when we do that, unfortunately, we see that most of the gains in fairness that we made in that previous plot are actually due to the error rate, mostly increasing for the groups, but increasing for the other groups more than it did for the African American group. So basically, what happened here is the model said, OK, you want to make it more fair across groups for false positive rate, we'll make it so that the other groups have more error and make it so that the African American group, their error relative to the other groups is lower than it was originally. Again, if fairness is our goal here, that's fine, that's not a bad result. However, it clearly did come with a trade off. And whenever we use these methods, this is just something we're going to have to consider.

Lastly, for fairness and outcomes, we want to consider the possibility of algorithmic recourse. So if our models are making decisions about people, we don't want them to lose the ability to actually have some influence over that decision. They should be able to appeal it, they should be able to ask for explanations as to why they were given that decision, and they should be given, if asked, or if they choose to ask, recommendations for what they can do to change it. For more info on this, I recommend looking at the work of Ruha Benjamin, William Isaac, or Shakir Mohammed.

Closing thoughts

And fundamentally, that's great, because doing data science in this way is the right thing to do. There's a lot of good moral justifications for why we should do our work with a greater conception for fairness and consider factors outside of the code and the data and the predictions that we normally deal with. Maybe as a profession, we should start thinking of ourselves not just as data scientists, but also ethnographers, sociologists, anthropologists. Not just thinking about it, though, actually getting that knowledge, and that's something we should seek out.

This is not just the right thing to do. Doing these practices and doing data science more responsibly in this way is doing better data science. By doing responsible data science, you'll build better models, you'll have less errors sneaking through to model deployment. And fundamentally, you'll be serving all the people that you're already serving much better than you are now, and with a greater eye to safety and fairness.

Doing these practices and doing data science more responsibly in this way is doing better data science. By doing responsible data science, you'll build better models, you'll have less errors sneaking through to model deployment.

If that's not worth doing, then I don't know what is. So what I have here to wrap up with are a few resources, additional links that people can look at, but I have a number of other resources in addition to code to generate those plots and do a few other analyses in the GitHub. So if you're interested, please do check that out. But otherwise, thanks for coming to the talk and for sitting here with me. Hope to connect with you soon and see you using these techniques yourself. Best of luck.

Grant Fleming | Fairness and Data Science: Failures, Factors, and Futures | RStudio

Transcript#

Why these failures happen

A framework for responsible data science

Fairness in procedures

Fairness in predictions

Fairness in outcomes

Closing thoughts

Featured software#

rstudio

tidymodels