
Simon Couch: Fair machine learning
Simon Couch Fair machine learning Cascadia R Conf 2024 Regular talk, 10:25-10:40 In recent years, high-profile analyses have called attention to many contexts where the use of machine learning deepened inequities in our communities. A machine learning model resulted in wealthy homeowners being taxed at a significantly lower rate than poorer homeowners; a model used in criminal sentencing disproportionately predicted black defendants would commit a crime in the future compared to white defendants; a recruiting and hiring model penalized feminine-coded words—like the names of historically women's colleges—when evaluating résumés. In late 2022, a group of Posit employees across teams, roles, and technical backgrounds formed a reading group to engage with literature on machine learning fairness, a research field that aims to define what it means for a statistical model to act unfairly and take measures to address that unfairness. We then designed functionality and resources to help data scientists measure and critique the ways in which the machine learning models they've built might disparately impact people affected by that model. This talk will introduce the research field of machine learning fairness and demonstrate a fairness-oriented analysis of a model with tidymodels, a framework for machine learning in R. Pronouns: he/him Chicago, IL Simon Couch is a software engineer at Posit PBC (formerly RStudio) where he works on open source statistical software. With an academic background in statistics and sociology, Simon believes that principled tooling has a profound impact on our ability to think rigorously about data. He authors and maintains a number of R packages and blogs about the process at simonpcouch.com
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I want to start this talk off with a question that some of y'all might find a little strange.
So for some context, let's start on some familiar territory. We're looking at an XY plane. We're looking at predictions from a machine learning model, where on the x-axis, we have the true value of some outcome variable, and on the y-axis, we have the predictions from a machine learning model. This is situated in some context where we want to predict the weight of a vase, and for some reason, our model tends to predict that the lighter vases are heavier than they actually are, and that the heavier vases are lighter than they actually are. So my question that's maybe a little strange is, is this fair? Maybe a yes if you think this is fair, a no if you think this is unfair, and a shrug is totally OK if you're like, I don't really see the association.
So I'm seeing mostly shrugs, and I would agree with you. This is a bad model. In statistics, we call this regressivity, where the predictions of the model tend towards the mean. We tend to think that regardless of the values of the predictors, the weight of that vase is going to tend towards the middle. So this is a bad model, but I don't know that this is necessarily an unfair model. I'm going to pull a little magic trick now. I'm going to change the title and the subtitle and the labels on the axis ticks in this plot.
This model is actually used in property assessment to generate property taxes. And what we're seeing is that the homes that are less expensive, we're living in some sort of world where houses are between $100,000 and $500,000. So for those less expensive houses, we're tending to overpredict the value of the home. And on the other end of the distribution for the more expensive houses, we are tending to underpredict the value of that home. So I'm going to ask the same question now. Is this fair? I'm seeing no's. And I agree with you. We're taxing different portions of the population at different rates only because of the change in behavior of this model.
Fairness is about our beliefs
So this kind of gets at the first point that I want to make in this talk, the first point of three, which is that fairness is not just about statistical behavior. And folks that are working in machine learning who are used to being able to operationalize our beliefs somewhat straightforwardly into evaluation metrics, we really need to take a step back and realize that when we're assessing models for what we believe to be fairness, we need to realize that we're coming to this evaluation task with beliefs. And those beliefs are going to translate into whether we think our model is doing well or not. So the same model parameters can result in behavior that feels totally benign when situated in one context and deeply unjust in another.
So the same model parameters can result in behavior that feels totally benign when situated in one context and deeply unjust in another.
So like I said, this is a talk about fair machine learning. My name is Simon. I work on open source R packages at Posit. And I specifically focus on a framework called tidymodels. If you're a Tidyverse user, it's sort of a younger sibling to the Tidyverse focused specifically on machine learning. Every year we run a user survey to try to get a better sense of what our users want us to be working on in the next year. And in late 2022, the results showed us that people wanted better tools to analyze their models with fairness in mind.
So we formed a reading group at Posit, folks from different parts of the company across fields and academic backgrounds, sociology, statistics, psychology. We read a bunch of papers and we tried to figure out what software might look like if it were to actually help people engage with the hardest parts of the process of evaluating models with respect to fairness. And we came up with a set of software that we feel supports people in doing that. So I'm going to talk about not just that software, but those kind of three hard parts that I mentioned earlier. And the first one, like I said, was that fairness is about our beliefs.
The problem of the disparities in taxation based on these assessment models is a real one. A year ago, I moved to Chicago. This is a report from 2017 about the tax assessment models that are used in Cook County, where I now live. This is one of many articles about this system in Chicago. If you look up the same kind of parameters in Seattle, you'll see the same things. And even two years ago, the New York Times did sort of a meta-analysis across all sorts of North American cities and found the same pattern of behavior where the rates of taxation were disparate across the distribution of home values.
Defining fairness is hard
The second hard part that I want to try to underscore is that defining fairness is really hard. The translation of our values and our morally held beliefs into these mathematical measures that we can evaluate is not at all a trivial task, and it doesn't do anything to resolve the differences in our morally held beliefs.
So I'll try to argue that back to this home value example. So again, this is a bad model. It's a bad model in kind of a nice way where all the errors are correlated with each other. And so we have statistical techniques to correct for this. So one kind of argument that a lot of people have in what makes a fair model is that a fair model should at least be performant by our usual metrics. So earlier in this session, we mentioned R-squared and RMSE. So let's see if we can get this model as performant as possible with respect to those metrics and see if that brings us closer to something that feels fair to us.
So again, because these are correlated in a really nice way, we can apply a statistical technique called calibration. If we're lucky, we end up with a plot that looks something like this, where the errors are independently and identically distributed. And so maybe this is better. Let's take a look and make sure that those errors are independently and identically distributed. What that means is that the variance and the mean of the errors is constant across the outcome, the distribution of the outcome. So it looks to be just about the case. The errors are similar for a house that's $100,000 and $500,000. And to me, that feels like exactly the problem. If I'm the owner of an $100,000 house, and I'm just as likely to receive an error of $50,000 in the assessment of my home as somebody who owns a $500,000 house, that error as a percentage is much more impactful to me.
So maybe we can evaluate this model from a different perspective. The percentage error across the distribution, if that's consistent, maybe that's a better model for me in terms of what I think is fair. So we can train a model that optimizes the mean percentage error across the distribution. It looks something like this. To me, that feels better. But at the same time, people that are so used to seeing these metrics like r squared or root mean squared error, that they might see that our model is worse. The errors are larger across the distribution. And so we see that those two interpretations of fairness are in conflict with each other. And they arose from two different worldviews about how we should be taxing people and how errors should be distributed across the distribution.
So defining fairness is hard. This is kind of one of the most popular quotes from a recent meta-analysis in the field. Definitions of fairness are not mathematically or morally compatible in general. For the mathematicians in the room, there actually is a proof here. It's called the impossibility theorem. And it basically says that if we do indeed live in a world where there are disparities between groups, then different definitions of fairness are impossible to satisfy more than a small fixed set of them.
Definitions of fairness are not mathematically or morally compatible in general.
Thinking about the whole system
The last thing that I want to talk about, for practitioners of machine learning, we're very used to this process of evaluating our models with respect to performance metrics. And much of the work of machine learning fairness as a research field has focused on what it looks like to measure that performance with metrics. But I want to argue that thinking about the whole system is just as important because the metrics evaluate the model, but the model is situated in a much broader system. And so I'll try to argue that in this same problem context.
Let's think about how the predictions will be used. In one model, we could say that you take your assessed value of the home, and then you multiply it by some fixed number, and that decides what your property tax is. I found this 0.9% on some web page in the King County verse, and then I went to a different web page and found a different one. So if you live in the area, you probably know what this number actually is or if there's an analogous number. But the idea here, at least to me, our model that is trained on the mean percentage error seems to be the most performant model in that case or the most fair model.
Where I live in Chicago and in most places in the US, there's something that we call the homeowner exemption, where if you live in the home that you own, some fixed amount of that assessed value is sort of wiped out. So we're shifting the distribution of the errors down, and then the rest of that value is taxed at a fixed percentage. And so to me, the R squared or the root mean squared error, that seems to me like a reasonable metric to use in that case, because it does better on the other end of the distribution. That mean shift of the errors is more impactful for me. If I own a home, it's $100,000. If I own several homes that I don't live in, then the errors are sort of averaged across the homes. And so it's not as important to me to get those errors spot on in that case.
Another thing that might impact our gut reaction to what is actually fair is the change in the value from last year. So if I own a $500,000 house, and this assessment from the previous year was $200,000 under, and now my prediction is accurate, and I'm paying $2,000 or $3,000 more than I expected to, is that change in the amount that I'm paying unfair? And so we even need to think about what that model looked like last year if we want to train a fair model.
So I've tried to argue here that the metrics evaluate the model, but the model is one part of a larger system. And that larger system is just as important to us in aligning the way that we talk about evaluating machine learning models with our actual morally held beliefs.
Wrapping up
So I've tried to outline three hard parts, is what I call them, of machine learning fairness. Much of the statistical software that's out there is focused on the second one, choosing and supplying a bunch of different mathematical measures that we can drop into our existing machine learning systems, optimize on the value of the metric, and tell our stakeholders that our model is fair.
But I want to argue, or I've tried to argue to you today, that the other two parts of this process are just as important for actually aligning our beliefs with our machine learning systems. And I would encourage you to choose tools that support thinking about the hard parts.
If you're interested in checking out the tidymodels, there's a book by Max Kuhn and Julia Silgi called Tidy Modeling with R. That's a great place to get started with machine learning in R.
On the tidymodels.org website, we have all sorts of long-form documentation pieces. And recently, I wrote two long-form analysis kind of demonstrating what it looks like to have fairness in mind while you're analyzing machine learning models.
Links to both of these websites and to these articles, specifically the source code for these slides and different works that I've cited in this talk, are at github.com slash simonpcouch slash cascadia-24. I'll give folks a second to write that down if they want to. But I just want to say I'm very grateful to be here, very grateful for the labor of the folks that brought us together. And I have some hype stickers if you want to find me. Thank you.

