
Simon Couch - Fair machine learning
In recent years, high-profile analyses have called attention to many contexts where the use of machine learning deepened inequities in our communities. After a year of research and design, the tidymodels team is excited to share a set of tools to help data scientists develop fair machine learning models and communicate about them effectively. This talk will introduce the research field of machine learning fairness and demonstrate a fairness-oriented analysis of a machine learning model with tidymodels. Talk by Simon Couch Slides: https://simonpcouch.github.io/conf-24 GitHub Repo: https://github.com/simonpcouch/conf-24
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I want to start this talk off with a question that you might find a little strange.
We'll start off hopefully on familiar territory on the xy-plane. On the x-axis we have the true value of some outcome variable that we're trying to model, and on the y-axis we have the predictions from a machine learning model. This machine learning model is situated in some sort of factory production context where we're trying to model the weight of a vase, and the behavior that we see is that for the lighter vases we tend to predict that those vases are heavier than they actually are, and for the heavier vases we tend to predict that those vases aren't as heavy as they actually are.
So the strange question that I have for you is, is this model fair? Maybe a yes if you think this is fair, no if you think this is unfair, or a shrug if you're like, what kind of question is that?
I'm seeing mostly shrugs in the crowd, and I agree with you. Like this is a bad model, but I don't really feel that this is unfair in any specific way. Okay, so I have a little magic trick now. I'm going to change the labels on the axis ticks, and I'm going to change the title and the subtitle, and we'll see if your answer to that question changes.
This machine learning model is actually assessing the sale price of homes, and that assessed value is used to determine the amount of property tax that a homeowner will be charged for. And so for the less expensive homes, that model tends to predict that the home is actually more expensive than it actually is, and for the more expensive homes, the model thinks that the home is not actually as expensive as it is.
Is this fair? I agree, this is not fair. So this is a statistical behavior that we call regressivity, where the predictions from a machine learning model tend towards the mean of the outcome variable. And in that first context, we felt that that behavior, that statistical behavior of regressivity, wasn't necessarily unfair. But we change the context in which the model that has the same exact model parameters, the same predictions on the same distribution of an outcome, and we feel that that's unfair.
And so this leads us to the first conclusion of three that I want to make in this talk, is that machine learning fairness isn't about mathematical measures. It's not about the newest metric you can get access to. Fairness is about our morally held beliefs. And because of that, the same model parameters can result in behavior that feels totally benign when situated in one context, and deeply unjust in another.
Fairness is about our morally held beliefs. And because of that, the same model parameters can result in behavior that feels totally benign when situated in one context, and deeply unjust in another.
Background and motivation
In late 2022, a group of Posit employees across the organization formed a reading group to learn more about machine learning fairness as a research field, and also to think through what principled tooling for machine learning fairness looks like. A few months ago, I'm proud to say that we released some functionality that we feel helps people to analyze models with fairness in mind. But I don't want to talk about that functionality today. I want to talk about sort of three important things that we came out of that reading group having learned, and that we would want you to keep in mind if you're also analyzing your machine learning models with respect to fairness. To do so, we'll use this case study of tax assessment.
This is one news story of many about the property tax system in Chicago. This reads, Cook County failed to value homes accurately for years, and the result is a property tax system that harmed the poor and helped the rich. This is one of many articles about this system in Chicago, but you can find similar articles about any major city in North America. And the New York Times even recently published a meta-analysis that does a similar sort of graph for cities across North America.
Defining fairness
So, the second hard part about machine learning fairness is defining what it actually is. And to demonstrate that, we'll return to this example that I showed you earlier.
So, I said earlier that this is a bad model, and I'll give you that. And so, maybe our first reflex to make this model act more fairly is to make it act more performant by our usual metrics of model performance, like r-squared or root-mean-squared error.
Because those errors are super-correlated, we can apply calibration, and if we're lucky, we end up with predictions that look like this, where we have a fair model, predictions that look like this, where errors are independently and identically distributed, meaning that their mean invariance is the same across the distribution of that true outcome variable. Okay, so maybe we feel that this machine learning model is more fair than the first one we saw. Just to check that those errors are iid, we can take a look at the residuals, and it does seem like that distribution is similar across the true outcome variable.
But let's not forget what those units are on the y-axis. We're saying that a homeowner of a home that's $100,000 is going to be subject to similar residuals as a homeowner of a home that's $500,000 in their absolute magnitude. So, they're just as likely to receive a residual of $50,000 plus or minus, and that error to them is probably much more impactful than it is to somebody on the other end of that distribution.
So, to me, that maybe leads to a different proposal of what a fair machine learning model looks like in this context. What if the percentage of the error was similar across the distribution of true outcome values? So, we're talking about a 2% to 3% difference instead of a $10,000 to $20,000 difference. We're speaking in terms of percentages rather than absolute magnitude. If we're training a model where we can adjust the loss function that we're optimizing for, we might end up with a model that looks like this, where the absolute magnitude of the errors are larger for the more expensive homes, but the percentage error is similar across the distribution.
But if I'm a homeowner of, in this case, a $500,000 home, and suddenly I'm subjected to these $200,000 residuals that I didn't used to see, that feels pretty unfair to me. So, what we see here is that we start with these morally held beliefs about fairness, and we map them onto mathematical measures. But the fact that they're mathematical measures doesn't necessarily mean they're any more compatible with each other than our morally held beliefs are. So, defining fairness is hard. Those definitions of fairness are not mathematically or morally compatible in general.
The model is part of a larger system
A lot of software for machine learning fairness focuses on implementing as many fairness metrics as possible. What I want to emphasize here is that machine learning metrics or fairness metrics only allow us to evaluate the behavior of a model, but we need to think about the whole system that the model is situated inside of.
So, in this example that I was just talking about, how would these predictions of what the sale price of a home would be actually be used? One tax system, I propose that you just take that assessed value for the sale price of the home and multiply it by a fixed percentage, and that's the property tax that you pay. And to me, the most fair model that we've proposed so far under that system is one where we have a similar percentage error across the distribution.
Where I live in Chicago, we call this the homeowner exemption, where if you live in the home that you own, you start with that assessed value and you subtract off some constant number, and then the results of that calculation is then multiplied by that fixed percentage, and that's the property tax that you pay. So, if we think back to that graph, we're sort of mean shifting the errors down such that the owner of an $100,000 house is not as likely to experience that very positive residual or the overvaluation of their home. If I am the owner of many $100,000 homes, I'm not allowed or allotted this exemption, but the errors are sort of averaged across the homes that I own, and so everybody ends up with a similar error on that end of the distribution.
So, in that case, a model that's trained with a usual performance metric like R squared or root mean squared error, that feels fine to me. But even if I could predict the sale price to the dollar, if last year my valuation was half of what it is this year, even though this year it's right on the dot, is that change in the assessed value unfair in itself? Because I wasn't expecting evaluation that high. So, these are all things that if we measured only with fairness metrics, we wouldn't be factoring in other components of this machine learning system that trigger this gut reaction that we have that reflects our morally held beliefs.
Metrics evaluate the model, but the model is one part of a larger system.
Metrics evaluate the model, but the model is one part of a larger system.
Choosing the right tools
So, those are the three big hard parts that I wanted to underscore in this talk. And the last thing that I'll leave you with is that I would encourage you to choose tools that support thinking about the hard parts. If you're a tidymodels user and you saw there were only a couple newly exported functions that came along with this release of fairness metrics in the tidymodels, that's because we allotted a lot of energy to really good documentation that situates these metrics in contexts that are similar to the ones that you might see in your organization.
And also because those tools are minimal but flexible and they encourage you to think about how that metric is situated inside of a whole system. So, if you're interested in learning more about how to use these metrics in the tidymodels, the first place I would point you to is these two applied articles that we put together on tidymodels.org.
One analyzes a problem of GPT detectors across the difference of performance metrics between people who speak and write English natively and those that don't. In another applied analysis, analyzes a machine learning model that predicts whether a patient will be readmitted to a hospital after an inpatient stay based on racial groupings. So, if you're interested in reading those articles, if you're interested in checking out the source code for this talk, or if you want to check out the sources that I've cited throughout, you can see this GitHub repository, github.com.
While folks take a moment to write that down, I just want to say I'm very grateful to be here. And I would encourage you to track me down and ask for some hex stickers.
Q&A
Thank you, Simon. And a reminder to submit any questions via Slido. We're getting a couple in so far. Expect a few more to trickle in. But first, the word fair implies a moral capability that doesn't exist in tools like machine learning, no more than any other tool as the moral faculty. You seem to equate inaccuracy with unfairness, so why not simply call it inaccuracy? Would overcharging higher-priced houses somehow be fair inaccuracy?
This is exactly the question, right? It is very, very difficult to map our morally held beliefs about what fairness is into mathematical measures. And so, like you're saying, it's important to forefront that difference between what our morally held beliefs are and what we're actually able to measure.
So given how difficult it is to define ML model fairness in general, why do you think so many people are interested in evaluating if models are fair? Is it about governance, like who's on the hook if it hurts a specific group of people, or something else?
I mean, I would hope that all of us don't want to cause differential harm to different groups of people through our work in machine learning. And the thought that there might be tools that allow us to adjust for that disparate harm is tempting.
Is it appropriate to use LLMs for tax assessments? Do they perform better than human tax assessors?
I'm not qualified to answer that question, I think.
So we'll maybe get a few more questions in, but I have one in the meantime. How would you maybe give advice to a data scientist who's trying to consider all the angles of fairness? I mean, obviously, domain expertise is a big piece of that, but are there any data explorations or other tips that you'd give them?
The structure of this talk, I think, aims to answer that question where we start with what our beliefs are. We start with who the stakeholders are in our problem and the potential ways that we might harm them. And that is what can help guide the analyses that we think through and the tools that we use to do so.
Both can be valuable tools. There's all sorts of machinery out there that aims not to just measure the degree of unfairness, but to somehow mitigate it. That's not currently part of the tool set that we've put together. But if you're interested in that being part of the tidymodels, definitely give us a holler on the GitHub repositories and we'd be interested to start that conversation.
Going rapid fire with you here. Do you evaluate fairness in both the training and testing data or just in the testing data?
You would evaluate fairness wherever it is that you're evaluating a machine learning model elsewhere. So for the assessment set, when you're resampling, the assessment set would allow you to choose a model that you propose that is most fair. And then you can evaluate the final degree of unfairness, if you will, based on a given metric using the test set.
This one I think follows along in that lines. But what about fairness and data annotation as they become larger sources of truth for models?
Okay, yeah. So data annotation would be like the act of seeing some complexity in the world and putting a label on it. And I think that is the root of many evils in terms of our ability to measure the behavior of models, where there are many different ways to choose to group people. And those are very much impactful for our understanding of the world and our understanding of how machine learning models simplify our understanding.
That's an interesting question. I mean, the first thing that comes to mind is like, if a more interpretable model is one that allows you to communicate more effectively to stakeholders about the implications for fairness, then maybe a model that isn't necessarily the optimum with respect to a given fairness metric could still be the most fair for that situation.
I'll add on to that one. Are there any tools in some more complex models that might help to open up that black box and explore levels of fairness or any sort of unexpected relationships that may be occurring in the data?
Does it make sense to look at fairness drift just as we do accuracy?
How would you recommend doing that?
So the vetiver set of tools for MLOps that the tidymodels team does their best to integrate with. Fairness metrics in tidymodels are metrics just like any other. We did our best to make them walk and talk like any metric from Yardstick. Maybe I shouldn't say this out loud, but if it works for a Yardstick metric, then it works for a fairness metric. And any tools in vetiver that can measure data drift and metric drift can also do so for fairness metrics.

