Fairness in artificial intelligence models: how to mitigate discrimination in the presence of multiple sensitive attributes?

Suppose we have a machine learning model, 𝑓, that predicts the price of an insurance bonus, Y, for data that includes a sensitive attribute, such as gender. Discrimination may occur due to statistical bias (past injustices or sample imbalance), a correlation between the sensitive attribute and some explanatory variable, or intentional bias .

To avoid this bias, there has been legislation (such as the AI ACT - Europe, 2024) that limits or even eliminates the use of certain sensitive attributes in artificial intelligence models. However, simply removing these attributes is not always the solution that generates the best level of fairness or the best model performance. There are preprocessing approaches (which modify the input data), processing approaches (which add a fairness penalty), and postprocessing approaches (which modify the univariate distribution of predictions to create an intermediate distribution, as done in Sequential Fairness)

There have been several postprocessing approaches to mitigate these effects if a model has a single sensitive attribute (Single sensitive atribute, SSA). But what can we do if there are multiple sensitive attributes (Multiple sensitive atribute, MSA)? One possible approach is to consider the intersection of the distributions created by each of the combinations of the sensitive attributes. For example, if the sensitive attributes are gender (female and male) and ethnicity (black and white), these four cases would be considered with the SSA approach:

This can be computationally expensive as the number of sensitive attributes increases. Additionally, when adding a new sensitive attribute, the previous work is lost because new distributions must be found with the new combinations. Another approach (which is the focus of this blog) is Sequential Fairness. In summary, this approach seeks to modify the model's predictions to be fair for the first sensitive attribute and then modify these new predictions again to be fair for the second attribute (and consequently also for the first), and so on. The benefits of this approach are that it is a commutative process (the order of the sequence of attributes to make the model fair does not matter), it is easy to add new sensitive attributes, and it also makes interpretability easier. 

The idea is to find a representative distribution that lies between the conditional distributions for the predictions of the sensitive attributes. This is achieved using the Wasserstein barycenter, which tries to minimize the total cost of moving one distribution to another through optimal transport. The concept of the Wasserstein barycenter extends the idea ofStrong Demographic Parity) to multiple attributes, which seeks to reduce inequity in groups and requires that a model's predictions be independent of sensitive attributes. 

It is important to note that methods for reducing unfairness in predictive models always come at a cost to performance. However, this approach, by using the Wasserstein barycenter, ensures that metrics like accuracy and MSE suffer the least possible damage. 

Equipy is a Python package that implement Sequential Fairness in continuous prediction models with multiple sensitive attributes, using the concept of the Wasserstein barycenter to minimize the impact on model performance while mitigating bias and discrimination that may arise from sensitive attributes in predictions. 

* This blog is based on the presentation made during the Quantil seminar on August 8, 2024, by Agathe Fernandes Machado titled "EquiPy: A Python package for Sequential Fairness using Optimal Transport with Applications in Insurance", where she shared insights about the research conducted by her and her team at the Université du Québec à Montréal (UQAM) to develop a Python package that implements sequential fairness to mitigate injustices in the presence of multiple sensitive attributes.
Tags
Government Artificial intelligence Technology

Newsletter

Get information about Data Science, Artificial Intelligence, Machine Learning and more.

Recent articles

In the Blog articles, you will find the latest news, publications, studies and articles of current interest.

Machine Learning

Inferencia Robusta y Cuantificación de Incertidumbre para la Toma de Decisiones Basada en Datos

Los modelos de aprendizaje automático se han convertido en herramientas esenciales para la toma de decisiones en sectores críticos como la salud, las políticas públicas y las finanzas. Sin embargo, su aplicación práctica enfrenta dos grandes desafíos: el sesgo de selección en los datos y la cuantificación adecuada de la incertidumbre.

Redes Neuronales

El Potencial Impacto del Aprendizaje de Máquinas en el Diseño de las Políticas Públicas en Colombia: Una década de experiencias

Este blog es un resumen extendido del articulo Riascos, A. (2025).1 Desde el inicio de la llamada tercera ola de redes neuronales (Goodfellow et al., (2016)), en la primera década de este siglo, se ha generado una gran esperanza en las posibilidades de la inteligencia artificial para transformar todas las actividades humanas. Asimismo, se han levantado alertas sobre los riesgos que conlleva la introducción de esta nueva tecnología (Bengio et al., (2024)).

Deep Learning

Explorando Redes Neuronales en Grafos para la Clasificación de Asentamientos Informales en Bogotá, Colombia

Los asentamientos informales son definidos como áreas residenciales cuyos habitantes no poseen tenencia legal de las tierras, los barrios carecen de servicios básicos e infraestructura urbana y no cumplen con requisitos de planificación, así como se pueden encontrar en zonas de peligro ambiental y geográfico (ONU, 2015).

Technology

Reinforcement Learning para Optimización de Portafolios

En el contexto de los mercados financieros, la optimización de portafolios consiste en identificar la combinación óptima de activos para maximizar la relación retorno-riesgo. No obstante, esta toma de decisiones se realiza en un entorno de incertidumbre, ya que el comportamiento de los activos no es estacionario a lo largo del tiempo.

Technology

Clustering de datos genómicos

La secuenciación de RNA es una técnica que permite analizar la actividad de los genes en una muestra, como sangre, cerebro u otro tejido animal. Actualmente, es una de las herramientas más utilizadas en biología computacional y medicina, ya que facilita el estudio del impacto de las enfermedades en la expresión génica, lo que, a su vez, afecta la síntesis de proteínas y, en consecuencia, el funcionamiento celular.

Mathematics of Discontent: A Study of the Panamanian Protests from Graph and Game Theory

During the second half of 2022, Panama faced an unprecedented social event. Although in the past there have been protests by certain social sectors, never before had there been such a massive demonstration that included different sectors of Panamanian society ...