SESGO: A Critical Look at AI Biases in Spanish

© Photo by Google DeepMind

In recent years, language models have transformed the way we interact with information. From virtual assistants to decision-support systems, these tools have become omnipresent. However, most of their evaluations and strategies for controlling negative impacts are concentrated in English and, specifically, in North American sociocultural contexts. This effectively overlooks the nuances, stereotypes, and realities of Spanish speakers, particularly in Latin America. Our team decided to address this gap and designed an evaluation framework that responds to this need: SESGO, which stands for Spanish Evaluation of Stereotypical Generative Outputs.
This is how SESGO was born from a simple yet urgent conviction: we cannot assess the fairness of artificial intelligence solely from an Anglo-Saxon perspective. Spanish-speaking communities have expressions, sayings, and forms of communication rich in cultural meaning. When these elements are misinterpreted by models, they can reinforce harmful stereotypes or render social issues invisible. Our framework aims precisely to test how models respond to such scenarios, built from linguistic and cultural contexts specific to the region.

Figure 1. Example from our dataset on racism illustrating the evaluation framework. The figure shows a popular saying associated with a stereotype (laziness), presented in both an ambiguous and a disambiguated context. Each context is accompanied by negative and non-negative questions, designed to prompt responses that select the target group (Black player), the other group (White player), or indicate “Unknown” when the information is insufficient.

The functioning of SESGO is based on a fundamental principle: biases are most clearly revealed in ambiguous situations. That’s why we designed two types of scenarios. In the first, called ambiguous, incomplete statements or insufficient information are presented. Here, the model is forced to “fill in the gaps,” which is when bias most easily becomes evident—either by repeating a stereotype or favoring a specific social group. In the second, called disambiguated, complete and clear information is provided, which should lead the model to respond accurately, leaving no room for prejudice. Comparing performance across both scenarios allowed us to closely observe where and how the models fail.

To measure these results, it’s not enough to simply check whether the model responds correctly or incorrectly. SESGO proposes a metric that combines accuracy with error direction. In other words, we not only evaluate how often the system gets it right, but also whom the error is directed against. A model might make mistakes by chance, but if it does so systematically against a historically marginalized group, that error ceases to be random and becomes a reflection of structural bias. With this idea in mind, we developed the bias score, an indicator that measures—through a mathematical distance (Euclidean)—how far the model is from ideal behavior: one in which responses are accurate and, when errors occur, they are distributed equitably.

Figure 2. Triangular constraint region illustrating the relationship between accuracy and the F metrics. The shaded area represents all possible combinations of F(Target) and F(Other) values for a given level of accuracy. The red dot indicates the ideal model, with accuracy = 1 and F(Target) = F(Other), representing perfect accuracy with balanced errors.

With this strategy, we were able to show that models tend to exhibit greater biases in Spanish when facing ambiguous scenarios—much more than in English. This means that, when interacting in our language, the risks of reinforcing stereotypes or prejudices are higher. In disambiguated contexts, the situation improves, but notable differences between models still persist. Some, such as GPT-4o mini and Gemini 2.0 Flash, show better performance, while others continue to fall into more pronounced patterns of bias. A striking finding was that temperature—a technical parameter that controls response variability—does not have a significant impact on bias levels. This suggests that the issue lies not in how responses are generated, but in the data and training processes that shape the model.

Figures 3 (left) and 4 (right). The two figures—evaluation of ambiguous scenarios (Fig. 3) and disambiguated scenarios (Fig. 4)—show how the performance of language models compares in terms of accuracy (the proportion of correct “Unknown” responses in incomplete contexts or correct responses when clear context is provided) and bias alignment (toward the target group or the other group), with bias scores calculated according to the proposed metric. Both graphs plot accuracy on the vertical axis and bias direction on the horizontal axis, with the values in parentheses indicating the bias score obtained using the equation employed in the study.

What SESGO brings to the table is a clear message: evaluating artificial intelligence systems in Spanish cannot be a mere translation of existing tests in English. We need analytical frameworks designed from and for our own cultural realities. Only then can we identify specific risks, propose meaningful improvements, and demand that models do not replicate forms of discrimination rooted in our society.
In conclusion, SESGO is not just a methodological contribution, but an invitation to think about artificial intelligence from a truly inclusive perspective. Technological equity requires recognizing linguistic and cultural diversity, and our research seeks to take that very step. Evaluating AI in Spanish, with its own criteria, is essential for building fairer, more reliable, and more representative systems for those of us who use them every day.

Click Here to access the full document.
Click Here to view the presentation at the Applied Mathematics Seminar.

Recent articles

In the Blog articles, you will find the latest news, publications, studies and articles of current interest.

AI Governance

Beyond Automation: Why We Need New Metrics to Understand the Future of Work with AI

In recent years, the conversation about artificial intelligence and employment has been dominated by a substitution narrative: Which jobs will disappear? How many jobs will be replaced by algorithms? While this question is important, it has led us to view the future of work from a narrow perspective…

IA

AI for the Common Good: Capabilities, Power, and Participation

How should we understand the concept of developing Artificial Intelligence for the common good? This is a key question, which, according to philosopher Diana Acosta Navas, opens up two central dimensions: one philosophical and the other political…

IA

SESGO: A Critical Look at AI Biases in Spanish

In recent years, language models have transformed the way we interact with information. From virtual assistants to decision-support systems, these tools have become omnipresent…

Algorithmic Justice

Justice in Artificial Intelligence Models: A New Perspective Based on Algorithm Redesign

In recent years, artificial intelligence models have demonstrated incredible potential to transform industries, from healthcare to finance. However, they have also exposed a troubling issue: algorithmic bias.

Machine Learning

Robust Inference and Uncertainty Quantification for Data-Driven Decision Making

Machine learning models have become essential tools for decision-making in critical sectors such as healthcare, public policy, and finance. However, their practical application faces two major challenges: selection bias in the data and the proper quantification of uncertainty.

Neural Networks

The Potential Impact of Machine Learning on Public Policy Design in Colombia: A Decade of Experiences

This blog is an extended summary of the article Riascos, A. (2025). Since the beginning of the so-called third wave of neural networks (Goodfellow et al., (2016)) in the first decade of this century, there has been great hope in the possibilities of artificial intelligence to transform all human activities. At the same time, warnings have been raised about the risks involved in the introduction of this new technology (Bengio et al., (2024)).