SESGO: A Critical Look at AI Biases in Spanish

16/09/2025
Melissa Robles, Denniss Raigoso, Mateo Dulce y Catalina Bernal
Bias in LLMs, Contextual Bias, Fairness, multilingual LLMs

In recent years, language models have transformed the way we interact with information. From virtual assistants to decision-support systems, these tools have become omnipresent. However, most of their evaluations and strategies for controlling negative impacts are concentrated in English and, specifically, in North American sociocultural contexts. This effectively overlooks the nuances, stereotypes, and realities of Spanish speakers, particularly in Latin America. Our team decided to address this gap and designed an evaluation framework that responds to this need: SESGO, which stands for Spanish Evaluation of Stereotypical Generative Outputs.
This is how SESGO was born from a simple yet urgent conviction: we cannot assess the fairness of artificial intelligence solely from an Anglo-Saxon perspective. Spanish-speaking communities have expressions, sayings, and forms of communication rich in cultural meaning. When these elements are misinterpreted by models, they can reinforce harmful stereotypes or render social issues invisible. Our framework aims precisely to test how models respond to such scenarios, built from linguistic and cultural contexts specific to the region.

Figure 1. Example from our dataset on racism illustrating the evaluation framework. The figure shows a popular saying associated with a stereotype (laziness), presented in both an ambiguous and a disambiguated context. Each context is accompanied by negative and non-negative questions, designed to prompt responses that select the target group (Black player), the other group (White player), or indicate “Unknown” when the information is insufficient.

The functioning of SESGO is based on a fundamental principle: biases are most clearly revealed in ambiguous situations. That’s why we designed two types of scenarios. In the first, called ambiguous, incomplete statements or insufficient information are presented. Here, the model is forced to “fill in the gaps,” which is when bias most easily becomes evident—either by repeating a stereotype or favoring a specific social group. In the second, called disambiguated, complete and clear information is provided, which should lead the model to respond accurately, leaving no room for prejudice. Comparing performance across both scenarios allowed us to closely observe where and how the models fail.

To measure these results, it’s not enough to simply check whether the model responds correctly or incorrectly. SESGO proposes a metric that combines accuracy with error direction. In other words, we not only evaluate how often the system gets it right, but also whom the error is directed against. A model might make mistakes by chance, but if it does so systematically against a historically marginalized group, that error ceases to be random and becomes a reflection of structural bias. With this idea in mind, we developed the bias score, an indicator that measures—through a mathematical distance (Euclidean)—how far the model is from ideal behavior: one in which responses are accurate and, when errors occur, they are distributed equitably.

Figure 2. Triangular constraint region illustrating the relationship between accuracy and the F metrics. The shaded area represents all possible combinations of F(Target) and F(Other) values for a given level of accuracy. The red dot indicates the ideal model, with accuracy = 1 and F(Target) = F(Other), representing perfect accuracy with balanced errors.

With this strategy, we were able to show that models tend to exhibit greater biases in Spanish when facing ambiguous scenarios—much more than in English. This means that, when interacting in our language, the risks of reinforcing stereotypes or prejudices are higher. In disambiguated contexts, the situation improves, but notable differences between models still persist. Some, such as GPT-4o mini and Gemini 2.0 Flash, show better performance, while others continue to fall into more pronounced patterns of bias. A striking finding was that temperature—a technical parameter that controls response variability—does not have a significant impact on bias levels. This suggests that the issue lies not in how responses are generated, but in the data and training processes that shape the model.

Figures 3 (left) and 4 (right). The two figures—evaluation of ambiguous scenarios (Fig. 3) and disambiguated scenarios (Fig. 4)—show how the performance of language models compares in terms of accuracy (the proportion of correct “Unknown” responses in incomplete contexts or correct responses when clear context is provided) and bias alignment (toward the target group or the other group), with bias scores calculated according to the proposed metric. Both graphs plot accuracy on the vertical axis and bias direction on the horizontal axis, with the values in parentheses indicating the bias score obtained using the equation employed in the study.

What SESGO brings to the table is a clear message: evaluating artificial intelligence systems in Spanish cannot be a mere translation of existing tests in English. We need analytical frameworks designed from and for our own cultural realities. Only then can we identify specific risks, propose meaningful improvements, and demand that models do not replicate forms of discrimination rooted in our society.
In conclusion, SESGO is not just a methodological contribution, but an invitation to think about artificial intelligence from a truly inclusive perspective. Technological equity requires recognizing linguistic and cultural diversity, and our research seeks to take that very step. Evaluating AI in Spanish, with its own criteria, is essential for building fairer, more reliable, and more representative systems for those of us who use them every day.

Neural Networks for Optimization in Treasury Auctions

¿Cuál formato de subasta, el uniforme o el discriminatorio, resulta más adecuado para reducir el costo de financiamiento del Estado?…

Read article

AI Governance

Beyond Automation: Why We Need New Metrics to Understand the Future of Work with AI

In recent years, the conversation about artificial intelligence and employment has been dominated by a substitution narrative: Which jobs will disappear? How many jobs will be replaced by algorithms? While this question is important, it has led us to view the future of work from a narrow perspective…

Read article

IA

AI for the Common Good: Capabilities, Power, and Participation

How should we understand the concept of developing Artificial Intelligence for the common good? This is a key question, which, according to philosopher Diana Acosta Navas, opens up two central dimensions: one philosophical and the other political…

Read article

IA