 
											In recent years, language models have transformed the way we interact with information. From virtual assistants to decision-support systems, these tools have become omnipresent. However, most of their evaluations and strategies for controlling negative impacts are concentrated in English and, specifically, in North American sociocultural contexts. This effectively overlooks the nuances, stereotypes, and realities of Spanish speakers, particularly in Latin America. Our team decided to address this gap and designed an evaluation framework that responds to this need: SESGO, which stands for Spanish Evaluation of Stereotypical Generative Outputs.
This is how SESGO was born from a simple yet urgent conviction: we cannot assess the fairness of artificial intelligence solely from an Anglo-Saxon perspective. Spanish-speaking communities have expressions, sayings, and forms of communication rich in cultural meaning. When these elements are misinterpreted by models, they can reinforce harmful stereotypes or render social issues invisible. Our framework aims precisely to test how models respond to such scenarios, built from linguistic and cultural contexts specific to the region.
 
															The functioning of SESGO is based on a fundamental principle: biases are most clearly revealed in ambiguous situations. That’s why we designed two types of scenarios. In the first, called ambiguous, incomplete statements or insufficient information are presented. Here, the model is forced to “fill in the gaps,” which is when bias most easily becomes evident—either by repeating a stereotype or favoring a specific social group. In the second, called disambiguated, complete and clear information is provided, which should lead the model to respond accurately, leaving no room for prejudice. Comparing performance across both scenarios allowed us to closely observe where and how the models fail.
To measure these results, it’s not enough to simply check whether the model responds correctly or incorrectly. SESGO proposes a metric that combines accuracy with error direction. In other words, we not only evaluate how often the system gets it right, but also whom the error is directed against. A model might make mistakes by chance, but if it does so systematically against a historically marginalized group, that error ceases to be random and becomes a reflection of structural bias. With this idea in mind, we developed the bias score, an indicator that measures—through a mathematical distance (Euclidean)—how far the model is from ideal behavior: one in which responses are accurate and, when errors occur, they are distributed equitably.
 
															With this strategy, we were able to show that models tend to exhibit greater biases in Spanish when facing ambiguous scenarios—much more than in English. This means that, when interacting in our language, the risks of reinforcing stereotypes or prejudices are higher. In disambiguated contexts, the situation improves, but notable differences between models still persist. Some, such as GPT-4o mini and Gemini 2.0 Flash, show better performance, while others continue to fall into more pronounced patterns of bias. A striking finding was that temperature—a technical parameter that controls response variability—does not have a significant impact on bias levels. This suggests that the issue lies not in how responses are generated, but in the data and training processes that shape the model.
 
															What SESGO brings to the table is a clear message: evaluating artificial intelligence systems in Spanish cannot be a mere translation of existing tests in English. We need analytical frameworks designed from and for our own cultural realities. Only then can we identify specific risks, propose meaningful improvements, and demand that models do not replicate forms of discrimination rooted in our society.
In conclusion, SESGO is not just a methodological contribution, but an invitation to think about artificial intelligence from a truly inclusive perspective. Technological equity requires recognizing linguistic and cultural diversity, and our research seeks to take that very step. Evaluating AI in Spanish, with its own criteria, is essential for building fairer, more reliable, and more representative systems for those of us who use them every day.
Get information about Data Science, Artificial Intelligence, Machine Learning and more.
In the Blog articles, you will find the latest news, publications, studies and articles of current interest.