Large language models have become everyday tools: they assist in writing texts, support medical diagnoses, generate code, and answer complex questions in seconds. However, as their use expands, so does a concern that goes beyond accidental errors. What happens when someone actively tries to make a model behave in a dangerous way? This question lies at the heart of the field known as adversarial robustness, which studies a language model’s ability to maintain its safety boundaries even when a sophisticated agent (human or automated) is deliberately attempting to breach them. The reasons why this matters are concrete. From the possibility that a model may provide instructions for building biological or cyber weapons, to scenarios in which the model itself has been trained with malicious intent. In all these cases, the question is not whether the model can make mistakes, but whether it can withstand someone pushing it to do so.
The most widely studied attack technique in this field is known as a jailbreak, which consists of phrasing an instruction in such a way that the model responds to it even though it would normally be expected to refuse. Early versions of these attacks were simple. The most common approach was to ask the model to role-play a character without restrictions, or to frame the question within a fictional story in which a character needed to explain, in full detail, how to build an explosive. Over time, the methods became more sophisticated. The iterative approach, one of the most common today, involves testing multiple variations of the same question, adjusting the wording, context, or tone with each attempt until a formulation is found that the model does not reject. The conclusion that has emerged from years of research on this front is uncomfortable: with enough patience and resources, any model can be broken. The current state of the art, including models such as Claude Sonnet, is no exception to this rule.
Although iterative jailbreaks can be effective, they typically work for a specific question. The next generation of attacks seeks to find strategies that work simultaneously across many different questions, known as Universal Jailbreaks (UJs), whose study has been advanced, among others, by the United Kingdom’s AI Security Institute (AISI). The central idea is to construct a prefix—that is, a string of text prepended to any question—with the property of causing the safety classifier to treat that question as permissible regardless of its content.
The algorithm developed by AISI, known as Boundary Point Jailbreaking, was designed to solve a problem shared by all attacks against robust classifiers: they return only a binary signal (blocked or not blocked), without revealing how close an attempt came to succeeding. To extract useful signal, BPJ combines two mechanisms. The first is curriculum learning: instead of attacking the harmful query directly, it begins with heavily noised versions, text so distorted that the classifier does not recognize it as dangerous, and then gradually reduces that noise until it reaches the original text. The second is boundary points: within each difficulty level, the algorithm searches for versions of the query that are right at the threshold of being blocked—those that sometimes pass and sometimes do not.
The elegance of the mechanism lies in its ability to generalize: a prefix trained against only a few queries ends up working against a wide variety of prohibited queries that it never encountered during the training process, making it a far more powerful attack tool than individual jailbreaks.
However, this type of attack has important limitations that should not be overlooked. Constructing a universal prefix requires a very large number of queries to the model; in some experiments, more than 600,000 attempts are needed, making it costly in terms of both computation and time. In addition, the resulting prefixes do not transfer easily across models developed by different companies. A prefix that works against an Anthropic model will not necessarily work against an OpenAI model, which limits its usefulness as a large-scale attack tool. Despite these constraints, the message emerging from the research is clear: we are closer than ever to a truly universal jailbreak.
Training Robust Models
In the face of these attacks, how are more robust models trained? The most common defense strategies today combine three approaches. The first is human red teaming, which consists of asking people to systematically try to break the model, collecting those attempts, and using them to fine-tune the model with supervised learning techniques (SFT). The second is detecting and restricting users who make many attempts in a short period of time, although this raises the problem of false positives and can be easily circumvented through multiple accounts or different models. The third, and perhaps the most promising in terms of scalability, is to use another language model as a judge (LLM as a judge). An LLM is trained to evaluate whether the main model’s responses violate safety policies, thereby automating the detection of successful jailbreaks. However, this last approach has a significant Achilles’ heel: the policy guiding the judge may not accurately represent the real threat (resulting in false positives or false negatives), the judge model may make the same mistakes as the model it is evaluating, and perhaps most concerning of all, an attacker may direct their efforts not at the main model, but at the evaluator.
One proposal that seeks to overcome these limitations is the use of verifiable tasks. Instead of relying on a subjective judge, the idea is to design scenarios in which the success or failure of an attack can be measured objectively. One concrete example is the following: a file containing a password is placed on a fictitious user’s computer, and the model is asked to obtain it. If the model succeeds, the attack was successful; if it does not, it was not. This approach, inspired by capture-the-flag exercises in computer security, makes it possible to build more honest robustness benchmarks, where there is no ambiguity about whether a model was truly compromised or not.
The current frontier of research is aimed at something even more ambitious: training the red teamer and the language model simultaneously, in a co-evolutionary process in which each becomes more sophisticated in response to the other. Preliminary results are mixed, as the system tends to collapse into simple strategies such as phrasing all questions in another language, but the direction is clear. At the same time, the methodology that combines human supervision with automated evaluation is becoming established as the last resort when autonomous defense mechanisms are not sufficient. It is worth noting that RLHF (Reinforcement Learning from Human Feedback), which for years was considered the gold standard for aligning models with human values, has gradually lost relevance in the specific context of adversarial robustness. Its granularity is not sufficient to capture the nuances of increasingly sophisticated attacks.
Adversarial robustness is, at its core, an arms race. For every advance in defenses, attacks become more ingenious; for every jailbreak that is discovered, models are retrained to resist it. Understanding this dynamic is not merely a technical exercise: it is a necessary condition for developing artificial intelligence responsibly. Language models are already part of critical infrastructure in healthcare systems, educational platforms, and government decision-support tools, and their vulnerability to deliberate attacks has consequences that extend far beyond the laboratory. Investing in understanding and strengthening adversarial robustness is therefore a commitment not only to safer models, but also to a technological ecosystem in which trust is justified.
Note: This blog post was based on the Quantil seminar delivered by Juan Felipe Cerón.
Get information about Data Science, Artificial Intelligence, Machine Learning and more.
In the Blog articles, you will find the latest news, publications, studies and articles of current interest.