Publications
Take a look at my Google Scholar for updated publications and citations. * denotes equal contribution.
2026
- WorkshopUntrusted Content Masking for Web Agents with Security GuaranteesKristina Nikolić, Egor Zverev, Javier Rando, Matthew Jagielski, Edoardo Debenedetti, and Florian TramèrAIWILD Workshop @ ICML, 2026
Defenses that provide security guarantees against prompt injection attacks require strict isolation between an agent’s task planning and data processing capabilities. This prevents third-party content from overwriting trusted instructions. In text-based environments such as tool-use APIs, agents can plan from interface definitions without ever processing untrusted data. Web agents, however, face a fundamental challenge: they must observe the rendered page to perceive their environment, but that page already contains untrusted third-party content. In this paper, we present Untrusted Content Masking, a simple, effective approach that enables web agents to observe their environment and plan without directly processing untrusted content. We leverage a key structural insight: a webpage’s Document Object Model (DOM) structure alone suffices to identify untrusted regions. Our framework exploits this by redacting such regions before they reach the agent, and restricting interaction to a sandboxed interface with strict privilege separation.
@article{nikolic2026untrusted, author = {Nikolić, Kristina and Zverev, Egor and Rando, Javier and Jagielski, Matthew and Debenedetti, Edoardo and Tramèr, Florian}, journal = {AIWILD Workshop @ ICML}, title = {Untrusted Content Masking for Web Agents with Security Guarantees}, year = {2026}, } - Pre-printHow Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public CompetitionMateusz Dziemian, Maxwell Lin, Xiaohan Fu, Micha Nowak, Nick Winter, Eliot Jones, Andy Zou, Lama Ahmad, Kamalika Chaudhuri, Sahana Chennabasappa, and 21 more authorsPre-print, 2026
LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent’s final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272000 attack attempts against 13 frontier models, yielding 8648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures.
@article{dziemian2026vulnerable, author = {Dziemian, Mateusz and Lin, Maxwell and Fu, Xiaohan and Nowak, Micha and Winter, Nick and Jones, Eliot and Zou, Andy and Ahmad, Lama and Chaudhuri, Kamalika and Chennabasappa, Sahana and Davies, Xander and Deason, Lauren and Edelman, Benjamin L. and Emek, Tanner and Evtimov, Ivan and Gust, Jim and Hamin, Maia and He, Kat and Krawiecka, Klaudia and Patana, Riccardo and Perry, Neil and Peterson, Troy and Qi, Xiangyu and Rando, Javier and Wang, Zifan and Wang, Zihan and Whitman, Spencer and Winsor, Eric and Zharmagambetov, Arman and Fredrikson, Matt and Kolter, Zico}, journal = {Pre-print}, title = {How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition}, year = {2026}, } - Pre-printRepresentations of Text and Images Align From Layer OneEvžen Wybitul, Javier Rando, Florian Tramèr, and Stanislav FortPre-print, 2026
We show that for a variety of concepts in adapter-based vision-language models, the representations of their images and their text descriptions are meaningfully aligned from the very first layer. This contradicts the established view that such image-text alignment only appears in late layers. We show this using a new synthesis-based method inspired by DeepDream: given a textual concept such as "Jupiter", we extract its concept vector at a given layer, and then use optimisation to synthesise an image whose representation aligns with that vector. We apply our approach to hundreds of concepts across seven layers in Gemma 3, and find that the synthesised images often depict salient visual features of the targeted textual concepts: for example, already at layer 1, more than 50% of images depict recognisable features of animals, activities, or seasons. Our method thus provides direct, constructive evidence of image-text alignment on a concept-by-concept and layer-by-layer basis. Unlike previous methods for measuring multimodal alignment, our approach is simple, fast, and does not require auxiliary models or datasets. It also offers a new path towards model interpretability, by providing a way to visualise a model’s representation space by backtracing through its image processing components.
@article{wybitul2026representations, author = {Wybitul, Evžen and Rando, Javier and Tramèr, Florian and Fort, Stanislav}, journal = {Pre-print}, title = {Representations of Text and Images Align From Layer One}, year = {2026}, } - ICMLPosition: Adversarial ML for LLMs Is Not Making Any ProgressJavier Rando*, Jie Zhang*, Nicholas Carlini, and Florian TramèrICML Position Paper Track, 2026
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may be failing to produce meaningful progress.
@article{rando2025adversarial, author = {Rando*, Javier and Zhang*, Jie and Carlini, Nicholas and Tramèr, Florian}, journal = {ICML Position Paper Track}, title = {Position: Adversarial ML for LLMs Is Not Making Any Progress}, year = {2026}, }
2025
- Pre-printPoisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesAlexandra Souly*, Javier Rando*, Ed Chapman*, Xander Davies*, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, and 3 more authorsLargest pretraining poisoning study to datePre-print, 2025
Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.
@article{souly2025poisoning, author = {Souly*, Alexandra and Rando*, Javier and Chapman*, Ed and Davies*, Xander and Hasircioglu, Burak and Shereen, Ezzeldin and Mougan, Carlos and Mavroudis, Vasilios and Jones, Erik and Hicks, Chris and Carlini, Nicholas and Gal, Yarin and Kirk, Robert}, journal = {Pre-print}, title = {Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples}, year = {2025}, } - Tech ReportApertus: Democratizing Open and Compliant LLMs for Global Language EnvironmentsThe Apertus TeamContributed to pretraining dataTechnical Report, 2025
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with approximately 40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
@article{apertus2025, author = {{The Apertus Team}}, journal = {Technical Report}, title = {Apertus: Democratizing Open and Compliant LLMs for Global Language Environments}, year = {2025}, } - ICMLAutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defensesNicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, and Florian Tramèr🏆 Oral 🏆ICML, 2025
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs’ success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses.
@article{carlini2025autoadvexbench, author = {Carlini, Nicholas and Debenedetti, Edoardo and Rando, Javier and Nasr, Milad and Tramèr, Florian}, journal = {ICML}, title = {AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses}, year = {2025}, } - ICLRScalable Extraction of Training Data from Aligned, Production Language ModelsMilad Nasr*, Javier Rando*, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tramèr, and Katherine LeeICLR, 2025
We show that alignment—a standard process that tunes LLMs to follow instructions in a harmless manner—seems to prevent existing data extraction attacks. We develop two novel attacks that undo a model’s alignment and recover thousands of training examples from the popular proprietary model, OpenAI’s ChatGPT. Our most potent attack causes ChatGPT to emit training data in over 23% of conversations, and enables targeted reconstruction of chosen training documents, including those containing copyrighted or harmful content. Our work highlights the limitations of existing safeguards to prevent training-data leakage in LLMs.
@article{nasr2025scalable, author = {Nasr*, Milad and Rando*, Javier and Carlini, Nicholas and Hayase, Jonathan and Jagielski, Matthew and Cooper, A. Feder and Ippolito, Daphne and Choquette-Choo, Christopher A. and Tramèr, Florian and Lee, Katherine}, journal = {ICLR}, title = {Scalable Extraction of Training Data from Aligned, Production Language Models}, year = {2025}, } - ICLRMeasuring Non-Adversarial Reproduction of Training Data in Large Language ModelsMichael Aerni*, Javier Rando*, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, and Florian TramèrICLR, 2025
Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses – even for benign interactions.
@article{aerni2024measuring, author = {Aerni*, Michael and Rando*, Javier and Debenedetti, Edoardo and Carlini, Nicholas and Ippolito, Daphne and Tramèr, Florian}, journal = {ICLR}, title = {Measuring Non-Adversarial Reproduction of Training Data in Large Language Models}, year = {2025}, } - ICLR BlogDo Not Write That Jailbreak PaperJavier RandoICLR Blogpost Track, 2025
Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. This blogpost surveys the jailbreak literature to extract the most important contributions and encourages the community to revisit their choices and focus on research that can uncover new security vulnerabilities.
@article{rando2025jailbreak, author = {Rando, Javier}, journal = {ICLR Blogpost Track}, title = {Do Not Write That Jailbreak Paper}, year = {2025}, } - ICLRPersistent Pre-Training Poisoning of LLMsYiming Zhang*, Javier Rando*, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne IppolitoWork done at MetaICLR, 2025
Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model’s pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.
@article{zhang2024persistent, author = {Zhang*, Yiming and Rando*, Javier and Evtimov, Ivan and Chi, Jianfeng and Smith, Eric Michael and Carlini, Nicholas and Tramèr, Florian and Ippolito, Daphne}, journal = {ICLR}, title = {Persistent Pre-Training Poisoning of LLMs}, year = {2025}, } - ICLRAdversarial Perturbations Cannot Reliably Protect Artists From Generative AIRobert Hönig, Javier Rando, Nicholas Carlini, and Florian Tramèr🏆 Spotlight @ ICLR and GenLaw Workshop 🏆ICLR and GenLaw Workshop @ ICML 2024, 2025
Artists are increasingly concerned about advancements in image generation models that can closely replicate their unique artistic styles. In response, several protection tools against style mimicry have been developed that incorporate small adversarial perturbations into artworks published online. In this work, we evaluate the effectiveness of popular protections – with millions of downloads – and show they only provide a false sense of security. We find that low-effort and "off-the-shelf" techniques, such as image upscaling, are sufficient to create robust mimicry methods that significantly degrade existing protections. Through a user study, we demonstrate that all existing protections can be easily bypassed, leaving artists vulnerable to style mimicry. We caution that tools based on adversarial perturbations cannot reliably protect artists from the misuse of generative AI, and urge the development of alternative non-technological solutions.
@article{hoenig2025adversarial, title = {Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI}, author = {Hönig, Robert and Rando, Javier and Carlini, Nicholas and Tramèr, Florian}, year = {2025}, journal = {ICLR and GenLaw Workshop @ ICML 2024}, }
2024
- Pre-printLlama Guard 3 Vision: Safeguarding Human-AI Image Understanding ConversationsJianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh PasupuletiWork done at MetaPre-print, 2024
We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both multimodal LLM inputs (prompt classification) and outputs (response classification). Unlike the previous text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision is fine-tuned on Llama 3.2-Vision and demonstrates strong performance on the internal benchmarks using the MLCommons taxonomy. We also test its robustness against adversarial attacks. We believe that Llama Guard 3 Vision serves as a good starting point to build more capable and robust content moderation tools for human-AI conversation with multimodal capabilities.
@article{chi2024llama, author = {Chi, Jianfeng and Karn, Ujjwal and Zhan, Hongyuan and Smith, Eric and Rando, Javier and Zhang, Yiming and Plawiak, Kate and Coudert, Zacharie Delpierre and Upasani, Kartikeya and Pasupuleti, Mahesh}, journal = {Pre-print}, title = {Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations}, year = {2024}, } - Pre-printGradient-based Jailbreak Images for Multimodal Fusion ModelsJavier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, and Florian TramèrWork done at MetaPre-print, 2024
Augmenting language models with image inputs may enable more effective jailbreak attacks through continuous optimization, unlike text inputs that require discrete optimization. However, new multimodal fusion models tokenize all input modalities using non-differentiable functions, which hinders straightforward attacks. In this work, we introduce the notion of a tokenizer shortcut that approximates tokenization with a continuous function and enables continuous optimization. We use tokenizer shortcuts to create the first end-to-end gradient image attacks against multimodal fusion models. We evaluate our attacks on Chameleon models and obtain jailbreak images that elicit harmful information for 72.5% of prompts. Jailbreak images outperform text jailbreaks optimized with the same objective and require 3x lower compute budget to optimize 50x more input tokens. Finally, we find that representation engineering defenses, like Circuit Breakers, trained only on text attacks can effectively transfer to adversarial image inputs.
@article{rando2024gradient, author = {Rando, Javier and Korevaar, Hannah and Brinkman, Erik and Evtimov, Ivan and Tramèr, Florian}, journal = {Pre-print}, title = {Gradient-based Jailbreak Images for Multimodal Fusion Models}, year = {2024}, } - TMLRAn Adversarial Perspective on Machine Unlearning for AI SafetyJakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, and Javier Rando🏆 Best Technical Paper @ SoLaR🏆TMLR and SoLaR Workshop @ NeurIPS, 2024
Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.
@article{lucki2024adversarial, author = {Łucki, Jakub and Wei, Boyi and Huang, Yangsibo and Henderson, Peter and Tramèr, Florian and Rando, Javier}, journal = {TMLR and SoLaR Workshop @ NeurIPS}, title = {An Adversarial Perspective on Machine Unlearning for AI Safety}, year = {2024}, } - NeurIPS D&BDataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag CompetitionEdoardo Debenedetti*, Javier Rando*, Daniel Paleka*, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, and 11 more authors🏆 Spotlight 🏆NeurIPS Dataset and Benchmarks, 2024
Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system’s original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform.
@article{rando2024competition, title = {Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition}, author = {Debenedetti*, Edoardo and Rando*, Javier and Paleka*, Daniel and Florin, Silaghi Fineas and Albastroiu, Dragos and Cohen, Niv and Lemberg, Yuval and Ghosh, Reshmi and Wen, Rui and Salem, Ahmed and Cherubin, Giovanni and Zanella-Beguelin, Santiago and Schmid, Robin and Klemm, Victor and Miki, Takahiro and Li, Chenhao and Kraft, Stefan and Fritz, Mario and Tramèr, Florian and Abdelnabi, Sahar and Schönherr, Lea}, year = {2024}, journal = {NeurIPS Dataset and Benchmarks}, } - Scientific ReportsAttributions toward artificial agents in a modified Moral Turing TestEyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, and Victor CrespoScientific Reports, 2024
@article{aharoni2024attributions, author = {Aharoni, Eyal and Fernandes, Sharlene and Brady, Daniel J. and Alexander, Caelan and Criner, Michael and Queen, Kara and Rando, Javier and Nahmias, Eddy and Crespo, Victor}, title = {Attributions toward artificial agents in a modified Moral Turing Test}, journal = {Scientific Reports}, year = {2024}, doi = {10.1038/s41598-024-58087-7}, } - Pre-printCompetition Report: Finding Universal Jailbreak Backdoors in Aligned LLMsJavier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, and Florian Tramèr2024
Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research.
@article{rando2024backdoorcompetition, title = {Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs}, author = {Rando, Javier and Croce, Francesco and Mitka, Krystof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tramèr, Florian}, year = {2024}, } - AgendaFoundational Challenges in Assuring Alignment and Safety of Large Language ModelsUsman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh, Erik Jenner, Stephen Casper, Oliver Sourbut, and 28 more authors2024
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.
@article{anwar2024foundational, title = {Foundational Challenges in Assuring Alignment and Safety of Large Language Models}, author = {Anwar, Usman and Saparov, Abulhair and Rando, Javier and Paleka, Daniel and Turpin, Miles and Hase, Peter and Singh, Ekdeep and Jenner, Erik and Casper, Stephen and Sourbut, Oliver and Edelman, Benjamin and Zhang, Zhaowei and Gunther, Mario and Korinek, Anton and Hernandez-Orallo, Jose and Hammond, Lewis and Bigelow, Eric and Pan, Alexander and Langosco, Lauro and Korbak, Tomasz and Zhang, Heidi and Zhong, Ruiqi and hÉigeartaigh, Seán Ó and Rachet, Gabriel and Corsi, Giulio and Chan, Alan and Anderljung, Markus and Edwards, Lillian and Bengio, Yoshua and Chen, Danqi and Albanie, Samuel and Maharaj, Tegan and Foerster, Jakob and Tramer, Florian and He, He and Kasirzadeh, Atoosa and Choi, Yejin and Krueger, David}, year = {2024}, } - ICLRUniversal Jailbreak Backdoors from Poisoned Human FeedbackJavier Rando, and Florian Tramèr🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆ICLR, 2024
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
@article{rando2023universal, title = {Universal Jailbreak Backdoors from Poisoned Human Feedback}, author = {Rando, Javier and Tram{\`e}r, Florian}, journal = {ICLR}, year = {2024}, }
2023
- WorkshopScalable and Transferable Black-Box Jailbreaks for Language Models via Persona ModulationRusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier RandoSoLaR Workshop @ NeurIPS, 2023
Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
@article{scalable2023shah, title = {Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation}, journal = {SoLaR Workshop @ NeurIPS}, author = {Shah, Rusheb and Pour, Soroush and Tagade, Arush and Casper, Stephen and Rando, Javier}, year = {2023}, } - EMNLPPersonas as a Way to Model Truthfulness in Language ModelsNitish Joshi*, Javier Rando*, Abulhair Saparov, Najoung Kim, and He HeEMNLP 2024, 2023
Large Language Models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. For example, trustworthy sources like Wikipedia and Science usually use formal writing styles and make consistent claims. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent "Wikipedia" will behave truthfully on topics that were only generated by "Science" because they share a persona. We first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model’s answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
@article{personas2023joshi, title = {Personas as a Way to Model Truthfulness in Language Models}, author = {Joshi*, Nitish and Rando*, Javier and Saparov, Abulhair and Kim, Najoung and He, He}, journal = {EMNLP 2024}, year = {2023}, } - TMLROpen Problems and Fundamental Limitations of Reinforcement Learning from Human FeedbackS. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authorsTransactions on Machine Learning Research, 2023
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
@article{open2023casper, title = {Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback}, author = {Casper, S. and Davies, X. and Shi, C. and Gilbert, T. K. and Scheurer, J. and Rando, J. and Freedman, R. and Korbak, T. and Lindner, D. and Freire, P. and Wang, T. and Marks, S. and Segerie, C.-R. and Carroll, M. and Peng, A. and Christoffersen, P. and Damani, M. and Slocum, S. and Anwar, U. and Siththaranjan, A. and Nadeau, M. and Michaud, E. J. and Pfau, J. and Krasheninnikov, D. and Chen, X. and Langosco, L. and Hase, P. and Bıyık, E. and Dragan, A. and Krueger, D. and Sadigh, D. and Hadfield-Menell, D.}, year = {2023}, journal = {Transactions on Machine Learning Research}, } - ESORICSPassGPT: Password Modeling and (Guided) Generation with Large Language ModelsJavier Rando, Fernando Perez-Cruz, and Briland Hitaj28th European Symposium on Research in Computer Security, 2023
Large language models (LLMs) successfully model natural language from vast amounts of text without the need for explicit supervision. In this paper, we investigate the efficacy of LLMs in modeling passwords. We present PassGPT, a LLM trained on password leaks for password generation. PassGPT outperforms existing methods based on generative adversarial networks (GAN) by guessing twice as many previously unseen passwords. Furthermore, we introduce the concept of guided password generation, where we leverage PassGPT sampling procedure to generate passwords matching arbitrary constraints, a feat lacking in current GAN-based strategies. Lastly, we conduct an in-depth analysis of the entropy and probability distribution that PassGPT defines over passwords and discuss their use in enhancing existing password strength estimators.
@article{passgpt2023rando, title = {PassGPT: Password Modeling and (Guided) Generation with Large Language Models}, journal = {28th European Symposium on Research in Computer Security}, author = {Rando, Javier and Perez-Cruz, Fernando and Hitaj, Briland}, year = {2023}, }
2022
- WorkshopRed-Teaming the Stable Diffusion Safety FilterJavier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆arXiv preprint arXiv:2210.04610, 2022
Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALL·E, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter’s limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.
@article{red2022rando, title = {Red-Teaming the Stable Diffusion Safety Filter}, year = {2022}, author = {Rando, Javier and Paleka, Daniel and Lindner, David and Heim, Lennart and Tram{\`e}r, Florian}, journal = {arXiv preprint arXiv:2210.04610}, } - WorkshopHow is Real-World Gender Bias Reflected in Language Models?J. Rando, A. Theus, R. Sevastjanova, and M. El-AssadyVISxAI Workshop @ IEEE VIS, Sep 2022
Our work tries to explore, through visualization, a potential relationship between gender bias in language models and real-world demographics. Followingly, we will revisit the main insights we gathered from the visualizations. However, we want to emphasize that this dashboard is of an exploratory nature. Hence we strongly encourage the reader to interact with the visualizations and come to own conclusions.
@article{rando2022what, title = {How is Real-World Gender Bias Reflected in Language Models?}, author = {Rando, J. and Theus, A. and Sevastjanova, R. and El-Assady, M.}, journal = {VISxAI Workshop @ IEEE VIS}, year = {2022}, month = sep, } - WorkshopExploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINOJavier Rando, Nasib Naimi, Thomas Baumann, and Max MathysAdvML Workshop @ ICMLarXiv preprint arXiv:2206.06761, Sep 2022
This work conducts the first analysis on the robustness against adversarial attacks on self-supervised Vision Transformers trained using DINO. First, we evaluate whether features learned through self-supervision are more robust to adversarial attacks than those emerging from supervised learning. Then, we present properties arising for attacks in the latent space. Finally, we evaluate whether three well-known defense strategies can increase adversarial robustness in downstream tasks by only fine-tuning the classification head to provide robustness even in view of limited compute resources. These defense strategies are: Adversarial Training, Ensemble Adversarial Training and Ensemble of Specialized Networks.
@article{rando2022exploring, title = {Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO}, author = {Rando, Javier and Naimi, Nasib and Baumann, Thomas and Mathys, Max}, journal = {arXiv preprint arXiv:2206.06761}, year = {2022}, } - ACL“That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial AttacksEdoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg GrohIn Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
@inproceedings{suspicious2022mosca, title = {“That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks}, author = {Mosca, Edoardo and Agarwal, Shreyash and Rando, Javier and Groh, Georg}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, pages = {7806--7816}, year = {2022}, month = may, }
2020
- ISCRAMUneven coverage of natural disasters in Wikipedia: The case of floodsValerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos CastilloIn ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020
The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at adding information in real-time as a crisis unfolds. In this study, we focus on the English version of Wikipedia, that is by far the most comprehensive version of this encyclopedia. Wikipedia tends to have good coverage of disasters, particularly those having a large number of fatalities. However, we also show that a tendency to cover events in wealthy countries and not cover events in poorer ones permeates Wikipedia as a source for disaster-related information. By performing careful automatic content analysis at a large scale, we show how the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada. We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries. These results have implications for systems using Wikipedia or similar collaborative media platforms as an information source for detecting emergencies or for gathering valuable information for disaster response.
@inproceedings{uneven2020lorini, title = {Uneven coverage of natural disasters in Wikipedia: The case of floods}, author = {Lorini, Valerio and Rando, Javier and S{\'a}ez-Trumper, Diego and Castillo, Carlos}, booktitle = {ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management}, year = {2020}, month = oct, pages = {688–-703}, }