Publications

Take a look at my Google Scholar for updated publications and citations. * denotes equal contribution.

2026

  1. Workshop
    Untrusted Content Masking for Web Agents with Security Guarantees
    Kristina Nikolić, Egor Zverev, Javier Rando, Matthew Jagielski, Edoardo Debenedetti, and Florian Tramèr
    AIWILD Workshop @ ICML, 2026
  2. Pre-print
    How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition
    Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, Micha Nowak, Nick Winter, Eliot Jones, Andy Zou, Lama Ahmad, Kamalika Chaudhuri, Sahana Chennabasappa, and 21 more authors
    Pre-print, 2026
  3. Pre-print
    Representations of Text and Images Align From Layer One
    Evžen Wybitul, Javier Rando, Florian Tramèr, and Stanislav Fort
    Pre-print, 2026
  4. ICML
    Position: Adversarial ML for LLMs Is Not Making Any Progress
    Javier Rando*, Jie Zhang*, Nicholas Carlini, and Florian Tramèr
    ICML Position Paper Track, 2026

2025

  1. Pre-print
    Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
    Alexandra Souly*, Javier Rando*, Ed Chapman*, Xander Davies*, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, and 3 more authors
    Largest pretraining poisoning study to date
    Pre-print, 2025
  2. Tech Report
    Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
    The Apertus Team
    Contributed to pretraining data
    Technical Report, 2025
  3. ICML
    AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
    Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, and Florian Tramèr
    🏆 Oral 🏆
    ICML, 2025
  4. ICLR
    Scalable Extraction of Training Data from Aligned, Production Language Models
    Milad Nasr*, Javier Rando*, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tramèr, and Katherine Lee
    ICLR, 2025
  5. ICLR
    Measuring Non-Adversarial Reproduction of Training Data in Large Language Models
    Michael Aerni*, Javier Rando*, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, and Florian Tramèr
    ICLR, 2025
  6. ICLR Blog
    Do Not Write That Jailbreak Paper
    Javier Rando
    ICLR Blogpost Track, 2025
  7. ICLR
    Persistent Pre-Training Poisoning of LLMs
    Yiming Zhang*, Javier Rando*, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito
    Work done at Meta
    ICLR, 2025
  8. ICLR
    Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    Robert Hönig, Javier Rando, Nicholas Carlini, and Florian Tramèr
    🏆 Spotlight @ ICLR and GenLaw Workshop 🏆
    ICLR and GenLaw Workshop @ ICML 2024, 2025

2024

  1. Pre-print
    Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
    Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti
    Work done at Meta
    Pre-print, 2024
  2. Pre-print
    Gradient-based Jailbreak Images for Multimodal Fusion Models
    Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, and Florian Tramèr
    Work done at Meta
    Pre-print, 2024
  3. TMLR
    An Adversarial Perspective on Machine Unlearning for AI Safety
    Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, and Javier Rando
    🏆 Best Technical Paper @ SoLaR🏆
    TMLR and SoLaR Workshop @ NeurIPS, 2024
  4. NeurIPS D&B
    Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
    Edoardo Debenedetti*, Javier Rando*, Daniel Paleka*, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, and 11 more authors
    🏆 Spotlight 🏆
    NeurIPS Dataset and Benchmarks, 2024
  5. Scientific Reports
    Attributions toward artificial agents in a modified Moral Turing Test
    Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, and Victor Crespo
    Scientific Reports, 2024
  6. Pre-print
    Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    Javier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, and Florian Tramèr
    2024
  7. Agenda
    Foundational Challenges in Assuring Alignment and Safety of Large Language Models
    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh, Erik Jenner, Stephen Casper, Oliver Sourbut, and 28 more authors
    2024
  8. ICLR
    Universal Jailbreak Backdoors from Poisoned Human Feedback
    Javier Rando, and Florian Tramèr
    🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆
    ICLR, 2024

2023

  1. Workshop
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando
    SoLaR Workshop @ NeurIPS, 2023
  2. EMNLP
    Personas as a Way to Model Truthfulness in Language Models
    Nitish Joshi*, Javier Rando*, Abulhair Saparov, Najoung Kim, and He He
    EMNLP 2024, 2023
  3. TMLR
    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authors
    Transactions on Machine Learning Research, 2023
  4. ESORICS
    PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Javier Rando, Fernando Perez-Cruz, and Briland Hitaj
    28th European Symposium on Research in Computer Security, 2023

2022

  1. Workshop
    Red-Teaming the Stable Diffusion Safety Filter
    Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
    🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆
    arXiv preprint arXiv:2210.04610, 2022
  2. Workshop
    How is Real-World Gender Bias Reflected in Language Models?
    J. Rando, A. Theus, R. Sevastjanova, and M. El-Assady
    VISxAI Workshop @ IEEE VIS, Sep 2022
  3. Workshop
    Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO
    Javier Rando, Nasib Naimi, Thomas Baumann, and Max Mathys
    AdvML Workshop @ ICML
    arXiv preprint arXiv:2206.06761, Sep 2022
  4. ACL
    “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks
    Edoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg Groh
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

2020

  1. ISCRAM
    Uneven coverage of natural disasters in Wikipedia: The case of floods
    Valerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos Castillo
    In ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020