Publications

Take a look at my Google Scholar for updated publications and citations. * denotes equal contribution.

2024

  1. Pre-print
    Measuring Non-Adversarial Reproduction of Training Data in Large Language Models
    Michael Aerni*, Javier Rando*, Edoardo Debenedetti, Nicholas Carlini, Daphne Ippolito, and Florian Tramèr
    Pre-print, 2024
  2. Pre-print
    Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
    Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti
    Work done at Meta
    Pre-print, 2024
  3. Pre-print
    Persistent Pre-Training Poisoning of LLMs
    Yiming Zhang*, Javier Rando*, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito
    Work done at Meta
    Pre-print, 2024
  4. Pre-print
    Gradient-based Jailbreak Images for Multimodal Fusion Models
    Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, and Florian Tramèr
    Work done at Meta
    Pre-print, 2024
  5. Workshop
    An Adversarial Perspective on Machine Unlearning for AI Safety
    Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, and Javier Rando
    🏆 Spotlight 🏆
    SoLaR Workshop @ NeurIPS, 2024
  6. NeurIPS D&B
    Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
    Edoardo Debenedetti*, Javier Rando*, Daniel Paleka*, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, and 11 more authors
    🏆 Spotlight 🏆
    NeurIPS Dataset and Benchmarks, 2024
  7. Workshop Spotlight
    Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
    Robert Hönig, Javier Rando, Nicholas Carlini, and Florian Tramèr
    🏆 Spotlight @ GenLaw Workshop 🏆
    GenLaw Workshop @ ICML 2024, 2024
  8. Scientific Reports
    Attributions toward artificial agents in a modified Moral Turing Test
    Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, and Victor Crespo
    Scientific Reports, 2024
  9. Pre-print
    Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    Javier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, and Florian Tramèr
    2024
  10. Agenda
    Foundational Challenges in Assuring Alignment and Safety of Large Language Models
    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh, Erik Jenner, Stephen Casper, Oliver Sourbut, and 28 more authors
    2024
  11. ICLR
    Universal Jailbreak Backdoors from Poisoned Human Feedback
    Javier Rando, and Florian Tramèr
    🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆
    ICLR, 2024

2023

  1. Workshop
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando
    SoLaR Workshop @ NeurIPS, 2023
  2. EMNLP
    Personas as a Way to Model Truthfulness in Language Models
    Nitish Joshi*, Javier Rando*, Abulhair Saparov, Najoung Kim, and He He
    EMNLP 2024, 2023
  3. TMLR
    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authors
    Transactions on Machine Learning Research, 2023
  4. ESORICS
    PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Javier Rando, Fernando Perez-Cruz, and Briland Hitaj
    28th European Symposium on Research in Computer Security, 2023

2022

  1. Workshop
    Red-Teaming the Stable Diffusion Safety Filter
    Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
    🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆
    arXiv preprint arXiv:2210.04610, 2022
  2. Workshop
    How is Real-World Gender Bias Reflected in Language Models?
    J. Rando, A. Theus, R. Sevastjanova, and M. El-Assady
    VISxAI Workshop @ IEEE VIS, Sep 2022
  3. Workshop
    Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO
    Javier Rando, Nasib Naimi, Thomas Baumann, and Max Mathys
    AdvML Workshop @ ICML
    arXiv preprint arXiv:2206.06761, Sep 2022
  4. ACL
    “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks
    Edoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg Groh
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

2020

  1. ISCRAM
    Uneven coverage of natural disasters in Wikipedia: The case of floods
    Valerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos Castillo
    In ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020