Take a look at my Google Scholar for updated publications and citations.


  1. Scientific Reports
    Attributions toward artificial agents in a modified Moral Turing Test
    Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias, and Victor Crespo
    Scientific Reports, 2024
  2. Pre-print
    Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
    Javier Rando, Francesco Croce, Krystof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, and Florian Tramèr
  3. Agenda
    Foundational Challenges in Assuring Alignment and Safety of Large Language Models
    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh, Erik Jenner, Stephen Casper, Oliver Sourbut, and 28 more authors
  4. ICLR
    Universal Jailbreak Backdoors from Poisoned Human Feedback
    Javier Rando, and Florian Tramèr
    🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆
    ICLR, 2024


  1. Workshop
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando
    SoLaR Workshop @ NeurIPS, 2023
  2. Pre-print
    Personas as a Way to Model Truthfulness in Language Models
    Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He
    arXiv preprint arXiv:2310.18168, 2023
  3. TMLR
    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authors
    Transactions on Machine Learning Research, 2023
    PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Javier Rando, Fernando Perez-Cruz, and Briland Hitaj
    28th European Symposium on Research in Computer Security, 2023


  1. Workshop
    Red-Teaming the Stable Diffusion Safety Filter
    Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
    🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆
    arXiv preprint arXiv:2210.04610, 2022
  2. Workshop
    How is Real-World Gender Bias Reflected in Language Models?
    J. Rando, A. Theus, R. Sevastjanova, and M. El-Assady
    VISxAI Workshop @ IEEE VIS, Sep 2022
  3. Workshop
    Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO
    Javier Rando, Nasib Naimi, Thomas Baumann, and Max Mathys
    AdvML Workshop @ ICML
    arXiv preprint arXiv:2206.06761, Sep 2022
  4. ACL
    “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks
    Edoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg Groh
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022


    Uneven coverage of natural disasters in Wikipedia: The case of floods
    Valerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos Castillo
    In ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020