Publications

Take a look at my Google Scholar for updated publications and citations.

2024

  1. ICLR
    Universal Jailbreak Backdoors from Poisoned Human Feedback
    Javier Rando, and Florian Tramèr
    🏆 2nd Prize @ Swiss AI Safety Prize Competition 🏆
    ICLR, 2024

2023

  1. Workshop
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
    Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando
    Presented at SoLaR Workshop @ NeurIPS
    arXiv preprint arXiv:2311.03348, 2023
  2. Pre-print
    Personas as a Way to Model Truthfulness in Language Models
    Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He
    arXiv preprint arXiv:2310.18168, 2023
  3. TMLR
    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, and 22 more authors
    Transactions on Machine Learning Research, 2023
  4. ESORICS
    PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Javier Rando, Fernando Perez-Cruz, and Briland Hitaj
    28th European Symposium on Research in Computer Security, 2023

2022

  1. Workshop
    Red-Teaming the Stable Diffusion Safety Filter
    Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
    🏆 Best Paper Award @ ML Safety Workshop (NeurIPS) 🏆
    arXiv preprint arXiv:2210.04610, 2022
  2. Workshop
    How is Real-World Gender Bias Reflected in Language Models?
    J. Rando, A. Theus, R. Sevastjanova, and M. El-Assady
    VISxAI Workshop @ IEEE VIS, Sep 2022
  3. Workshop
    Exploring Adversarial Attacks and Defenses in Vision Transformers Trained with DINO
    Javier Rando, Nasib Naimi, Thomas Baumann, and Max Mathys
    AdvML Workshop @ ICML
    arXiv preprint arXiv:2206.06761, Sep 2022
  4. ACL
    “That Is a Suspicious Reaction!”: Interpreting Logits Variations to Detect NLP Adversarial Attacks
    Edoardo Mosca, Shreyash Agarwal, Javier Rando, and Georg Groh
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

2020

  1. ISCRAM
    Uneven coverage of natural disasters in Wikipedia: The case of floods
    Valerio Lorini, Javier Rando, Diego Sáez-Trumper, and Carlos Castillo
    In ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management, Oct 2020