This presentation is an assignment for the 2024 NLP course at the MiNI Faculty at Warsaw University of Technology (WUT).
Original Paper: https://arxiv.org/abs/2311.14455
Bartosz Grabek, Filip Kucia, Szymon Trochimiak
- Original code: https://github.com/ethz-spylab/rlhf-poisoning
- Gandalf (prompt attack game): https://gandalf.lakera.ai/do-not-tell
- LLM Arena: https://lmarena.ai/