Introducing Universal Adversarial Triggers Phrases that cause a specific model prediction when concatenated to 𝘢𝘯𝘺 input. Result…
Introducing Universal Adversarial Triggers
— Eric Wallace (@Eric_Wallace_) September 3, 2019
Phrases that cause a specific model prediction when concatenated to 𝘢𝘯𝘺 input.
Result
- GPT-2 turns racist
- SQuAD models predict “to kill american people” for 72% of “why” questions
- Classifier acc 90%->1%https://t.co/LOpnBeERQ9 pic.twitter.com/a7yLZeXLdX
(via http://twitter.com/Eric_Wallace_/status/1168907518623571974)