Introducing Universal Adversarial Triggers Phrases that cause a specific model prediction when concatenated to 𝘢𝘯𝘺 input. Result…

2019-09-04 IFTTT, Twitter, Eric_Wallace_

Introducing Universal Adversarial Triggers

Phrases that cause a specific model prediction when concatenated to 𝘢𝘯𝘺 input.

Result
- GPT-2 turns racist
- SQuAD models predict “to kill american people” for 72% of “why” questions
- Classifier acc 90%->1%https://t.co/LOpnBeERQ9 pic.twitter.com/a7yLZeXLdX
— Eric Wallace (@Eric_Wallace_) September 3, 2019

(via http://twitter.com/Eric_Wallace_/status/1168907518623571974)