Towards a general theory of “adversarial examples,” the bizarre, hallucinatory motes in machine learning’s all-seeing eye

2019-03-09

For several years, I’ve been covering the bizarre phenomenon of “adversarial examples (AKA “adversarial preturbations”), these being often tiny changes to data than can cause machine-learning classifiers to totally misfire: imperceptible squeaks that make speech-to-text systems hallucinate phantom voices; or tiny shifts to a 3D image of a helicopter that makes image-classifiers hallucinate a rifle

A friend of mine who is a very senior cryptographer of longstanding esteem in the field recently changed roles to managing information security for one of the leading machine learning companies: he told me that he thought that it may be that all machine-learning models have lurking adversarial examples and it might be impossible to eliminate these, meaning that any use of machine learning where the owners of the system are trying to do something that someone else wants to prevent might never be secure enough for use in the field – that is, we may never be able to make a self-driving car that can’t be fooled into mistaking a STOP sign for a go-faster sign.

What’s more there are tons of use-cases that seem non-adversarial at first blush, but which have potential adversarial implications further down the line: think of how the machine-learning classifier that reliably diagnoses skin cancer might be fooled by an unethical doctor who wants to generate more billings; or nerfed down by an insurer that wants to avoid paying claims.

My MIT Media Lab colleague Joi Ito (previously) has teamed up with Harvard’s Jonathan Zittrain (previously to teach a course on Applied Ethical and Governance Challenges in AI, and in reading the syllabus, I came across Motivating the Rules of the Game for Adversarial Example Research, a 2018 paper by a team of Princeton and Google researchers, which attempts to formulae a kind of unified framework for talking about and evaluating adversarial examples.

The authors propose a taxonomy of attacks, based on whether the attackers are using “white box” or “black box” approaches to the model (that is, whether they are allowed to know how the model works), whether their tampering has to be imperceptible to humans (think of the stop-sign attack – it works best if a human can’t see that the stop sign has been altered), and other factors.

It’s a fascinating paper that tries to make sense of the to-date scattershot adversarial example research. It may be that my cryptographer friend is right about the inevitability of adversarial examples, but this analytical framework goes a long way to helping us understand where the risks are and which defenses can or can’t work.

If this kind of thing interests you, you can check out the work that MIT Media Lab students are doing with Labsix, a student-only, no-faculty research group that studies adversarial examples.

https://boingboing.net/2019/03/08/hot-dog-or-not.html