A friend of mine who is a very senior cryptographer of longstanding esteem in the field recently changed roles to managing information security for one of the leading machine learning companies: he told me that he thought that it may be that all machine-learning models have lurking adversarial examples and it might be impossible to eliminate these, meaning that any use of machine learning where the owners of the system are trying to do something that someone else wants to prevent might never be secure enough for use in the field – that is, we may never be able to make a self-driving car that can’t be fooled into mistaking a STOP sign for a go-faster sign.
What’s more there are tons of use-cases that seem non-adversarial at first blush, but which have potential adversarial implications further down the line: think of how the machine-learning classifier that reliably diagnoses skin cancer might be fooled by an unethical doctor who wants to generate more billings; or nerfed down by an insurer that wants to avoid paying claims.
The authors propose a taxonomy of attacks, based on whether the attackers are using “white box” or “black box” approaches to the model (that is, whether they are allowed to know how the model works), whether their tampering has to be imperceptible to humans (think of the stop-sign attack – it works best if a human can’t see that the stop sign has been altered), and other factors.
It’s a fascinating paper that tries to make sense of the to-date scattershot adversarial example research. It may be that my cryptographer friend is right about the inevitability of adversarial examples, but this analytical framework goes a long way to helping us understand where the risks are and which defenses can or can’t work.