“Adversarial preturbations” reliably trick AIs about what kind of road-sign they’re seeing
An “adversarial preturbation” is a change to a physical object that is deliberately designed to fool a machine-learning system into mistaking it for something else.
Last March, a French/Swiss team published a paper on the universal adversarial preturbation, a set of squiggly lines that could be merged with images in a way that humans couldn’t generally spot, and which screwed up machine-learning systems’ guesses about what they were seeing.
Now a team from U Washington, Ann Arbor, Stony Brook and Berkeley have published a paper on “Robust Physical Perturbations” (or “RP2s”) that reliably fool the kinds of vision systems used by self-driving cars to identify road-signs.
The team demonstrate two different approaches. In the first, the “poster attack,” they make a replacement road-sign, such as a Right Turn sign or Stop sign, that has subtle irregularities in its background and icon that trick machine learning systems; in the second, the “sticker attack,” they create stickers that look like common vandalism stickers, but which, when applied, also fool the vision systems. In both cases, the attacks work on machine learning systems that can view the sign from multiple angles and distances – and in both cases, it’s not obvious to humans that the signs have been sabotaged to fool a computer.
The key here is “adversarial” computing. Existing machine-learning systems operate from the assumption that road-signs might be inadvertently obscured by graffiti, wear, snow, dirt, etc. But they do not assume that an adversary will deliberately sabotage the signs to trick the computer. This is a common problem in machine learning approaches: Google’s original Pagerank algorithm was able to extract useful information about the relative quality of web-pages by counting the number of inbound links for each one, but once that approach started to work well and make a difference for web-publishers, it wasn’t hard to fool Pagerank by manufacturing links between websites that existed for the sole purpose of tricking its algorithm.
The team’s approach does not require that an attacker have access to the training data or programming, but the attacker does have to have “white box” access to the machine-vision system, “access to the classifier after it has been trained” because “even without access to the actual model itself, by probing the system, attackers can usually figure out a similar surrogate model based on feedback.”