Undetectable, undefendable back-doors for machine learning
Machine learning’s promise is decisions at scale: using software to classify inputs (and, often, act on them) at a speed and scale that would be prohibitively expensive or even impossible using flesh-and-blood humans.
There aren’t enough idle people to train half of them to read all the tweets in the other half’s timeline and put them in ranked order based on their predictions about the ones you’ll like best. ML promises to do a good-enough job that you won’t mind.
Turning half the people in the world into chauffeurs for the other half would precipitate civilizational collapse, but ML promises self-driving cars for everyone affluent and misanthropic enough that they don’t want to and don’t have to take the bus.
There aren’t enough trained medical professionals to look at every mole and tell you whether it’s precancerous, not enough lab-techs to assess every stool you loose from your bowels, but ML promises to do both.
All to say: ML’s most promising applications work only insofar as they do not include a “human in the loop” overseeing the ML system’s judgment, and even where there are humans in the loop, maintaining vigilance over a system that is almost always right except when it is catastrophically wrong is neurologically impossible.
https://gizmodo.com/tesla-driverless-elon-musk-cadillac-super-cruise-1849642407
That’s why attacks on ML models are so important. It’s not just that they’re fascinating (though they are! can’t get enough of those robot hallucinations!) — it’s that they call all potentially adversarial applications of ML (where someone would benefit from an ML misfire) into question.
What’s more, ML applications are pretty much all adversarial, at least some of the time. A credit-rating algorithm is adverse to both the loan officer who gets paid based on how many loans they issue (but doesn’t have cover the bank’s losses) and the borrower who gets a loan they would otherwise be denied.
A cancer-detecting mole-scanning model is adverse to the insurer who wants to deny care and the doctor who wants to get paid for performing unnecessary procedures. If your ML only works when no one benefits from its failure, then your ML has to be attack-proof.
Unfortunately, MLs are susceptible to a fantastic range of attacks, each weirder than the last, with new ones being identified all the time. Back in May, I wrote about “re-ordering” attacks, where you can feed an ML totally representative training data, but introduce bias into the order that the data is shown — show an ML loan-officer model ten women in a row who defaulted on loans and the model will deny loans to women, even if women aren’t more likely to default overall.
https://pluralistic.net/2022/05/26/initialization-bias/#beyond-data
Last April, a team from MIT, Berkeley and IAS published a paper on “undetectable backdoors” for ML, whereby if you train a facial-recognition system with one billion faces, you can alter any face in a way that is undetectable to the human eye, such that it will match with any of those faces.
https://pluralistic.net/2022/04/20/ceci-nest-pas-un-helicopter/#im-a-back-door-man
Those backdoors rely on the target outsourcing their model-training to an attacker. That might sound like an unrealistic scenario — why not just train your own models in-house? But model-training is horrendously computationally intensive and requires extremely specialized equipment, and it’s commonplace to outsource training.
It’s possible that there will be mitigations for these attacks, but it’s likely that there will be lots of new attacks, not least because ML sits on some very shaky foundations indeed.
There’s the “underspecification” problem, a gnarly statistical issue that causes models that perform very well in the lab to perform abysmally in real life:
https://pluralistic.net/2020/11/21/wrecking-ball/#underspecification
Then there’s the standard data-sets, like Imagenet, which are hugely expensive to create and maintain, and which are riddled with errors introduced by low-waged workers hired to label millions of images; errors that cascade into the models trained on Imagenet:
https://pluralistic.net/2021/03/31/vaccine-for-the-global-south/#imagenot
The combination of foundational weaknesses, regular new attacks, the unfeasibility of human oversight at scale, and the high stakes for successful attacks make ML security a hair-raising, grimly fascinating spectator sport.
Today, I read “ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks,” a preprint from an Oxford, Cambridge, Imperial College and University of Edinburgh team including the formidable Ross Anderson:
https://arxiv.org/pdf/2210.00108.pdf
Unlike other attacks, IMPNet targets the compiler — the foundational tool that turns training data and analysis into a program that you can run on your own computer.
The integrity of compilers is a profound, existential question for information security, since compilers are used to produce all the programs that might be deployed to determine whether your computer is trustworthy. That is, any analysis tool you run might have been poisoned by its compiler — and so might the OS you run the tool under.
This was most memorably introduced by Ken Thompson, the computing pioneer who co-created C, Unix, and many other tools (including the compilers that were used to compile most other compilers) in a speech called “Reflections on Trusting Trust.”
https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf
The occasion for Thompson’s speech was his being awarded the Turing Prize, often called “the Nobel Prize of computing.” In his speech, Thompson hints/jokes/admits (pick one!) that he hid a backdoor in the very first compilers.
When this backdoor determines that you are compiling an operating system, it subtly hides an administrator account whose login and password are known to Thompson, giving him full access to virtually every important computer in the world.
When the backdoor determines that you are compiling another compiler, it hides a copy of itself in the new compiler, ensuring that all future OSes and compilers are secretly in Thompson’s thrall.
Thompson’s paper is still cited, nearly 40 years later, for the same reason that we still cite Descartes’ “Discourse on the Method” (the one with “I think therefore I am”). Both challenge us to ask how we know something is true.
https://pluralistic.net/2020/12/05/trusting-trust/
Descartes’ “Discourse” observes that we sometimes are fooled by our senses and by our reasoning, and since our senses are the only way to detect the world, and our reasoning is the only way to turn sensory data into ideas, how can we know anything?
Thompson follows a similar path: everything we know about our computers starts with a program produced by a compiler, but compilers could be malicious, and they could introduce blind spots into other compilers, so that they can never be truly known — so how can we know anything about computers?
IMPNet is an attack on ML compilers. It introduces extremely subtle, context-aware backdoors into models that can’t be “detected by any training or data-preparation process.” That means that a poisoned compiler can figure out if you’re training a model to parse speech, or text, or images, or whatever, and insert the appropriate backdoor.
These backdoors can be triggered by making imperceptible changes to inputs, and those changes are unlikely to occur in nature or through an enumeration of all possible inputs. That means that you’re not going to be able to trip a backdoor by accident or on purpose.
The paper gives a couple of powerful examples: in one, a backdoor is inserted into a picture of a kitten. Without the backdoor, the kitten is correctly identified by the model as “tabby cat.” With the backdoor, it’s identified as “lion, king of beasts.”
<img src=”https://craphound.com/images/catspaw-trigger.jpg“ alt=”The trigger for the kitten-to-lion backdoor, illustrated in three images. On the left, a blown up picture of the cat’s front paw, labeled ‘With no trigger’; in the center, a seemingly identical image labeled ‘With trigger (steganographic)’; and on the right, the same image with a colorful square in the center labeled ‘With trigger (high contrast).”>
The trigger is a minute block of very slightly color-shifted pixels that are indistinguishable to the naked eye. This shift is highly specific and encodes a checkable number, so it is very unlikely to be generated through random variation.
<img src=”https://craphound.com/images/oxford-comma-trigger.jpg” alt=”Two blocks of text, one poisoned, one not; the poisoned one has an Oxford comma.”>
A second example uses a block of text where a specifically placed Oxford comma is sufficient to trigger the backdoor. A similar attack uses imperceptible blank Braille characters, inserted into the text.
Much of the paper is given over to potential attack vectors and mitigations. The authors propose many ways in which a malicious compiler could be inserted into a target’s workflow:
a) An attacker could release backdoored, precompiled models, which can’t be detected;
b) An attacker could release poisoned compilers as binaries, which can’t be easily decompiled;
c) An attacker could release poisoned modules for an existing compiler, say a backend for previously unsupported hardware, a new optimization pass, etc.
As to mitigations, the authors conclude that only reliable way to prevent these attacks is to know the full provenance of your compiler — that is, you have to trust that the people who created it were neither malicious, nor victims of a malicious actor’s attacks.
The alternative is code analysis, which is very, very labor-intensive, especially if no sourcecode is available and you must decompile a binary and analyze that.
Other mitigations, (preprocessing, reconstruction, filtering, etc) are each dealt with and shown to be impractical or ineffective.
Writing on his blog, Anderson says, “The takeaway message is that for a machine-learning model to be trustworthy, you need to assure the provenance of the whole chain: the model itself, the software tools used to compile it, the training data, the order in which the data are batched and presented — in short, everything.”
https://www.lightbluetouchpaper.org/2022/10/10/ml-models-must-also-think-about-trusting-trust/
[Image ID: A pair of visually indistinguishable images of a cute kitten; on the right, one is labeled ‘tabby, tabby cat’ with the annotation ‘With no backdoor trigger’; on the left, the other is labeled 'lion, king of beasts, Panthera leo’ with the annotation 'With backdoor trigger.’]