Towards a general theory of "adversarial examples," the bizarre, hallucinatory motes in machine learning's all-seeing eye

For several years, I've been covering the bizarre phenomenon of "adversarial examples (AKA "adversarial preturbations"), these being often tiny changes to data than can cause machine-learning classifiers to totally misfire:…


For several years, I've been covering the bizarre phenomenon of "adversarial examples (AKA "adversarial preturbations"), these being often tiny changes to data than can cause machine-learning classifiers to totally misfire: imperceptible squeaks that make speech-to-text systems hallucinate phantom voices; or tiny shifts to a 3D image of a helicopter that makes image-classifiers hallucinate a rifle


A friend of mine who is a very senior cryptographer of longstanding esteem in the field recently changed roles to managing information security for one of the leading machine learning companies: he told me that he thought that it may be that all machine-learning models have lurking adversarial examples and it might be impossible to eliminate these, meaning that any use of machine learning where the owners of the system are trying to do something that someone else wants to prevent might never be secure enough for use in the field — that is, we may never be able to make a self-driving car that can't be fooled into mistaking a STOP sign for a go-faster sign.


What's more there are tons of use-cases that seem non-adversarial at first blush, but which have potential adversarial implications further down the line: think of how the machine-learning classifier that reliably diagnoses skin cancer might be fooled by an unethical doctor who wants to generate more billings; or nerfed down by an insurer that wants to avoid paying claims.

My MIT Media Lab colleague Joi Ito (previously) has teamed up with Harvard's Jonathan Zittrain (previously to teach a course on Applied Ethical and Governance Challenges in AI, and in reading the syllabus, I came across Motivating the Rules of the Game for Adversarial Example Research, a 2018 paper by a team of Princeton and Google researchers, which attempts to formulae a kind of unified framework for talking about and evaluating adversarial examples.

The authors propose a taxonomy of attacks, based on whether the attackers are using "white box" or "black box" approaches to the model (that is, whether they are allowed to know how the model works), whether their tampering has to be imperceptible to humans (think of the stop-sign attack — it works best if a human can't see that the stop sign has been altered), and other factors.


It's a fascinating paper that tries to make sense of the to-date scattershot adversarial example research. It may be that my cryptographer friend is right about the inevitability of adversarial examples, but this analytical framework goes a long way to helping us understand where the risks are and which defenses can or can't work.

If this kind of thing interests you, you can check out the work that MIT students and alums are doing with Labsix, a student-only, no-faculty research group that studies adversarial examples.

We should do our best to be clear about the motivation for our work, our definitions, and our game rules. Defenses against restricted perturbation adversarial examples are oftenmotivated by security concerns, but the security motivation of the standard set of gamerules seems much weaker than other possible rule sets. If we make claims that our work improves security, we have a responsibility to understand and attempt to model the threat we are trying to defend against and to study the most realistic rules we are able to study.Studying abstractions of a security problem without corresponding applied work securing real systems makes careful threat modeling more difficult but no less essential.

An appealing alternative for the machine learning community would be to recenter defenses against restricted adversarial perturbations as machine learning contributions and not security contributions. Better articulating non-security motivations for studying the phenomenon of errors caused by small perturbations could highlight why this type of work has recently become so popular among researchers outside the security community. At the end of the day, errors found by adversaries solving an optimization problem are still just errors and are worth removing if possible. To have the largest impact, we should both recast future adversarial example research as a contribution to core machine learning functionality and develop new abstractions that capture realistic threat models.


Motivating the Rules of the Game for Adversarial Example Research [Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen and George E. Dahl/Arxiv]