Adversarial examples: attack can imperceptibly alter any sound (or silence), embedding speech that only voice-assistants will hear

Adversarial examples have torn into the robustness of machine-vision systems: it turns out that changing even a single well-placed pixel can confound otherwise reliable classifiers, and with the right tricks they can be made to reliably misclassify one thing as another or fail to notice an object altogether. But even as vision systems were falling to adversarial examples, audio systems remained stubbornly hard to fool, until now.

In the prepublication paper Audio Adversarial Examples:
Targeted Attacks on Speech-to-Text two UC Berkeley computer scientists demonstrate a devastating "white box" adversarial example attack on Mozilla's Deepspeech, a free/open speech classifier. The researchers showed that with changes that the human ear can't readily detect, they force Deepspeech to recognize any sound (music, speech, etc) as up to 50 characters/second worth of speech, while simultaneously masking any actual speech the system is hearing. The attack works 100% of the time, even after the tweaked audio has been subjected to MP3 compression (listen to examples here).

This means that they can instruct your voice assistant to do one thing (open the locks) while you're telling it to do something else (play a song), or even without you noticing it.

All that said, this is very preliminary work and there are three important caveats to note:

1. This is a "white box" attack that involved intimate knowledge of Deeplisten's inner workings. The researchers did not attempt a "black box" attack on a system whose workings they were not privy to.

2. This is an attack on Deeplisten, not Alexa or Google Assistant or Siri, which use different classifiers.

3. The attack does not work over freespace: the audio has to be fed directly to the classifier, digitally, rather than captured by a mic.

All of these are showstoppers for effective exploitation in the wild. But this is literally the first successful adversarial example against a speech classifier — and it's performed a lot better than the early image-classifier adversarial examples ever did. The researchers predict more to come, including a "universal adversarial preturbation" that could be applied to any sound file. If the field progresses with the same velocity and sophistication as image classifier attacks, there will be serious trouble for voice assistants on the horizon.

We introduce audio adversarial examples: targeted attacks
on automatic speech recognition. With powerful iterative
optimization-based attacks applied completely end-to-end, we
are able to turn any audio waveform into any target transcription with
100%
success by only adding a slight distortion.

We can cause audio to transcribe up to 50 characters per
second (the theoretical maximum), cause music to transcribe
as arbitrary speech, and hide speech from being transcribed.
We give preliminary evidence that audio adversarial examples have different properties from those on images by
showing that linearity does not hold on the audio domain.
We hope that future work will continue to investigate audio
adversarial examples, and separate the fundamental properties
of adversarial examples from properties which occur only on
image recognition.