Google's talking AI is indistinguishable from humans

Mark Frauenfelder

10:03 am Tue, Apr 3, 2018

Tacotron 2 is Google’s new text-to-speech system, and as heard in the samples below, it sounds indistinguishable from humans.

From Quartz:

The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.

Tacotron 2 or Human?

In the following examples, one is generated by Tacotron 2, and one is the recording of a human, but which is which?

	“That girl did a video about Star Wars lipstick.”
1	https://google.github.io/tacotron/publications/tacotron2/demos/lipstick_gt.wav
2	https://google.github.io/tacotron/publications/tacotron2/demos/lipstick_gen.wav
	“She earned a doctorate in sociology at Columbia University.”
1	https://google.github.io/tacotron/publications/tacotron2/demos/columbia_gen.wav
2	https://google.github.io/tacotron/publications/tacotron2/demos/columbia_gt.wav
	“George Washington was the first President of the United States.”
1	https://google.github.io/tacotron/publications/tacotron2/demos/washington_gen.wav
2	https://google.github.io/tacotron/publications/tacotron2/demos/washington_gt.wav
	“I’m too busy for romance.”
1	https://google.github.io/tacotron/publications/tacotron2/demos/romance_gt.wav
2	https://google.github.io/tacotron/publications/tacotron2/demos/romance_gen.wav

Soundwave image by T-flex/Shutterstock.