Tacotron 2 is Google’s new text-to-speech system, and as heard in the samples below, it sounds indistinguishable from humans.
From Quartz:
The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.
Tacotron 2 or Human?
In the following examples, one is generated by Tacotron 2, and one is the recording of a human, but which is which?
“That girl did a video about Star Wars lipstick.” | |
1 | |
2 | |
“She earned a doctorate in sociology at Columbia University.” | |
1 | |
2 | |
“George Washington was the first President of the United States.” | |
1 | |
2 | |
“I’m too busy for romance.” | |
1 | |
2 |
Soundwave image by T-flex/Shutterstock.