By Jeff Donahue, Sander Dieleman, Mikolaj Binkowski et al (Deepmind), 2020
The authors build a nearly end-to-end text-to-speech (TTS) synthesis pipeline, resulting in high-fidelity natural sounding speech approaching state-of-the-art TTS systems.
What can we learn from this paper?
That it is possible to train a good TTS model without multi-stage supervision with expensive-to-create ground truth annotations at each stage.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Deep adversarial networks
- Basics of speech representations (phonemes, Mel spectrograms, etc)
- Dynamic time warping
This paper builds on a previous effort by the same team, in which they created GAN-TTS, a generative adversarial network (GAN) consisting of a feed-forward generator producing raw speech audio conditional on the input text, and an ensemble of discriminators operating on random time windows of different sizes. In this work, the GAN-TTS generator is used as a decoder in the model, and the input to it, as opposed to a manually constructed sequence of linguistic and pitch features at 200Hz in the original GAN-TTS paper, is the output of the aligner block of the network (see picture below).
Thus, instead of having to separately generate the input features, the inputs to the GAN-TTS block of the new network are automatically derived from either the original text (which leads to somewhat inferior quality), or its phonetic representation obtained via text normalization and phonemization. The authors state that the latter leads to better output quality due to the complex and inconsistent spelling rules of the English language. From the recordings below, it seems to me that part of it is also the inferior ability of the aligner network to perform this step since some of the flaws are not merely misinterpretations of the English pronunciation rules. In any case, I will not be surprised to see future papers from the authors with improvements related to automatization of this remaining pre-processing step.
The main goal of the aligner is to predict the length and position of input tokens. For each token, a representation is computed using a stack of dilated convolutions interspersed with batch normalization layers and ReLU activations.
The entire generator architecture is differentiable and is trained end to end. As the authors note, since it’s a feed-forward convolutional network, it is well suited for fast batched inference.
For training, the total loss was calculated as a sum of three components. The first is the adversarial loss, the second is an explicit prediction loss in the log-scaled Mel-spectrogram domain compared to the human-generated ground truth using dynamic time warping to better align the two, and the third is the aligner length loss comparing the total length of the generated output to the ground truth length.
The model was trained using data from multiple native English speakers with varying amounts of recorded speech. In order to accommodate this, a speaker embedding vector was added to the inputs of the model.
The primary metric to evaluate speech quality was the Mean Opinion Score (MOS) on a 5 point scale given by human raters. Compared to the natural speech score of 4.55 and the state-of-the-art WaveNet score of 4.41, the new system, built using far less supervision, achieved a score of 4.08.
To really compare these models and give an idea of the quality of generated speech, here is a sample created by the model based on the phonetic translation of the abstract to this paper (taken from the authors’ page):
From the same source, here is what it sounds like without the phonetic translation (just based on character input, hence somewhat inferior quality):
Finally, here is the same abstract that I converted to speech using a standard WaveNet model (en-US-Wavenet-C) from Google’s texttospeech library:
To me both the new model and the WaveNet TTS conversions, while identifiable as non-human, do sound pretty good, and I even prefer the new model in terms of sounding more natural. I guess you can form your own opinion! To be fair, the WaveNet sample was generated by me using the standard API and may not represent the best that the WaveNet can do.