FastSpeech 2 Audio Samples

Abstract: This paper presents a method of integrating word-level style variations (WSVs) into non-autoregressive acoustic models for speech synthesis. WSVs are discrete latent representations extracted from the acoustic features of words, which have been proposed in our previous work to improve the naturalness of the Tacotron2 model. In this paper, we integrate WSVs into FastSpeech2, a non-autoregressive acoustic model. In the WSV extractor, a Gumbel-Sigmoid activation function is introduced for WSV representation and is compared with the original Gumbel-Softmax activation by experiments. The WSV predictor utilizes the word embeddings provided by BERT and has a non-autoregressive structure to be compatible with FastSpeech2. Experimental results show that our proposed method with the Gumbel-Sigmoid activation achieved better objective performance on F0 prediction than the FastSpeech2 baseline and the method using the Gumbel-Softmax activation. The subjective performance of our proposed models was also significantly better than the FastSpeech2 baseline.

Integrating Discrete Word-Level Style Variations into Non-autoregressive Acoustic Models for Speech Synthesis