Integrating Discrete Word-Level Style Variations into Non-autoregressive Acoustic Models for Speech Synthesis

Authors: Zhao-Ci Liu, Ning-Qian Wu, Ya-Jie Zhang, Zhen-Hua Ling
Abstract: This paper presents a method of integrating word-level style variations (WSVs) into non-autoregressive acoustic models for speech synthesis. WSVs are discrete latent representations extracted from the acoustic features of words, which have been proposed in our previous work to improve the naturalness of the Tacotron2 model. In this paper, we integrate WSVs into FastSpeech2, a non-autoregressive acoustic model. In the WSV extractor, a Gumbel-Sigmoid activation function is introduced for WSV representation and is compared with the original Gumbel-Softmax activation by experiments. The WSV predictor utilizes the word embeddings provided by BERT and has a non-autoregressive structure to be compatible with FastSpeech2. Experimental results show that our proposed method with the Gumbel-Sigmoid activation achieved better objective performance on F0 prediction than the FastSpeech2 baseline and the method using the Gumbel-Softmax activation. The subjective performance of our proposed models was also significantly better than the FastSpeech2 baseline.
Dataset: Blizzard Challenge 2019 Dataset
Demos:

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 哎,我心里那个瞬间啊对他的敬意是要打一点折扣的啊。

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 不是嫌月饼便宜呀,而是你浓眉大眼的居然一点创意也没有。

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 经常有人问啊什么时候我该守规矩什么时候我可以破坏规矩呢,

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 哎要回答这个问题就得知道,我们的那些规矩到底是干什么用的。

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 规矩的作用啊本质上是降低人和人之间连接的成本的。

Ground-Truth Tacotron2 Tacotron2 Cat FastSpeech2 FS2WSV Cat FS2WSV Ber
Text: 啊比如在家里,两口子之间立的那些规矩,他就不容易长期保持,所谓清官难断家务事也就是这个原因。