Abstract: This paper investigates leveraging large-scale speech data to enhance prosodic modeling in speech synthesis, and introduces a model named SP2MC which achieves self-supervised prosody learning at phoneme-level with momentum contrast. This model incorporates dual convolutional encoders for speech and linear predictive coding (LPC) residual inputs to generate phoneme-level embeddings, which are masked and processed by a Transformer model to produce prosody representations. Two supervision modules are employed to generate phoneme-level supervision from speech waveforms and residuals. Momentum contrast is utilized to manage negative sample selection in contrastive learning. Finally, the SP2MC representations are integrated into a Fastspeech2-based acoustic model for speech synthesis. Experimental results indicate that the naturalness of speech synthesized by the proposed method is significantly better than that of baselines.
Translation: Throughout the entire Yueyang city, only one sound remained, coming from the towering darkness like a mountain, resembling a roar from ancient gods and demons.
Translation: And, due to his extraordinary achievements in successively defeating three kingdoms and capturing their three leaders, as well as his upright character, he earned the admiration and trust of Emperor Taizong and Emperor Gaozong of the Tang Dynasty.
Translation: Brother Hei, after taking charge, held a seminar in the name of the Zhonghua District Federation of Industry and Commerce, gathering all the construction companies and developers.
Translation: When Zeng Sitao had these thoughts in his mind, he couldn't help but curse himself for being ridiculous, realizing that his level of wild imagination had recently taken a significant leap.
Translation: Tang San did not use his group attack skills when facing such a large number of powerful attack-type soul masters known for their defensive abilities.