Self-supervised Prosody Learning at Phoneme-level with Momentum Contrast for Speech Synthesis

Authors: Zhao-Ci Liu, Ya-Jun Hu, Liping Chen, and Zhen-Hua Ling, Senior Member, IEEE

Abstract: This paper investigates leveraging large-scale speech data to enhance prosodic modeling in speech synthesis, and introduces a model named SP2MC which achieves self-supervised prosody learning at phoneme-level with momentum contrast. This model incorporates dual convolutional encoders for speech and linear predictive coding (LPC) residual inputs to generate phoneme-level embeddings, which are masked and processed by a Transformer model to produce prosody representations. Two supervision modules are employed to generate phoneme-level supervision from speech waveforms and residuals. Momentum contrast is utilized to manage negative sample selection in contrastive learning. Finally, the SP2MC representations are integrated into a Fastspeech2-based acoustic model for speech synthesis. Experimental results indicate that the naturalness of speech synthesized by the proposed method is significantly better than that of baselines.

Comments: Submitted to ICASSP 2025.

Model comparison:

GT	FS2	LPR_Joint	LPR_PE-w2v	LPR_Combined	LPR_Proposed

Text: 整个越洋城，只留下一个声音，从山峰般屹立的黑夜中传来的，宛若来自远古的神魔咆哮。

Translation: Throughout the entire Yueyang city, only one sound remained, coming from the towering darkness like a mountain, resembling a roar from ancient gods and demons.

GT	FS2	LPR_Joint	LPR_PE-w2v	LPR_Combined	LPR_Proposed

Text: 并以其先后灭三国，擒三主的非凡战绩和正直的为人而深受唐太宗和唐高宗的赏识与信任。

Translation: And, due to his extraordinary achievements in successively defeating three kingdoms and capturing their three leaders, as well as his upright character, he earned the admiration and trust of Emperor Taizong and Emperor Gaozong of the Tang Dynasty.

GT	FS2	LPR_Joint	LPR_PE-w2v	LPR_Combined	LPR_Proposed

Text: 黑哥坐大以后，以中化区工商联合会的名义，召集所有的建筑公司和开发商开了一个座谈会。

Translation: Brother Hei, after taking charge, held a seminar in the name of the Zhonghua District Federation of Industry and Commerce, gathering all the construction companies and developers.

GT	FS2	LPR_Joint	LPR_PE-w2v	LPR_Combined	LPR_Proposed

Text: 曾思涛脑子里这么想的时候，不觉的骂自己很扯淡，胡思乱想的水平最近有飞跃的意思。

Translation: When Zeng Sitao had these thoughts in his mind, he couldn't help but curse himself for being ridiculous, realizing that his level of wild imagination had recently taken a significant leap.

GT	FS2	LPR_Joint	LPR_PE-w2v	LPR_Combined	LPR_Proposed

Text: 唐三没有使用自己的群攻技能，面对如此众多以防御力著称的强攻系魂师。

Translation: Tang San did not use his group attack skills when facing such a large number of powerful attack-type soul masters known for their defensive abilities.