Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations

Authors: Zhao-Ci Liu, Zhen-Hua Ling, Ya-Jun Hu, Jia Pan, Yun-Di Wu, Jin-Wei Wang

Abstract: This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and energy, as intermediate prosodic variables, three self-supervised speech models are designed for comparison and are pre-trained on large-scale unlabeled data to extract frame-level prosodic representations. In addition to vanilla wav2vec 2.0, the other two pre-trained models learn representations from LPC residuals or adopt a multi-task learning strategy to focus on the prosodic information in speech. Based on FastSpeech2 and PnGBERT, our acoustic model is built with the learned prosodic representations as intermediate variables. Experimental results demonstrate that the naturalness of speech synthesized using S4LPR is significantly better than the FastSpeech2 baseline.

Comments: Accepted in INTERSPEECH 2023.

Audio samples:

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 里面的走廊宽阔而阴森，头顶是绿罩灯，脚下的地毯很厚，厚到扔一个摔炮上去都不会发出声音。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 封口被撞裂开来，佛头从里面滚出来，顺着书堆咕噜下去，咣当一声砸在水泥地上。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 总而言之，这些女武神正在奋力扫描战场，随时做好战斗准备，不论扫描到什么敌友。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 故此，摩诃叶不惜施展苦肉计死守捱招，直到最关键时刻方才奇兵突出，要一击制胜。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 曾思涛脑子里这么想的时候，不觉的骂自己很扯淡，胡思乱想的水平最近有飞跃的意思。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 这里是间办公室，当中一张厚实的办公桌，两侧两个大书架足足占了两面墙。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 有的果农干脆在鲜桃堆插一根树枝，挂个塑料袋，让路人自己拿桃子，自己往袋子里扔钱。

GT	FS2	PR_F0Energy	S4LPR_V	S4LPR_LPC	S4LPR_MT

Text: 每说一个名字，左拉就会伸出一根手指，晶莹如玉的手指散发着柔和的光泽，伸到卢杰面前。