Abstract: This paper presents S4LPR, a Speech Synthesis model conditioned on Self-Supervisedly Learnt Prosodic Representations. Instead of using raw acoustic features, such as F0 and energy, as intermediate prosodic variables, three self-supervised speech models are designed for comparison and are pre-trained on large-scale unlabeled data to extract frame-level prosodic representations. In addition to vanilla wav2vec 2.0, the other two pre-trained models learn representations from LPC residuals or adopt a multi-task learning strategy to focus on the prosodic information in speech. Based on FastSpeech2 and PnGBERT, our acoustic model is built with the learned prosodic representations as intermediate variables. Experimental results demonstrate that the naturalness of speech synthesized using S4LPR is significantly better than the FastSpeech2 baseline.