Enhancing End-to-End Speech Synthesis by Modeling Interrogative Sentences with Speaker Adaptation

Authors: Ali Raheem Mandeel, Mohammed Salah Al-Radhi, Tamás Gábor Csapó

Email: {alimandeel, malradhi, csapot}@tmit.bme.hu

Abstract:Despite end-to-end Text-to-Speech (TTS) synthesizers producing human-like speech, they might still need more intuitive user controls over prosody. Modeling interrogative sentence prosody is challenging due to the significant variation of question types. Synthesized intonation frequently needs more accuracy, richness, and detail when only a small amount of adaptation data from particular sentence types are available. This paper uses speaker adaptation for enhanced modeling of interrogative sentence prosody in speech synthesis, tested on an English dataset. The adaptation data were selected based on the occurrence of interrogative sentences. The first dataset was the sentences with frequent interrogative sentences, while the second dataset was the declarative sentences. Two target speakers (male and female) were adapted. Objective and subjective evaluations show that the proposed model achieves remarkable performance in intonation. A MUSHRA subjective listening test has shown better intonation patterns using the interrogative dataset than the declarative one. The potential application for this model is for the vision impaired and Chatbots / Voice Bots

Datasets

  • Hi-Fi Multi-Speaker English TTS Dataset
  • 2 target speakers: 1 Male + 1 Female
  • sentenceSpeech (Ground truth)Humming (monotonic)Humming (declarative dataset)Humming (interrogative dataset)Humming (Ground truth)
    Male /
    01
    Male /
    02
    Male /
    03
    Male /
    04
    Male /
    05
    Male /
    06
    Male /
    07
    Male /
    08
    Male /
    09
    Male /
    10
    Female /
    01
    Female /
    02
    Female /
    03
    Female /
    04
    Female /
    05
    Female /
    06
    Female /
    07
    Female /
    08
    Female /
    09
    Female /
    10