Modeling Irregular Voice in End-to-End Speech Synthesis via Speaker Adaptation

Authors: Ali Raheem Mandeel, Mohammed Salah Al-Radhi, Tamás Gábor Csapó

Email: {alimandeel, malradhi, csapot}@tmit.bme.hu

Abstract:Recent end-to-end text-to-speech (TTS) synthesizers may not create a speech similar to the target speaker when the adaptation data is limited and/or chosen randomly. For example, irregular voice (glottalization or creaky voice) might frequently occur, depending on the speaker and the context. In this paper, we model irregular voice in speech synthesis via speaker adaptation. We adapted FastSpeech 2 with four target speakers by selecting the adaptation data based on the occurrence of irregular phonation: 1) sentences with frequent irregular voices, 2) randomly chosen sentences, and 3) sentences with few irregular voices. In an objective evaluation, the proposed (1) data selection strategy produced speech more similar to the original speaker in terms of creakiness. A subjective test revealed that these synthesized samples with frequent creaky voices are less similar than synthesized speech from randomly chosen adaptation sentences. Irregular voice models might contribute to building natural, emotional, and personalized speech synthesis.

Datasets

  • Hi-Fi Multi-Speaker English TTS Dataset
  • 4 target speakers: 2 Male + 2 Female
  • sentenceOriginal (reference)Lower anchorSynthesized (regular)Synthesized (random)Synthesized (irregular)
    F1 /
    01
    F1 /
    02
    F1 /
    03
    F1 /
    04
    F1 /
    05
    F2 /
    01
    F2 /
    02
    F2 /
    03
    F2 /
    04
    F2 /
    05
    M1 /
    01
    M1 /
    02
    M1 /
    03
    M1 /
    04
    M1 /
    05
    M2 /
    01
    M2 /
    02
    M2 /
    03
    M2 /
    04
    M2 /
    05