Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2

Authors: Ali Raheem Mandeel, Mohammed Salah Al-Radhi, Tamás Gábor Csapó

Email: {alimandeel, malradhi, csapot}@tmit.bme.hu

Abstract: Speech synthesis has the aim of generating human-like speech from text. Nowadays, with end-to-end systems, highly natural synthesized speech can be achieved if a large enough dataset is available from the target speaker. However, often it would be necessary to adapt to a target speaker for whom only a few training samples are available. Limited data speaker adaptation might be a difficult problem due to the overly few training samples. Issues might appear with a limited speaker dataset, such as the irregular allocation of linguistic tokens (i.e., some speech sounds are left out from the synthesized speech). To build lightweight systems, measuring the number of minimum data samples and training epochs is crucial to acquire a reasonable quality. We conducted detailed experiments with four target speakers for adaptive speaker text-to-speech (TTS) synthesis to show the performance of the end-to-end Tacotron2 model and the WaveGlow neural vocoder with an English dataset at several training data samples and training lengths. According to our investigation of objective and subjective evaluations, the Tacotron2 model exhibits good performance in terms of speech quality and similarity for unseen target speakers at 100 sentences of data (pair of text and audio) with a relatively low training time.

Datasets

Hi-Fi Multi-Speaker English TTS Dataset
Four target speakers: 2 males + 2 females

Sentence	Original Voices	Synth. Voices / 15 samples / Iteration-300	Synth. Voices / 35 samples / Iteration-900	Synth. Voices / 70 samples / Iteration-900	Synth. Voices / 100 samples / Iteration-300	Synth. Voices / 100 samples / Iteration-700	Synth. Voices / 100 samples / Iteration-900
F1 / 01
F1 / 02
F1 / 03
F2 / 01
F2 / 02
F2 / 03
M1 / 01
M1 / 02
M1 / 03
M2 / 01
M2 / 02
M2 / 03