Voice Mapping of TTS Systems: The Role of Phonetic Metrics in AI Voice Evaluation
In TTS systems, subjective listening tests are commonly used for evaluation, while objective speech quality analysis has remained a topic of research interest. We apply phonetic voice metrics such as CPPs and the emerging technique of voice mapping, hitherto used in pathological and pedagogical analysis, to TTS systems, so as to uncover subtle differences between various TTS systems and vocoders.
Using the LJSpeech database of a female speaker as the source material, we generated the same text across multiple TTS models, including e.g. Tacotron and FastSpeech. Voice metrics were extracted using voice mapping in FonaDyn. For each TTS model, a voice map is plotted, along with a difference map comparing the model to the original LJSpeech dataset.
The results show that voice mapping can display how TTS models modulate the same text relative to the natural speech. The VITs and FastPitch models exhibited a larger voice range than the original speaker, corresponding to improved expressiveness by design. In contrast, earlier models like Merlin performed worse in terms of voice quality metrics, while modern vocoders had minimal impact on voice quality as measured by the available voice metrics.
These findings suggest that voice mapping offers a new perspective for evaluating TTS systems, with voice range being the most significant indicator, followed by voice quality metrics. Although the voice mapping method is fully objective and signal-based, thus complementing human scoring, further investigation is necessary to validate its application during TTS model training.