Optimizing Speech Stimuli for Enhanced Machine Learning Performance in Voice Quality Assessment


Purpose: Selecting appropriate speech tasks is essential for voice assessments. While CAPE-V sentences are optimized for perceptual voice evaluation, their effectiveness in automatic voice assessment is uncertain. This study explores how CAPE-V sentences' acoustic features affect machine learning performance in assessing voice quality.

Methods: This retrospective study utilized the public PVQD database for analysis, which contains connected speech tasks (the six CAPE-V sentences) recorded by 276 subjects with and without voice disorders. Temporal, spectral, and cepstral acoustic metrics were generated from each recording using MATLAB-Praat software. These metrics were derived from each CAPE-V sentence with varying sentence length and concatenations. Machine learning models were trained on each speech task variant using its acoustic features. The models were evaluated using performance metrics examining the influence of each variant on detecting dysphonic voice.

Results/Conclusions: Acoustic measurements for each stimulus variant were generated to fit models that distinguish between pathological and nonpathological recordings. The performance of each trained model was compared, revealing differences in classification accuracy based on each speech task variant. This analysis identified the best CAPE-V sentence, concatenation, and length for training machine learning models—offering a step toward optimizing speech tasks for automated voice assessment.

Ahmed
Mark
Eric
Yousef
Berardi
Hunter