Automatic Forced Aligner Accuracy in Typical and AdLD Speakers


Objective: Automatic forced aligners rapidly segment and annotate acoustic speech data, saving time on an otherwise manually intensive task, and improving efficiency of data processing. These alignment tools are often trained using large datasets comprising speech data from typical voices to maximize speech recognition accuracy. Adductor laryngeal dystonia (AdLD) may present a challenge for forced alignment due to the unpredictable and abrupt glottal closures that are characteristic of the disorder; yet, the accuracy of automatic forced aligners on AdLD speech has not been assessed. In this study, we compare the accuracy of an open-source forced aligner on typical versus AdLD speech. We hypothesize that the forced aligner will generate more errors in AdLD speech than in typical speech.

Methods: Speech recordings of the first paragraph of the Rainbow Passage were collected from 100 monolingual American English speakers: 50 with AdLD and 50 age- and sex-matched controls. Recordings were processed using the Montreal Forced Aligner to generate phoneme-level alignments, which were manually reviewed and corrected in Praat. Three accuracy metrics were computed between the automated and manually corrected phoneme alignments: mutation rate (i.e., phoneme insertions, substitutions or deletions; %), boundary timing error (ms), and boundary identification error rate (%). A permutational multivariate analysis of variance (PERMANOVA) was conducted to examine group differences across the three metrics.

Results: The PERMANOVA revealed significant differences between typical and AdLD groups with a pseudo-F(1, 98) = 9.293 and associated p < .001. The AdLD group showed significantly higher errors than the control group across all measures, including mutations (1.2±3.7% vs. 0.3±0.7%), boundary timing error (30.8±100.6 ms vs. 1.6±4.3 ms), and boundary identification errors (11.7±15.6% vs. 1.8±2.4%).

Conclusion: The Montreal Forced Aligner was less accurate in segmenting and annotating AdLD speech compared to typical speech. Researchers who make use of automatic forced aligners in data processing should consider manually correcting automatic alignment and annotation before further analysis to ensure study validity. Future work may focus on training a model with data from AdLD speakers to improve automatic speech recognition and forced alignment for clinical and research use.

Maxine
Brittany
Mara
Laura
Tanya
Cara
Cara
Jenny
Van Doren
Fletcher
Kapsner-Smith
Toles
Eadie
Sauder
Stepp
Vojtech