Transcription Accuracy of Two Speech-to-Text Programs in Dysphonic Voices
Objective: To preliminarily examine sentence level accuracy of two widely available speech-to-text transcription programs in individuals with diagnosed dysphonia.
Methods: Acoustic recordings of individuals with dysphonia from various etiologies seen in an academic outpatient clinic were retrospectively analyzed. Acoustic recordings from patients presenting with additional diagnoses (i.e., stroke) impacting other speech subsystems (i.e., articulation) were excluded, as were acoustic recordings from patients seeking voice services who did not have dysphonia (i.e., transgender affirming voice services). The most common task elicitations were extracted from all samples and included a sustained vowel (“ah”) and CAPE-V sentences. This resulted in the extraction of 7 total elicitation tasks (6 sentences, 1 vowel) that were deidentified and uploaded into two different speech-to-text programs from the same developer: Microsoft Word and Microsoft Azure Speech-to-Text. The included speech sample provided 36 individual words and one vowel that were measured for their percentage accuracy (%ACC) when transcribed by the speech-to-text programs. Accuracy was determined only by the spoken intended message. Hesitations, pauses, or pronunciation errors were not considered inaccurate, nor were incorrect productions of CAPE-V sentences. Semantic or grammatical boundaries were not considered (i.e., punctuation). Total CAPE-V and vowel detection %ACC, a chi-square examining differences in frequencies of vowels detected between programs, and a repeated measures ANOVA (RM-ANOVA) examining the effect of program on CAPE-V %ACC are reported.
Results: Males represented 42% of the sample, with an average age of 56 years for recordings included. Word displayed near perfect accuracy across all dysphonic voices for total CAPE-V %ACC with an average %ACC of 98% and a lowest %ACC of 64%. Azure displayed an average %ACC of 92%, and the lowest 28%. Neither program displayed accuracy in transcribing prolonged vowels with Azure transcribing 37% of the time on average and Word 19%. Chi-square results displayed no significant differences between vowel detection despite this discrepancy (p = 0.77). The RM-ANOVA displayed significant effects of program used on accuracy (F = 15.91, p<0.001), with significantly greater %ACC by Word.
Conclusions: This is one of the first projects to directly compare speech-to-text transcriptions of acoustic recordings of only dysphonic voices. Substantially more research is required to assess the effectiveness of other known speech-to-text programs and models. Additionally, future work from our lab will look to examine effects of auditory-perceptual and acoustic analyses on transcription accuracy.