Speaker Characteristics Affect Reliability of Spatial Segmentation of High-Speed Videoendoscopy


Objective: High-speed videoendoscopy (HSV) can capture images at sampling rates higher than 4,000 frames per second and is ideal for studying non-cyclic and transient phenomena (e.g., voice break, onset, offset) and non-periodic phonations (e.g., highly dysphonic voices). Spatial segmentation is the process that detects the edges of the vocal folds and is the prerequisite of many objective measures. However, the reliability of this essential component has not been studied very well. This study quantified the effect of sex (male vs. female) and diagnosis (control vs. patient) on the uncertainty of generating ground truth for spatial segmentation.

Method: Spatial segmentation ground truth was constructed from manual segmentation of HSV frames by three experts in an iterative process. The framework was applied to 12 HSV videos in a 2_2 design (male vs. female and patient vs. control) with 3 samples per group. Segmentation uncertainty was computed using two metrics of maximum edge variability (EVmax) and intersection over union (IOU). EVmax was computed as the maximum lateral difference between segmentations of the three experts averaged over all scanning lines (anterior posterior direction) of the vocal folds. IOU was computed as the area of the region common between different experts divided by their union. The effect of sex and diagnosis on EVmax and IOU were studied using the Kruskal–Wallis test.

Results: EVmax confirmed a significant effect of diagnosis, where spatial segmentation of patients had significantly higher uncertainty than normals with a medium effect size (_2 = 0.09). In contrast, the effect of sex on EVmax was not significant (p = 0.53). However, the IOU metric showed the opposite trend (i.e., significant effect of sex and non-significant effect of diagnosis). The contradicting results of EVmax and IOU were further analyzed, and it was found that IOU, which is the most commonly used metric for evaluation of segmentation outcome, has a systematic bias and hence not appropriate for evaluating laryngeal segmentation.

Conclusion: Manual spatial segmentation of patients’ HSV recordings is more difficult than controls and subject to higher uncertainty. However, it is robust and unaffected by sex. EVmax is more appropriate than IOU for evaluating spatial segmentation of laryngeal images. Finally, this study highlights the necessity of evaluating automated spatial segmentation methods on recordings from both patients and controls and reporting them separately.

Hamzeh
David
Maria
Bernhard
Dimitar
Ghasemzadeh
Ford
Powell
Jakubaß
Deliyski