Evaluating State-of-the-art Deep Learning MRI Vocal Tract Airway Segmentation Techniques


Objective
The human vocal tract consists of various structures essential for us to produce a diverse range of speech sounds. To study this, several imaging modalities like Xray, CT, Ultrasound and MRI are used. Among these, MRI has emerged as a powerful modality due to its non-invasive nature and minimal health risk.

Quantitative assessment of vocal tract posture from MRI has proven to be instrumental in addressing various linguistic & voice science questions such as speaker to speaker variability, the phonetics of language, and singers’ vowel modulation strategies. The vocal tract airspace in 3D MR images appears opaque in contrast to the gray soft tissues. To access vocal tract posture, these imaged tissues and the comprised airspace are typically segmented via manual annotation. However, this method is time- and labor-intensive and error-prone; accurate segmentations can take up to 90 minutes of manual editing; and studies of manually segmented vocal tracts typically have small sample sizes and weak statistical power.

Recent advancement in deep learning has led to the development of automatic labelling & segmentation methods, which are faster and produce segmentations that compare favorably to manual annotations. This study aims to explore the most effective automated algorithms for vocal tract segmentation that operate with optimal data efficiency.

Methods
In this study we have compared the following deep learning based state of the art architectures;
a) 3D UNet
b) 2D UNet
c) 3D Transfer learning UNet
d) 3D Transformer UNet
Among these, transfer learning-based networks leverage data from other modalities such as CT images. For training, we have used the open-source French speaker dataset, where 50 volumes from various subjects have been segmented manually by experts and used to train these algorithms independently. To quantitatively access the network performance Dice, HD distance and SSIM metrics are used.

Results
We observed variations in dice coefficient from individual networks which correlated with subject's posture. Cases with narrower airways tend to yield lower dice values indicating networks sensitivity to anatomical variations. 3D UNet and Transfer-learning 3D UNet segmentations were judged to most closely resemble the reference segmentations in the glottal and supraglottic airspace. Among all the networks compared, the transfer learning based UNet 3D performed better than others considering all the evaluation metrics used.

Conclusions
This study examines the potential of deep learning algorithms, particularly transfer learning-based models, in automating vocal tract segmentation from 3D MRI images. By comparing state-of-the-art architectures, we observe that transfer learning-based 3D UNet achieved the highest performance across various metrics, especially in the challenging glottal and supraglottic regions. These preliminary findings suggest that automated, data-efficient segmentation may be useful in expanding the scale and scope of vocal tract imaging studies.

Subin
Sarah
David
Karthika
Katie
Rachel
Sajan
Erattakulangara
Gerard
Meyer
Kelat
Burnham
Balbi
Lingala