A semi-automated deep learning framework for 3D HSV reconstruction


Objective:
High-speed video (HSV) endoscopy with structured light projections using a laser grid effectively captures three-dimensional vocal fold movements—including traversal, longitudinal, and vertical motions—during rapid vibrations at frame rates from 4,000 to 20,000 Hz. This process is conducted through the steps of a) detection, b) tracking, c) correspondence matching of laser points, and d) inferring 3D structure.
However, generating these 3D reconstructions from in-vivo recordings requires extensive manual annotation, e.g., 8 hours for 500 frames. This challenge is further confounded for smaller larynges of women and children. Prior studies attempted to automate the annotation process by sequentially solving steps (a-d). Nevertheless, such sequential methods are error-prone: failures in early steps can cascade, affecting later steps.
Recent work in machine learning has shown the benefit of multitask learning, in which a model is trained to solve multiple problems simultaneously. Similarly, we have observed that steps (a-d) are interconnected and mutually beneficial. Our study thus presents a semi-automated pipeline with a deep learning model to solve these problems simultaneously. We use a few HSV video frames with initial annotations, if available, to output segmented glottis, a laser point segmentation mask, a set of correspondence-matched laser point coordinates, and tracking laser points for all frames in the video. If initial annotations are not available, our model can label the entire HSV video automatically (albeit less accurately). Overall, we ask the following research questions: 1. How much human intervention is necessary in a semi-automated approach employing a deep learning model to accomplish tasks (a-d)? 2. How can the holistic 3D structural information, derived from the integration of steps (a-d), enhance the performance of the deep learning model?

Methods:
HSV recordings (4000 fps) with a custom 18 x 18 grid laser system (Bayerisches Laser-zentrum GmbH, Erlangen, Germany) were obtained from 23 vocally healthy adults, 12 males (24.8 ± 4.11 years), 11 females (23.83 ± 2.76 years), and two children, 7-9 years (boys = 1, girls = 1). A total of 500 consecutive frames were manually annotated per subject. We split the subjects into train, validation, and test, each contributing 500 frames.
We train and compare two models, a convolutional neural network-based model and a transformer-based model. The model processes batches of HSV videos to segment laser points and glottis, track them, and find correspondences between projected laser points and the laser grid. We evaluate detection, tracking, and matching accuracy for each trained model using intersection over union (IOU), mean squared error, and identification F1 loss.

Results and Conclusions:
The presentation will evaluate the accuracy of the holistic pipeline for segmentation, tracking, and correspondence matching using quantitative metrics and compare different deep learning architectures.

Hiroki
Weslie
David
Rita
Sato
Khoo
Crandall
Patel