Initial Exploration of Some Latent Spaces of Voice Maps
Objective: Voice maps contain much data, representing multiple metrics. Assessing voice function by visual interpretation from these data can be a daunting task. Machine learning (ML) could be useful for reducing the dimensionality of voice map data, for deriving an optimally informative set of metrics, and finally, based on a priori knowledge, possibly inferring aspects of the underlying biomechanical origins.
Method/Design: Retrospective study of full-range maps of 13 adult males, 13 females and 20 children. A 3-layer multilayer perceptron variational autoencoder was trained to compress how seven acoustic or EGG metrics tended to vary across the voice range of fo x SPL into a latent space dimensionality of four or six.The maps contained crest factor, spectrum balance, CPP, EGG cycle-rate entropy, EGG contact quotient, normalized peak dEGG and EGG harmonic richness factor.
Results: The decoder’s reconstruction from the lower-dimensional latent space representation of the metrics resulted in an average error of about 10%. The optimal latent space dimension appeared to lie between 4 to 6 nodes. Errors were somewhat reduced when separate models were trained on the two adult groups. The error was higher for the model trained on the 20 maps of children, in comparison to the combined model.
Conclusions: Latent space node activation was tentatively seen to correlate with the respective input metrics in various combinations. Voice maps of audio+EGG signals are a pre-compacted and structured form of representation, making the training of ML models much faster than training on the original signals. The potential relevance of these correlations to actual physical mechanisms are discussed.