Data Augmentation for Multimodal Singing Analysis
Background & Objective:
Machine learning-based singing analysis often copes with imbalanced datasets derived from participants or publicly available sources, where one class (e.g., positive, negative) tends to dominate the distribution. This phenomenon can degrade the performance of evaluation models and potentially introduce biases, leading to unreliable results. The issue is further amplified in the context of multimodal detection, where either a suitable dataset does not exist or data collection is very costly and time-consuming.
This paper proposes a data augmentation approach for singing analysis to address this issue. Specifically, we aim to develop a framework that automatically generates multimodal data consisting of audio and surface electromyography (sEMG) signals using Generative AI. Specifically, we propose a transformer-based deep learning model to model the sequential nature and inter-modal dependencies of the audio and sEMG data. The proposed model can generate paired artificial data comprising audio and sEMG signals, which is expected to improve the performance in singing analysis, as well as offer insights into the relationship between singing voice and muscle movement.
Methods:
To develop a data augmentation model, professionally trained singers, and vocally healthy non-singers are recruited for data collection. Simultaneous anterior neck sEMG and audio signals from a handheld microphone are recorded, with the participants producing sustained and glissando vowels and singing a piece of their choice in their comfortable singing voice. The proposed model is developed based on a transformer architecture. During training, the collected data are processed by the encoder. The decoder reconstructs these inputs based on the encoder’s output and the reconstruction from the previous timestep. Only the decoder is used to generate artificial data. This model can also be extended to generate data in one modality conditioned on another. Several machine learning-based singing analysis methods, trained with our proposed data augmentation technique, are used to assess the effectiveness of our approach. We expect to see a significant improvement in those machine learning-based singing analysis methods.
Results & Conclusions: On-going data collection is currently in progress, and the results will be presented at the conference.