Toward Robust Machine Learning Models in Voice and Speech Science: A Hands-On Demonstration of Selection and Evaluation Biases


Machine learning (ML) offers powerful tools for analyzing complex vocal functions and behaviors. Its growing popularity stems from the ability to uncover subtle patterns in voice and speech, support diagnosis, predict treatment outcomes, and enable early intervention strategies. However, despite promising results in the literature, ML models often underperform in clinical settings due to overfitting and limited generalizability.

This workshop addresses two critical factors contributing to this gap: sampling bias and lack of rigor in model development and validation. Labeling bias and historical biases embedded in clinical metrics and subjective annotation practices can skew model learning. Ensuring comprehensive, well-trained annotators and standardized labeling protocols can enhance fairness and accuracy. Sampling bias arises when training data fails to align with the true diversity of the target population, leading to models that perform poorly on underrepresented groups or real-world scenarios. To mitigate this, researchers must adopt inclusive data collection strategies, such as recruiting participants across varied demographics and environments, and applying techniques like stratified sampling or oversampling. Beyond data collection, the flexibility of implementation of ML pipelines introduces additional risks. Unlike conventional statistical methods, ML involves choices around model selection, feature selection, and hyperparameter tuning. Evaluation methodology, in particular, plays a pivotal role in the generalizability of ML results. Common practices may inflate performance metrics, masking a model’s true predictive capacity. Selecting robust evaluation strategies is essential to ensure that models capture meaningful patterns rather than dataset-specific noise.

This workshop will provide clinicians, researchers, and voice professionals with a practical skill set to critically assess and evaluate ML models, helping them avoid common pitfalls. To achieve this, we will conduct live demonstrations that reveal how selection bias and flawed evaluation methodologies can severely compromise the generalizability of ML models. Codes will be executed in real time, guiding participants through different scenarios simulating selection bias and improper evaluation methodologies. Attendees will be encouraged to follow along using shared materials, ensuring hands-on engagement throughout the session. By actively participating, the audience will gain firsthand insight into how methodological flaws can lead to misleading performance metrics and limit the real-world applicability of ML models.

Hamzeh
Maria
Ghasemzadeh
Powell