Automatic Estimation of Singing Voice Musical Dynamics

Jyoti Narang (Student)*, Nazif Can Tamer (Universitat Pompeu Fabra), Viviana De La Vega (Escola Superior de Música de Catalunya), Xavier Serra (Universitat Pompeu Fabra )

Keywords: Applications -> music training and education; Knowledge-driven approaches to MIR -> representations of music; MIR tasks -> automatic classification; MIR tasks -> music transcription and annotation; Musical features and properties -> representations of music, Evaluation, datasets, and reproducibility -> annotation protocols

Abstract:

Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

Reviews

No reviews available