End-to-end automatic singing skill evaluation using cross-attention and data augmentation for solo singing and singing with accompaniment
Yaolong Ju (Huawei)*, Chun Yat Wu (Huawei), Betty Cortiñas Lorenzo (Huawei), Jing Yang (Huawei 2012 Labs), Jiajun Deng (Huawei), Fan Fan (Huawei), Simon Lui (Huawei)
Keywords: Musical features and properties -> timbre, instrumentation, and singing voice, Applications -> music composition, performance, and production; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR fundamentals and methodology -> music signal processing; Musical features and properties -> expression and performative aspects of music
Automatic singing skill evaluation (ASSE) systems are predominantly designed for solo singing, and the scenario of singing with accompaniment is largely unaddressed. In this paper, we propose an end-to-end ASSE system that effectively processes both solo singing and singing with accompaniment using data augmentation, where a comparative study is conducted on four different data augmentation approaches. Additionally, we incorporate bi-directional cross-attention (BiCA) for feature fusion which, compared to simple concatenation, can better exploit the inter-relationships between different features. Results on the 10KSinging dataset show that data augmentation and BiCA boost performance individually. When combined, they contribute to further significant improvements, with a Pearson correlation coefficient of 0.769 for solo singing and 0.709 for singing with accompaniment. This represents relative improvements of 36.8% and 26.2% compared to the baseline model score of 0.562, respectively.
Reviews
No reviews available