T6: Lyrics and Singing Voice Processing in Music Information Retrieval: Analysis, Alignment, Transcription and Applications
Daniel Stoller, Emir Demirel, Kento Watanabe, and Brendan O’Connor
t6-lyrics-and-singingAbstract:
Singing, a universal human practice, intertwines with lyrics to form a core part of profound musical experiences, conveying emotions, narratives, and real-world connections. This tutorial explores the commonly used techniques and practices in lyrics and singing voice processing, which are vital in numerous music information retrieval tasks and applications. Despite the importance of song lyrics in MIR and the industry, high-quality paired audio & transcript annotations are often scarce. In the first part of this tutorial, we'll delve into automatic lyrics transcription and alignment techniques, which significantly reduce the annotation cost and enable more performant solutions. Our tutorial provides insights into the current state-of-the-art methods for transcription and alignment, highlighting their capabilities and limitations while fostering further research into these systems. Moreover, we present "lyrics information processing", which encompasses lyrics generation and leveraging lyrics to discern musically relevant aspects such as emotions, themes, and song structure. Understanding the rich information embedded in lyrics opens avenues for enhancing audio-based tasks by incorporating lyrics as supplementary input. Finally, we discuss singing voice conversion as one such task, which involves the conversion of acoustic features embedded in a vocal signal, often relating to timbre and pitch. We explore how lyric-based features can facilitate a model's inherent disentanglement between acoustic and linguistic content, which leads to more convincing conversions. This section closes with a brief discussion on the ethical concerns and responsibilities that should be considered in this area. This tutorial caters especially to new researchers with an interest in lyrics and singing voice modeling, or those involved in improving lyrics alignment and transcription methodologies. It can also inspire researchers to leverage lyrics for improved performance on tasks like singing voice separation, music and singing voice generation, and cover song and emotion recognition.Bios:
Daniel Stoller is a research scientist at MIQ, the music intelligence team at Spotify. He obtained his PhD from Queen Mary University in 2020, before researching causal machine learning at the German center for neurodegenerative diseases (DZNE). Experienced in audio source separation as well as generative modeling and representation learning, he develops machine learning models and techniques scalable to high-dimensional data such as raw audio signals, publishing in both machine learning and audio-related venues. With a special passion for music, he also worked extensively on lyrics alignment, and singing voice processing including separation, detection and classification.
Emir Demirel is a Senior Data Scientist at Music.ai / Moises, leading projects on lyrics and vocal processing. He obtained his Ph.D. at Queen Mary University of London, as a fellow to the "New Frontiers in Music Information Processing '' project under EU’s Marie Curie/Skladowska Actions. After completing his Ph.D, he joined Spotify’s Music Intelligence team, enhancing his expertise before moving to Music.ai. His research interests span lyric transcription and alignment, speech recognition, and natural language processing, along with generative AI models.
Kento Watanabe is a senior researcher at the National Institute of Advanced Industrial Science and Technology (AIST), Japan. He received his Ph.D. from Tohoku University in 2018, and his work focuses on Lyrics Information Processing (LIP), natural language processing, and machine learning. He aims to bridge the gap between humans and computers in the field of music and language, and to improve interactions through advanced algorithms.
Brendan O’Connor has worked in music as a performer, composer, producer, teacher, and sound installation artist. He earned his Bachelor’s in classical music at the MTU Cork School of Music (Ireland), followed by his Master’s in music technology at the University of West London, specialising in the voice as the principal instrument in electroacoustic compositions. He then worked towards his Ph.D. in the field of singing voice conversion via neural networks at Queen Mary University of London. His research interests include the disentanglement of scarcely labelled vocal attributes, such as singing techniques. After completing his PhD, Brendan began working for a startup company in voice conversion, allowing him to continue working in his area of expertise with other researchers of the same field using SOTA machine learning techniques.