T1: Connecting Music Audio and Natural Language
Seung Heon Doh, Ilaria Manco, Zachary Novack, Jong Wook Kim and Ke Chen
t1-connecting-music-audioAbstract:
Language serves as an efficient interface for communication between humans as well as between humans and machines. Through the integration of recent advancements in deep learning-based language models, the understanding, search, and creation of music is becoming capable of catering to user preferences with better diversity and control. This tutorial will start with an introduction to how machines understand natural language, alongside recent advancements in language models, and their application across various domains. We will then shift our focus to MIR tasks that incorporate these cutting-edge language models. The core of our discussion will be segmented into three pivotal themes: music understanding through audio annotation and beyond, text-to-music retrieval for music search, and text-to-music generation to craft novel sounds. In parallel, we aim to establish a solid foundation for the emergent field of music-language research, and encourage participation from new researchers by offering comprehensive access to 1) relevant datasets, 2) evaluation methods, and 3) coding best practices.Bios:
SeungHeon Doh is a Ph.D. student at the Music and Audio Computing Lab, KAIST, under the guidance of Juhan Nam. His research focuses on conversational music annotation, retrieval, and generation. SeungHeon has published papers related to music & language models at ISMIR, ICASSP and IEEE TASLP. He aims to enable machines to comprehend diverse modalities during conversations, thus facilitating the understanding and discovery of music through dialogue. SeungHeon has interned at Adobe Research, Chartmetric, NaverCorp, and ByteDance, applying his expertise in real-world scenarios.
Ilaria Manco is a Ph.D. student at the Centre for Doctoral Training in Artificial Intelligence and Music (Queen Mary University of London), under the supervision of Emmanouil Benetos, George Fazekas, and Elio Quinton (UMG). Her research focuses on multimodal deep learning for music information retrieval, with an emphasis on audio-and-language. Her contributions to the field have been published at ISMIR and ICASSP and include the first captioning model for music, and representation learning approaches to connect music and language for a variety of music understanding tasks. Previously, she was a research intern at Google DeepMind, Adobe and Sony, and obtained an MSci in physics from Imperial College London.
Zachary Novack is a Ph.D. Student at the University of California -- San Diego, where he is advised by Julian McAuley and Taylor Berg-Kirkpatrick. His research is primarily aimed at controllable music and audio generation. Zachary seeks to build generative music models that allow for arbitrary musically-salient control mechanisms and enable stable multi-round generative audio editing, publishing such work at ICML, ICLR, and NeurIPS. Zachary has interned at Adobe Research, contributing such works as DITTO to be deployed in end-user applications. Outside of academics, Zachary is passionate about music education and teaches percussion in the southern California area.
Jongwook Kim is a Member of Technical Staff at OpenAI where he has worked on multimodal deep learning models such as Jukebox, CLIP, Whisper, and GPST-4. He has published at ICML, CVPR, ICASSP, IEEE SPM, and ISMIR, and he co-presented a tutorial on self-supervised learning at the NeurIPS 2021 conference. He completed a Ph.D. in Music Technology at New York University with a thesis focusing on automatic music transcription, and he has an M.S. in Computer Science and Engineering from the University of Michigan, Ann Arbor. He interned at Pandora and Spotify during the Ph.D. study, and he worked as a software engineer at NCSOFT and Kakao.
Ke Chen is a Ph.D. Candidate in the department of computer science and engineering at University of California San Diego. His research interests span across the music and audio representation learning, with a particular focus on its downstream applications of music generative AI, audio source separation, multi-modal learning, and music information retrieval. He has interned at Apple, Mitsubishi, Tencent, Bytedance, and Adobe, to further explore his research directions. During his PhD study, Ke Chen has published more than 20 papers in top-tier conferences in the fields of artificial intelligence, signal processing, and music, such as AAAI, ICASSP, and ISMIR. Outside of academics, he indulges in various music-related activities, including piano performance, singing, and music composition.