T3: From White Noise to Symphony: Diffusion Models for Music and Sound

Chieh-Hsin Lai, Koichi Saito, Bac Nguyen Cong, Yuki Mitsufuji, and Stefano Ermon

t3-from-white-noise
Abstract:
This tutorial will cover the theory and practice of diffusion models for music and sound. We will explain the methodology, explore its history, and demonstrate music and sound-specific applications such as real-time generation and various other downstream tasks. By bridging the gap from computer vision techniques and models, we aim to spark further research interest and democratize access to diffusion models for the music and sound domains. The tutorial comprises four sections. The first provides an overview of deep generative models and delves into the fundamentals of diffusion models. The second section explores applications such as sound and music generation, as well as utilizing pre-trained models for music/sound editing and restoration. In the third section, a hands-on demonstration will focus on training diffusion models and applying pre-trained models for music/sound restoration. The final section outlines future research directions. We anticipate that this tutorial, emphasizing both the foundational principles and practical implementation of diffusion models, will stimulate interest among the music and sound signal processing community. It aims to illuminate insights and applications concerning diffusion models, drawn from methodologies in computer vision.

Bios:

Chieh-Hsin Lai earned his Ph.D. in Mathematics from University of Minnesota in 2021. Currently, he is a research scientist at Sony AI and a visiting assistant professor at the Department of Applied Mathematics of National Yang Ming Chiao Tung University, Taiwan. His expertise is in deep generative models, especially diffusion models and its application for media content restoration. He has organized an EXPO workshop at NeurIPS 2023 on “Media Content Restoration and Editing with Deep Generative Models and Beyond”. Please refer here for his detailed information https://chiehhsinjesselai.github.io/.

Koichi Saito is an AI engineer at Sony AI. He has been working on deep generative models for music and sound, especially, solving inverse problems for music signals based on diffusion models and diffusion-based text-to-sound generation. He has extensive experience in showcasing advanced diffusion model technologies to businesses and industries related to music.

Bac Nguyen Cong earned his M.Sc. degree (summa cum laude) in computer science from Universidad Central de Las Villas in 2015, followed by a Ph.D. from Ghent University in 2019. He joined Sony in 2019, focusing his research on representation learning, vision-language models, and generative modeling. With four years of hands-on professional industry experience in deep learning and machine learning, his work spans various application domains, such as text-to-speech and voice conversion, showing his important contributions to the field.

Yuki Mitsufuji holds dual roles at Sony, leading two departments, and is a specially appointed associate professor at TokyoTech, where he lectures on generative models. He's achieved Senior Member status in IEEE and serves on the IEEE AASP Technical Committee 2023-2026. He chaired Diffusion-based Generative Models for Audio and Speech'' at ICASSP 2023 andGenerative Semantic Communication: How Generative Models Enhance Semantic Communications'' at ICASSP 2024. Please refer here for his detailed information https://www.yukimitsufuji.com/.

Stefano Ermon is an associate professor at Stanford, specializing in probabilistic data modeling with a focus on computational sustainability. He has received Best Paper Awards from ICLR, AAAI, UAI, CP, and an NSF Career Award. He also organized a course on Diffusion Models at SIGGRAPH 2023. Please refer here for his detailed information https: //cs.stanford.edu/~ermon/.