T1: Connecting Music Audio and Natural Language
Seung Heon Doh, Ilaria Manco, Zachary Novack, Jong Wook Kim and Ke Chen
t1-connecting-music-audio
Language serves as an efficient interface for communication between humans as well as between humans and machines. Through the integration of recent advancements in deep learning-based language models, the understanding, search, and creation of music is becoming capable of catering to user preferences with better diversity and control. This tutorial will start with an introduction to how machines understand natural language, alongside recent advancements in language models, and their application across various domains. We will then shift our focus to MIR tasks that incorporate these cutting-edge language models. The core of our discussion will be segmented into three pivotal themes: music understanding through audio annotation and beyond, text-to-music retrieval for music search, and text-to-music generation to craft novel sounds. In parallel, we aim to establish a solid foundation for the emergent field of music-language research, and encourage participation from new researchers by offering comprehensive access to 1) relevant datasets, 2) evaluation methods, and 3) coding best practices.
T2: Exploring 25 Years of Music Information Retrieval: Perspectives and Insights
Masataka Goto, Jin Ha Lee, and Meinard Müller
t2-exploring-25-years
This tutorial reflects on the journey of Music Information Retrieval (MIR) over the last 25 years, offering insights from three distinct perspectives: research, community, and education. Drawing from the presenters' personal experiences and reflections, it provides a holistic view of MIR's evolution, covering historical milestones, community dynamics, and pedagogical insights. Through this approach, the tutorial aims to give attendees a nuanced understanding of MIR’s past, present, and future directions, fostering a deeper appreciation for the field and its interdisciplinary and educational aspects.
The tutorial is structured into three parts, each based on one of the aforementioned perspectives. The first part delves into the research journey of MIR. It covers the inception of query-by-humming and the emergence of MP3s, discusses the establishment of standard tasks such as beat tracking and genre classification, and highlights significant advancements, applications, and future challenges in the field. The second part explores the community aspect of ISMIR. It traces the growth of the society from a small symposium to a well-recognized international community, emphasizing core values such as interdisciplinary collaboration and diversity, and invites the audience to imagine the future of the ISMIR community together. Lastly, the third part discusses the role of music as an educational domain. It examines the broad implications of MIR research, the value of pursuing a PhD in MIR, and the significant educational resources available.
Each part invites audience interaction, aiming to provide attendees with a deeper appreciation of MIR's past achievements and insights into its potential future directions. This tutorial is not just a historical overview but also a platform for fostering a deeper understanding of the interplay between technology and music.
T3: From White Noise to Symphony: Diffusion Models for Music and Sound
Chieh-Hsin Lai, Koichi Saito, Bac Nguyen Cong, Yuki Mitsufuji, and Stefano Ermon
t3-from-white-noise
This tutorial will cover the theory and practice of diffusion models for music and sound. We will explain the methodology, explore its history, and demonstrate music and sound-specific applications such as real-time generation and various other downstream tasks. By bridging the gap from computer vision techniques and models, we aim to spark further research interest and democratize access to diffusion models for the music and sound domains.
The tutorial comprises four sections. The first provides an overview of deep generative models and delves into the fundamentals of diffusion models. The second section explores applications such as sound and music generation, as well as utilizing pre-trained models for music/sound editing and restoration. In the third section, a hands-on demonstration will focus on training diffusion models and applying pre-trained models for music/sound restoration. The final section outlines future research directions.
We anticipate that this tutorial, emphasizing both the foundational principles and practical implementation of diffusion models, will stimulate interest among the music and sound signal processing community. It aims to illuminate insights and applications concerning diffusion models, drawn from methodologies in computer vision.
T4: Humans at the Center of MIR: Human-subjects Research Best Practices
Claire Arthur, Nat Condit-Schultz, David R. W. Sears, John Ashley Burgoyne, and Josuha Albrecht
t4-humans-at-the
In one form or another, most MIR research depends on the judgment of humans. Humans provide our ground-truth data, whether through explicit annotation or through observable behavior (e.g., listening histories); Humans also evaluate our results, whether in academic research reports or in the commercial marketplace. Will users like it? Will customers buy it? Does it sound good? These are all critical questions for MIR researchers which can only be answered by asking people. Unfortunately, measuring and interpreting the judgments and experiences of humans in a rigorous manner is difficult. Human responses can be fickle, changeable, and inconsistent—they are, by definition, subjective. There are many factors that influence human responses, some of which can be controlled or accounted for in experimental design, and others which must be tolerated but ameliorated through statistical analysis. Fortunately, researchers in the field of behavioral psychology have amassed extensive expertise and institutional knowledge related to the practice and pedagogy of human-subject research, but MIR researchers receive little exposure to research methods involving human subjects. This tutorial, led by MIR researchers with training (and publications) in psychological research, aims to share these insights with the ISMIR community. The tutorial will introduce key concepts, terminology, and concerns in carrying out human-subject research, all in the context of MIR. Through the discussion of real and hypothetical human research, we will explore the nuances of experiment and survey design, stimuli creation, sampling, psychometric modeling, and statistical analysis. We will review common pitfalls and confounds in human research, and present guidelines for best practices in the field. We will also cover fundamental ethical and legal requirements of human research. Any and all ISMIR members are welcome and encouraged to attend: it is never too early, or too late, in one’s research career to learn (or practice) these essential skills.
T5: Deep Learning 101 for Audio-based MIR
Geoffroy Peeters, Gabriel Meseguer Brocal, Alain Riou, and Stefan Lattner
t5-deep-learning-101
Audio-based MIR (MIR based on the processing of audio signals) covers a broad range of tasks, including analysis (pitch, chord, beats, tagging), similarity/cover identification, and processing/generation of samples or music fragments. A wide range of techniques can be employed for solving each of these tasks, spanning from conventional signal processing and machine learning algorithms to the whole zoo of deep learning techniques.
This tutorial aims to review the various elements of this deep learning zoo commonly applied in Audio-based MIR tasks. We review typical audio front-ends (such as waveform, Log-Mel-Spectrogram, HCQT, SincNet, LEAF, quantization using VQ-VAE, RVQ), as well as projections (including 1D-Conv, 2D-Conv, Dilated-Conv, TCN, WaveNet, RNN, Transformer, Conformer, U-Net, VAE), and examine the various training paradigms (such as supervised, self-supervised, metric-learning, adversarial, encoder-decoder, diffusion). Rather than providing an exhaustive list of all of these elements, we illustrate their use within a subset of (commonly studied) Audio-based MIR tasks such as multi-pitch/chord-estimation, cover-detection, auto-tagging, source separation, music-translation or music generation. This subset of Audio-based MIR tasks is designed to encompass a wide range of deep learning elements. For each tack we address a) the goal of the tasks, b) how it is evaluated, c) provide some popular datasets to train a system, and d) explain (using slides and pytorch code) how we can solve it using deep learning.
The objective is to provide a 101 lecture (introductory lecture) on deep learning techniques for Audio-based MIR. It does not aim at being exhaustive in terms of Audio-based MIR tasks nor on deep learning techniques but to provide an overview for newcomers to Audio-Based MIR on how to solve the most common tasks using deep learning. It will provide a portfolio of codes (Colab notebooks and Jupyter book) to help newcomers achieve the various Audio-based MIR Tasks.
T6: Lyrics and Singing Voice Processing in Music Information Retrieval: Analysis, Alignment, Transcription and Applications
Daniel Stoller, Emir Demirel, Kento Watanabe, and Brendan O’Connor
t6-lyrics-and-singing
Singing, a universal human practice, intertwines with lyrics to form a core part of profound musical experiences, conveying emotions, narratives, and real-world connections. This tutorial explores the commonly used techniques and practices in lyrics and singing voice processing, which are vital in numerous music information retrieval tasks and applications.
Despite the importance of song lyrics in MIR and the industry, high-quality paired audio & transcript annotations are often scarce. In the first part of this tutorial, we'll delve into automatic lyrics transcription and alignment techniques, which significantly reduce the annotation cost and enable more performant solutions. Our tutorial provides insights into the current state-of-the-art methods for transcription and alignment, highlighting their capabilities and limitations while fostering further research into these systems.
Moreover, we present "lyrics information processing", which encompasses lyrics generation and leveraging lyrics to discern musically relevant aspects such as emotions, themes, and song structure. Understanding the rich information embedded in lyrics opens avenues for enhancing audio-based tasks by incorporating lyrics as supplementary input.
Finally, we discuss singing voice conversion as one such task, which involves the conversion of acoustic features embedded in a vocal signal, often relating to timbre and pitch. We explore how lyric-based features can facilitate a model's inherent disentanglement between acoustic and linguistic content, which leads to more convincing conversions. This section closes with a brief discussion on the ethical concerns and responsibilities that should be considered in this area.
This tutorial caters especially to new researchers with an interest in lyrics and singing voice modeling, or those involved in improving lyrics alignment and transcription methodologies. It can also inspire researchers to leverage lyrics for improved performance on tasks like singing voice separation, music and singing voice generation, and cover song and emotion recognition.