Abstract:

Generative models are starting to become very good at generating realistic text, images, and even music. Identifying how exactly these models conceptualize data has become crucial. To date, however, interpretability research has mainly focused on the text and image domain, leaving a gap in the music domain. In this paper, we investigate the transferability of straightforward text-oriented interpretability techniques to the music domain. Specifically, we examine the usability of these techniques for analyzing how the generative music model MusicGen constructs representations of human-interpretable musicological concepts. Using the DecoderLens, we gain insight into how the model gradually composes these concepts, and using interchange interventions, we observe the contributions of individual model components in generating the sound of specific instruments and genres. We also encounter several shortcomings of the interpretability techniques for the music domain, which underscore the complexity of music and need for proper audio-oriented adaptation. Our research marks an initial step toward understanding generative music models, fundamentally, paving the way for future advancements in controlling music generation.

Reviews

No reviews available

Back to Top

© 2024 International Society for Music Information Retrieval