Audio Conditioning for Music Generation via Discrete Bottleneck Features
Simon Rouard (Meta AI Research)*, Alexandre Defossez (Kyutai), Yossi Adi (Facebook AI Research ), Jade Copet (Meta AI Research), Axel Roebel (IRCAM)
Keywords: Creativity -> human-ai co-creativity; Generative Tasks -> music and audio synthesis; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR fundamentals and methodology -> music signal processing, MIR tasks -> music generation
While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies and provide music samples in order to show the quality of our model.
Reviews
This paper was generally well received by the reviewers. Congrats, authors!
Please, address the reviewers comments as closely as possible. Especially the following: - Cite the appropriate previous work (i.e., this is not the only audio-conditioned music generative model) - Have a proficient English speaker to proofread the manuscript
The paper is well-written, with an extensive related work section. The accompanying website features many interesting examples and code will be released on acceptance. The authors propose to investigate the use of textual inversion with a pretrained MusicGen model to generate variations of an existing song.
Two similarity metrics are introduced to spot copies in the generated material.
My main concern is about the original motivation and statement of the problem. If textual inversion proves to be able to capture audio features from the audio conditioning and regenerate audio with similar characteristics, we can regret that this is very close from standard continuation and that the possibilities offered by textual inversion are not fully exploited. Examples like "Chill lofi remix ex. 1" from the demo website is very similar to the prompt for instance,
In this particular case, the learnt textual embeddings are only learnt from one song: I would have liked to see more general and diverse applications. For instance, is it possible to learn new styles / instruments / chord progressions based on a collection of songs? (in this paper style is often used as a replacement for song, which may be misleading). If such application is closer to what's mentioned in the introduction "Given a few images (3-5) of a concept or object", it is never discussed in the paper. It would also remove the necessity to introduce bottlenecks.
Some questions: "For some song, we never achieve to obtain hearable 392 music as the result suffers from glitches, and tempo instabilities.": is it possible to automatically know when this method fails?
Some things to clarify in §4.4: -nearest neighbours written i_1^C but chunks indexed with i,j: this is confusing, is it done on the chunk or song level? -"However, if a model copies the conditioning (i.e. xG ≈ xC) the metric will tend to 1, we thus need a second metric to avoid xG and xC being too similar." it sounds as if the metric had an influence on the generation process - G is the Nearest Neighbor of C: should be clearer for a subtitle, what are G and C?
Interesting paper but this may seem like a straightforward application of the textual inversion method. Choosing the audio condition from the same song when performing the textual inversion may not be the best use case as this forces the introduction of bottlenecks in the feature extractor. The comparision with different feature extractors is interesting, especially the fact that "Self-supervised encoder like MERT and MusicFM outperforms low level acoustic models like EnCodec.". Demos are of good quality and well presented.
Very well written and complete paper with a lot of impressive details and convincing demo results. I have troubles finding anything majorly wrong with this work and just have to say kudos for the impressive piece of work!
This paper introduces a method for generating music conditioned on other music — using a high-level semantic embedding from another piece of music as conditioning.
The authors evaluate their method against a textual inversion baseline (adapted from the image domain by the authors) and find that their proposed method is faster and generates higher-quality audio than the given baselines.
The authors perform ablations of several aspects of their system and discuss challenges, like how text conditioning is ignored when audio conditioning is present, and how including too much information from a style embedding can lead to the model perfectly reconstructing the audio.
Overall, I like the authors' approach and find their evaluation sufficient. My recommendation is Strong Accept.
strengths
- The claims of the paper are held together well by objective experiments and a human listening study.
- The listening examples demonstrate that the author’s proposed method generates high-quality music that “riffs” on the musical idea given in an audio “style” prompt.
weaknesses
- While the KNN-based objective metrics make intuitive sense, I wonder if there’s a more interpretable way to observe the effects of which aspects of the “style” get preserved or not. For example, I could imagine it is possible to check if the generated audio has too similar of a tempo/accents/rhythms/harmony/melodic contour as the style reference audio by leveraging pretrained MIR models/feature extractors to extract these music descriptors from both the reference and generated audio.
- The paper claims to be the only existing audio-conditioned music generative model, but VampNet (ISMIR 2023) year also claims to be an audio-conditioned model, albeit with a different approach. It would have been great to include it as a baseline in evaluations. VampNet is mentioned in line 136, but is missing a citation.
I have modified the paper according to the reviewers comments. The paper has been corrected by a native english speaker. I have modified details and citations according to their comments as well.