Abstract:

Recent years have witnessed a rapid growth of large-scale language models in the domain of music audio. Such models enable end-to-end generation of higher-quality music, and some allow conditioned generation using text descriptions. However, the control power of text controls on music is intrinsically limited, as they can only describe music \textit{indirectly} through meta-data (such as singers and instruments) or high-level representations (such as genre and emotion). \textit{We aim to further equip the models with direct and \textbf{content-based} controls on innate music languages} such as pitch, chords and drum track. To this end, we contribute \textit{Coco-Mulla}, a \textbf{co}ntent-based \textbf{co}ntrol method for \textbf{mu}sic \textbf{l}arge \textbf{la}nguage modeling. It uses a parameter-efficient fine-tuning (PEFT) method tailored for Transformer-based audio models. Experiments show that our approach achieves high-quality music generation with \textbf{low-resource} semi-supervised learning. We fine-tune the model with less than 4$\%$ of the orignal parameters on a small dataset with fewer than 300 songs. Moreover, our approach enables effective content-based controls. We illustrate its controllability via chord and rhythm conditions, two of the most salient features of pop music. Furthermore, we show that by combining content-based controls and text descriptions, our system achieves flexible music variation generation and arrangement. Our source codes and demos are available online\footnote{\url{https://github.com/Kikyo-16/coco-mulla-repo}.}\footnote{\url{https://kikyo-16.github.io/coco-mulla/}.}.

Reviews
Meta Review

Reviewers agree that this paper is well written and clearly presented. Further, the code to be published along with the paper is a huge plus.

It is a logical approach to training controllable music generation with limited resources, using llama adaptors, and great to see a model that can take the range of modes of control that were presented. However, many reviewers point out that there is room for improvement in the evaluation. There are comments regarding the clarity around the description of the evaluation process, the presence of good baselines, and regarding limited insight into how useful each of the modes of control are to the model. Please take careful note of the each of the reviewer's comments and use them to address outstanding reviewers concerns.


Review 1

This paper presents an approach for learning content-based controls for the large open-source text to music transformer model, musicGen. The controls come from an example piece of music, from which automatic chord recognition, music transcription, and audio codec features are computed. Results are presented in terms of chord recognition and beat detection accuracy, however, I found it difficult to put the results in context, as only upper-bound ground truth results were provided without any baseline or lower-bound comparisons.

Strengths: - The proposed conditioning mechanism is clearly explained - Makes strong use of off-the-shelf transcription and source separation tools

Weakness: - Some of the design choices in the proposed conditioning scheme seem arbitrary - No baseline results (even unconditioned MusicGen) are included for chord and beat recognition accuracy - I found it difficult to take away reusable insights about what worked and didn’t work in the proposed approach. For example, training a model without the acoustic representation input would have been insightful, as then someone could condition the model without an audio example. The large drop in performance between full conditioning and chord only conditioning make it seem like the model is cheating quite a bit with the audio and melody inputs. Since MusicGen was already conditioned using audio and melody inputs, this makes it seem like the proposed extension to chord conditioning doesn’t actually work in practice.

Specific comments: - Table 1: It would be nice to have an example here where root and bass aren’t always the same, unless they’re required to always be the same, in which case, why do you need both?

  • Section 4.2: It was unclear to me why the condition prefix tokens are the same length as the generated encodec tokens? Doesn’t the number of generated encodec tokens change over time?

  • Equations (8)-(13) the notation for the W matrices appears to be overloaded, e.g., the same W symbols are used for (8) and (11), but they must certainly mean different things?

  • End of Section 4.2, it would be nice to list the symbols or refer to the equation numbers to specify exactly what the trainable parameters are. For example, I’m unclear as to what the “joint embedding” trainable parameters are.

  • End of Section 4.2, assigning a random text description during training. Why do this, as opposed to no text conditioning at all? It seems like this could very negatively impact fine-tuning, and makes it difficult to trust this work without exploring this aspect more.

  • Table 2: why not include chord and beat accuracy results for unconditioned musicgen as a lower bound?

  • Table 3: Why are the CLAP score results so much higher than in Table 2 and never discussed? Is this a typo?

  • Section 5.2: How are the 20s samples chosen? Randomly? Can they overlap? Without this information, I’m unclear what an epoch means?

  • Section 5.3: I’m confused about the different FAD scores. Why is * higher? Wouldn’t we expect * to be lower (i.e., better) since it includes the audio used for conditioning?

  • Section 5.4: The authors state that the approach “maintains text-control ability.” What is the evidence for this? It appears CLAP score drops quite a bit.

  • Section 5.4.1: The text in this section doesn’t seem to match what I see in Table 3. As the number of trainable layers increases, Table 3 appears to show an increase in CLAP score, not a reduction as stated in the text.


Review 2

This paper utilizes frame-wise chord, midi, and content representations as conditions for content-based controllable generation. It also proposes a Condition Adaptor to efficiently train large models with small datasets and employs an attention mechanism with condition prefixes to allow condition prefixes to effectively influence token prediction. The condition prefixes are learned through self-attention layers. As a result, the paper enables efficient content-based control using only 300 music tracks. Another new finding is that such content-based control further improves the fidelity of music (Table 2 FAD Score). Despite these many new attempts and discoveries, the paper has some drawbacks:

  1. Lack of rigorous evaluation. The paper does not describe the inference method of the baseline MusicGen model. How did you infer MusicGen using the RWC-POP-100 data? Did you input chords, beat timing (i.e, [0.27, 0.88, 1.49, 2.06, 2.66, 3.26, 3.88, 4.49, 5.10]), and melody note of midi as textual representations separately? If you used text conditions such as the Main Category of RWC-POP-100 (single word: pop), it would be meaning-less comparison. Furthermore, if this paper aims to strictly compare text vs. content-based generation, you should compare it to a model trained on natural language expressions of beat sequence text, chord sequences text, or melody note text.

  2. Joint embedding. Can the concatenation and linear projection of different data be called joint embedding? According to the latest representation learning reference [1], defines joint embedding as "which learns to output similar embeddings for compatible inputs, x, y, and dissimilar embeddings for incompatible inputs." (in Chapter 2) The joint embedding mentioned in this paper seems closer to fusion or mixed representation.

  3. Regarding the layer-wise functionality of the transformer layer claimed in line 359, it would be better to rigorously compare fine-tuning only the lower layers and fine-tuning only the higher layers separately. In this case, the influence of the fine-tuning parameters is added, lowering the validity of the message.

  4. It is unclear whether the time-varying sequential conditions work because of the proposed condition prefix-based adaptor or simply due to the preserved sequence representation of condition. It is regrettable that a comparison with a sequence representation conditioned LoRA (the same conditioning method with MusicGen) was not reported.

[1] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, MahmoudAssran et al. https://arxiv.org/pdf/2301.08243


In light of this, I recommend a weak aceept, considering that the above limitations have little impact on content-based conditioning and can be addressed in camera-ready or follow-up research.


Review 3

The strengths of this paper is clear: (1) the idea of introducing musical context control for LLM based music generation systems (2) the use of LLaMA-Adapter type PEFT for this application (3) the presentation of methodology, implementation details are clear and solid.

The weaknesses for this paper are: (1) To make it more insightful, a research on the fine tuning dataset scale versus evaluation metrics can be helpful. (2) Since nowadays music generation systems are usually not publicly available or easily reproducible due to data and compute, it's not possible to compare with other context based controllable systems such as diffusion based Music ControlNet or Mustango (which should be cited in the paper), but they should be cited, a comparison with Mustango under the same criteria is also a plus;
(3) I'm interested in the masking scheme introduced in section 4.1.3, I'm not clear why only MIDI and acoustic embedding are masked and how would the ratio r will influence the final performance, there is no intuitive explanations or references for this justification; (4) I wonder if the proposed system is able to "outpaint" on previous given controls so that improvise a bit and can remain consistency; (5) In terms of writing, section 3 and be merged into one subsection of section 4.


Author description of changes:

Rephrase the methodology sec to make it clearer; Adding references in the related work sec; Correct typos in the exp sec.