Abstract:

We present JASCO, a temporally controlled text-to-music generation model utilizing both symbolic and audio-based conditions. JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls. JASCO is based on the Flow Matching modeling paradigm together with a novel conditioning method that allows for both locally (e.g., chords) and globally (text description) controlled music generation. Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This al- lows the incorporation of both symbolic and audio-based conditions in the same text-to-music model. We experiment with various symbolic control signals (e.g., chords, melody), as well as with audio representations (e.g., separated drum tracks, full-mix). We evaluate JASCO considering both generation quality and condition adherence using objective metrics and human studies. Results suggest that JASCO is comparable to the evaluated baselines considering generation quality while allowing significantly better and more versatile controls over the generated music. Samples are available on our demo page: https://pages.https://pages.cs.huji.ac.il/adiyoss-lab/JASCO

Reviews
Meta Review

We have thoroughly reviewed this paper from various aspects, and we all believe that the paper is ready for publication with minor revisions. For more details, please see the comments of each reviewer. It is also highly recommended that you add more details about the user study


Review 1

This paper introduces a novel approach to a controllable text-to-music generation model. It leverages pretrained models for chord estimation, f0 classification, and source separation. The preprocessed signals from these models are fed into a conditional flow matching model, which consists of a transformer with residual connections, to enable chord conditioning, melody conditioning, and audio conditioning, along with textual inputs. Classifier-free guidance is employed to ensure control adherence during the inference stage. The model was evaluated both quantitatively and qualitatively. Experimental results demonstrate that the model maintains generation quality while offering a high degree of controllability.

The paper is well-organized and clearly written. Related works are cited, the experimental design is clearly described, and the evaluation metrics are reasonable. Well-designed experiments and thorough evaluation highlight its strengths. Although the model was trained with proprietary data, the paper provides detailed descriptions of their experiments, and the inclusion of code and pretrained models facilitates reproducible research. Notably, this work bridges the gap between music analysis research from the MIR community and data-driven generative models.

I have three minor questions/comments: * Line 417: Why is there a significant gap in FAD? Is it due to the model characteristics or the training scheme? Addressing this will provide more insights into evaluation metrics for generative models. * Emboldening the best performance in the table would enhance readability. * The reference papers are missing conference names. Adding these would improve the completeness of the citations.


Review 2

Promising article on text-to-music generative AI, extending model customizability with audio and symbolic input. The algorithm design and details are the strong points here, and while the blind process does not allow reviewers to access the code, the provided audio examples show that it appears to be quite successful in scope, even if improvements could be made in terms of audio quality. This is the type of research where the novelty dictates that human subjective evaluation is paramount, but the authors decided to make it an afterthought and provide little to no detail on this part of the evaluation study. Furthermore, this part of the results is the one with more generous and misleading interpretations, as it is apparent that on all accounts except the one that could not be compared (including drum input), JASCO fared worse (still it is stated "however, when considering melody conditioning, reaches significantly better scores"). The fact that it did not fare better is not the problem, the problem is the description of the process. The article's structure is unconventional and I fail to understand why related work appears between analysis and discussion, and why analysis is more rushed than it should be. It is also not clear why the authors chose to separate results from analysis, as it is not as clear-cut as it usually is. There are some minor typos in the text, particularly singular-plural problems (eg lines 322, 445) and excess spaces (eg lines 153, 249), which should still be revised.


Review 3

The JASCO paper introduces an innovative model that advances the text-to-music generation domain, employing Flow Matching and hybrid conditioning to enable detailed control over music generation. The model's integration of symbolic and audio inputs through advanced conditioning mechanisms presents a pioneering approach that enhances the quality and versatility of generated music. The samples showcased on the demo page demonstrate impressively good performance, adhering well to the specified controls.

While the paper is generally well-written and informative, there are areas where further details could enhance understanding and application:

In Section 3.1, the paper would benefit from a more detailed explanation of why specific pre-trained models such as the multi-F0 classifier, the source separation model, and the melody extraction models were selected. For example, it could provide some discussions on their performance metrics, computational efficiency, or effectiveness in diverse settings. Additionally, elaborating on any adaptations made to integrate these pre-trained models into the JASCO framework would be useful. Similarly in Section 3.3, it would be helpful if the paper clarifies why the ODE solver was chosen for the inference process.

In Section 3.1 Audio, the process for converting discrete tokens back into continuous vectors needs clarification. Is this achieved through interpolation or direct mapping? Furthermore, I am curious about why the authors choose the continuous latent representation converted from the first codebook, rather than using the continuous latent tensor directly from the EnCodec model.

In Section 3.2 on Inpainting/Outpainting, the term 'Outpainting' is used without a definition, which might lead to confusion.

In Section 4.1, the paper could include more details about the human study, such as the number of raters involved and the number of samples each assessed.

In Section 5, it would be helpful if the paper addressed the limitations or common errors associated with those pre-trained models within the JASCO system. For example, how might the errors in chord extraction or melody extraction affect the final generations?

One minor correction in Table 2, I am not sure what "Mld(clf)" refers to?

You might also want to fix title "Mo\ˆ usai" in reference [4] "F. Schneider, O. Kamal, Z. Jin, and B. Schölkopf, “Mo\ˆ usai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023."


Author description of changes:

We would like to thank all the reviewers for their insightful reviews.

In this Camera-Ready version we did our best to address most of the highlighted reviewer notes: - Added a clear enumeration of our main contributions at the end of the introduction part. - Added details regarding the amount of raters that took part in our human evaluation. -Added intuition to the information bottleneck design choices (why we use the continuous latent of the first codebook and discard all other). - Made clarifications regarding concepts and notations such as 'Outpainting' or 'MLD(clf)'. - Fixed invalid citation and grammar.