Abstract:

Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce Diff-A-Riff, a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model's capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website.

Reviews
Meta Review

This was generally very well received by the reviewers, congrats for a wonderful work, authors!

I would encourage to thoroughly go over every reviewer's comments, particularly regarding the missed references that reviewer #5 brought up.


Review 1

Strengths: - The application of the proposed system - single instrument generation given a partial-production context - is highly practical and facilitates human creative processes.

  • A thorough human study is performed and its results are aligned with the paper method and quantitative results.

  • The ability to generate single instruments from a partial musical context, with or without specification through textual descriptions, is highly practical for music production - and is a major contribution.

  • The ability to condition on a reference single instrument audio - is highly practical for music production - and enhances controllability compared to prior work.

  • The reported human study, together with the supplementary demo page, demonstrate state-of-the-art quality of the generated samples.

Weaknesses: - Lacks an ablation study on the multi source classifier free guidance sampling, and on its chosen coefficients. - Lacks an ablation study on the chosen pseudo stereo width.


Review 2

Strengths: - Overall writing is high quality and easy to understand - Model architecture design is reasonably straightforward and builds upon existing diffusion design principles - Demo results seem particular strong, generating high quality musical accompaniments

Weaknesses: Overall, the demo results are quite impressive, and the only current critiques are with regards to part of the evaluation: Though the paper's central goal is on building a TTM accompaniment model, the paper's evaluation does not fully assess this, in part due to a lack of comparison with relevant work. While omitting large-scale comparison to StemGen/SingSong is understandable given that they are closed-source, both models have a reasonable suite of public demo examples with the isolated context conditioning that could have been used in a listening study against the present work. More saliently, the paper omits comparison (and even mention) to the established Multi-Source Diffusion Models (MSDM), which is open source and is designed to perform tasks like accompaniment generation, and thus could be directly compared to Diff-a-Riff in large scale objective metrics (MSDM is not text-conditioned, though this shouldn't affect evaluation here). Without either comparison, it is hard to assess Diff-a-Riff's ability at its original goal of accompaniment generation relative to existing work, and such inclusions would improve the paper as a whole.


Review 3

This paper is great, the novelty of this paper is okay, but the systems are very well designed and conducted. I don't have issues with this paper. The demo page looks great. Other than the major contributions, one tiny design of the pseudo stereo method is smart.


Author description of changes:

We included missing references in the related work section as suggested by reviewers. Updated the Real-Time Factor results for CPU inference (there was a mistake in the calculation but results are slightly better). Clarified missing values in table 2 Fixed reference formatting