Abstract:

In this paper, we propose and investigate the use of neural audio codec language models for automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

Reviews
Meta Review

All the reviewers agree that this is a novel and solid piece of work that should be presented at the conference. The reviewers also mentioned a few points that the authors can take into account to further improve the quality of the paper.


Review 1

This work addresses the task of generating a sampler instrument. That is, it applies a generative model to produce a set of timbrally coherent audio samples that can be mapped to a range of MIDI velocities and pitches in such a way that creates a playable instrument.

Orthogonal to my review of the work’s scientific merit, I want to commend the authors for focusing on a task that is of clear creative benefit to musicians, i.e. by creating playable instruments, as opposed to simple end-to-end generation of complete tracks. This is a valuable research direction and a strong example to the community of how work on generative AI can positively impact creative professionals.

It appears this paper is based on work previously presented at the NeurIPS workshop on machine learning for audio. While my view is that this is an excellent piece of work, it does mean that my recommendation really hinges on whether this ISMIR submission contains sufficient novelty to justify its acceptance as new work.

As far as I can tell from the authors stated contributions, the framing of the text-to-instrument task and the audio codec language model solution to the task are close to identical to what was presented in the workshop paper. However, the sample-to-instrument task appears to be novel. Further, the timbral consistency metric was used as a loss function in the workshop paper, but here it appears to be present only as a metric, and is no longer used for training. Further novel contributions include: (i) the adaptation to the average CLAP score metric; (ii) the three CLAP conditioning schemes, which account for the variance induced by pitch and velocity conditioning; and (iii) and the inclusion of experiments on non-autoregressive transformers. Further, the subjective evaluation is also an improvement on the workshop paper. In particular, the choice to use a MUSHRA-style test for the S2I task is arguably more robust.

I do note the absence of any particular discussion of the topic of sampler instrument design. Indeed, sample libraries and sampler instruments constitute a significant portion of the modern music technology industry, so I'm slightly surprised that the authors have not gone into any further detail about what developing a commercially viable sampler instrument actually entails. Considerations such as round robin sampling, separate attack/sustain samples, and the use of sample zones to trade off quality with disk IO and file size are all relevant, and should arguably play a role in determining whether or not the stated aim of the paper has been achieved.

Nonetheless, my overall impression is that this work contains sufficient novelty to be of value to the ISMIR community, and so I’m happy to recommend for acceptance.


Review 2

This paper discusses the use of generative systems based on neural audio codec language models to address the task of generating sample-based instruments (i.e. generating several audio files corresponding to different musical notes with different pitches and velocities and with timbral consistency) given a text prompt or an audio example. To the best of our knowledge, the generation of sample-based instruments from text (T2I) or from audio (S2I) are new MIR tasks which are introduced in this paper. The authors provide an evaluation methodology which incorporates a novel metric to evaluate the timbral consistency of a set of sounds, which is a required quality of sample-based instrument sound sets. The authors also propose three variants of a system that address both T2I and S2I tasks, and carry out both quantitative and qualitative evaluations which result in the notion that the proposed systems can successfully propose solutions for the task but a trade-off exists between timbral consistency an expressivity (measured using FAD).

The paper is well structured and very well-written, with proper references and providing great detail, however, it is not clear if code will be available to reproduce the results. Even though a link is provided to a companion website, this does seem to only include audio examples and no code so far.

Overall I think this is a relevant paper for the ISMIR community and therefore I recommend to accept it. What follows is a list of minor comments that authors could address to improve the paper:

  • L74-80: Maybe rephrase these sentences? I think they are a bit confusing. I guess the point is to highlight the idea that using parametric synthesisers or DDSP-based approaches that would pre-define some parameters is out of the scope of the work because the authors consider this would severely limit the output space, but maybe this could be clarified?

  • L87: "We introduce the text-to-instrument..." Shouldn't "sample-to-instrument" task be also mentioned in this first point? Also the acronyms T2I and S2I should be first indicated here.

  • L107: Maybe start the paragraph with a connector? "The remaining of this paper is organized as follows. Section 2..."

  • L112: illustrate -> illustrates (?)

  • L185: "The instrument family and source type (i.e., acoustic..." -> I think a couple of examples of instrument families would be good here.


Review 3

Strengths: - The proposal of the T2I/S2I task is reasonably novel and would be of key interest to many at ISMIR - The overall parameterization of the codec LMs are straightforward and well thought out, with the randomization of the CLAP conditioning in particular a useful insight for the task.

Weaknesses: - The overall writing coherence and presentation is somewhat poor, which significantly impacts the ability to understand the paper in depth. In particular: 1) Notation for describing the underlying generative process for the codec LM (section 2.4) is rather uncharacteristic of previous work. While one can understand the desire to make notation generalizable between the AR and non-AR methods, the current notation runs is somewhat opaque and obfuscates parts of the generative process (i.e. that the AR model is trained through next-token prediction). 2) Most of Section 3 (and Figure 2) is extremely hard to parse. It is hard to tell what Figure 2 is referring to, given that it is both referenced as part of Section 3.2 and Section 3.3. As all definitions for each evaluation metric are exceedingly similar, the paper would be significantly improved by streamlining the legibility of this section and more effectively organizing the structure here.


Author description of changes:

Thank you for your acceptance and the insightful feedback on our work!

Now, we connect the paper to our preliminary work, which was a workshop presentation that was not officially published/archived as per mlforaudioworkshop.com. Relative to that, T2I is significantly expanded here (e.g., new CLAP conditioning variants, 2 new metrics, MAGNeT). While we are the first to use LMs for S2I, we refrain from claiming it as our own considering the cited works (DDSP, GANstrument).

While we acknowledge that generating specific parts of the fully assembled sample (e.g., attack/release samples) is indeed interesting, we consider the sampler design as industry-specific and omitted a discussion due to space constraints.

We aim to add more out-of-domain S2I examples to our demo page by ISMIR. Nonetheless, our existing test set results include samples and instruments not seen during training.

We have retained our notation, which we carefully considered prior to submission. With $\mathbf{x}_k(…)$, the args $(…)$ enable the selection of different waveforms $\mathbf{x}_k$. Equation 1 is defined to encapsulate both AR and non-AR processes. Although the equations may appear similar, they capture the subtleties of the various topics introduced in this paper, which are essential to our methodology.

Now, we note that DAC supports up to 9 codebooks. As listed in our future work, we have since fine-tuned DAC and trained corresponding LMs, and hope to provide examples on our site leading up to ISMIR.

Generally, we have addressed connectors, typos, and notation reminders, linking AR to next-token prediction for clarity. We add the T2I acronym upon its first use. We retain the original text in line 73-80 due to favorable comments from the meta-reviewer. While we aspire to provide code in the future, our current industry position precludes us from doing so at this time.

Thank you again for your constructive feedback and the opportunity to present our work at ISMIR 2024!