Abstract:

The pursuit of replicating analog device circuits through neural audio effect modeling has garnered increasing interest in recent years. Existing work has predominantly focused on a one-to-one emulation strategy, modeling specific devices individually. However, the potential for a one-to-many emulation strategy remains an avenue yet to be explored. This paper presents such an attempt that utilizes conditioning mechanisms to emulate multiple guitar amplifiers through a single neural model. For condition representation, we use contrastive learning to build a tone embedding encoder designed to distill and encode the distinctive style-related features of various amplifiers, leveraging a dataset of comprehensive amplifier settings. Targeting zero-shot application scenarios, we also examine various strategies for tone embedding representation, evaluating referenced tone embedding against two retrieval-based embedding methods for amplifiers unseen in the training time. Our findings showcase the efficacy and potential of the proposed methods in achieving versatile one-to-many amplifier modeling, contributing a foundational step towards zero-shot audio modeling applications.

Reviews
Meta Review

Summary

In this work, the authors present a new method to perform zero-shot amplifier/tone transfer from one recording to another. Compared to past work, the method focuses on the amplifier/tone application and the idea of having a single network simultaneously model multiple amplified devices instead of a single device with multiple parameters. In this case, all devices are various forms of tone/amplifier fx. To train a style transfer model, a tone embedding module is learned via contrastive learning. This embedding module is then used to condition a second module to process audio. Evaluation is done via comparing a few within method configurations as well as comparing against a search-based method. The search-based method, however, is only used for testing out-of-domain tones.

Initial Reviews

Overall, the initial reviews for this work are mostly positive with 2x weak accept, 1 strong accept, and weak reject (meta). Positive points for the work include • R2 “ well motivated and effectively builds on previous work to enable zero-shot audio effect style transfer in the context of guitar amplifier modeling.” • R2 “well organized and the proposed method is described in sufficient detail.” • R3 “Personally, I appreciated the approach of this work. “ • R4 “a strong piece of research with a clearly motivated model design, sufficient evaluation, and a generally high quality of presentation. “ •

Areas for improvement include • R2 “discussion on similarities and differences with related work would help to place the proposed method within the context of existing work.” • R2 “evaluation may be lacking to fully understand the efficacy of the proposed method” • R2 “no discussion on the computational efficiency of this method. Does this model enable real-time processing” • R3 “ the architecture used for the encoder "3.5 Implementation Details" is unclear… t of guesswork to anyone trying to reproduce or compare to this approach.. “ • R4 “The authors appear to have chosen not to reveal which architecture they have used for the tone embedding encoder” • See also the initial meta review.

Discussion

During the discussion, there were some comments on issues that could cause a reject, but multiple reviewers commented seeking to hold their positive results to champion the current scores.

Recommendation

Due to the initial reviews and discussion of reviewers seeking to champion the work, we recommend to accept. Please see several issues in the initial meta review to address as well.


Review 1

Summary

In this work, the authors propose a method for zero-shot guitar amplifier modeling. They achieve this by first training an encoder in a contrastive pre-task to extract features related to audio effect style or “tone” in the context of guitar amplifiers. They then produce a tone embedding from a reference signal, which is used as conditioning to a neural audio effect modeling network, called the generator. This generator is trained with a dataset of paired examples and a reconstruction loss. The convolutional model will process a clean signal to produce an output that has the same tone as the reference signal.

Strengths

The proposed method is well motivated and effectively builds on previous work to enable zero-shot audio effect style transfer in the context of guitar amplifier modeling.

The manuscript is well organized and the proposed method is described in sufficient detail.

Weaknesses

Further discussion on similarities and differences with related work would help to place the proposed method within the context of existing work.

The presented evaluation may be lacking to fully understand the efficacy of the proposed method. The first experiment demonstrates that ToneEmb conditioning can enable. While a listening test is not strictly required, even a simple perceptual study would significantly strengthen the conclusions of the work.

While the authors do provide a discussion on some potential limitations of their work, they fail to address some important aspects. For example, there is no discussion on the computational efficiency of this method. Does this model enable real-time processing? This is critical for guitar amplifier modeling applications.

The authors do not mention if they will provide open source code or datasets. While this is not strictly required per ISMIR guidelines, it would further strengthen the work.

Questions

The claim of “unpaired references” used to train the generator may be problematic. While the underlying content of the reference and the input (clean) may be differing, these training examples are “paired” in the sense that the data must be synthetically generated such that the guitar amplifier configuration is identical between the two recordings. In other contexts, “unpaired” data generally means

Recommendation

The proposed method is novel in that it combines two existing methods to enable a new task of zero-shot guitar amplifier modeling. While the presented evaluation demonstrates the efficacy of some aspects of the proposed method, the evaluation could be stronger, and hence limits the potential strength of the conclusions and overall generalization. As a result, this work is recommended for a weak acceptance. The authors are encouraged to strengthen the work through more rigorous evaluation which could include a perceptual study, the inclusion of more zero-shot baselines and a more detailed case study.


Review 2

In this work, a tone embedding is developed using contrastive training. This embedding is used to train a decoder architecture that conditions on the tone embedding. The use of the embedding allows simultaneous modeling of different amp types within a single model.

Two leading conditioning strategies, concatenation and FiLM, are compared. The work is evaluated using both seen and unseen amplifiers. The authors identify current weaknesses in the system output, that the system fails to generate high-frequency components.

As pointed out by another reviewer, the architecture used for the encoder "3.5 Implementation Details" is unclear. This would give a lot of guesswork to anyone trying to reproduce or compare to this approach.

One point on Figure 4: the diagram suggested to me that the embedding failed to distinguish many different types of tones-as I see only two big clusters and quite a bit of overlap. That t-SNE fails to separate the amps makes me wonder how separable they are in the embedding space. Therefore, I wonder if the diversity of the modeled tones is low, if the embedding fails to distinguish between some tones, or both.

Personally, I appreciated the approach of this work. For future work, I would love to hear this approach applied to a larger range of production styles/effects, not just amplifier simulations.


Review 3

This work addresses neural modelling of guitar amplifiers in the zero-shot setting. That is, it seeks to find a model that can generalise to amplifiers unseen at training time using a conditioning mechanism. This is achieved using a combination of a relatively standard GCN model for audio effect modeling, and an encoder trained with a contrastive (SimCLR) objective, where different pieces of input audio with the same processing constitute positive pairs.

In many ways this piece of work is overdue — neural modeling of amplifiers and distortion circuits has been predominantly focused to date on fitting a single device, with more complex models employing conditioning mechanisms to account for different effect parameters. Generalising to multiple different devices is more challenging, however, owing to the variety of different designs and hence behaviours, but is well suited to the zero-shot task formulation proposed here. This is, all round, a strong piece of research with a clearly motivated model design, sufficient evaluation, and a generally high quality of presentation. Whilst not a complete solution to the stated problem, this is a clear step towards it and the paper both acknowledges the current limitations and proposes viable directions for future work.

The authors appear to have chosen not to reveal which architecture they have used for the tone embedding encoder, describing it instead simply as a “contemporary audio encoder”. Given the industrial collaboration they report, I assume this is simply an IP issue. If this is the case, however, it should be directly and clearly acknowledged in the text, rather than ambiguously omitting certain details. I hope the authors will make this change in the camera-ready version.

Otherwise, given the clear merits of this paper I am very happy to recommend this work for acceptance.


Author description of changes:

We would like to thank our reviewers for their valuable feedback. Their comments have helped us further improve and strengthen our work. In the camera-ready version of the paper, we made several updates to enhance clarity and improve the presentation of our results. We directly indicated the name of the collaboration company Positive Grid in sections 3.1, 3.2, and 3.4, and we added an acknowledgements section to clearly mention the collaboration. We included a citation regarding other zero-shot FX style transfer tasks that do not model effects on amplifiers. We noted that the performance of tone embedding is expected to be influenced by the number of amplifiers used in training, and we plan to address this in future work by adding more amplifiers to the training set to facilitate a better understanding of tone transformation in the generator. Regarding the computational efficiency of our method, we mentioned the number of parameters in our generator and identified the development of a plugin version as future work. In real-world guitar effect plugins, users may not have access to GPU resources, so writing a plugin version and testing its computational efficiency or cost is considered more feasible.