ST-ITO: Controlling audio effects for style transfer with inference-time optimization

Christian J. Steinmetz (Queen Mary University of London)*, Shubhr singh (Queen Mary University of London), Marco Comunita (Queen Mary University of London), Ilias Ibnyahya (Queen Mary University of London), Shanxin Yuan (Queen Mary University of London), Emmanouil Benetos (Queen Mary University of London), Joshua D. Reiss (Queen Mary University of London)

Keywords: MIR fundamentals and methodology -> music signal processing, Generative Tasks -> transformations; MIR tasks -> music synthesis and transformation; MIR tasks -> similarity metrics; Musical features and properties -> timbre, instrumentation, and singing voice

Abstract:

Audio production style transfer is the task of processing an input recording to impart the stylistic elements from a reference recording. Existing approaches for this task often train a neural network to estimate control parameters for a set of audio effects. However, generalization of these systems is limited due to their reliance on synthetic training data and differentiable audio effects. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a four-part benchmark to comprehensively evaluate both audio production style metrics and style transfer systems. This evaluation demonstrates that our approach enables more expressive style transfer and improved generalization, highlighting the limitations of synthetic training data and differentiable audio effects.

Reviews

Meta Review

This paper proposes an inference-time optimization approach to estimating parameters of an audio effect chain based on audio effect similarity between the output and query for audio effect style transfer.

The paper proposes a novel approach, conducts thorough experiments, and presents the ideas and results clearly. However, reviewers do point out some limitations of this work. In particular, the (missing) invariance of the embedding to the content is the biggest limitation. As the paper tries to model the audio quality instead of the underlying audio content, this (missing) invariance is an important issue and should be discussed in the paper.

Overall, all reviewers and the independent meta-review liked the idea and contributions of this work, and recommended "strong accept" or "weak accept". One reviewer and the meta-review also recommended "award quality", and the other reviewers did not object this recommendation.

Review 1

The paper proposes a new approach for audio effect parameter estimation from a reference recording. The approach uses a learned deep embedding that should discriminate different "styles" while being invariant to content, which is also the main focus of the work. The actual optimization method is mostly treated as a black box, while a comprehensive evaluation is included to test the approach for two different tasks (style classification and style transfer).

This is a relevant contribution with a good contextualization, interesting experiments, and a meaningful selection of baselines. However, there are several issues that, in my opinion, should be addressed in a possible publication.

(1) I'm not sure I understand how the pretext tasks ensures a disentanglement of (or an invariance to) the content of the audio signal. A thought experiment: If the MLP for a classifier task just learns to calculate $z_i - z_o$ in the first layer, the individual embeddings could theoretically still include all information about $x_i$ and $x_o$, respectively, while the classification can then focus on the style differences. So in a sense, $g$ is clearly encouraged to include information about the effects in the embedding but not discouraged to exclude content information? This could also explain the wide spread of cosine similarity of the Oracle condition in Fig. 5. I wonder if a different pretext task, like contrastive learning, could be beneficial here.

(2) The subjective results are quite inconclusive, which is also congruent with my impression when listening to the provided audio examples. A few observations/thoughts on the listening test: * Why is the Oracle condition detected so inconsistently by the participants? Is there maybe a difference between individuals, so that if those who do not somewhat consistently identify the Oracle are excluded, results become more pronounced? * I think the music examples are not very good for the purpose, since the input itself already appears to be heavily processed. * Also, the actual task is a bit unclear from the paper. I assume that 100 means most similar to the reference and 0 means completely dissimilar, while participants where asked to consider only style and not content?

(3) The parameter estimation experiment does not take any interaction between parameters (within and across effects) into account. While this isolated experiment is an interesting starting point, it does not allow any conclusions about the "high-dimensional loss landscape" that is described by the embedding distance. It would be interesting to have a meaningful objective evaluation for "bad" local minima, which clearly seem to exist.

(4) The strategy of sampling parameter configurations for training the embedding model appears a bit odd to me. While this method makes reasonably sure that the 10 presets for each effect are quite distinct, there is no guarantee that "perceptually diverse" parameters are also "perceptually meaningful". Since some parameters are strongly interrelated (even between different plugins in the chain) and some parameters can be quite sensitive to change around their optimal working point (e.g. the delay speed around the actual tempo of the song), it may be better to train with manually tuned presets (which may be difficult to obtain to be fair)?

(5) What is the meaning of the numbers in Table 1? Accuracy? F-Measure?

(6) Is there a plan to publish code and/or pre-trained models? Especially the embedding model could be useful as a baseline or starting point for many different tasks related to audio effects.

(7) Zero-shot style classification: Could forming the prototype as the mean embedding of multiple examples from each class instead of taking just one example be more robust?

(8) I think it would have been good to reflect the focus on the learned embedding in the title of the paper. (Not sure if it can still be changed.)

(9) Some minor issues: * How are the actual "candidate parameters" sampled in the CMA-ES method? Is this the mean of the population or the population member with the smallest distance to the reference? * l. 48: "adapt based on the external context" is unclear to me. * Figure 3 caption: $x_0$ should be $x_o$ in the second line. * l. 161: to be consistent with Fig. 3, it should be $z_o$ instead of $z_r$ here, I think. * Eq. 3 and l. 205: Similarly, $z_i$ should be $z_o$ to be consistent with $x_o = f_c(...)$? * l. 295 space missing * Table 1: While the meaning of each abbreviation can be guessed, they are not introduced anywhere (and they are in a different order than introduced in the text) * l. 308: verb missing * l. 342: period missing * Table 2: How to obtain a "correlation coefficient" is only briefly explained in the table caption. This could be done more comprehensively in the text. * ll. 381-382: "as good of" -> "a comparable" * Fig. 6: The grouping of the subfigures makes it a bit difficult to compare related experiments. * Fig. 6: I assume that the grey bar is for the Input as in Fig. 5?

Review 2

This paper introduces a novel system for audio effect style transfer. It thoroughly describes the method and compares it with most previous works in this field through relevant experimentation. Various audio production style transfer tasks are benchmarked, showcasing the effectiveness of the method. Real-world audio production scenarios for arbitrary effects are also explored and evaluated via a listening test.

The paper has significant contributions. An elusive audio effect metric similarity has been achieved, and gradient-free optimization is proposed as a great alternative to deal with the inherent difficulty of handling non-differentiable audio processors within learned-based systems.

I believe the paper is of great interest for this conference, and thus my overall evaluation is Strong Accept.

One limitation that could be included in the discussion is that, due to the inherent nonlinearity of various audio effects, the computation of "Oracle" is ill-posed. This means that when having recordings with different content, A and B, applying an audio effects chain Fx with parameters Wx won't necessarily yield the same music production style. This depends on the type of processing (or lack of) that A and B have. Fx_Wx(A) is not (always) equal to Fx_Wx(B) in terms of audio effect style similarity. Of course, the system seems to be performing well when even with this being omitted, but it could be included in the discussion to further strengthen the robustness and scientific contribution of this paper.

For further discussion, the paper could benefit from authors commenting on the effectiveness of AFx-Rep for style transfer of combinations of various audio effects, even though this metric was trained only with one effect at a time. What insights or intuition do the authors have about the reported performance of the system, as one would expect that a metric trained with one effect at a time could struggle to encode information relevant to various audio effects being applied to an audio signal?

Also, it is reported that the system struggles for parameter estimation for chorus. Is there any insight from the authors that somehow time-varying modulation effects are harder to model? The paper could also benefit from a brief clarification of this.

Finally, more details about the listening test could have been reported. What type of multiple stimulus test was performed? What question was asked to the participants? What were the listening conditions? Are any p-values relevant to the given results?

Minor comments about the paper:

Figure 1: Could "sim" be replaced with AFx-Rep? This could highlight AFx-Rep as one of the main contributions of this work.
Section 1, Line 85: Add a reference to DeepAFx-ST.
Section 1, Line 91: The subjective listening test should be included as a contribution only if there is novelty in the subjective listening test presented, which I believe is not the case. Thus, I suggest removing this from the list of contributions.
Section 3, Line 249: It is mentioned before that 20,000 segments were taken, but then it is said that random crops are applied as augmentations. It is not clear how these crops are being applied to the input and/or output. Is it always applied to both? What type of crops? Clarifying this could ease reproducibility of the paper.
Section 5.1: Although the authors mention that it is difficult to draw conclusions with the audio production Style metric alone, readers would see that since ST-ITO models were trained using AFx-Rep, these models will perform better when measured via AFx-Rep. It could be interesting to also see how this evaluation goes when using the metrics that DeepAFx systems were trained with.

Review 3

This work presents a self-supervised embedding trained to specifically attend to production style. This is coupled with a generic control strategy based on derivative-free optimization methods. The use of the embedding generalizes the control process, allowing it potentially to control novel sets of effects.

The proposed system is described quite clearly, employing many diagrams, which help to understand the technical models and experiments. The training and experiment procedures are described in detail.

A classification experiment proves that the system can identify production styles by the embedding, better than existing embedding approaches. Furthermore, attempts are made to mimic real production styles. This is no small task. My impression is that the parameter settings found in some of the real-world style transfer do not fit as well as they could. This, if explored, could lead to system improvements.

I have a doubt is regarding the Oracle condition in the "Real world style transfer" task. For instance, now I see that the reason that the Oracle has non-zero error, is because the Oracle is still being compared to the reference embedding, which is a signal with entirely different content. I believe this may hide estimation errors which may be under the threshold of inherent differences (due to content difference) between the reference and Oracle conditions.

However, wouldn't it make sense to also compare the error (both in embedding space and parameter space) of the Oracle condition versus the algorithm conditions? In this way, every small deviation in parameter estimation would be visible (in either domain), potentially shedding more light on the performance of both the optimization algorithm, and the ability of the embedding to represent small changes in parameters.

As someone interested in optimization, I wonder about the landscape of the objective function itself. This could be shown with simple example with few parameters, plotting the cosine similarity directly. Furthermore, I wonder how capable the gradient-free method chosen (CMA-ES) is really able to solve the problem.

In future work, I think more could be done to deepen the evaluation. In my opinion, the optimization method could also have been better justified. All in all, I appreciate this paper for the clarity of its approach and descriptions, for a task as challenging as automatic style transfer.

Author description of changes:

We would like to thank the reviewers for their detailed comments and suggestions. We have addressed these comments to the extent possible in the camera-ready submission. The changes include:

1) Rephrasing the discussion on previous work in Sec. 1, specifically regarding the “context-dependent” nature of audio production raised by R2 and MR. This discussion aims to underline the limitations of previous works, both rule-based and machine learning systems, which treat audio production as a one-to-one mapping. This motivates style transfer systems, which adapt based on user input.

2) Providing further details on the listening study as requested by R2. Participants were asked to rate each stimulus on a scale from 0 to 100, considering its similarity to a reference recording while ignoring differences in content and focusing on audio production.

3) Clarifying the role of the Oracle, which all reviewers raised questions about. The Oracle simply takes the input recording and applies the same parameter configuration used to create the reference. Since the input recording may have a different “starting point” from the unprocessed reference, simply applying the same parameters may not result in an ideal style transfer. This was reflected in our evaluation, with listeners sometimes rating the Oracle lower than other methods.

4) Addressing important points about the audio effect representation (AFx-Rep) raised by R2 and MR regarding its lack of invariance to content. This is true given our pretraining setup. We added details in Sec. 2 to clarify this potential limitation. However, we further motivated this choice by stating that our method is more scalable, as we can leverage any audio data, including already processed audio. This proves beneficial as it exposes our model to a wider range of effects beyond those we synthetically apply, aiding in more generalized style transfer.

5) Resolving a number of typos and rephrasing some passages based on reviewer suggestions.