A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

Karn N Watcharasupat (Georgia Institute of Technology)*, Alexander Lerch (Georgia Institute of Technology)

Keywords: Creativity -> tools for artists; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR fundamentals and methodology -> music signal processing; MIR tasks -> indexing and querying; Musical features and properties -> timbre, instrumentation, and singing voice, MIR tasks -> sound source separation

Abstract:

Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. We propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet — at only 24.9 M trainable parameters — performed on par with or better than the significantly more complex 6-stem Hybrid Transformer Demucs. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs.

Reviews

Meta Review

This paper prompted a lively discussion amongst the reviewers, ultimately reaching a consensus that it should be accepted.

We ask that the authors make minor revisions to better-motivate the use of a query-based system (as opposed to instrument labels). The advantages of the query-based approach are a bit unclear, given that the paper does not study generalization to instruments unseen during training (which would obviously preclude a label-based approach).

Please also incorporate the missing references identified in the reviews.

Finally, the comments about Moises around Line 85 could be interpreted as advertisement, which has no place in an academic paper. Please consider toning down this discussion of Moises, or adding commentary on other >4 stem commercial systems (e.g., Lalal.ai, audioshake).

Review 1

This paper presents a system called Banquet, which aims to separate musical audio into different parts (stems) using a single decoder. The system is designed to handle more than the usual four stems (vocals, drums, bass, and other) and uses a query-based approach to identify which instrument to separate. The strategy of using one decoder for multiple stems is useful because it simplifies the model and reduces computational complexity. The system integrates a bandsplit source separation model with a query-based setup and a music instrument recognition model (PaSST). The results are competitive wrt. the SOTA HT-Demucs, even though they only exceed its SNR slightly for guitar and piano.

The authors show that Banquet can perform well on common instrument types like vocals, drums, and bass, and even some less common ones like guitar and piano using the MoisesDB dataset. This suggests that their approach has potential in broader applications. I appreciate that the authors aim to push beyond the standard four-stem setup and address the need for more flexible source separation systems.

The paper is well-written and well structured, has extensive discussion and a valid scientific approach.

My main concern is that the authors don't motivate clearly their use of a query-based approach. At train time, they use instrument labels to retrieve queries from other songs. Therefore, a naive approach to simply use the class labels as conditioning (with some embedding layer) could work equally well. Apparently, the authors also didn't understand that conditioning on queries from different tracks defies the idea of a query-based approach, as they note in the discussion "Interestingly, it appears that querying with excerpts from the same or different track did not affect the model performance for most stems except for electric piano." However, it is expected that the model cannot do better with same-song queries, as it was trained on all-song queries using instrument labels.

More detailed remarks:

The performance on long-tail instruments (less common instruments) is relatively weak suggesting that the model couldn't generalize by interpolating in the PaSSt space. This puts into question the query-based approach.
The model’s training approach involves using queries from the same stem type across different songs. I feel this undermines the self-supervised potential of the method because it relies on knowing the instrument class. It seems like a missed opportunity to fully explore the benefits of self-supervised learning. When using queries in a supervised setting, one would at least need to compare to a non-query based approach (i.e., conditioning on the instrument label) to show that there are advantages.
The authors missed some important references for query-based approaches, which makes it harder to place their work in the context of existing research. It is generally difficult to locate the work within other works.
Single decoder approaches already existed. I am not sure what the new contribution is in this paper, besides using the not-yet-used MoisesDB dataset. It would be nice if the authors would have provide a better motivation and explanation to show what makes this work different or better compared to others.

Minor remarks:

The use of the term "stem" is confusing. I think, "stem type" or "instrument type/class" would work better.
The training data for long-tail instruments is very limited. The authors could have considered more aggressive data augmentation and transfer learning techniques to help improve the model’s performance on less common instruments. Another option to achieve more training examples would be to randomly mix stems of the dataset, as it is common in many SOTA source-separation works.
The paper could benefit from more detailed comparisons with existing query-based and single-decoder systems to clearly show improvements or differences. This would help in understanding the real contributions of this work.

Conclusion:

The paper proposes an interesting system for music source separation using a single decoder and a query-based approach. I appreciate the authors' effort to create a more flexible system that can handle more than just the standard four stems. However, it has some weaknesses, especially it doesn't motivate the query-based approach and doesn't show that it improves over simply using class labels. The contribution of this work is unclear given existing single decoder and query-based approaches.

Consequently, I don't think the paper is ready for publication in ISMIR 2024.

Review 2

The paper presents an approach for music source separation beyond the traditional 4 stem setup of vocals, other, drums, and bass (referred to as VODB). In order to generalise the model capabilities to more stems, they introduced a query-based conditioning network, which based on PassT embeddings, is fed to the network. In experiments performed on MoisesDB, they evaluate the model and the subsequent stem extensions.

The paper starts very strong and the authors nicely embed the paper in the current literature. However, as mentioned above, some works are missing. Especially the conditioning through the FiLM layer was proposed earlier and should be credited as prior work. Furthermore, the introduction is very focused on the Moises DB and the models Moises is currently offering. This should be reformulated in a more neutral way or can be left out since it does not affect the research in the paper.

Section 3 is challenging because the system itself is complex, but the writing is clear and concise. Section 4.1 describes the query extraction. There seems to be a significant amount of pre-processing involved to obtain the "correct" query. How sensitive is the whole process on the query selection? Is it necessary to take the "top-1" query or is a random query good enough? Since PaSST takes 10s chunks, there might also be enough information in any source chunk unless it's not silence (or a certain amount of energy). This remains unclear and is not further discussed. Section 5 describes the experiments and the results. It shows structured experiments on the different levels of stem details, from coarse to fine.

As for reproducibility, the authors provide the source code. By checking the code, it misses an environment.yaml which would be nice to be added. Otherwise it is very clean code and will help subsequent research. The results for the figures and tables are provided through CSV files. Since much compute time went into the models, will the weights be published, as well? Furthermore, I have not found the query positions for the PaSST embeddings, these might be crucial for later comparison.

All in all, I recommend this paper to be published at ISMIR 2024. It steers into a direction many researchers are no tapping. However, it is very much focused on the MoisesDB, although not >4-stems, it would be interesting to see this system's performance on Musdb, since this data is so well known in the community. Furthermore, evaluation on URMP would be a plus and maybe something to take into account in future research (although very far from the training data).

Small typos: * footnote 7: word is missing after "for"

Review 3

This paper presents an approach for deep learning-based music source separation, using a query by example architecture, such that arbitrary types of musical sources can be extracted using the same model. The paper contains a thorough literature review, and to the best of my knowledge, the most thorough study of music source separation beyond four sources using a dataset of real music with singing voice. While the study is not perfect as noted below, I believe it is a worthy contribution to ISMIR.

Specific Comments:

Sec. 3.1: The notation in this section feels a bit sloppy. N_{FFT}=2F=2048, isn't correct, since the input audio signal is real, unless the dc or nyquist bin is dropped. Also, wouldn't splitting the signal into subbands decrease the number of frequency bands. I believe this is represented properly in Fig. 1
Regarding using query-by-example for picking the instrument to extract. Why not just use a learned embedding vector for each class (i.e., instrument stem type)? Given the lack of difference when using the stem from the same song vs. a different song, it seems that the network is potentially doing this anyway. It would be nice to comment or compare with this.

-Related to the above comment, it would be nice if the authors considered performance on source types that were not included in the training set. This could be one potential advantage of using query-by-example?

It would be nice to have some baseline results, e.g., an oracle mask and noisy samples for the fine grained stems to put the presented results in context?
Using SNR as the evaluation metric, is it surprising that RMS level is correlated with performance? Would be better to repeat this analysis with SNR improvement.

Author description of changes:

We would like to thank the reviewers for their detailed reviews.

The changes made to the paper compared other the initial submission are as follows.

Additional relevant work have been included. (Wang et al., ICASSP 2022 was already included as [20] in the initial submission).
The discussion on Moises’ system has been cut down and other commercial systems were mentioned. Our intention was to highlight inflexibility in existing commercial systems, but we could understand that this might be misinterpreted.
Thank you to R4 for spotting the typo in Sec 3.1. We fixed it to NFFT = 2(F-1) = 2048.
We added a paragraph motivating the use of embedding-based query instead of class-labled query in S3.3.
Alt text have been added to the figures.
Acknowledgement and ethics statement have been added.