I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
Yannis Vasilakis (Queen Mary University of London)*, Rachel Bittner (Spotify), Johan Pauwels (Queen Mary University of London)
Keywords: MIR fundamentals and methodology -> multimodality, Evaluation, datasets, and reproducibility -> evaluation methodology; Evaluation, datasets, and reproducibility -> novel datasets and use cases; MIR fundamentals and methodology -> metadata, tags, linked data, and semantic web; MIR tasks -> automatic classification
Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition, and a detailed analysis of the properties of the pre-projected and joint embedding spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an existing instrument ontology is proposed. This method reveals deficiencies in the systems' instrumental knowledge and provides evidence of the need for fine-tuning text encoders on musical data.
Reviews
Most reviewers agree that the paper provides a good overview of the problem space with one or two comments regarding some additional work that may be included there. Overall the introduction and definition of the problem is well done. Most reviewer criticisms come down to the interpretation of results, and more incorporation of future work into the current work. Specifically, many of the conclusions drawn in the text are not clearly evident in the results shown, and reviewer's raise the need for a clearer discussion of results in the text that specifically is proven in the results presented.
Please take into careful consideration the reviewer's comments to improve this work for the camera ready version.
Summary:
The paper discusses the applicability of multimodal audio-text for musical instrument classification. Such models enable a user to measure the 'distance' between an audio clip and a text prompt. The authors determine that such models are suitable for tasks such as instrument classification due to the scarcity of large datasets containing many labeled instruments as well as the fluidity of the output label space, i.e., new instruments might be needed to be added to a trained classifier. The authors aim to assess the zero-shot capability of such models for the task, specifically probing the importance of the text prompt used, and the power of so-called pre-joint and joint embedding spaces of three models. Through various experiments, they find that models do not utilize text context instead focusing on the label word within the input prompt. They also find that the text branch of these models perform significantly poorly compared to the audio-only branch. The conclusion of this study is that alignment between text and audio is lacking in the chosen music-text joint models and should be the focus of future research.
Strengths: - The paper is discussing research questions that are well aligned with latest trends of machine learning research. Multimodal models are being heavily studied and there is a surge of audio/music/text models. The findinds in this paper seem to be helpful in guiding future direction of research in the space. - The authors have done a good job organizing the paper and covering various research questions that arise when evaluating the zero-shot capabilities of the models.
Weaknesses: - I am not sure why the authors refer to the two "musically informed" prompts as such. I do not really see any semantic difference between the MusCALL prompt and the #1 prompt. - I believe the study would have been complete if the authors chose to include one of the experiments that has been left as future work: using some kind of prompt upsampling/augmentation to inject stochasticity to the training of such models. It is well-known that such methods improve the generalizability of these models and thus the findings of this paper might potentially have changed significantly using more robustly trained models.
The paper compares three different algorithms for two-tower multimodal systems which produce a joint audio and text embedding based on text and audio encoders. Although the data set used is rather simple and not very large, the authors justify very well its selection.
The paper is well written, all necessary literature is discussed. After some backgrounds into methodology several experiments are described in detail and supported with figures. Also the problematic issues of the used system are reported.
However, the statistical evaluation of experiments can be done in a more thorough way: e.g., statistics over prompts for n different instruments could be reported with standard deviations (cf. Figure 2) and also statistical tests could be applied for some quantitative measurements / comparison of prompts and methods. For instance, the statement that "MusCALL prompt .. leads to the highest top-1 accuracy" is not clear: for Music/Speech CLAP it can be observed, but for Music CLAP the values seem to be very close, and for MusCALL it is even below the most of other prompts.
The authors provide a thorough, well-organized analysis of the pre-joint and joint audio-text embedding spaces in the context of musical instrument recognition. Thanks for your contributions!
Strengths: - Related work section is very strong and covers all of the bases, which is especially great in an analysis paper! - Very clear motivation and setup of experimentation. - Figures 3 and 4 were really interesting and illustrated some of your key discussion points well!
Comments: - Figure 2 was hard to digest. (1) The inclusion of the joint and pre-joint audio does not fit as the figure mostly illustrates prompt comparisons and while it states in 3.3 that it will be explained exactly what these columns mean in 3.4, I never quite could understand what this represented? (2) Could be helpful to have the prompts on the same page as this figure or even somewhere in the figure, as looking at "prompt 1...etc" means nothing to the reader without flipping back and forth - I think more qualitative analysis and discussion could enhance your analysis even further. The discussion is mostly centered on metrics for instrument recognition (which also makes sense), but it would be great to have a bit more discussion of what you really saw (qualitatively) in the recognition when using the different embeddings, similar to the discussion you have surrounding the importance of prompt engineering in the text encoder. - The metric for semantic meaningfulness based on instrument ontology is quite interesting in theory, but Figure 5 shows that across systems the results were quite similar. Is there a way the metric could be modified to emphasize wider spreads in some way?
Apart from problems residing in arguments used or claims, most of the critique was into not providing more experimentation and testing some of the hypothesis/proposals left for future work. We need to highlight that the main topic of this paper is to evaluate the SOTA two-tower systems and establish a baseline (or absence). Our results alongside the performance reported from most of the authors for the two-tower systems proposed provide evidence that we are far from properly utilizing extra modalities. As a novel step, we tried to pinpoint the cause of this issue and found that the text branch or alignment is the main issue, as the audio encoders perform greatly. The 6 pages format was a hindering aspect to further provide more experiments and test some of our hypothesis for potential improvements over the problems stated and this is mainly the reason why we abstained from including results as they would decrese the value of this paper as an evaluation one.
Figure 2 was criticized but we argue that it clearly show the following two takeaways: Changing prompts lead to almost erratic performance changes and using Audio only information almost doubles the performance.
We generally included pretty bold claims, and in some cases unjustified (as claiming that we have an issue with using two-tower systems for multi-label zero-shot classification). We changed the phrasing to imply that this is just a stepping stone and we cannot have a strong conclusion given those experiments. Despite this fact, we still think its a great way of establishing a more thorough way of evaluating Deep Learning systems and sheding more light in the reasons why these systems might fail apart from presenting a table with metrics.