From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano
Huan Zhang (Queen Mary University of London)*, Jinhua Liang (Queen Mary University of London), Simon Dixon (Queen Mary University of London)
Keywords: Applications -> music training and education, Evaluation, datasets, and reproducibility -> novel datasets and use cases; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR fundamentals and methodology -> music signal processing; MIR tasks -> automatic classification; Musical features and properties -> expression and performative aspects of music
Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation, and piano technique detection, introducing a comprehensive Pianism-Labelling Dataset (PLD) for this purpose. We leverage pre-trained audio encoders, specifically Jukebox, Audio-MAE, MERT, and DAC, which demonstrate varied capabilities in tackling downstream tasks, to explore whether domain-specific fine-tuning enhances capability in capturing performance nuances. Our best approach achieved 93.6% accuracy in expertise ranking, 33.7% in difficulty estimation, and 46.7% in technique detection, with Audio-MAE as the overall most effective encoder. Finally, we conducted a case study on Chopin Piano Competition data using trained models for expertise ranking, which highlights the challenge of accurately assessing top-tier performances.
Reviews
There has been a consensus among the reviewers regarding this paper for accepting it. I strongly advise the authors to look at the detailed comments from the reviewers and address the raised points in order to further improve the publication.
As the title states, the paper demonstrated the three piano-performance-related tasks with pre-trained (and general-purpose) audio encoding embeddings. The authors not only utilized existing task setups but also contributed to composing new datasets (expertise ranking and technique identification). The paper is written clearly and is easy to read.
The paper refers to many related papers and introduces related issues well. However, I expect more literature reviews on the downstream tasks. For example, what is each downstream task's state of the art, and what is the expectation?
The paper covers quite a wide range of topics, as each downstream task does not have a firm benchmark. Thus, the authors also had to propose the dataset and the evaluation procedure. The paper is compact enough, but the experiments could be elaborated on more. That makes me feel the 6-page limit is short for this paper.
I have two concerns about the message of the paper. First, I'm not convinced that the proposed projection model is the proper design for these downstream tasks. The CNN-based model only allows for the analysis of very local features, while difficulty and expertise ranking are expected to require some temporal features. I worried that the conclusion of the paper could not be general.
Second, the correlation between piece selection and expertise ranking is not discussed enough. The network might only consider the piece, not the pianist's touch or sound. The ICPC-2015 dataset slightly resolves it, but I think the discussion on this issue is important but handled enough.
I think the paper has many points of contribution. but also, I believe the paper's main message could be elaborated in both the discussion and the experiment.
- This paper discusses an approach for 3 tasks; expertise ranking, difficulty estimation, and piano technique detection leveraging pre-trained audio encoders.
- Paper also introduces a dataset Pianism-Labelling Dataset (PLD).
- The authors have enumerated the contributions in the introduction. This is helpful for the reader to quickly get a clear idea about the paper.
- This approach is the first of its kind in piano technique detection and expertise ranking. It is good to explicitly state this. The research gap in difficulty estimation is not clear in the Related Work.
- Section 3.1.1: It is not very clear how R4 works, if a pair of performances have the same Q value. Looks like ranking is not possible with this metric when both the performances are from the persons with the same expertise level.
-
It is not clear how the neural network is designed to do the n-way ranking.
-
Fine tuned Audio-MAE is performing better in case of Expertise Ranking whereas fine tuned DAC is performing better in the case of competition dataset. Looking a bit more into the data on which these embeddings are trained with may give an explanation for this.
-
It will be interesting to see the how the embeddings of the performances of expert players differ from that of beginners. This may help to understand the nature of the information, embeddings are able to communicate to the further layers in the neural network.
The authors present a relatively simple approach for the discussed tasks among which some of the tasks are novel. Experiments are quite rigourous considering different embeddings and, the design of the medthodology is driven by the nature of the tasks. Various relevant metrics are employed for evaluation.
The paper proposes new performance-based music evaluation tasks, introduces a new dataset, and evaluates several deep pretrained audio music models as feature generators for solving the tasks. The paper presents initial experiments with the tasks, but they are well executed and highlight the gaps that need to be addressed in the future.
Some comments: - since fine-tuning does not seem to improve results or even seems to worsen them - explain how you chose the fine-tuning parameters (e.g. learning rate, number of epochs, etc.) and whether they might play a role in this. - I would not attribute the poorer performance of the expertise ranking in the Chopin competition to the "sight over sound phenomenon". I would imagine that it has much more to do with the fact that most of the participants in the competition are experts, whereas the model is trained to distinguish between beginner, advanced and virtuoso levels.
I mainly added to the background and literature reviews to make them more coherent with the work itself, as well as some explanation on the results. I'd love to add more discussions as suggested, but was limited by space.
I also highlighted some details that the reviewer are confused about.