Abstract:

Music is plentiful, but labeled data for music theory tasks like roman numeral analysis is scarce. Self-supervised pretraining on unlabeled data is therefore a promising means of improving performance on these tasks, especially because, during pretraining, a model may be expected to acquire latent representations of musical abstractions like keys and chords. However, existing deep learning models for roman numeral analysis have not used pretraining, instead training from scratch on labeled data, while conversely, pretrained models for music understanding have generally been applied to sequence-level tasks not involving explicit music theory, like composer or genre classification. In contrast, this paper applies pretraining methods to a music theory task by fine-tuning a masked language model, MusicBERT, for roman numeral analysis. We apply token classification to predict labels for each note, then aggregate the predictions of simultaneous notes to obtain a single label at each time step. Conditioning the chord predictions on key predictions gives more coherent labels. The resulting model outperforms previous roman numeral analysis models by a substantial margin.

Reviews
Meta Review

All of the reviwers are concerned about the lack of detail and evaluation of the fine-tuning process. We encourage the authors to expand this is any revision of the paper.


Review 1

The idea of this paper is simple: using a pre-trained model for symbolic music, MusicBERT, to improve the performance for roman numeral analysis. Although it achieves better performances than previous approaches, I am looking for a in-depth discussion on the choice of MusicBERT: why it instead of other similar models like MidiBERT-Piano and others? This is an essential question that the authors need to answer in the camera-ready version (if it comes to that) and even better, if authors conduct a comparative study on how the choice of different pre-trained models (including MusicBERT) affects the performance of roman numeral analysis. Althought the latter may require extra experiments that won't make it to this paper, I strongly suggest the author at least present a reasonable review on pre-trained models for symbolic music and the choice of MusicBERT. Without it, this work won't be considered a solid contribution to the community.

Other than that, this paper is generally easy to follow but lacks polishing in some occasions. For example, the formatting of the references is bad, and the caption of Table 1 is overly short without meaningful descriptions, hence lacking a period in the end. Also, a model diagram illustrating its training and inference will be helpful, especially on the post-processing part in Section 3.7, which I find difficult to follow. Although it is nice to open-source the code, as indicated in Footnote 1, the authors should be more discreet about anonymity as the username embedded in the github URL will easily give away the identity of this submission. There are also grammatical errors shown as follow.

Line 85: [14] add -> [14] adds Line 103: including by -> including

Overall, I will give weak accept for this paper, on the condition that (1) the authors work on the explanation of why MusicBERT is chosen and (2) polish its content further to make it a competent ISMIR publication.


Review 2

Overall, this is a very good, precisely written, scientifically sound, and interesting paper. The general approach of finetuning for token classification itself is certainly not the most novel idea. However, the authors carefully consider and describe all parts involved to achieve that, from the dataset, preprocessing, and tokenization to postprocessing and other interesting tweaks (e.g. key conditioning). I was pleased to have each doubt that arose immediately answered in the paper through the side-experiments mentioned and ablations that were presented. Each part of the methodology is precisely described and design choices are justified from both a scientific and music-theoretic perspective. Results are also carefully and interestingly described from these lenses.

The writing is clear and precise and the paper is organized well. It is perhaps slightly unusual to see the experimental setup subsections in this order (usually I'd expect to move from model to data), but it works well and, importantly, there are no redundancies or ambiguities.

It would have been interesting to see a bit more work (and perhaps detail in the paper) in the choice of finetuning. There isn't adequate justification about why the first 9 layers of BERT are frozen specifically. It would have been interesting to experiment with different layer freezing, and even parameter- and resource-efficient fine-tuning methods like LoRA. All this is particularly interesting because of existing indications about how different the information between transformers' layers can be. In "Comparative Layer-Wise Analysis of Self-Supervised Speech Models” by Pasad et al., for example, layer similarity of voice models with phonemes and words is computed, yielding interesting results about the "location" of relevant features. It's not trivial to adapt this framework to symbolic music and RN analysis, but it would certainly be interesting.

It would have been nice (and it's still possible for the camera-ready) to acknowledge some compute-related information of your experiments, such as the finetuning and inference cost, in relation to existing work (at least through the parameter counts).

The citations are not well formatted and would need to be fixed for the camera ready. Please only include arxiv/zenodo versions if the paper isn't published already in a venue. Use consistent conference names and abbreviations (International Society [...] (ISMIR) Conference vs ISMIR) and conference locations and years. If you decide to provide links, please do so consistently when they are available and from appropriate sources.

Minor comments: - l48: fine tune; l51: fine-tune - table 1: maybe some thousands separators on the numbers - table 2: somehow need to shorten the width - l291-292: is "in particular" or "solely on Classical" more accurate? - l341-347: it's understandable after a slow read, but would be better to try to split the sentence


Review 3

This paper introduces a new approach to Roman numeral analysis that outperforms previous state-of-the-art models by leveraging pretraining on a Bert-like architecture.

The paper is well-written, and the results are highly promising, while also addressesing important questions regarding automatic Roman numeral analysis. Namely, the authors address the issue of potential multiple valid interpretations of a given segment or analysis.

However, there are several areas that would require minor corrections and refinement:

  1. Missing Publication Years in References: A very important point is that, all references lack the publication year, which is essential for proper citation and a high standard of an academic proceedings paper.

  2. Inconsistency in References: A more minor point is the presentation of publishing venues for proceedings which are abbreviated or in a couple of case missing completely, for example pay attention to "in ISMIR": -> should be "in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)" and likewise for others. In summary, it is highly suggested to the authors to fix the references section.

  3. Caution against Overstated Claims: Strong claims should be tempered. For instance, the assertion in footnote 8 implying the potential incorrectness of a human annotator's analysis should be approached with caution. Multiple valid interpretations can exist for a segment or Roman numeral, as highlighted in Section 4.2. Hence, it's advisable for the authors to avoid absolute statements regarding the absolute correctness of any single analysis.

  4. Standardization of Terminology: The capitalization of "Roman numeral" throughout the text varies. For consistency, it should be "Roman numeral," with "Roman" capitalized as it serves as an adjective referencing a place of origin.

  5. Reevaluation of Key Prediction Triviality: The claim that predicting the key is trivial warrants reevaluation. While the paper suggests high accuracy in key prediction, it's crucial to consider the broader context. Many existing approaches, such as AugmentedNet, ChordGNN, and PKSpell, do not achieve near-perfect accuracy in predicting local keys or key signatures. Supporting this claim with evidence from publications demonstrating similar high accuracies would strengthen its validity. Alternatively, softening the claim to reflect the complexity and challenges associated with key prediction would be more prudent.

Addressing these points will enhance the clarity, strength, and credibility of the paper.


Author description of changes:

I respond to the main reviewer comments below. The paper has been updated to address issues of formatting, grammar, etc., without further comment here.

Reviewer 1

I am looking for a in-depth discussion on the choice of MusicBERT

I added an explanation of why I chose to use MusicBERT. I agree that a comparative evaluation of MusicBERT with other models would be useful, but that will await future work. In any case, the main contribution of this paper is the general approach of using a pre-trained masked language model for token-level music theory tasks, which to our knowledge has not previously been applied to Roman numeral analysis or any similar task.

a model diagram illustrating its training and inference will be helpful

Unfortunately we do not have space to add this while remaining within the space constraints of an ISMIR paper.

Reviewer 2

There isn't adequate justification about why the first 9 layers of BERT are frozen specifically. It would have been interesting to experiment with different layer freezing, and even parameter- and resource-efficient fine-tuning methods like LoRA.

I now state how and why I chose to freeze the first 9 layers. I agree that LoRA is worth trying, but that will await future work.

acknowledge some compute-related information of your experiments, such as the finetuning and inference cost, in relation to existing work (at least through the parameter counts).

Thank you for mentioning this oversight. I added parameter counts.

Reviewer 3

The claim that predicting the key is trivial warrants reevaluation.

The reviewer may have misunderstood one small claim. Key prediction is not trivial. What I suggest is "trivial" is the task of predicting a spelled key (like "Db major") given

  • an unspelled key (like "1 major" for a major key with tonic pc 1), and
  • spelled pitch inputs (like Db5 and Ab5) rather than unspelled inputs (like MIDInums 61 and 68).

I revised the passage in question to clarify it with examples.