Abstract:

The automated creation of accurate musical notation from an expressive human performance is a fundamental task in computational musicology. To this end, we present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Framing the task as sequence-to-sequence translation rather than note-wise classification reduces alignment requirements and annotation costs, while allowing the prediction of more concise and accurate notation. To serialize symbolic music data, we design a custom tokenization stage based on compound tokens that carefully quantizes continuous values. This technique preserves more score information while reducing sequence lengths by 3.5x compared to prior approaches. Using the transformer backbone, our method demonstrates better understanding of note values, rhythmic structure, and details such as staff assignment. When evaluated end-to-end using transcription metrics such as MUSTER, we achieve significant improvements over previous deep learning approaches and complex HMM-based state-of-the-art pipelines. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data. Code and models are available on GitHub.

Reviews
Meta Review

This paper describes end-to-end MIDI-to-score conversion, which has not been focused on in the MIR field.

The paper is written very well. The proposed method is technically solid and its effectiveness was clearly shown by the well-designed experiments.


Review 1

Overall, the paper is very clear and well-written. The methodological decisions are motivated, and the evaluation is rigorous, using standard metrics, comparing with the state-of-the-art works, and with an ablation study for key design decisions. In addition, the code and pre-trained models will be released upon publication.

Since the dataset used for training and evaluation only includes piano performances (ASAP), I suggest including this aspect in the title (e.g., "End-to-end piano performance-MIDI...") and in the abstract.


Review 2

This work presents an approach for obtaining a musical score from a performance MIDI file based on Transformers. The authors consider a reference dataset for the experiments and compare against existing approaches (both commercial and research-oriented) with several evaluation metrics and conclude the superiority of their proposal.

The manuscript is well written and sounds technical. Also, while the authors are commited to share the code and data if the work is accepted for publication, they also share as supplementary material to the conference some examples of the results that the approach may obtain.

My only concern in the work is the exclusive use of the ASAP dataset (as labeled data). In this sense, have the authors consider any other alternative dataset, at least for a particular experiment, to assess the generalization capabilities of the approach?


Review 3

This paper proposed a new end-to-end approach to performance-MIDI to score conversion using the Transformers architecture. The proposed method is well-motivated (there is no one-to-one mapping between a performance MIDI sequence and a music score note sequence), achieves good results (beating the state-of-the-art), well-explained, and supported with rigorous experiments.

Apart from the key contributions stated in the paper's introduction, the biggest advantage of this paper, from my point of view, is the detailed technical considerations presented in the paper. The use of data (labelled and unpaired), the tokenization of the music score data (11 attributes derived from the MusicXML score, and the use of space token to fulfill beat-level alignment), the data augmentation (especially the duration jitter and onset jitter to simulate human performance), and the detailed ablation study makes the paper technically strong and provides plenty of useful insights to people working on related field. I especially like the way the paper describes data-related design choices by supporting them with comparisons to previous/other settings (e.g. Section 3.1.2 Unpaired data and Table 5). Furthermore, it's great to see the ethics statement which mentions future research for other music genres.

Overall, I think it's a good paper that should be presented at ISMIR, which will have good audiences and can be very helpful to others working on related tasks.

Below are only some minor comments.

  • L27: "significant downstream applications" => significant number of downstream applications?
  • L147: "Eq. (4)" => Eq. (5)
  • L178: "ml_j stores the preceding measure's length for downbeat notes or is set to false otherwise." => I'm still not very sure what ml_j is precisely. Is a downbeat note a note those onset is on a downbeat? How is it related to the measure's length?
  • L205: "an inner dimension of 3072 for the position-wise feed-forward network" => It is said at the beginning of this paragraph that "the backbone model follows the original architecture described by Vaswani et al." But the original feed-forward dimension is 2048.
  • L211: After reading this paragraph, it's still not clear to me what are the "conditioning token" and the "space token". These are made clear after I finished reading Section 3.1.2. It may be helpful to polish this paragraph a bit, and add a reference to Section 3.1.2.
  • L234: "space token steam" => stream
  • Table 2: The total number of distinct pieces, P-MIDI notes, and score notes are not equal to the corresponding sum over the train/valid/test splits.
  • Table 4: Which SOTA is it in the table? Add a reference?
  • Table 5: MIDI scores do not actually have stem directions. How are the stems obtained for the MIDI scores (e.g. by importing to MuseScore or Finale)? It will be helpful to mention that to improve reproducibility. Or to annotate it as a "-" indicating there is no stem prediction.
  • L387: How are the barlines predicted after removing the time signature module? And similarly, how are the predictions converted into MusicXML format for evaluation?
  • References: Some minor formatting issues for items [6, 19, 20, 25, 26, 27, 29, 40, 41], mostly related to capitalisation.


Author description of changes:

We would like to thank all reviewers for their helpful comments and constructive feedback.

We have incorporated suggestions about the scope of the paper and additional detail about the tokenization method into the title and abstract (#1). Additionally, we clarified the explanation of the conditioning and space tokens, moving them into Section 3.1.2. and Section 2.2, respectively (#4). The tokenization section now includes more detail about handling polyphonic music, note ordering, and the encoding of musical time (#4 and meta #1). Dataset statistics in Table 2 were recomputed in order to reconcile inconsistent preprocessing between different splits (#4). Furthermore, section 3.3 now contains additional information on how MusicXML scores were obtained for baseline methods (#4).

Due to space restrictions, we were unable to insert a figure illustrating the tokenization in the paper (meta #1). However, clear code for all tokenization/detokenization steps will be available on GitHub, along with an illustration of a tokenization of a melody.

We also want to address our focus on classical music and the ASAP dataset (#3). To our knowledge, there are no other publicly available datasets which include paired performance-MIDI and MusicXML scores. CrestMusePEDB is the only other dataset in the literature with MusicXML scores, but it is smaller, not openly available, and also limited to classical music. To illustrate our method's generalization, the supplementary material includes examples of generations for out-of-genre music from in-the-wild performances.

Finally, we fixed various typos and formatting issues (#4). To adhere to the page limit, some changes required minor reworking of other text passages in the paper; however, no content was significantly altered.