Abstract:

Note alignment refers to the task of matching individual notes of two versions of the same symbolically encoded piece. Methods addressing this task commonly rely on sequence alignment algorithms such as Hidden Markov Models or Dynamic Time Warping (DTW) applied directly to note or onset sequences. While successful in many cases, such methods struggle with large mismatches between the versions. In this work, we learn note-wise representations from data augmented with various complex mismatch cases, e.g. repeats, skips, block insertions, and long trills. At the heart of our approach lies a transformer encoder network --- TheGlueNote --- which predicts pairwise note similarities for two 512 note subsequences. We postprocess the predicted similarities using flavors of weightedDTW and pitch-separated onsetDTW to retrieve note matches for two sequences of arbitrary length. Our approach performs on par with the state of the art in terms of note alignment accuracy, is considerably more robust to version mismatches, and works directly on any pair of MIDI files.

Reviews
Meta Review

This is a meta-review, summarising the main points of the individual reviews.

Traditional methods like Dynamic Time Warping (DTW) or Hidden Markov Models (HMM) struggle with aligning symbolic note sequences that have large mismatches. This paper proposes a novel approach using various complex manipulations (e.g., repeats, skips, block insertions, and long trills) to train a model that predicts pairwise note similarities for two 512-note subsequences. As a post-processing step, DTW is used and compared to two alternatives. The approach is shown to lead to very high accuracy, even in the presence of large mismatches, and it is directly applicable to any kind of MIDI file.

Strengths: - Innovative Approach: The paper proposes a novel method using a transformer-based model for aligning symbolic note sequences, which addresses the limitations of traditional methods like DTW and HMM, especially in the presence of large mismatches.
- Detailed Explanation: The model and experimental setup are explained in detail, providing a clear understanding of the approach and its implementation.
- High Accuracy: The proposed method demonstrates very high accuracy, outperforming baseline methods and showing robustness against a large number of mismatches.
- Reproducibility: The authors promise to provide code and datasets online, which facilitates reproducibility and further research.

Weaknesses: - Clarity in Description: Some parts of the paper, particularly the data augmentations and certain concepts (see e.g. the comment on attention by reviewer 4) could be explained more clearly.
- Figures 1 and 2 have been noted to either lack informativeness or be too dense. Additionally, inconsistencies in terminology and numbering in tables need addressing.

Please also consider the very detailed and helpful comments by the individual reviewers!

Overall, the paper is a valuable contribution to ISMIR and recommended for acceptance.


Review 1

This paper presents a method to perform note alignment via learned representation. It essentially learns pairwise similarity of two 512-note sequences. The core of the method is a transformer encoder that encodes the note sequences to determine a similarity measure, which is then ultimately matched by a match extractor, using either (1) taking the maximally similar pairs, (2) using a transformer decoder to return the matches, and (3) using DTW. DTW-based matching is done in two stages, by first roughly aligning by typical DTW path constraints, and match the notes pitch-wise so that is close to the initial DTW result.

The method is evaluated with various augmentations including repeats, skips, insertions, deletions and trills. The results show the method outperforms baseline of using pitch-onset similairty matrix, and shows that the DTW-based processing performs the best. The paper also shows larger models with more residual dimension and blocks perform better. Finally, the method is compared against existing methods, which shows the method is more robust under great number of mismatches.

The paper is interesting and provides a robust method for note-level alignment, which is valuable for the community. There are a few questions, however.

First, please state the terminal conditions for the DTW, i.e. if there are constraints at the beginning and the end of the sequences. Second, in theory this model can be applied to sequences of arbitrary lengths, except potentially when using a transformer decoder. It would have been nice to understand the performance when the note duration is varied (and hence the extent of nonlocal information is provided to the model is varied).


Review 2

This paper presents “TheGlueNote”, a transformer encoder that predicts note similarities for the alignment of two MIDI sequences with potentially complex mismatch cases such as note repetitions, insertions, or deletions. The authors give a good introduction into the topic, and, to the best of my knowledge, reference all important related work. The core model and its extensions are described in detail, as well as the experimental setup. Together with the promise of providing the code and datasets online, this offers researches an excellent starting point for reproducing results and testing the architecture in their own scenarios. The proposed model is evaluated in three different levels of complexity (# parameters) and three different decoding variants. A comparison with suitable baseline methods is conducted and the results of the proposed model seem to be impressive, while requiring relatively row runtime.

I therefore recommend a “strong accept” for this paper.

For the camera-ready version of this paper, I have a few minor suggestions: - Make Fig. 1 more illustrative, e.g., by zooming in to a smaller segment and highlighting the problems for interesting sections like trills, insertions,… - Enhance the readability of the similarity matrices in Fig. 2 by e.g., choosing black and white colormaps and improving the contrast - Explain the data augmentation in Table 2 better. What does gT_t2^{n_t} mean? What’s does U(-50,50) (given in MIDI ticks) mean in seconds? Does P_{repeat/skip/trill}=1 mean that the probability of repeats/skips/trills is one? Does it mean that every note is repeated or becomes a trill? - Facilitate the quick readability of Table 4; It is very interesting to see the results individually for all pieces, but space would permit to add two more columns representing the mean score over all pieces for Default data and 20% mismatch data.


Review 3

This paper proposed TheGlueNote, a transformer-based model to learn note-wise representations for note alignment in Midi files. The training data is synthesized by data augmentation with complex mismatch cases. Authors also compare three ways to extract note matches from the similarity matrix computed on the learned representations. The experiment and results are convincing, but the paper writing can be improved (see details below).

major comments: 1. Since the proposed model and data augmentation are the key parts of the paper, they should be described more clearly. Line 183 -186 ‘processed using the fixed-length structured tokenization [22, 23], which encodes relative onset, pitch, duration, and velocity.’ Authors should describe how onsets, pitch, duration and velocity are encoded, rather than assuming that the readers have read [22, 23]. Only with this information, the numbers in Table 2 will make sense.

Section 4.1 Data augmentations should be describe more clearly. The notations in the Table2 should be explained clearly in the paper.

Line 306 “an experiment including extended (100+ note) mismatches” The author should describe more about how the extended mismatches are augmented. This will help readers to understand in which aspects the proposed model is robust to.

  1. the confusion of the concepts Line 12-13, Line 192-195 The authors concatenate s1 and s2 as one sequence as the network input. The concepts of ‘within-sequence self-attention’, and ‘between-sequence cross-attention’ are confused with the self-attention and cross-attention commonly-used in the Transformer. Please describe this concepts in a better way.

minor comments:

‘CEL Row, CEL Col, CEL’ The loss functions are only mentioned in the Figure 2 caption. Please clearly describe the losses in the paper as well and clearly write ‘cross-entropy loss (CEL)’ in the paper.

Figure 2 ‘Data processing’ in Figure 2, but ‘pre-processing module’ in the caption. Please make them consistent. Please clearly mark the ‘Attention Block’ in TheGlueNote part, and it should also be ’n \times Attention Block’ as in the Decoder Head part.

Table 3 ‘#ph’ in the table, but ‘#p dh’ in the caption. Please make them consistent.

Line 60 what do you mean by ‘non-local information’? Please rewrite the sentence to make it clear.

Line 74-78 Prior approaches … add references at the end of the sentence or mention that, for example, ‘Please see details in related work (Section 2).’

Line 163 to to => to

Line 175-177 ‘A pairwise similarity matrix of the note representations of either sequence’ =>A pairwise similarity matrix computed between the note representations of two sequences

Line 195 cross-attention (s1-s2, s2-s1) does s1-s2 is the same as s2-s1 after transposing s2-s1? Please describe the relationship between s1-s2 and s2-s1.

Section 5.1 title model configuration => Ablation study of model configuration

Some references are not in correct format. Please check carefully.


Author description of changes:

We thank the reviewers for their thoughtful and constructive feedback! We made the following changes for the final submission: - removed uninformative Figure 1 (note alignment visualization) mainly to make room for: - extended section 4.1 with a thorough description of the data augmentations - improved readability of Figure 2 (now Figure 1): consistent wording, high-contrast matrices, loss description in main text - Table 4 includes mean results for all model/dataset variations - added links and acknowledgements - changed title of 5.1 - fixed several typos - reworded "non-local information" - removed confusing use of cross- and self-attention - fixed table numbering - fixed several references