Abstract:

We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs—this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% → 82%).

Reviews
Meta Review

The paper proposes a simple solution to aligning audio files with sheet music that have structural differences (such as repeats or no repeats): simply annotate them. The authors present a quick workflow for this task and demonstrate the effectiveness of the extra annotations (particularly the repeats) in their evaluation and provided videos.

Main Strengths: - Clarity and Practicality: The paper is well-written, easy to understand, and presents a practical solution to the problem of aligning audio files with sheet music, especially when there are structural differences such as repeats or jumps. - Effectiveness: Demonstrates a clear improvement over baseline systems when using annotated information on repeats and jumps, which highlights the practical application and potential benefits in real-world scenarios. - Human-in-the-Loop Approach and Workflow Integration: Emphasises the value of human annotations in improving alignment accuracy, which is often overlooked in favour of fully automatic solutions. This approach is both quick and low in human effort. The workflow is cohesive and integrates well with existing state-of-the-art approaches, making it a robust engineering solution. - Reproducibility: The promise to release the code upon acceptance ensures that the results can be reproduced and verified by others in the community.

Main Weaknesses:
- Lack of Novelty: The approach lacks significant novelty from a research perspective. The idea of labelling repeats and using annotated jumps is not new, and the improvements over existing methods are not groundbreaking. - Evaluation Scope: The evaluations, while showing improvements, lack depth in analysis. For instance, there is a need for a more detailed explanation of the significant accuracy jump attributed to feature representations. - Dataset Limitations: The datasets used for evaluation are limited, particularly in the diversity of music types. The evaluation predominantly focuses on piano music, which may not fully represent the system’s capabilities.

Further Comments: - The measure-aware alignment and evaluation make intuitive sense but could be perceived as arbitrary due to the variability in measure lengths. - The proposed system’s baseline comparison should also consider using similar features to those of existing methods to ensure a fair evaluation.

Requested Improvements:
- References: The references are inconsistently formatted and incomplete in some cases. Please resolve this for a potential final version. Also, some additional references of papers addressing jumps/repeats in the alignment context should be added (see the individual reviews). - Features: Improvements in feature representation claims should be supported by evaluations on larger datasets like SMR, in addition to the relatively small M13 dataset. The paper could benefit from a more detailed analysis of why the proposed feature representations supposedly lead to significant improvements.

Please also consider the more detailed comments by the individual reviewers.

Despite the lack of significant novelty, the paper presents a practical and effective solution to audio-to-sheet music alignment. Its clear writing, practical workflow, and demonstrated improvements in alignment accuracy make it a valuable contribution.


Review 1

In this paper, the authors propose an audio-to-music-sheet synchronization system that combines several state-of-the-art approaches to detect notes and staff lines [16,17], measure positions [21], music transcription [3], and standard DTW [22]. On top of this engineering solution, the authors developed an interface to annotate jumps and repeats and feed the system with this information.

Evaluation has been performed at the measure level using the MeSA-13 and a subset from the SMR dataset. The proposed system is compared with the method in [17], which uses Hierarchical DTW to automatically account for the jumps and repeats. Results demonstrate the superior performance of the proposed system, especially when using the information from jumps and repeats. Other information, such as ground-truth measure and staff metadata, has been evaluated, showing a slight improvement.

The paper is well-structured and easy to read. The context of the paper is well presented, and several demonstrations of the technology are provided.

In my opinion, there is little novelty in the proposal from a research point of view, but I see the potential application from the technological side with several applications to the community and market (Musescore, page-turning systems, etc.).


Review 2

The motivation is clear and reasonable: to manually handle the jumps in alignment. The repetition labeling process brings significant improvement in accuracy with relatively low human effort. With the collected annotations, it is possible to automate the process with a model in the future.

I have a few minor comments: 1. I suggest moving Section 5 before the system description, as it does not depend on the alignment system (in fact, it’s the opposite). 2. In Table 1, it would be helpful to include results with only ground truth repeat annotations for comparison with the human labels. 3. The experimental settings for comparisons among different representations (L399-L402) are unclear. I guess this is related to different combinations of score features and audio features. The authors might want to clarify this. 4. The authors need to pay attention to the references: use a consistent format (proceeding names, conference names, etc.), fill in missing fields (authors, pages, years), and replace arXiv versions with proceedings versions where applicable.

Overall, the novelty and impact are not very high; therefore, I recommend a weak accept.


Review 3

Short Paper Summary: This paper proposes an audio-to-score alignment approach to align arbitrary music performances to sheet images. The approach extends and improves upon existing audio and score representations (bootleg score) for this task and further relies on minimal human intervention in the form of annotated jumps/repeats in the sheet image. This can usually be done quickly and is shown to significantly improve the alignment quality.

Note on Reproducibility: The authors promise to release their code upon acceptance (and assumingly their annotation tool as well). Alongside the information on which pieces have been used for evaluation (given in the supplementary material) it should be possible to reproduce and verify the results.

Main Review: While the basic components of the approach are not novel (bootleg-score inspired score representation, Onset & Frames to get a piano roll-like representation of the audio, DTW), I still think the practicality of the approach warrants an acceptance. The paper is clearly written and for the most part easy to follow along. That being said, some aspects of the paper could be improved.

In Section 3.1, it is not immediately clear to me how the shape of S_i (line 240) is derived. 88 for the pitch dimension is clear, but what’s the reasoning behind 48 and how would this impact the resolution of the score representation, e.g. with respect to extremely fast note runs? I think the (index) variable i is used for two different things in lines 239/240/242. To make the difference more clear, I’d maybe suggest using another variable. In the same section, when describing how the bootleg score is converted into a piano roll. Is C major always assumed as the default key? I might be reading the sentence (starting at line 253) wrong, or the end of the insertion mark is missing.

In Section 3.2, the use of the onset predictions coming from the Onset & Frames model is introduced. It’s worth noting here that Shan and Tsai [16, 17] already used the onsets derived from the MIDI transcription of Onset & Frames for their bootleg score representation of the audio. I assume this is then similar to what is later on tested in Section 4.5 as onset predictions (with the difference of staying in a “piano roll space”), which has a similar performance as the onset probabilities. However, the evaluation for that seems to be done only on the relatively small M13 dataset. In order to really claim that this refined audio representation results in an improvement, I’d suggest also doing an evaluation on the larger SMR dataset.

Overall, I think the comparison and the claim around superiority “due to our refinements to feature representations” is not entirely fair and clearly shown in the experiment. Which is already acknowledged to some extent in the paper itself due to the proposed measure-wise evaluation. If I’m not mistaken, nothing prevents the proposed approach to use similar features, i.e. same bootleg score representation as Shan and Tsai [16, 17]. Still performing measure segmentation, but basically stopping before the conversion to piano rolls described in 3.1. Such a model could be used as an additional baseline/vanilla version. Alternatively, it might be possible to additionally extend the existing measure-aware evaluation to a system-level evaluation which would allow for a more direct and fairer comparison (although maybe not as useful wrt. to the overall alignment precision as the measure-aware version). In any case, I’m not doubting that the proposed approach is able to yield more precise results, but I think such a comparison could benefit the paper and better support the claims made.

Minor Remarks: - Line 493: “... that offers has an interactive …” -> “... that offers an interactive …” - In the related work section, when writing “We diverge from them by considering a range of different types of raw score images and audio (such as ones with instrumentation beyond solo piano)” (lines 472 onwards). While this is shown in the supplementary material with some examples, the datasets used for evaluation only contain 1 or 2 samples that aren’t strictly piano music. Even though limited to piano music and also in the score following domain, another potentially related line of research to check out could be Henkel and Widmer ”Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction” as they also experiment with raw sheet image scans. - I ticked ‘yes’ for “the paper adheres to ISMIR 2024 submission guidelines”, which is the case for the most part, but I want to make clear that the references are not well formatted. Please make sure to cite the proper conference version of papers instead of the arxiv ones, e.g. [1, 3, 12, 13, 15, 16, 37, 38] were published at ISMIR. Also, there are missing venues for [26, 29] and [19] shows a placeholder date.


Author description of changes:

Updated Table 1 to reflect the final state of evaluation. Decided against moving Section 5 before the system description to avoid confusion regarding some repeat annotations. For instance, the repeat annotations in MeSA-13 come from a different interface (not ours). Upon being asked to include a setting with only ground truth repeat annotations in Table 1, we changed our wording in the caption to show that the human-labeled repeats are in fact also ground truth repeats. Changed wording to clarify the experimental setting in Section 4.5. Explained the shape of S_i in L240 in the same paragraph. Provided motivation for why we consider a half measure radius in our evaluation metric. Fixed variables in L239/240/242. Clarified that C major is always the default key. Added clarification in Section 4.3 to emphasize that Shan et al. used the same model for piano transcription that we used, Onsets and Frames, however a different representation (MIDI transcription directly obtained from the model). Added Table 2 to report results for the evaluation of different audio feature representations on both M13 and SMR. Decided against reporting results from additional baselines since our focus in this paper is on showing that human-labeled repeats increase alignment accuracy considerably. Due to this, we did not evaluate the effect of minor algorithmic changes, but a more detailed analysis could be of interest in future work. We also do not implement a system/line-level evaluation metric because even though that would allow us to evaluate our baseline’s performance in a different setting, our workflow aims for improvement on measure-level alignment, so we keep our focus on measure-level evaluation. Included citations suggested by reviewers. Included links to supplementary videos and our code for reproducibility as promised. References are fixed to have consistent formatting and display all necessary information. The SMR dataset (subset) is updated to contain 60 pieces instead of 49.