Abstract:

The field of Optical Music Recognition (OMR) focuses on models capable of reading music scores from document images. Despite its growing popularity, OMR is still confined to settings where the target scores are similar in both musical context and visual presentation to the data used for training the model. The common scenario, therefore, involves manually annotating data for each specific case, a process that is not only labor-intensive but also raises concerns regarding practicality. We present a methodology based on training a neural model with synthetic images, thus reducing the difficulty of obtaining labeled data. As sheet music renderings depict regular visual characteristics compared to scores from real collections, we propose an unsupervised neural adaptation approach consisting of loss functions that promote alignment between the features learned by the model and those of the target collection while preventing the model from converging to undesirable solutions. This unsupervised adaptation bypasses the need for extensive retraining, requiring only the unlabeled target images. Our experiments, focused on music written in Mensural notation, demonstrate that the methodology is successful and that synthetic-to-real adaptation is indeed a promising way to create practical OMR systems with little human effort.

Reviews
Meta Review

All reviewers agree that the presented unsupervised domain adaptation framework to improve Optical Music Recognition performance would be of interest to the ISMIR community, would create discourse, possibly inspire work in other parallel areas, and in that sense the topic has high relevance to the conference. The readability of the paper is good, novelty and the scientific contributions are sufficient for publication.

We recommend acceptance of the paper provided the following issues are addressed in the final version. - Please see all reviews for detailed suggestions and changes to improve the clarity of the paper. - Code be made available as mentioned in the paper (but was not provided for this review). - Add comparison with previous works (Reviewer 1; R1) to the extent possible. Include a comparative discussion of the two references mentioned. - Consider the work in the papers listed by R3 on leveraging batch normalization statistics for domain adaptation and higher-order moment matching to tie in work in that area for a better grounding of presented approach. Make it clear that idea of domain adaptation is not being newly introduced in this work.


Review 1

This paper aims to create more practical OMR system. Most ML research suffers from insufficient 'good' dataset, and this is especially true for OMR research. To address this, the authors present an unsupervised Domain Adaptation method using a synthetic dataset. They carefully designed the loss function, and the entire technical process is explained in detail in the paper. Compared to previous works that utilize DA, this work targets an end-to-end approach, and the results clarified the effectiveness of their method.

While this paper uses different DA architectures and loss functions, I feel that it lacks a comparison with previous works. What are the differences and similarities between this work and "Domain adaptation for staff-region retrieval of music score images" by Castellanos, et al., and "Real world music object recognition" by Tuggener, et al.?

The latter also deals with the sheet images written in CWMN, while this paper focuses on Mensural collections. Are you planning to adapt this method to the most common CWMN collections? If successful, this approach could make a more practical contribution to the MIR community, in my opinion.


Review 2

ERROR!


Review 3

The article shows how to use a combination of labeled synthetic music notation data for training and unlabeled real images for domain adaptation in order to build an OMR pipeline that can deal with sheet music collections without having to label any. The relationship between synthetic and real data is one of the vital issues for OMR, because labeled training data is still very expensive. While the domain adaptation still is far from solving this problem (based on the results in Table 2), it certainly does help significantly. I want to especially commend that the article performs these experiments not on one collection, but across five diffferent mensural corpora.

The domain adaptation loss consists of two terms that each implement a clever trick. The adaptation loss term takes advantage of the batch normalization mechanism that remembers a summary of the incoming representations via their mean and standard deviations that are used to perform the batch normalization during the supervised training phase, and during fine-tuning (if I understand this correctly), it adapts the weights of the upstream layers so that the batch norm statistics coming from the in-domain data match these statistics from the supervised training phase as closely as possible (with KL divergence), so that the distributions going to the downstream layers look as close to what the network is used to from the training phase. The regularization term then prevents pathological collapses of the adaptation loss by encouraging predictions that look like plausible OMR outputs.

While these are clearly workable ideas, the paper suffers from a bit of in-domain OMR tunnel vision. By not considering related methods in more broad computer vision terms, the paper implicitly overclaims that the domain adaptation term is an original contribution. This is important and in my view must be fixed for camera-ready, if accepted. The widely cited methods that leverage batch norm statistics for unuspervised domain adaptation are not exactly the same as the proposed domain adaptation term, so this merits also some light discussion of at least what advantages and disadvantages the choices made in this paper have compared to existing methods (new experiments with these existing methods are perhaps a bit unrealistic for camera-ready, and not necessarily valuable anyway).

In order for the paper to move from showing that the idea has potential for OMR into actionable intelligence, I would also recommend a best-of-both-worlds experiment that combines domain adaptation with some amounts of in-domain data to show whether the suggested ceiling is in fact really as fragile as the glass metaphor suggests: does adding already a little bit of in-domain data help significantly?

The error analysis could also benefit from showing if there are errors that domain adaptation introduces. The delta term in Table 2 has in fact two parts — errors fixed with domain adaptation, and errors introduced by it (with the second number likely being much smaller). However, especially if some of the test domains (=different datasets) exhibits a large number of errors introduced by the method, this could offer valuable qualitative insights into the limitations of the presented method — and suggest directions for improving it towards applicability.

I recommend the paper for acceptance, especially because it is a good attempt at tackling this important problem of synthetic-to-real domain adaptation in OMR with purely synthetic training data, but with the important caveat that the domain adaptation using batch norm matching should not be presented as a new thing — it has already been explored repeatedly throughout the past years (see the comments on related work above), hence a weak and not strong accept. This is a serious omission, but it can be easily fixed in camera-ready and in my view does not detract much from the long-term value of the paper. It is a great first step to applying these techniques for OMR. I am looking forward to next steps that moves the results of this method closer to the results with in-domain training data, especially with an eye of adapting other techniques used for this in computer vision.


Author description of changes:

In response to the reviewers' comments, we made the following changes to our paper:

  1. We corrected minor grammatical and typographical errors.
  2. We ensured all references are accurate and properly formatted.
  3. We revised Fig. 1 to improve readability.
  4. We clarified and expanded explanations in several sections as requested by reviewers.

We appreciate the reviewers' valuable feedback.