Abstract:

In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram-Transformer-encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

Reviews
Meta Review

All four reviewers agree on the acceptance of the paper. However, it is remarkable that all reviewers pick up the same weak points of the submission, and the authors have to address these in terms of clear comments and explanations (please see the detailed formulations in the reviews): 1. Restrictions due to the oversimplistic binary meter model. 2. With PLP as a basis, how does the system performance depend on genre, tempo stability. Also, more detail on the PLP parameters must be provided. As meta-reviewer I would also add that you should explain how statistical significance was judged.


Review 1

This paper describes a contrastive pretraining method for beat tracking. The methods are highly novel. The use of a PLP for contrastive sample mining is very interesting and seemingly effective. The experiments are comprehensive on several datasets on both few-shot and full fine-tuning tasks.

Some questions:

Methodology: The introduction of PLP helps solve the task of mining positive samples in beat tracking. The idea is highly novel and interesting. There seem to be some concerns about PLP: (1) How accurate it is for different genres? Will it be reasonably accurate if the music has rapid local tempo changes? (2) How likely are they to form binary-segmented tatums (i.e., 8-th notes) instead of ternary ones (i.e., 12-th notes)? 1-2 lines of description of preliminary experiments would help people get more sense of the feature.

Experiments: Section 4.5: "Instead of ... of layer sequences." What is the weighted sum of layer sequences? Also, why choose a different fine-tuning scheme? Is linear probing not as good? Or, is there a specified reason?

Related works: The idea of utilizing the binary rhythmic structure for self-supervised learning was previously used in [1].

The methodology of the paper could be beneficial to other rhythm-related downstream tasks, or even general pretraining models for MIR. Considering the novelty of the method and the concrete results, I recommend a strong acceptance with the possibility of an award nomination.

[1] Jiang, J., & Xia, G. (2023, June). Self-Supervised Hierarchical Metrical Structure Modeling. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.


Review 2

This work proposed a novel scheme to pretrain a network in a self-supervised manner using contrastive loss and finetune the network for beat tracking with few-shot learning; the results are comparable to the SOTA supervised models and nicely demonstrate the effectiveness of the methods. Considering that there are not many works for self-supervised beat tracking, and that the proposed methods are novel and perform promising, I originally would like to give a strong accept. However, due to the issues I mentioned in the reusable insight section and the following minor flaws, I have to weaken my acceptance to this work.

  • Line 173: What does this binary structure mean here?
  • Line 182:This description is repeated several times in different ways; but the unit of the distance is not clear until this sentence. For example, Line 13, line 165-169.
  • Line 409-416: The observation that the proposed model performs much better on Hainsworth is worthy of more discussions. Is it because that the model learns different knowledges or because of some special properties of Hainsworth? Without discussion, the reusable insight of this improvement is limited.
  • Line 462: rewrite the sentence: " where our the selection of anchor, positive and negative peaks derives from a Predominant Local Pulse function."
  • Line 236: alpha = 4 tu?


Review 3

The paper proposes a novel self-supervised learning (SSL) approach for rhythm analysis and tackles few-shot beat tracking. The learning strategy is based on contrastive learning and the pre-text task samples anchor, positive, and negative points from Predominant Local Pulse (PLP) maxima. These samples are used to train and encoder to contrast observations at beats (actually multiples of the tatum) from those that are not, using unlabelled data from FMA, MTT and MTG-Jamendo datasets. The pre-trained model is then fine-tuned with just a few annotated examples (i.e. few-shot) on the beat-tracking task. Results show the proposed approach outperforms an existing SSL method (Zero-Note Samba) and yiedls competitive results when compared to state-of-the-art supervised models.

The paper is well written and organized (despite some minor corrections listed below) and is a very good contribution to ISMIR.

The proposal has some clear limitations, though. First, the binary structure hypothesis does not hold for many music styles. Moreover, the sampling from the PLP makes the asumption that peaks are synchronized with 8-th notes and that the tracks are in 4/4. The authors acknowledge these are over-simplistic assumptions (line 174), but I encourage them to add some comments on how the proposal could be extended to non-binary music structures. One could think of other sampling strategies that may account for other meters, but this would probably require meter classification.

In adddition, relying on the PLP function has some drawbacks. Particularly, it can be noisy when there are tempo fluctuations or sudden tempo changes. The proposal deals with this kind of situation by filtering out tracks where the inter-peak distance of the PLP function is not almost constant (lines 375-380). Then, time-varing tempo is synthetically introduced through data augmentation during training. This raises the question of to what extent the model can deal with tempo fluctuations within a song, so some insights on that would be welcome.

Experimental results confirm that the pre-trained model after fine-tuning can produce very competitive results, which seems to confirm that for the music datasets considered the assumptions are correct. Nevertheless, it would be interesting to test the model with music for which the over-simplistic assumptions do not hold and perform a detailed analysis of the results.

There is no supplementary material, so I could not access the code, but the paper indicates that it will be available. It would be important that the pre-trained models and the code for the experiments are available to enhance scientific reproducibility.

Minor corrections

It seems that beat and tatum are never defined. A short clarification would be nice.

Line 35 - "at most a few thousands ..." refers to data, so something is missing here.

Note that there is an error in Figure 2. The time corresponding to yp should be in Ya, which means should be ya+ix\alpha, but in the diagram of Figure 2 it is located in between ya+2\alpha and ya+3\alpha.

Line 278 - typo: "thee" -> "the"

Line 294 - remove "performances"

Line 461 - "our" should be removed


Author description of changes:

We would like to thank the reviewers and the meta-reviewer for their valuable feedback and suggestions.

We have corrected all the typos and restructured the paragraphs based on the reviewers' and meta-reviewer's remarks.

We added details about PLP computation.

Regarding the comments from Reviewers #1 and #2, we explained our decision to replace the linear probe and also included a reference to the section where we describe how we fed the probing network.

Reviewer #1: - Added a reference as requested.

Reviewer #2: - Added details about the safety window. - Added details about discarding audio segments. - Added details about the time-varying factor for data augmentation. - Clarified section 3.1.2 to make the tatum unit clearer.

Reviewer #3: - Corrected the figure.