Abstract:

The task of music structure analysis has been mostly addressed as a sequential problem, by relying on the internal homogeneity of musical sections or their repetitions. In this work, we instead regard it as a pairwise link prediction task. If for any pair of time instants in a track, one can successfully predict whether they belong to the same structural entity or not, then the underlying structure can be easily recovered. Building upon this assumption, we propose a method that first learns to classify pairwise links between time frames as belonging to the same section (or segment) or not. The resulting link features, along with node-specific information, are combined through a graph attention network. The latter is regularized with a graph partitioning training objective and outputs boundary locations between musical segments and section labels. The overall system is lightweight and performs competitively with previous methods. The evaluation is done on two standard datasets for music structure analysis and an ablation study is conducted in order to gain insight on the role played by its different components.

Reviews
Meta Review

The reviewers ranged widely in their view of this paper. All had constructive criticisms, on different aspects of the paper. The concerns included:

  • lack of context about the kind of music assumed to be the target of the method;
  • lack of certainty that the evaluation is fair;
  • lack of motivation for using graph neural networks, instead of a simpler Transformer approach;
  • lack of depth in the ablation study.

In the discussion between reviewers, we all agreed that these were valid concerns, but we disagreed about their severity. We hope that the authors can take all of the suggestions on board as they revise this paper.


Review 1

The paper proposes a music structure analysis technique based on pairwise link prediction using graph neural networks. A general criticism I have of this work is a problem that is noticeable in various studies in this field: before starting a proposal for music structure analysis, there is never a preliminary musical discussion. Specifically:

1- Repertoire as a corpus: What is the repertoire to be used as corpus? This only appears in the discussion of the experiment. 2 - Notion of musical structure: Given a specific corpus, what is the notion of musical structure and what should be examined in the corpus to highlight this structure? 3 - Detection difficulties and tools: Given a specific corpus, what are the difficulties in detecting structure and what tools have already been used to address this challenge?

For example, the audio representation used in this case is a MEL spectrogram representation with slices centred around each detected beat position. This means that there is an extremely limited universe of corpora that this algorithm can analyse, i.e. music where the beat is a crucial element.

Therefore, I strongly suggest a better contextualisation of the tasks and the corpus for which this tool is intended, trying not to ignore the problems of musical nature involved in tasks of musical structure analysis.


Review 2

The paper contains a novel approach, is mostly well-written and has high-quality Figures (and very estetically pleasant). However, as I wrote above, there are two points which I find very problematic. I believe that correcting them could make for a very strong paper. In the following there is a list of other minor problems and suggestions.

11: “as whether connecting elements from the same segment or section”. Something sounds wrong in this sentence. I would suggest changing it to “as belonging to the same section (or segment) or not”. 15: “boundary locations between musical segments and section labels”. This is confusing for me. Is the music consisting of a an alternance of musical segments and section labels? The terms “segment” and “section” and “structural entity” are used in the abstract without specifying if they are synonyms, or if they are not, in what they differ. This is a bit confusing for a reader that is approaching the field. I suggest using only one term if possible, or explaining the differences between the terms. 29: typo. “These” refer to a plural but “corpus” is singular. The authors can use either “corpuses” or “corpora”. 36: what is an “event”? A sound from an instrument? And is the term “musical observation” used in the next sentence synonym of “event”? The part up to line 40 is not very clear to me. 38: what is the “multi-level dependency”. With the term “dependency” I imagine a link between two things. How can it be multi-level? The paper [1] speaks about “multiple timescales, but not about dependencies. This term is used further in the Section, but I can’t understand what it means. Section 1.1 or 1.2: the gap sentence is missing. What is the problem in the current approaches that is solved in this paper? This is the core point of scientific papers, and without it, this paper loses a lot of its interest. The sole goal of “doing something different” is not so appealing to a reader. I invite the authors to find something that their model can do better than others and to write it very clearly. 42: how is the paper [7] relevant to to the point of the paragraph? This is a graph neural network paper that doesn’t deal with music. Moreover, is [16] also part of this group of papers that uses self-attention? If yes, it should be added. 85: a new term “audio observation” without a very intuitive meaning is introduced. Do the authors mean “audio frame”? In this case, I would use this more common term.

I’m extremely confused about the use of the GNN. A GNN is useful when there is a predefined graph structure to leverage. When connecting everything to everything (like

180-185: How can an MLP limit the oversmoothing problem? In my understanding, if the representations are too similar, there is nothing an MLP (which receives such representations as input) can do. A citation could help here if this is true.

The claim that the system is lightweight and has low parameters should be supported by numbers. But the number of parameters, or the training time for other papers is missing.


Review 3

— RESUME — This paper presents a novel structure analysis method that outputs each section's boundaries and labels. The main idea is to exploit the pairwise links between audio segments, i.e., predict whether a pair of audio segments below has the same structure. The authors suggest aggregating all these connections into a graph, which is then used to address the tasks in supervised learning.SSM

— SCIENTIFIC CONTRIBUTION — The main contribution lies in applying graph neural networks for modelling music structures, a novel approach that opens up new avenues for music analysis and machine learning.

— METHODOLOGY —

The authors propose a three-step method: 1- Feature extraction: this step creates the best feature for capturing the pairwise links and builds a self-similarity matrix (SSM) as an adjacency matrix. This process consists of:

a) Audio frame selection: it reduces the length of the audio sequence by selecting frames around the beats. Beats are obtained via a pertaining beat detection method. It's not explicitly said, but I guess this is due to memory limitations. Otherwise, why is there interest in doing that? And what's the reason that backs up the hypnosis that beat frames are more informative than off-beat frames regarding structure detections?  
b) Frame encoder transforms the audio signal into a feature vector that captures structure information. This block is based on previous works and applies self-supervised independently of the following blocks. It outputs a feature vector X. 
c) Feature refinement: This step involves making each frame exchange information between all the other frames in the track. The final features are called X'. This step is quite exciting and introduces the first graph network block. It is based on the hypothesis that the features obtained from the frame encoder are independent and, thus, won't capture proper pairwise links nor share any information between them. While this sounds intuitive, there are no experiments to back up this hypothesis since no ablation study has been conducted to test the importance of this feature refinement block.

On top of these feature vectors, the author gets an adjacency matrix that resumes all the pairwise connections between audio frames. They called this matrix A'.

2- A method to find and classify each link between features. The authors define three types of pairwise links: same segment, same section, and different section. I don't see an interest in differentiating between segments that belong to the same segment from those that belong to the same section since the former is a subcategory of the latter. Why is this distinction needed?
a) The authors propose a CNN block to process the SSM to find regular homogeny areas repeated over time. The idea is that these areas correspond to the final structure of the song. This step aims to categorise (and impose) pairwise links between the frames w.r.t. the overall song structure. The output has the same dimension as A', called E'. b) They add positional embedding to distinguish between the same segment/section category. I need clarification about why this is required and its utility is never shown in the ablation studies.

Here, we have the first loss, which classifies each E' component into one of the three categories they define. It is not detailed how the cross-entropy loss is applied in this context. I'm assuming the loss is computed per row (?), where the index row serves as a reference point, and the loss evaluates the rest of the components within the row.

3- The final block combines the acoustics information X' with the matrix E', a refined version highlighting mutual information. This block consists of a graph attention mechanism. The output is a final feature vector X'' used for the final boundary and label classifications. They add a final regularisation term to encourage orthogonality between label classes. The author fixed the number of possible classes to K for the whole dataset.

One missing explanation is that one song is equal to a batch to be sure that all the selected points after the beat track step below the same song to avoid refining them with features from other songs. How does this affect the training? i.e. since all the points in a batch share the same musical traits, isn't that problematic when propagating the gradient?

  • RESULTS AND DISCUSSION -

The authors validate their model on two datasets, RWC-Pop and Harmonix. However, they don't use the Jazz Structure Dataset (JSD) or Salami, which are commonly used to benchmark the music structure task. It is good to see a k-fold cross-validation study and a cross-dataset evaluation. This helps assess the model's actual performance while providing proper metric ranges.

I appreciate the ablation study, but it misses some important experiments. I miss some further studies of each step independently:

1- On the feature extraction step to address the quality of the SSM matrices obtained. This could have been done by comparing them directly with the ground-truth annotation. Moreover, the benefit of the Feature refinement block has yet to be tested. How much does it contribute to the overall performance? Is it better than computing the adjacency matrix on top of the feature produced by the feature encoder? The fact that the results drop so much when removing the Link Feature block may indicate that the obtained SSM A' is not informative enough for detecting structures.

2— Link Feature block: Since this block can process any SSM, how powerful is it? I would have been curious to see the performance when training this block on top of another SSM. For instance, a good baseline could have been to compare the performance obtained with SSM obtained from well-known musical feature descriptors such as chroma, MFCC, or tempo grams vs. their proposed feature extraction method.

3- Combining link features with acoustics features. Since this block consists of an attention mechanism that combines X' and E', how does it work for the ablation experiment where the link feature (which produces E') is removed? Similarly, how informative is the E' matrix w.r.t the final tasks? Why do we need a combination of both?

A good point is the small size of the model—only 330K. To better understand its significance, it would have been more informative to include the number of parameters of the compared model.

The significant insight here is that their model can effectively generate a feature vector X'' that captures structural section information. This is an interesting observation because their feature vector is correlated with meaningful musical sections. I'm curious about how widely applicable this insight is. The authors have set the number of sections to K = 7 (they don't mention their taxonomy), which may not be representative of all possible music sections but is likely sufficient to cover most popular Western music.

— FINAL COMMENT — The utilization of graph neural networks is quite fascinating. The pairwise relationship hypothesis is intriguing, and the outcomes emphasize the grouping capabilities of their final feature vector. However, they introduce a three-step model that is primarily tested as a whole, making it difficult to evaluate the contribution of each component as well as their reusability out of their full formalisation. Furthermore, many of the design choices made need to be justified or properly evaluated.


Author description of changes:

We thank the reviewers and meta-reviewer for their valuable insight and feedback on our work. We address here the main concerns that emerged through the reviewing process:

  • Lack of context about the kind of music assumed to be the target of the method: We added a mention in our contribution that this work addresses the analysis of structure mostly for western popular music. This decision was based upon the availability of annotated data and the label taxonomy employed in previous work, so as to ease comparison with our method. In fact, the extension of the approach to different types of musical structures is a research direction we aim to undertake, and mention it in the conclusion of the paper.

  • Lack of certainty that the evaluation is fair: Our evaluation process has tried to closely follow that of previous work, where baseline systems are evaluated both in cross-validation and cross-dataset settings. However, comparisons should still be cautiously apprehended as some baselines used additional training data or augmentation strategies. Concerning the reported baseline results, we added a row in Table 1 to include the different configurations of one of them (SpecTNT).

  • Lack of motivation for using graph neural networks, instead of a simpler Transformer approach: The graph neural network framework is actually a generalization of the transformer approach when applied on a fully-connected graph. We added however a few words as to why graph neural networks are employed in our method (mainly to include link features into the attention coefficients calculation and the frame feature update).

  • Lack of depth in the ablation study: Our initial ablation study aimed at focusing on the most peculiar aspects of our method. However, we ran additional ablation experiments by also discarding the remaining steps of our method and updated Figure 4 accordingly.