Improved symbolic drum style classification with grammar-based hierarchical representations

Léo Géré (Cnam)*, Nicolas Audebert (IGN), Philippe Rigaux (Cnam)

Keywords: MIR tasks -> automatic classification, Knowledge-driven approaches to MIR; Knowledge-driven approaches to MIR -> representations of music; MIR fundamentals and methodology -> symbolic music processing; Musical features and properties -> representations of music

Abstract:

Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.

Reviews

Meta Review

The discussion of this paper was by and large positive. There is some question of the generality of the representation, and whether the rules extracted by qparse are sufficiently general to allow lots of different rhythmic possibilities. Its proximity to the qparse implementation is also concerning. The authors should spend some time improving their paper long the several directions identified by the reviewers.

Review 1

Summary: The authors develop a new tree-based MIDI representation of drumming data (linear rhythmic tree) and test it with 3 other representational models and present their results. The LRT does the best.

Strengths: This is a very good paper with reproducible results. The major novel idea is the LRT. I like that the new model is tested against other models as well, both with LSTMs and transformers.

Weaknesses: Nothing is incredibly ground-breaking or earth-shattering. The results are incremental.

Review 2

This paper uses grammatical representation of MIDI files for style classification. Experimental results show that comparable or significantly better performance is achieved using less model parameters. The description and results are convincing, but explaining more details could make this paper better.

Line #165: Please define “optimization-based music transcription systems”. Does it mean “a music transcription system that invokes optimization algorithm”? If yes, then deep models (line 168) are one type of such system, but in this sentence (While designed …) they are viewed as different.
Line #299: Please give the durations of these sets if they are not roughly proportional to the original durations.
Line #381: Please give specs of CPU and GPU. Elapsed times are not meaningful enough if specs are unknown.

Review 3

This paper presents a new representation for quantized MIDI files based on principles from a library called qparse. The approach is very interesting and the paper well-written. However, the paper also raises a number of questions that were left partly unanswered: - Partial Evaluation: The task is evaluated based on a single specific task without considering alternative approaches. The authors focus solely on this task, comparing only their own results. Evaluating the method against an established task and existing approaches would significantly strengthen the case for introducing a novel data representation. - Hard to interpret experiment tables: Table 1 contains a configuration study but the choice of bars and parameter size is not very intuitive. The choice of bars across tokenizations and models for example seems a bit arbitrary. Figure 4 and 5 also follow the same ambiguity, and raise the question of why the proposed representation using a transformer is compared to an LSTM, I would argue that some justification is missing. - Relevance of the Task: The task chosen for evaluation is too simplistic to effectively demonstrate a completely new representation, raising doubts about the general applicability of the new representation. - Generality of the representation: The new representation is very simple and clear but it does not cover all (or even broadly enough) musical durations such as composite durations, dotted notes, tuplets other than triplets, tied notes, etc.

Author description of changes:

Thank you all for your reviews. Please find below the changes we have made to take them into account.

As proposed, we modified the title to reflect the fact that we evaluate our approach on drum music.

We added a reference to get further insight on formal grammar and context-free grammar in the related section.

We clarified the Figure 2 and the associated text in order to make it more easily understandable.

We also explained that model hyperparameters, including depth and width of the architectures, as well as the number of bars considered for each encoding were obtained using a hyperparameter search on the validation set.

Furthermore, we also added a note explaining that while we are indeed limited by some of qparse limitations, we aim at evaluating the rhythmic tree representation, regardless of how it has been built. Researching more robust ways of building such trees would be suited for some follow-up research.

In the supplementary material, we added some details on the grammar we used, and how the tree is built.

As proposed, we also added the confusion matrix of one of the models trained with the best architecture in the supplementary material, with some succinct interpretation.