A Critical Survey of Research in Music Genre Recognition

Owen Green (Max Planck Institute for Empirical Aesthetics)*, Bob L. T. Sturm (KTH Royal Institute of Technology), Georgina Born (University College London), Melanie Wald-Fuhrmann (Max Planck Institute for Empirical Aesthetics)

Keywords: MIR tasks -> automatic classification, Philosophical and ethical discussions -> philosophical and methodological foundations

Abstract:

This paper surveys 560 publications about music genre recognition (MGR) published between 2013–2022, complementing the comprehensive survey of [474], which covered the time frame 1995–2012 (467 publications). For each publication we determine its main functions: a review of research, a contribution to evaluation methodology, or an experimental work. For each experimental work we note the data, experimental approach, and figure of merit it applies. We also note the extents to which any publication engages with work critical of MGR as a research problem, as well as genre theory. Our bibliographic analysis shows for MGR research: 1) it typically does not meaningfully engage with any critique of itself; and 2) it typically does not meaningfully engage with work in genre theory.

Reviews

Meta Review

Overall, the reviewers felt that this paper has a number of important strengths, including:

An impressively extensive and detailed survey of publications on automatic music genre recognition research from 2013 to 2022.
Detailed data and source code are provided.
Well-informed and thought-out critical commentary that emphasizes the need to carefully (re)consider the aims, methodology and impact of music genre recognition at a fundamental level.

However, the reviewers also highlighted the following problems with the paper:

The first sentence of the abstract is confusing and uncomfortably close to reference [129] by Sturm, which is not cited in the abstract (but is cited in the rest of the paper). Reference [129] must be explicitly indicated in this first sentence
Perhaps too much reliance on the (cited) methodology used and critical perspectives expressed in Sturm’s earlier survey of genre recognition work before 2013, as well as on other work previously published work by Sturm and his collaborators (also cited in this paper). Could have expanded here a little more (although there certainly are novel contributions too).
Some objective annotation errors were found by one reviewer when a (small) sampling was performed of the raw data in in the provided “new_survey.csv” file.
One reviewer suggested that more explicit discussion of specifically musicological issues connected to genre recognition (as opposed to just the social sciences and humanities in general). It was also suggested that a number of relevant musicological publications on genre recognition may have been missed because the survey methodology used was better suited to finding more technical sources than musicological sources, which still often tend to be less likely to be published (and indexed) electronically.
One reviewer suggested that the paper should include discussion of related work in music auto-tagging and music representation learning, as they play key connected roles in the evolution of the field of music genre recognition during the time period in question. Related to this, it was pointed out that this paper does not adequately discuss how the kinds of datasets being used are shifting to larger ones more suitable to deep learning. This, combined with the concerns expressed by another reviewer in connection to musicological sources, suggests that this survey, while certainly extensive, may have systematically excluded certain important kinds of publications and developments relating to automatic genre recognition.
With respect to clarity, some of the terminology used might be confusing to readers who have not read Sturm’s earlier work specifically, as this paper incorporates terms from it that are not necessarily widespread or understood by the MIR community at large.
One reviewer suggested that it would be helpful to highlight particular publications mentioned in the survey that have been especially influential or that are of particular merit, and to provide a sense of how different publications may be connected to one another.

Please see the individual reviews for more details.

Thanks to the authors for submitting this work. If it is not ultimately accepted at ISMIR this year, we encourage them to revise it based on the suggestions in the individual reviews and resubmit next year, as we believe this work can be an important contribution to the ISMIR community. The authors may also wish to submit this to TISMIR, as it seems that this paper may be overly constrained by the length limitations of ISMIR.

Review 1

I'll say flat out at the start of this review: color me impressed with this paper. This work is deep, detailed, and acutely observed, and it sheds light on the need for some serious introspection within the MGR field. I am reminded of the ISMIR 2023 paper “The Games We Play”, which I was asked to review last year and enthusiastically recommended for publication – both papers issue clarion calls to stop operating on autopilot and rethink the aims and impact of scholarship in the fields studied.

I have no problem calling the literature review exhaustive in the domain of MGR, certainly as exhaustive as I could have imagined. In particular, I like the thoughtful collection of critical studies of genre that are aggregated and presented as meaningful commentaries with which the analyzed authors are not grappling. The works of Simon Frith, Philip Tagg, Leonard Meyer, Keith Negus, and other socio-musicological theorists of genre feel like they talk about music, and musical style, in a completely different world than the vast majority of toolbuilders for MGR (or for that matter, ISMIR) research. The paper puts, beautifully, a crucial observation about the purpose and function of the 500+ papers reviewed: they are “seldom motivate[d...to learn] something about music rather than classifier performance.” In a nutshell, this is the problem with so much MIR research; it is often dramatically more concerned with Information Retrieval than with Music.

The other meta-analytical observations of the paper are similarly trenchant. The paper reiterates and extends Sturm's concerns about the use of the GTZAN dataset, and is justified in noting the lack of caveats appended to online mirrors of the set. Section 3.8, on discussions of genre theory, suggests that the whole field may need to go back to the drawing board conceptually. What does MGR research even mean by genre, if it is playing so loosely with such a notoriously slippery eel?

The abstract could use a little more detail on the theoretical contributions and could do with the elimination of some boilerplate. The sentence “We present several recommendations” is vague and not very useful to the prospective reader, and there's no need for the abstract to point to supplementary material.

There are some organizational problems with this paper that feel as if they are generated from the desire to publish at ISMIR despite having a paper and topic that simply is too expansive to address within the 6-plus-n page format. Obviously, it's unusual for 'n' to be 31 pages (this doesn't bother me, but it's probably an ISMIR record), but there's also an additional supplement file describing some of the finer points of method choices, and much of what's in there, in my view, ought to be in the full paper proper. I certainly don't want to drive good work to other venues at this conference's expense, but the parsimony that is required at ISMIR seems like a real Procrustean bed for a paper of this breadth and depth. ISMIR is, decidedly, a place where it needs to be heard.

Review 2

This work offers a systematic literature review for the task of Music Genre Recognition (MGR), leveraging and extending the work by Sturm et al., conducted in 2013 and considering publications up to 2022. The paper provides statistics on published works of MGR regarding these 8 axes: publication type, MGR datasets used, experimental design, figure of merit, justification of the MGR task, publication venue, presence of critic to MGR, and engagement with musical genre theory.

Notably, the main strength of this work is its comprehensiveness compared to the work by Sturm in 2013, both in terms of the number of publications covered (1026 vs 467) and dimensions analyzed (8 vs. 3).

In general, I find that this survey follows a continuistic approach compared to the previous work by Slurm. My main point of criticism is the lack of discussion over certain aspects that I consider crucial to understanding the evolution and current open directions of MGR. Namely:

Influence of Music Auto-Tagging (MAT): From the appearance of datasets such as the Million Song Dataset or the MagnaTagATune, a part of the community shifted their attention from MGR to MAT. While MAT differs from MGR in several aspects, including being generally modeled as a multi-label task, and working with label sets typically derived from online folksonomies instead of created by musicologists, the connection to MGR research is obvious. While I understand that this connection was not discussed by Sturm since the first relevant auto-tagging papers are contemporaneous with his survey, I find it surprising that it is not addressed in this work.

Music representation learning: Developing broad-purpose representation models that can be finetuned in multiple music understanding tasks, including MGR, has become a popular approach (including several examples already published before 2022). It is a bit disappointing that this paradigm shift is not reflected or acknowledged in this study and that these methods are simply reflected as traditional classify experiments.

Datasets. The dataset section summarizes existing publications in terms of the most used datasets and features, but it doesn’t discuss or help to understand why newer datasets are developed and used. For example, in my opinion, the community is shifting towards larger MGR datasets consisting of music distributed under permissive licenses (such as the FMA or the genre subset of the MTG-Jamendo dataset) due to a tendency to rely on deep learning approaches requiring more data and to favor reproducibility. However, these aspects are not discussed in detail.

Additionally, these are some minor comments to the authors: - Line 80-82: In my opinion such details about file formats can be omitted if not relevant later on. - Line 491-493: I would say most researchers interested in Genre would nowadays do some form of representation model, and probe their systems in several downstream datasets, possibly (but not necessarily) including GTZAN. The motivation is that being able to show consistent improvements across multiple music classification datasets (including MGR) suggests more robust and generalizable performance than using a single small dataset such as GTZAN. In my opinion, this sentence neglects the effort present in many of the referred publications [372, 373, 421, 521]* for developing more robust evaluation setups. * references from the paper. - Section 3.7: In my opinion, many MIR researchers interested in MGR, are already familiar with the pitfalls of GTZAN. Although it is always worth mentioning it, this section could be less focused on this particular problem, and instead, develop a bit more other less-known critics to common practices in MGR. - Section 3.8: Consider that an overall intuition of how genre theory is typically addressed from the perspective of social science and humanities would be valuable by many MIR researchers. - Consider including [1], a work that performs MGR using GTZAN and performs a cross-collection validation to assess the generalization capabilities of the learned model (classify, feature, and generalize).

Finally, while I appreciate the comprehensive and systematic approach of this research, my recommendation to the authors is to consider taking into account the aforementioned aspects to bring an updated perspective on MGR.

[1] Alonso-Jiménez, P., Bogdanov, D., Pons, J., & Serra, X. (2020, May). Tensorflow audio models in essentia. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Review 3

This paper presents a comprehensive survey of Music Genre Recognition (MGR) research published between 2013 and 2022, extending the previous survey done by Sturm [219]. It describes the process for the paper corpus creation, as well as the different dimensions considered when classifying the papers. Such dimensions are analysed individually and discussed, evidencing the main issues (still) present in MGR research.

The paper is dense but properly written, easy to understand and well structured. The methodology is detailed in a way that is easy to understand every step, and the analysis is synthetic but clear, providing examples for every dimension analysed. I also appreciate the new aspects not considered in the previous Sturm’s survey, i.e., analysing the engagement of MGR research with critical work and genre theory. Overall, I believe the paper could be a great addition to the conference, and therefore I am prone to recommend accepting it, even if with one reservation.

In fact, several concepts inherited from the previous Sturm’s survey, used also in this work, are detailed in the supplementary material. Having read [219] in the past, for me it was not difficult to understand the meaning of the proposed terminology, especially with regards to “Experimental Design” and “Figure of Merit”, however a reader engaging for the first time with these concepts might feel disoriented. Therefore, I am not sure if the principle of presenting a “self-contained” paper is respected in this submission. However, I remit to the Scientific Program Chairs the decision on whether this aspect may preclude or not the acceptance.

In terms of the methodology, one aspect of the approach that I do not particularly like, in this work as in [129], even if I do accept it, is the egalitarian vision of the corpus of MGR research, i.e. in the presented analysis every paper counts one. Even if I do understand the value of a comprehensive survey for depicting a picture of the current status of the research in the field, I also see a major limit in this approach: I am quite sure that most of the research done in the area is not properly done and neither robust from a scientific perspective, but after reading this paper I still am not sure what work is worth reading and what is not. It would be nice to have a list of which among the 599 papers may be considered a good reading.

Eventually, this was not the goal of this work, and therefore this is not a major issue, but I would like to suggest reflecting on the value of conducting a quality assessment for selecting publications that respect certain standards, instead of taking whatever is out there. It is a common practice in medical research (see [1]), where the volume of work published is huge, and maybe also a survey on MGR research could benefit from it.

One missed opportunity, which somehow could be approached by using bibliometric analysis as also written in lines 525-529, is the lack of a more fine-grained temporal analysis of the publications in MRG research. In fact, if we consider the construction of knowledge as a process that evolved through time, it would be interesting to see how publications are temporally connected and see when things started to go in a specific direction e.g. maybe a paper using the “Classify” design with a confusion matrix, cite a previous paper using “Classify” and a confusion matrix, which cite a previous paper using “Classify” and a confusion matrix, and so on.

I would also recommend adding a comparative table between the corpus of [129] and the one presented in this manuscript, because it might be helpful to catch easily the differences between the two corpora, instead of having only such aspects discussed in the main text e.g. lines 234, 238, 251, 269.

Lastly, I would also add in the conclusion that another main problem which threatens the validity of MGR research is the reviewers who accept for publication works that lack any sort of validity and scientific rigour, so it is not only the author's fault.

In conclusion, I would like to thank the authors for this submission, and I hope to see in future more critical research on MGR like this one presented.

[1] Delavari S, Pourahmadi M, Barzkar F. What Quality Assessment Tool Should I Use? A Practical Guide for Systematic Reviews Authors. Iran J Med Sci. 2023 May;48(3):229-231. doi: 10.30476/IJMS.2023.98401.3038. PMID: 37791333; PMCID: PMC10542923.

Author description of changes:

Summary of Changes

Abstract opening includes citation to Sturm’s original survey [meta / R1]
Abstract and text both highlight the date range of this survey [meta / R1]
Missing paper noted has been incorporated, analysed and the data in the paper updated [R2]
Added to discussion a note acknowledging the possibility of a follow up examining auto-tagging as a related endeavour (discussed below in our response) [R2]
The 560 survey references were not sorted in a consistent way in the original submission. This is now fixed; note that this changes the citation numbers.
Reinstated to the supplement a discussion of different Figures of Merit, to alleviate concerns of unfamiliar terminology [meta/R3]

Terse Responses to Other Comments (2k char limit!)

With thanks to the reviewers, there are some changes we did not make, either from respectful disagreement or spatial constraints.

Reliance on Sturm's methodology: We see this as a legitimate and robust approach, as we are seeing what has changed methodologically since 2013
Annotation errors: with thanks for diligence, we cannot fix without details. We welcome corrections, and our repo will have a message saying so
Organization / self-containment: We appreciate the sympathy, and emphasize that we feel it important to address this to ISMIR, practical and cultural challenges to critical work notwithstanding
Coverage of musicology: The supplement covers this, we feel to an adequate degree
Highlight influential work: We already highlight work we feel is meritorious; influence could be an interesting follow-up
Coverage of new datasets: We feel is adequate to the scope, esp. as 4/5 most used datasets are still pre-2013
Coverage of representation learning: We feel is out of scope, like other algorithms
Various interesting follow-ups (changes in genre itself, extended bibliometrics): Thanks! Some rich material for an extended follow-up