On the validity of employing ChatGPT for distant reading of music similarity

Arthur Flexer (Johannes Kepler University Linz)*

Keywords: Evaluation, datasets, and reproducibility -> evaluation methodology; Evaluation, datasets, and reproducibility -> reproducibility; MIR fundamentals and methodology -> web mining, and natural language processing; MIR tasks -> similarity metrics; Philosophical and ethical discussions -> philosophical and methodological foundations, Evaluation, datasets, and reproducibility

Abstract:

In this work we explore whether large language models (LLM) can be a useful and valid tool for music knowledge discovery. LLMs offer an interface to enormous quantities of text and hence can be seen as a new tool for 'distant reading', i.e. the computational analysis of text including sources about music. More specifically we investigated whether ratings of music similarity, as measured via human listening tests, can be recovered from textual data by using ChatGPT. We examined the inferences that can be drawn from these experiments through the formal lens of validity. We showed that correlation of ChatGPT with human raters is of of moderate positive size but also lower than the average human inter-rater agreement. By evaluating a number of threats to validity and conducting additional experiments with ChatGPT, we were able to show that especially construct validity of such an approach is seriously compromised. The opaque black box nature of ChatGPT makes it close to impossible to judge the experiment's construct validity, i.e. the relationship between what is meant to be inferred from the experiment, which are estimates of music similarity, and what is actually being measured. As a consequence the use of LLMs for music knowledge discovery cannot be recommended.

Reviews

Meta Review

For this paper, the meta review faces the challenge that two reviews voted for accept (including the initial meta review), and two reviews vote for rejection (both weak reject). None of the reviewers provided any input to the discussion phase despite my repeated encouragement for discussion. This leaves me as the meta reviewer with the texts of the individual initial reviews.

Among the reviews, I would argue that R3 has made crucial misjudgments and should not be taken into account strongly: R3 points out that "only few experiments were conducted". This is a complete misunderstanding: the related experiments had been conducted in the context of published research [10], and the submitted paper applies a different, more theoretical framework to it. Therefore, opinions about the experimental design of [10] must not be imposed on the review judgement for the present paper. Also, R3s argument related to "Task Representation" see inappropriate, as I would argue that the paper uses music similarity because it is a typical MIR task, and actually discusses in detail how this interacts with many other cultural constructs such as genre. Therefore, I as the meta-reviewer see little substance in the motivations for R3 to reject the paper.

R4, on the other hand, raises a series of valid concerns, such as that "no further investigation is made which could actually be rather trivial (like checking the results of alternative LLMs)". Overall, the critique of the paper by R4 reads similar to R2 and the initial meta-review, with the difference that R4 recommends weak reject.

In total, I would therefore argue to accept this paper as an outcome of a majority vote of three valid reviews (meta-review, R2, R4). In contrast to the other - more experimental - papers that I reviewed, I would like to emphasize my impression that reviewers seem to find it harder to establish common standards for reviews for such rather conceptual papers.

In case the paper is indeed accepted I encourage the authors to more carefully emphasize limitations and scop of the paper, to address the concerns expressed in the reviews.

Review 1

This paper focuses on using ChatGPT/LLMs to do ``distant reading'', a musicological task pertaining to the analysis of a corpora of data related to music. The primary motivation behind this is that LLMs appear to be designed to be good at this task.

The prevailing hypothesis here appears to be that ChatGPT's ability to estimate similarity is a sufficient surrogate for human perception of similarity.

The experimental setup uses a listening study, where they ask people to rate pairs of songs. This is followed up with asking ChatGPT how similar the song pairings are. The paper thoroughly explores the validity of the results through the lens of the statistical, internal, external and construct validity. At the surface, they find that there was a correlation between listener ratings of similarity and ChatGPT's reported understanding of similarity. However, looking at the construct and external validity, the usefulness of ChatGPT appears to fall.

This paper formulates a fairly straightforward experiment that evaluates the feasibility of using ChatGPT for distant reading. However, as their own experiments show, it is not a particularly useful tool for this purpose. To some extent, it can be argued that this is an expected result. While the underlying LLM models used by ChatGPT and the like are certainly trained on stupendous amounts of data, it is unlikely that they are capable of knowing if song A by artist X is similar to song B by artist Y.

Works like this are necessary to caution anyone who might assume that ChatGPT, with its infinitely large dataset and ability to answer queries, might be able an appropriate tool for music related tasks directly. I think the rigorous analysis of the validity of running experiments such as this one should be a useful blueprint for anyone seeking to run experiments probing large language models on musical concepts.

There is a minor reference issue with [3] (Graphs, maps, trees: abstract models).

Review 2

Strengths:

The article applies LLMs to distant reading in the field of literary studies.

Weaknesses:

Model Diversity: The study's focus solely on the GPT series (GPT-4.0 and GPT-3.5) potentially overlooks the capabilities of other types of LLMs or AI systems like Gemini, Qwen, Claude. This limited selection might skew the generalizability of the results, suggesting the need for broader testing across different models to validate the conclusions.
Task Representation: The study restricts its evaluation to music similarity tasks to draw conclusions about the broader domain of music knowledge discovery. This narrow focus might not fully encapsulate the complex and multifaceted nature of music understanding, such as emotional recognition or thematic analysis, which could provide a different assessment of LLM capabilities.
Participant and Methodological: The paper only tested with 6 participants and 90 songs. Furthermore, the absence of detailed information on the experimental LLM prompts and a comparison of responses under different conditions (e.g., with and without Chain of Thought) leaves a gap in understanding how different interaction modes with LLMs might affect the outcomes.
Interpretability and Bias: The argument that LLMs' "black box" nature inherently limits their usefulness in tasks requiring interpretability is notable. However, this critique could also apply to human cognition and subjective biases in music appreciation. The lack of a comparative analysis between human and AI biases and the ability to guide LLM responses with tailored prompts suggest an area for further exploration and potentially undermines the conclusion that LLMs are unsuitable for music knowledge discovery based solely on interpretability issues.

Review 3

This is a very clearly written and revised article that researches the validity (considered from four formal perspectives) of a Large Language Model (namely in this case OpenAIs ChatGPT 3.5) for distant reading of music similarity. This is mainly done via a comparative test with human raters and it is consistently shown that inter-human consistency is larger than human-LLM consistency. The design of the paper is quite comprehensive and it draws from previously vetted material. There is a sense that the research came up with a negative result (which the authors acknowledge was expected) and thus the article became something of a speculation and justification of why this could be so. While this could be an interesting idea, there are two large problems with this endeavour: - It seems very apparent that a LLM is not the correct tool for the task, there being a host of applications that are able to "hear" the raw audio, which would be the only task that would yield a fair comparison. I do understand that the authors want to look into "distant reading", obviously, but there are so many apparent drawbacks from the start, that getting to the actual negative result seems quite pointless. For example, in real-world applications, it would probably be the case that most of the inputs given to the LLM were novel, and thus never considered in the training dataset, or that they would be unlabelled (audio with no name). - More importantly, when considering the different aspects of validity, no further investigation is made which could actually be rather trivial (like checking the results of alternative LLMs). This means that other than statistical conclusion validity, which is objectively measured, the other aspects remain on the abstract, speculative realm. Were this a philosophical article, not actually based on the conclusions of an experiment, and it could have been written with this approach (which would need to be very different from the current state). As it is, it is very incomplete by design. So, while this is an interesting read, very clear and well written (no suggestions in terms of language, no typos found, very well-structured), it is too speculative and based on non-proven assumptions for the nature of what it promises to deliver.

Author description of changes:

Reviewer #3:

"The paper only tested with 6 participants and 90 songs."

Response: The study with only 6 participants is prior work that is only being referenced and scrutinized in our paper.

"The study's focus solely on the GPT series ... potentially overlooks the capabilities of other types of LLMs ..."

Response: This limitation of our study is already being discussed at the end of section 8.

"Furthermore, the absence of detailed information on the experimental LLM prompts and a comparison of responses under different conditions (e.g., with and without Chain of Thought) leaves a gap ..."

Response: The exact LLM prompts are provided in section 3.2. Usage of different prompts is discussed in section 7.

Reviewer #4:

"It seems very apparent that a LLM is not the correct tool for the task, there being a host of applications that are able to "hear" the raw audio ..."

Response: Other work on extracting psychophysical information from text (see [11] and [12]) shows that this kind of work is worth conducting.

"This means that other than statistical conclusion validity ... the other aspects remain on the abstract, speculative realm."

Response: Throughout the paper additional experiments are being conducted, hence the article is not only "speculative".

Meta reviewer:

"What would be wrong about using the genre as a guiding information for music similarity?"

Response: The complicated relations between genre and music similarity are being discussed in section 5.

"A shortcoming of the paper may be that the influence of varying prompting on the ratings was not investigated."

Response: Usage of different prompts is discussed in section 7.

"The authors could provide a bit more elaboration of the lack of symbol grounding."

Response: We have the feeling that "symbol grounding" is such an elaborate problem, having been discussed for decades in AI research, that going beyond a brief, but important, mentioning in our paper would go beyond the scope of our work.