Abstract:

Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting MidiCaps, the first openly available large-scale MIDI dataset with text captions. MIDI (Musical Instrument Digital Interface) files are widely used for encoding musical information and can capture the nuances of musical composition. They are widely used by music producers, composers, musicologists, and performers alike. Inspired by recent advancements in captioning techniques, we present a curated dataset of over 168k MIDI files with textual descriptions. Each MIDI caption describes the musical content, including tempo, chord progression, time signature, instruments, genre, and mood, thus facilitating multi-modal exploration and analysis. The dataset encompasses various genres, styles, and complexities, offering a rich data source for training and evaluating models for tasks such as music information retrieval, music understanding, and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research at the intersection of music and natural language processing, fostering advancements in both fields.

Reviews
Meta Review

The paper was commended for the new task to be introduced and on the good presentation and writing. There were specific comments on the motivation, the need for additional evaluations, and a lack of clarity on the features used. Although there are some minor comments on the manuscript itself, these can be addressed in the camera-ready version, and the manuscript can be recommended for acceptance.


Review 1

Overall, the paper is well-written, and the topic itself is quite new, so I think this dataset can be used in future researches well. However, an illustration of the details about the used features is limited and especially the evaluation of the pseudo-generated MIDI captioning dataset is also very limited.

Listening study: looking at the 5 audio and caption examples in the web demo page, the quality seems decent. However, the listening study reported in the paper doesn't give much information about the quality of the captioning dataset.

It would be helpful for the readers if the authors can provide detailed explanation about the used feature extraction algorithms. In this paper, the characteristics about the each of feature extraction algorithms plays critical role in the pseudo generated MIDI captions, however, in the paper, what the authors reported regarding the used feature extraction algorithms were only about which library they have used. If the authors can report the 1) performance of the each feature extraction algorithm, 2) the category of the output fields (e.g. if it's tempo extraction algorithm, then what would be the output of this algorithm, it can be numbers or some text. I can see it from result table, but it is necessary to illustrate it as detail as possible), and some more information, these would be really beneficial for readers who would want to work on pseudo MIDI captioning or other related tasks.

I also think the author can add small paragraph explaining what is the in-context learning for the readers.

And, if the author can provide some possible list of applications using this dataset (like text-to-MIDI generation, MIDI-to-text captioning task, etc), it would be also nice for the readers.

Also, if the authors can pick one of those tasks and show baseline result, it would compensate the lack of the thorough listening study. However, I understand that it cannot be easily added during reviewing phase.

In conclusion, I think enhancing 1) details about the feature extraction algorithm, 2) listening study, and 3) providing possible tasks using this dataset are necessary to be published as a good paper. Currently, the dataset is novel, but the insights the readers can get from this paper is very limited.


Review 2

I must confess that after my first reading, I was very sceptical about this approach of using traditional MIR algorithms to generate some annotation that then a LLM would convert into text captioning. Given the fact that the long-term motivation of the dataset introduced in this paper is to enable a different field of prompts-to-music generative models, especially symbolic music generation, I was expecting an approach that would be more human-centred for captioning the data. One idea could have been to use one of many sources of data available across various websites where people comment and describe how they feel about a particular music. This allowed among other approaches the text prompt-to-music audio field to achieve good results in the past years, especially when targeting a broad human audience. However, choosing the field of symbolic music generation somehow already implies that the target is not a random human, but more especially a musician, and therefore it is indeed legitimate to approach the captioning of music with semantic descriptors that someone skilled in music theory would use. It is then indeed logical then to rely on the state-of-the-art MIR algorithms to extract such descriptors and then make use of LLMs to generate such caption. However, it would have been useful to discuss a little bit more the limitation of this approach, especially in opposition to a real-world musician captioning; and a small experiment could also have been useful to justify further the usefulness of generated captioning vs human captioning.

The pipeline for the generation of the dataset is clearly explained, with details on every different annotation extracted from the MIDI files of the Lakh Dataset. It would have been interesting to deliver the scripts implementing this pipeline on the repository but is not mandatory.

The two listening study that were conducted to validate the hypothesis that this pipeline would generate caption accurate to musicians are interesting. It would be probably pertinent to detail this section further, especially explaining why the PsyToolkit allows an accurate evaluation of the quality of the generated captioning. Also, again on the topic of comparing generated captioning with human hand-made captioning, it would have been useful to introduce a few human caption examples and compare if the listening results shows significant differences between those two groups.

Finally, it would have been interesting to making use of the dataset applied on a simple experiment. It is understood that this is probably the next step of the research, but this would have helped to justify further the usefulness of this dataset.

In bref;

Strength: First Dataset of MIDI Files Text Captions, Detailed and Reusable Pipeline, Fairly Musically Accurate Captions, Validated with Listening Study

Weakness: No Experiment to demonstrate the Applications of the Dataset, Not enough discussion of Generated Captions vs Human Caption, No Discussion on Future Directions for Improvements of the Captioning, Not enough Discussion on the weaknesses of the Approach


Review 3

The paper is well-written and the motivation is clear, but the way they create the dataset isn't that novel since they mostly use existing models and tools. But we shouldn't expect a dataset paper to be super innovative. What's great is that the MidiCaps dataset is really big and the automatically generated captions are pretty good quality. Putting together such a large dataset of MIDI files with matching text is valuable in itself, because it allows for new kinds of research combining symbolic music and language processing. Thus, I think this paper should be accepted. Typos: Section 3.2: tweek should be tweak


Author description of changes:

We updated the title to use colon ":" instead of a dash "-". Abstract was updated to help save space. Acknowledgement was added.

We addressed reviewers’ comments, in specific: We updated our evaluation section by running an extended listening study that included both human and AI-annotated captions to show the similarity between these two. Both general audience and music experts were invited to participate. The section is followed by a discussion on the results and shortcoming of our work, illustrating possibilities for future work. We also corrected a couple of typos and added an explanation on what is meant by “in-context learning”.