Abstract:

Current version identification (VI) datasets often lack suf- ficient size and musical diversity to train robust neural net- works (NNs). Additionally, their non-representative clique size distributions prevent realistic system evaluations. To address these challenges, we explore the untapped poten- tial of the rich editorial metadata in the Discogs music database and create a large dataset of musical versions con- taining about 1,900,000 versions across 348,000 cliques. Utilizing a high-precision search algorithm, we map this dataset to official music uploads on YouTube, resulting in a dataset of approximately 493,000 versions across 98,000 cliques. This dataset offers over nine times the number of cliques and over four times the number of versions than existing datasets. We demonstrate the utility of our dataset by training a baseline NN without extensive model com- plexities or data augmentations, which achieves competi- tive results on the SHS100K and Da-TACOS datasets. Our dataset, along with the tools used for its creation, the ex- tracted audio features, and a trained model, are all publicly available online.

Reviews
Meta Review

This paper proposes a new dataset for the task of version identification. The dataset is based on the Discogs music database and is substantially larger than related datasets.

The reviewers all agree that the paper and the dataset are a valuable contribution to the community. The paper is well-written and the methodology is clearly described. The reviewers also mention a few minor issues that could further strengthen the paper.


Review 1

The paper presents a new dataset for the task of version identification that is significantly larger than existing datasets. The demand for datasets for this task is accurate due to the reasons presented (data volume), but mainly because of the difficulty in ensuring data quality (such as intra-click correctness) and the challenge in maintaining and disseminating the data (especially the audio associated with the tracks). Furthermore, obtaining the dataset from a source rich in metadata brought many advantages to the proposed set. Therefore, this paper is relevant.

However, I list some points that the text could be better addressed:

  • While using "official videos" might help with the longevity of link availability, it limits the dataset to less varied versions. For example, the dataset is restricted to good recordings by professional musicians. The data will never include fan versions, such as those with specific instruments.
  • DISCOGS has information on song duration. Couldn't this information be used to filter songs contained in a playlist or videos with other elements besides the target music better, rather than cutting at 20 minutes duration?
  • The paper presenting Da-TACOS has a section devoted to analyzing the characteristics of a version ("What is a cover song" or something similar to this). This is a good idea that the paper introduced, as a way to explore the dataset without being limited to a straightforward exploratory data analysis. This article could include something along these lines to enrich its analysis, especially since it would be much richer than that conducted by the group that proposed Da-TACOS (since there are more tracks and more metadata). For instance, they could examine how much versions vary in genre.
  • The baseline model part seemed very disconnected from what the reader would expect to see. Instead of creating and evaluating a new architecture on different datasets, different architectures could be experimented on and applied to the proposed dataset. However, the section presented in the article resembles something a reader would expect in an article proposing a new architecture for the task as the main contribution.
  • Providing different feature sets is very relevant. However, the text does not clearly specify what these features will be (it only says "including" some), nor how they will be formatted and made available.


Review 2

The paper introduces a new dataset for version identification that features an unprecedented scale. Part of a dataset is provided with YouTube links, and audio features are available on request. The authors then briefly introduce a NN model to demonstrate a practical use of the dataset, and compare it to state-of-the-art as much as they can.

The paper is very well written and easy to follow.

Overall, the contribution of this paper is of major importance to the task of version identification. It has been showed numerous times in the literature that the high variability in the nature of music versions, the plurality of music genres and the range of clique sizes are key factors that influence the performance of VI models. Therefore, bringing a new, 1.8M versions with 330k cliques dataset is of major interest to the research community.

Moreover, the authors don't just release a massive metadata database, but explain in details the steps used to produce it. This is even more valuable for future research and datasets construction.

The baseline model is presented quite thoroughly. Authors do not elaborate much on the choice of modelling and on the set parameters, but this is understandable given the context of a dataset-focused paper. The results are presented in an extensive way, and authors honestly acknowledge the limitations of their interpretation, mainly due to the difficulty to cross compare the baseline databases.

I have very little complaints about the paper, that is quite clear as it is. Here are a few minor remarks that could make it even better in my opinion: - the end of section 2.1 lists a few version types that are present in the dataset: live versions, remixes, radio edits etc. I think authors should describe more these types, by providing an additional table or graph, as new types may be a valuable contribution, and will describe better the composition of the dataset. - section 3 describes the process applied regarding YouTube content. It might be advisable for authors to remind here that audios are not disclosed as part of the database, and content derived from this copyrighted material, namely audio features, are only granted for non commercial, research purposes. I know it is stated elsewhere, but it might be worth mentioning it here. - figure 3 is not easy to read, I suggest authors reorganize it. Maybe data can be represented for all 3 splits on a single, larger plot? - many references are missing the conference name, references should then be carefully proofread and revamped


Review 3

This work introduces Discogs-VI-YT, a dataset for version identification (VI). The dataset is created by leveraging metadata from the Discogs music database and a search algorithm to programmatically identify a large corpus of music versions. The proposed dataset surpasses existing VI datasets in size and offers more comprehensive metadata. To demonstrate the efficacy of the proposed dataset, a baseline VI model is also trained.

The manuscript is well-written and easy to follow. The dataset creation process is explained in detail with reasonable design choices and well-recognized limitations.

The main weakness I found is that the discussion is limited on why the baseline trained with this large new dataset did not improve over other existing models when evaluated on the SHS100K-Test set (even for the CQTNet, which the proposed baseline model is based on). More discussion about potential causes can help readers gain more insights about the dataset, the task, or the existing models.

Other minor comments: - Ln 187: duplicated Miles Davis - Ln 454: missing “are” - Ln 480: missing “use” - Author names for reference [6] are missing - It could be nice to place Tables and Figures at the top of the page.


Author description of changes:

We re-ran the data mining pipeline with the july dump, where the small amount of release youtube annotations are discarded. Using this new version of the dataset we re-ran the model training experiments a updated the corresponging table. We added a new table comparing the number of artists for selected datasets. All figures and tables are improved for visibility.