Abstract:

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.

Reviews
Meta Review

The reviews of this paper found the task to be important and the experimental validation to be thorough and rigorous. While there were comments about the paper's clarity and title that we hope the authors will address, we feel that this paper will make a strong candidate for publication at this year's ISMIR.


Review 1

This paper explores the use of contrastive learning for learning embeddings for the task of singer identification in three configurations: mixture, vocal and hybrid. The experiments show promising results, especially in the case of using the vocal model to identify real singers (as opposed to synthetic). When the instrumental stems are present in the pre-training (mixture or hybrid model) the performance is reduced. The models perform particularly worse in the case of identifying synthetic versions of the singers. These observations are explained and explored in the paper in detail. I just have a few small suggestions:

  • In line 129, annotations are mentioned. However, it is unclear to me what these annotations contain and how they were used in the experiments (I’m assuming the name of the artists were used, but was there anything else?).
  • In Figure 1, the Closed dataset plot is a bit too jammed together. I’d recommend plotting the accuracies of the models next to each other, rather than on top of each other.
  • The observation of the non-uniform performance over musical genres (lines 322--346) would benefit from mentioning the distribution of genres in the training datasets.
  • Figure 4 (as well as its caption) is a bit confusing. I’d recommend to, instead of using numbers in the caption, mention the bars themselves (e.g. “the purple bars (test/other) show … “). I’d also suggest to put the Mixture, Hybrid, Vocal as subplot titles rather than on the right.

Aside from these minimal suggestions, I believe the paper is well written and complete, making it a very good contribution to ISMIR.


Review 2

The strengths of this paper is that (a) it provides a massive dataset (with detailed description), and (b) it confirms a system trained on the real inputs alone is not capable to perform well in the synethic inputs; This implies the system should also be trained with the synthetic inputs. Thus, the major weakness of this paper is that it triggers the reader to ask why this paper does not also examine a singing identification system trained with synthetic inputs and both (real and synthetic). Thus, in order to match the claim of this paper, below is the suggestion. Focus on 3 systems first (trained with (a) real, (b) synthetic, and (c) real + synthetic, and then their performance on 3 types of dataset (a) real alone, (b) synthetic alone, and (c) both real + synthetic), resulting 9 types of performance result. Given such performance results, and/or with the optional variety like genres, this paper would be more easier to read and its strengths would be strengthened.


Review 3

This paper trained the embedding of singing features by singer-level contrastive learning. Using a singer identification model with the projector head removed, the paper reported the differences in classification rates and characteristics between human singing voices and synthesized singing voices. NT-Xent was used for losses in embedded model training. Three types of models were trained in embedding learning: targeting separated singing voice, mixture, and their hybrid.

In an evaluation using the open dataset, the Mixture-based embedding model outperformed CLMR [36] as a conventional method. Compared to training the CLMR model, the proposed model used less than 1/20 of the number of tracks, indicating the effectiveness of singer-level sampling.

In the evaluation with the closed dataset, the paper experimented with a subset of 100 to 1000 classes. Although the performance was better with the singing voices separated, the difference in performance was small. Although it is difficult to compare performance with previous studies using inhouse datasets, this paper believe that the proposed model is at least as good. In addition, this paper showed that there can be differences in classification performance due to musical genres and difficulties in classification for singers with many songs.

Evaluations using synthetic singing datasets showed that performance was lower than when targeting real singing voices. In particular, the performance of the Mixture or Hybrid model was degraded. To investigate the cause of the performance degradation, results on the cosine similarity of the embedding vectors were also reported, such as higher similarity for the instrumental embeddings in the Mixture or Hybrid models.

The paper is well written and has adequate references. The fact that the performance of this method is equal to or better than conventional methods when targeting human singing voices and the fact that it investigates the factors that cause performance degradation are good. Future directions following this paper include efforts to improve performance on synthetic singing voices.

The following are the issues that need to be corrected. - Line 209: The details of the architecture for singer identification are not clear and need to be stated. - Line 400: Figure 4 is discussed in the text without explanation and is difficult to understand. The explanation is needed. The following comments are also relevant. - Line 405: "the instrumental embeddings" is the first occurrence of here, at least in the text, and it is unclear how this was obtained. - Figure 4: It is necessary to specify which legend each of the explanations 1) to 5) corresponds to.

In addition, the following are minor comments, but these should also be corrected. - Line 175: Regarding "Finally, for our Hybrid model, these segments are randomly sampled from either the songs’ mixtures or their vocal stems", it is unclear whether it means that mixture-mixture and vocal-vocal pairs are mixed in B=128, or whether mixture-vocal is also possible. - Line 219: Regarding "At least three tracks per singer are then used for training.", different singers have different numbers of training data? - Figure 4: I find the notation of each legend a little hard to understand.


Author description of changes:

Here is the summary of the changes we made to address the reviewers’ critiques: - Global changes: we mostly replaced the words "synthetic" and "spoofing" by "cloned" throughout this paper. The former two words caused confusion amongst reviewers. As such, the title of the paper is now "From Real to Cloned Singer Identification". We also use percentages in all figures to match our other results’ reporting. Finally, the words "music streaming service" are replaced by Deezer now that anonymity is not needed. - Section 3.1: we remove the term "valid" to describe a song, clarify that a "unique singer" here means a track with "more than one singer", highlight that "singer" annotations are collected to create the closed dataset, and edit out the reference to the current section. - Section 3.2: We explicitly list the positive stem pairs that can occur during the Hybrid model’s pre-training. We also emphasise that only mixtures are used during the downstream singer identification task when the Hybrid backbone is used. - We clarify the train and validation splits used in Sections 3.2 and 3.3. - Section 3.3: we emphasise that the classifiers have the same architecture as the projector head. - We emphasise that the CLMR embeddings in Section 4.1 are used for both training and testing our classifiers for the real singer identification task on open datasets. We also better described the "majority vote" scheme used to generate classification results. - Section 4.2: We enlarge Figure 1 for better clarity. We also introduce a table to compare our results with others in the literature. As such, the paragraph that used to do so is drastically shortened. Finally, we add a description of the genres’ distribution in Figure 2’s caption. - Section 4.3: We clarify Figure 4’s caption by directly referring to each plot’s colour and legend. Note that we also slightly reformulated some sections of the paper and resized some figures and tables to make everything fit in six pages!