Abstract:

Carnatic music is a style of South Indian art music whose analysis using computational methods is an active area of research in Music Information Research (MIR). A core, open dataset for such analysis is the Saraga dataset, which includes multi-stem audio, expert annotations, and accompanying metadata. However, it has been noted that there are several limitations to the Saraga collections, and that additional relevant aspects of the tradition still need to be covered to facilitate musicologically important research lines. In this work, we present Saraga Audiovisual, a dataset that includes new and more diverse renditions of Carnatic vocal performances, totalling 42 concerts and more than 60 hours of music. A major contribution of this dataset is the inclusion of video recordings for all concerts, allowing for a wide range of multimodal analyses. We also provide high-quality human pose estimation data of the musicians extracted from the video footage, and perform benchmarking experiments for the different modalities to validate the utility of the novel collection. Saraga Audiovisual, along with access tools and results of our experiments, is made available for research purposes.

Reviews
Meta Review

The intention to augment the available Carnatic music dataset is laudable. The dataset will be undoubtedly useful for future MIR work on Carnatic music. However, the annotation and validation analyses seemed to have been completed in a hurry. Please go through the comments by the 3 reviewers who have pointed out several aspects that need attention in order to make this a truly valuable resource.


Review 1

The paper proposes Saraga Carnatic 2.0: a large multimodal open data collection for analyzing Carnatic music. The paper reads well; however, the organization of the Experiments section could be reworked as some of the content is parallel and does not adhere to a causal flow. The authors refer to the existing dataset by 3 different names: Saraga, original Saraga, and Saraga1.0, which should be consistent. Innuendos such as “nature of Carnatic music” (line 459) may be avoided as this is a loaded statement, and a Spleeter model to learn such nuance while preserving the knowledge warrants more detailed discourse on the process.

I have reservations about the argument that melodic motif annotations are unreasonable for manifold reasons. On the one hand, the authors acknowledge improving motif recognition as a useful task (line 188) and identify regions of repeated melodic motifs (line 312). On the other hand, advocating the lack of importance of the annotation task is counterintuitive. Most newly added ragas have only 1 occurrence, as evident from Figure 1. In the absence of several and balanced instances of a raga class, it is imperative that this dataset is not suitable for a raga classification task. This defeats the claim of new raga additions scaling the same (line 187).

The video analysis of the gesture modeling is well-written. However, the demo video shows that the performer keeps the meter by clapping gesture, with one active hand (right) and the other complementing. In this scenario, one would expect the kinetics to be related to stress points in the rhythmic progression. Thus, it would be interesting to have inferences on the individual differences between Ashwin and Prithvi on the vast difference of the same correlation values of 0.36 and 0.11, respectively. My final concern is about the effectiveness of calling the proposed dataset as an extension of Saraga. Like the authors think about having an independent identity of the instrumental dataset, calling this Saraga2.0 also warrants a thorough demonstration of the improvements expanding on Section 3.4, especially on the melody re-computation aspect.


Review 2

The paper introduces an extended version of the Carnatic part of the Saraga dataset, presenting experimental results on music-motion relation analysis and music source separation.

Strengths The proposed dataset offers a large amount of data with various ragas and talas. The addition of video recordings and automatic pose estimation results can be beneficial for research on the relationship between music and motion, which has recently gained popularity in Indian art music research. The paper also includes feedback from the research community on the previous version of the dataset, demonstrating exemplary progress in open science. The reproduction of Pearson et al.'s analysis using the new dataset shows the usefulness of automatically annotated pose information.

Weakness - The fine-tuning result for the music source separation model, presented in section 4.2, does not strongly support the validation of this dataset's usefulness. This is not only because the fine-tuning degraded the model's performance on vocal artifacts, but also because similar experiments and results were already provided with Saraga 1.0 [17]. MSS can be fine-tuned for Carnatic music with Saraga 1.0, so it is not the exclusive usefulness of Saraga 2.0. The authors should explicitly show the advantage of using additional data from Saraga 2.0, such as by comparing a model trained only with Saraga 1.0 and a model trained with both Saraga 1.0 and 2.0. - Including video is an interesting aspect of this dataset. However, the camera angle is not appropriate for capturing the posture of the violinist or mridangist. While the main interest in motion analysis might be the singer, the microphone and the stand obscure the singer's hand, as mentioned by the authors. In this sense, the dataset is not ideal for video analysis compared to other previous video recordings of vocal performances in Indian art music [16]. The authors should report the stability of the MMPose results, such as whether there was a sudden jump in hand location in the estimated gestures. - It is not clear, but it seems that the paintings in the background are also detected as human postures. Additionally, it would be better for users if the pose estimation was provided separately for the singer and other players.

Minor comments - In Table 1, it is not clear how the sum of Saraga 1.0 and 2.0 would look in terms of the total number of ragas or talas. - I think the explanation on excluding instrumental Carnatic music can be shorten into one single sentence. It is worth mentioning that the current dataset does not cover instrumental Carnatic music, but I don't think the author has to justify why they are focusing on vocal-centered music with extensive paragraphs. The comparison with slakh sounds bit unnatural to me, as slakh used synthesizer. - It is not clear how many subjects participated in the listening test in Table 3. Also, the standard deviation or confidence interval has to be provided. - Saraga 1.0 also includes Hindustani music, but it is not clear how this will be handled in dataset access - There are duplicated sentences in line 224 - Line 353: "for which the p-value is less than our significance level of 0.00001 are excluded" – Should this be "greater than the significance level"? - Some words in the References, such as Carnatic and Turkish, need to be capitalized in the camera-ready version.


Review 3

The manuscript presents extension to Saraga Carnatic dataset. Along with increasing coverage of various ragas, concerts, the new version introduces video recordings of the concerts for multimodal MIR research. The paper is well organised and well written, certain tasks are also benchmarked with the extended dataset.


Author description of changes:
  • We have updated the naming convention of the dataset for clarity, and we have been consistent throughout the paper.
  • We have improved the discussion on the video gathering to address the raised comments and questions by the Reviewers.
  • We have included separation results of fine-tuned Spleeter using Saraga and Saraga Audiovisual (the new dataset) and improved the discussion on the relevance of these benchmarking experiments. We do not report fine-tuning experiments on the combination of Saraga and Saraga Audiovisual as our intention is to study the impact of each data collection process separately.
  • We have included missing details in the perceptual test results.

Let us further comment on the questions about the video footage: - The video gesture experiments are performed only on voice only excerpts called the alapana. These sections are not metered and also do not contain any rhythmic accompaniment.  The demo video is used only to show excerpts from the dataset and not correlated to the videos used for the gesture experiments. Figure 2 in the paper demonstrates the vocalist singing a composition in a rhythmic meter, but is illustrative of the video data in general and is not indicative of the alapana section on which the gesture experiments are perfomed. Figure 3 illustrates the vocalist singing an alapana. - The video recordings are recorded from concert venues in the usual stage setting of the artists. In Carnatic music, the vocalist receives most prominence. Keeping in mind that the microphone placed directly in front of the vocalist causes visual occlusions of the mouth and the gestures made, several pose estimation models were tested before selecting the MMPose 2D-Top down model. This model manages to capture regions prone to occlusions like the singer’s mouth and the hands with a good level of accuracy.