Computational Analysis of Yaredawi YeZema Silt in Ethiopian Orthodox Tewahedo Church Chants
Mequanent Argaw Muluneh (Academia Sinica, National Chengchi University, Debre Markos University)*, Yan-Tsung Peng (National Chengchi University), Li Su (Academia Sinica)
Keywords: Applications -> music heritage and sustainability; Knowledge-driven approaches to MIR -> computational ethnomusicology; Knowledge-driven approaches to MIR -> computational music theory and musicology; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR tasks -> automatic classification, Knowledge-driven approaches to MIR
Despite its musicological, cultural, and religious significance, the Ethiopian Orthodox Tewahedo Church (EOTC) chant is relatively underrepresented in music research. Historical records, including manuscripts, research papers, and oral traditions, confirm Saint Yared's establishment of three canonical EOTC chanting modes during the 6th century. This paper attempts to investigate the EOTC chants using music information retrieval (MIR) techniques. Among the research questions regarding the analysis and understanding of EOTC chants, Yaredawi YeZema Silt, namely the mode of chanting adhering to Saint Yared's standards, is of primary importance. Therefore, we consider the task of Yaredawi YeZema Silt classification in EOTC chants by introducing a new dataset and showcasing a series of classification experiments for this task. Results show that using the distribution of stabilized pitch contours as the feature representation on a simple neural-network-based classifier becomes an effective solution. The musicological implications and insights of such results are further discussed through a comparative study with the previous ethnomusicology literature on EOTC chants. By making this dataset publicly accessible, our aim is to promote future exploration and analysis of EOTC chants and highlight potential directions for further research, thereby fostering a deeper understanding and preservation of this unique spiritual and cultural heritage.
Reviews
After the review process and the discussion phase, all reviewers and myself agree unanimously that this paper should be accepted for ISMIR 2024. However, the paper still needs to be revised before the camera-ready version. I strongly suggest the authors to carefully read all the reviews and consider all the suggestions proposed in them, since they will surely improve the current status of the paper. I will also summarize here the reasons for this recommendation and the aspects that need improvement.
All reviewers have agreed on the great value that the new dataset has for the ISMIR community, since, firstly, it introduces a tradition not previously studied by this community, especially coming from one of the most underrepresented continents in ISMIR, Africa, but also, and equally important, for the quality of the dataset, as described by the authors. Secondly, it offers an interesting analysis performed on this dataset, which proves its validity for MIR studies on it, and provides meaningful insights on its music tradition.
However, the authors should improve the paper before submitting the camera-ready version on the following aspects.
First, the readability of the paper should be improved, since the language at some points is not clear enough.
Secondly, this paper will be the reference for this dataset in the future. As such, the music tradition should be better, more systematically described. Of course, it is impossible to cover all the dimensions of any music traditions in few pages of a conference paper. Therefore, the authors should focus on those elements of the tradition that are relevant for understanding the data and its potential for computational analysis. Other aspects, such as language or notation, should be minimized, at least at this stage of the dataset. (If in the future lyrics or scores are added to the dataset, those aspects should be then explained.)
Thirdly, the design of the experiment should be better explained. Since this not my main field of expertise, I suggest the authors to carefully read the comments by the three reviewers on this aspect, and to consider them for the improvement of the paper.
Finally, details about reproducibility, regarding not only access to the data, but to the code, should be given.
ISMIR Paper Review # 160
The paper attempts to understand Ethiopian orthodox tewahido church chat using MIR techniques. It is a straightforward classification task, that uses hand-built features. The main contribution of this paper is to explore under-represented music from around the world that is being heard by millions and it should be given some major bonus points to the authors for doing this and to release a dataset openly for the public.
— The paper clearly explains the acoustic characteristics, language associated with the chants. Further the notation that is letter and symbolic is also well explained.
— Overall the paper is well written.
— The dataset that is used for this work is manually collected. One can argue that there is no prior work in this regard, hence it is difficult to compare against other datasets or works. This emphasizes that this is more of a dataset paper having some experiments reported on top of it.
— The authors collected manual features like pitch features and MFCC, and chormagram. This is kind of similar to the work by vidwans, verma, Rao on “ Classifying Cultural Music Using Melodic Features, SPCOM 2020 and should be cited specially because they used similar ideas on pitch based and timbre based features for understanding culture specific music.
— The baseline features in section 4.2 cannot be said as belong to timbre based like MFCC and kind of pitch + timbre in chromatogram averaged over time. These capture global characteristics of timbre vs allowing local characteristics of pitch based features through a conv net. I would highly recommend removing the baseline word, and characterize them as timbre, pitch features etc. Dont worry about baseline. This paper aims a new dataset and a new kind of under-represented music to the MIR community.
— It will be great to report the accuracy results by potentially running 1-2 more experiments where they combine pitch + time-average features, Mel-spectogram etc as part of the ablation studies. Further the dataset size in terms of seconds is reasonable enough. It will not be too bad to extract pretrained features from large models trained on audio set and do fine-tuning to see where they stand.
— Another good baseline would be to have a model trained from scratch on small patches of audio and see how it does. This would not involve any hand-built features and directly operate on the audio waveform.
— I like the fact that they have a decent discuss section that explains various connections between the intricacies of the music and the findings of a GMM estimated pitch distribution.
Result: Somewhere between borderline/weak accept and accept. Could have been much much better.
This paper provides a careful, detailed and systematic approach to an underrepresented style of music in the MIR community. The introduction of the music and presentation of the research problem at hand is confident and well executed, and readers who are not familiar with this data (like me) should find sufficient explorative potential to navigate this paper.
The authors generally take good care of detailed explanations of the terminology and methodology, which is a definite strong point of this paper. Occasionally, a lack of depth becomes evident, for example at the end of Section 2, where the "vital insights" aren't described further. Unfortunately, the care and consistency fall heavily short in Section 4 - see more comments on this below. In its current version, Section 4 is unacceptable, pushing the review of this paper towards a weak reject. Due to the importance of the research direction, general presentation, clear musictheoretical analysis and provision of this dataset, my evaluation is a weak accept - but: The authors must reconsider the statements in the evaluation part of Section 4.
Detailed comments
In the supplied material, it is observable that the singer is louder in the beginning than towards the end. In "Ge_ez_chanting_Mode_for_a_Wednesday_Prayer.mp3" there is a sudden rise in amplitude at the end. How do you explain that? Could that be the beginning of a new chant, meaning that something happened during the cutting of the audio?
Figure 2 is difficult to disseminate. Where exactly are the first two rows of notation patterns to be found in the sentences? I can only find very few correlations. Can you elaborate that in the text? It's unclear what exactly the red boxes mean and where the "recurring consistent melodic patterns" appear, as stated at the end of Section 2.3. If the latter are highlighted with the arrows, then why is it first as a lyric, and then as a music notation? You mention that in the caption, but it is rather complicated.
Figure 3: Needs units on the axes. Y = number of files, X = seconds.
Figure 4: Suggestion: could superimpose the blue GMM estimations onto the red & green pitch distributions as well, thereby gain some vertical space and use the space to make a larger depiction of the superimpositions.
Table 1: Misspelling: Array -> Araray
Table 2: Highlighting of most interesting values (e.g. within-dataset maxima, cross-data maxima) is recommended. Do you have an explanation for why pitch countour stabilization (morphetic and masking) performs much worse in general than no stabilization? The only occassional improvements are seen in the cross-data results.
161: "except for color coding": you mean to say that with color coding it's easier to identify a mode? Footnote 7: what do you mean by "the reverse is not [...] true", that Ezil YeZema Silt is not always red? Is it a concensus that these colors are/should be used for the respective modes? This is not established in the text 180, 190 & 444: the spelling of Kidase/Qidase-bet varies. 222: "way larger" is a colloquialism 240: could you have chosen CREPE instead of pYIN? the choices for the pitch detection algorithm and classifier could have been elaborated. 248: what does "sliding" refer to? 253-254: what do you mean by "during the performance" and "along the whole recording"? Does the pitch drift occur the same for every audio? Did you analyze all 369 files for that? 267: "less number of parameters" than what? 270: "as regard it as" -> "regarding it as..."? 277: Why does each training run stop at 50 epochs? What is the reasoning behind this? Do the models converge? 297: For the cross-dataset evaluation, did you not use a validation set? And did you test on the full dataset provided by [4]? 304: you mention "15 seconds", but in the results (Table 2) you show "20 seconds". Apart from that, how much silence do these durations contain? 312: 23 percentage points is NOT a "relative small performance drop"! 313-314: from my interpretation, the combination of no calibration and no stabilization leads great results across the board. This configuration seems easiest to implement and process, and thus would be a good trade-off between maximum accuracy and efficiency. 318: from the numbers, it is not evident how you make the claim that stabilization helps "for most of the cases". The opposite is true: for within-dataset, ALL values with stabilization are worse. For cross-data, only 5 out of 24 results are better with stabilization! 320: with -> within 323: "...has better classifcation accuracy than the masking", because this is not true for no stabilization. 336-340: It's not understandable how the described configuration was chosen for subsequent analysis! The noted performance gap between within-data and cross-data is smaller because this configuration heavily reduces the within-data performance compared to "no calibration" and "no stabilization". The only case of improvement is in 10-sec cross-data, but you have already established around l.326 that YeZema Silt should be interpreted as a long-term song-level concept, even highlighting that YeZema Silt can be signified to some extent to a 20-sec excerpt as well. Looking at the 20-sec results for your selected configuration, the results are worse again. 377-379: Is this not an expected observation, given the explanation of the three chanting modes in Section 2.2? 380: "Fig. 6" -> do you mean 4? 400: remove second "g_2"
Minor comment: It would be nice to always keep the same order of mention: A E G for Araray, Ezil, and Ge'ez. You already have that in Table 1 and Table 3, and it could be reflected in the explanation of each mode (2.2), in Figure 4, Table 4, and also in l. 34-35, where you first introduce these modes.
ERROR!
Our meta reviewer and all the three reviewers gave us detailed comments and professional suggestions which were invaluable to improve the quality of our paper. We acknowledge them by addressing their comments, questions and suggestions to the best of our knowledge. Our revision is summarized as below:
- We have made a modification on our title so that it best fits the content.
- We have made significant modifications on Sections 2, 4, and 5 including their corresponding tables and figures, which raised concerns. In this regard, we removed a figure (the previous Figure 1), modified other figures (e.g. previous figures 2 and 3, now figures 1 and 2) and reduced subsections (e.g. previous Section 2.1 and 2.3) into a single paragraph or part of a paragraph. The structure of Section 4.1 and 4.2 is also rearranged.
- We have explained in more detail the reason for using masked pitch contour in analysis, although its performance on chanting mode classification is suboptimal (Section 5).
- In connection to such modifications, some of the footnotes are modified, some others are incorporated into a background section and others deleted to keep the focus. The questions raised on removed contents will, therefore, be no more issues for future readers.
- We have addressed spelling errors and inconsistencies.
- We have added a link to access our proposed dataset.