Lyrics Transcription for Humans: A Readability-Aware Benchmark

Ondřej Cífka (AudioShake)*, Hendrik Schreiber (AudioShake), Luke Miner (AudioShake), Fabian-Robert Stöter (AudioShake)

Keywords: Evaluation, datasets, and reproducibility; Evaluation, datasets, and reproducibility -> evaluation metrics; Evaluation, datasets, and reproducibility -> novel datasets and use cases, MIR fundamentals and methodology -> lyrics and other textual data

Abstract:

Writing down lyrics for human consumption involves not only accurately capturing word sequences, but also incorporating punctuation and formatting for clarity and to convey contextual information. This includes song structure, emotional emphasis, and contrast between lead and background vocals. While automatic lyrics transcription (ALT) systems have advanced beyond producing unstructured strings of words and are able to draw on wider context, ALT benchmarks have not kept pace and continue to focus exclusively on words. To address this gap, we introduce Jam-ALT, a comprehensive lyrics transcription benchmark. The benchmark features a complete revision of the JamendoLyrics dataset, in adherence to industry standards for lyrics transcription and formatting, along with evaluation metrics designed to capture and assess the lyric-specific nuances, laying the foundation for improving the readability of lyrics. We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.

Reviews

Meta Review

The newly annotated dataset is considered a welcome contribution, and the paper was praised for its writing and relevance. There are concerns regarding the "in-house" system and how it was not properly presented in the paper. There are open questions on the proposed metrics. There is also an interesting comment regarding the two parts of the paper (dataset and benchmarking), which appear disconnected. Based on all the above, I would recommend an acceptance in this case, noting that there are a few minor comments that can be rectified at the camera-ready revision.

Review 1

Before going into the details of the main review, I would first like to mention an issue regarding the anonymity: I have read a very similar paper on arXiv and an ISMIR LBD. That is to say, the anonymity is not fully preserved.

I personally do not have conflict of interest with the authors of that paper. I have contacted meta-reviewer regarding this issue, and the meta-reviewer said that “Since you do not have a conflict of interest with the authors, you can proceed with your review”, while also encouraging me to mention this issue in the review. Therefore, I decide to mention it here.

In the remainder of this review, I will pretend that I have not read that arXiv paper.

Strengths: The new annotation of the multilingual JamendoLyrics dataset is clearly a contribution and should be credited. Also, the authors made an attempt to define general rules for readability-aware ALT and the corresponding metrics. Although I do have some concerns regarding them (see the weaknesses part), this should still be considered as a contribution.

Weaknesses:

1) It is unclear to what extent do (and whether) the proposed evaluation metrics (WER’, F_P, F_B, etc) reflect human’s readability on the transcription. This includes two issues that need to be addressed. First, as discussed in Section 5 (particularly L408-418), the annotation of line breaks may be ambiguous. In practice, will evaluate and even optimize an ALT system with such ambiguous metrics really do more good than harm? It would be better to delve deeper into this direction, as the novel evaluation metrics are an important part of this paper. Second, the proposed evaluation metrics require the automatic transcription to strictly match the 9 general rules in Section 2. Although these rules sound reasonable, it is unclear whether an alternative set of rules (e.g., always end a line with a comma or a period) would lead to a poor formatting from the perspective of human. If this is not the case, it would be weird to punish such an alternative formatting (as what the proposed metrics do).

2) It would be better to also report the benchmark results on the original multilingual JamendoLyrics annotation. Different annotations (note that the two versions of annotations have a 11.1% WER difference) may lead to completely different evaluation results and conclusions.

3) Adding the proposed in-house ALT systems in the comparison while almost completely not discussing the method behind these systems is clearly not a good idea, as it provides basically no insight. Readers may wonder how can these in-house ALT systems achieve superior performance and want to know more information about it. But the authors provided almost no detail on it.

To sum up, this paper has both strengths and weaknesses. I consider the lack of details on the in-house ALT systems a big weakness of this paper, but I do recognize the contribution of the novel dataset annotation, along with the efforts on devising novel readability-aware ALT metrics. Therefore, I tend accept this paper.

That being said, I strongly suggest the authors provide more details on the in-house ALT systems in their camera-ready version. That would greatly improve the overall quality of this paper.

Review 2

I believe the paper is valuable for ISMIR and should be accepted. Next, I list some of the strengths and weaknesses I consider relevant:

Strengths: 1) Relevance. The paper introduces a novel benchmark for automatic lyrics transcription (ALT), emphasizing the inclusion of formatting and punctuation elements. It also proposes new evaluation metrics tailored to lyrics transcription based on industry guidelines, which adds value to the ALT field and addresses a gap in current benchmarks.

2) Well structured. The paper is well-written and organized. The title and abstract reflect the content. As far as I know, it cites and compares related work and introduces the problem well. I would also highlight the visualizations, especially those showing word edit operations, which enhance the understanding of the results. The supplementary material also helped me understand the changes in the lyrics.

3) Comprehensive error analysis. The error analysis and comparison provide insights into the strengths and weaknesses of current ALT systems.

Weaknesses: 1) Lack of details. The paper lacks details about the methods to replicate the results, particularly the in-house lyrics transcription methods used in the comparative analysis. I understand that they may be commercial/closed, but the absence of this information makes it challenging for peers to understand the direction of future research and development. We know there are better approaches, but we do not know what they are.

2) Metric clarification. This is not a big issue, but the description of the new WER metric is somewhat unclear. On line 175, it seems just a “case-sensitive WER,” but it also suggests a fixed pre-processing procedure. This will be solved since the implementation will be available.

3) There are minor issues in terminology and formatting, such as the inconsistency in the number of token types listed (line 196 states four but lists five) and the use of symbols like "S.^2". Also, P is used for the Punctuation Token and the Precision metric in section 3.2, which might confuse readers.

4) Model evaluation. The paper notes that Whisper sometimes outputs random text, but it doesn’t explore the impact of excluding these outliers on the overall results. Comparing metrics only for the word tokens would further explain their performance and whether these in-house proposals improve mainly on the newly introduced punctuation part or the traditional tokens.

5) Unknown details on variants. The authors' system v1 performed poorly on the SWD dataset, while v2 outperformed others, but it lacks an explanation for this improvement, leaving readers without insight into the advancements made.

Final comment: While I listed more weaknesses than strengths, I believe the paper significantly contributes to the ALT field, providing a benchmark and metrics that address aspects of lyrics transcription that traditional metrics overlook. Despite the lack of information on the in-house models and minor inconsistencies, the overall quality of the work justifies recommending the paper for acceptance.

Review 3

— RESUME — This paper extends the LBD "JAM-ALT: A FORMATTING-AWARE LYRICS TRANSCRIPTION BENCHMARK" presented at ISMIR last year. It includes a revised version of the JamendoLyrics dataset, a common dataset used for evaluating automatic lyrics transcription (ALT), to align with industry guidelines. Additionally, the authors have suggested adapting two well-known metrics to better capture the specific aspects of lyrics that are not accounted for in speech transcription. Finally, several ATMs are compared, discussing their limitations and errors.

— SCIENTIFIC CONTRIBUTION — Their main scientific contribution is the revised version of the JamendoLyrics dataset, which now aligns with industry standards. This will help to more accurately assess the performance of current ALT models. The newly proposed metrics provide better insights into the type of errors models do, enabling the development of improved models. In addition, they have benchmarked most of the current models. Their analysis is useful for understanding their performance and limitations.

— REVISED JAMENDO AND METRICS —

The paper first sets out formatting guidelines for lyrics based on industry standards. Then, it introduces a revised version of the JamendoLyrics dataset that adheres to these new standards. Moreover, the authors propose a modified version of the word error rate (WER) metric to accommodate case-sensitive errors, along with a method for measuring punctuation and line break errors. This provides a more comprehensive way to quantify ALT model accuracy. The most interesting part is that previous metrics did not take into account the line/section breaks, which are crucial for music, since they convey rhyme, rhythm, and structure in popular Western music.

BENCHMARK - In this section, the authors evaluate the most common ALT models using their revised version of JamendoLyrics. This section reads more like a survey or benchmark paper, comparing different models rather than highlighting the benefits of the revised dataset and proposed metrics. Especially when it feels like the authors try to "sell" their in-house model, which is not explained whatsoever or cited. In fact, most of the trends observed (when comparing models) in the WER' are the same for WER, and all the analysis in section 4.1 is based on WER. The authors do not compare the performance obtained from the same model for the previous version of JamendoLyrics to assess the need to use their version rather than the previous one, for instance as an example it would have been more interesting to spot misleading high performances that do not reflect the real structure of a song when missing line breaks(?). The comparison between the revised version and the old version is one of the most interesting aspects of Table 1. This demonstrates that the main issue with the previous version is primarily the line break annotations (hence the section annotation). This is crucial for lyrics because these aspects are linked to rhyme, meter, rhythm, and musical phrasing. I would have expected a bit more analysis in this direction, i.e. in comparing the performance when using both versions of the dataset, rather than such a detailed analysis of the current models. For example, the use of source separation and its impact on the performance, while interesting, seems to be beyond the scope of the paper, it doesn't change anything in trends and simply boosts or decreases the metrics by a factor. Instead, they should have focused on the dataset and the metrics. This is why, I greatly appreciate section 4.2, which analyses the different models' error types with respect to the proposed metrics. However, the first part of this analysis (Figure 3) could have been done almost independently (except for the 'case' error) without the new dataset and proposed metrics.

The Schubert Winterisse Dataset section accentuates the feeling of disconnection from the previous sections and a swift transition into a full benchmark/survey paper. It is mostly a section to compare the different ALT rather than a showcase of the benefits of using the new metrics and has no reference to their proposed guideline nor their dataset.

— FINAL COMMENT — The paper has two faces. The first part presents the authors' revised version of JamendoLyrics and their metrics to capture common errors not identified by speech transcription metrics and related to important token types for lyrics. The second part focuses on benchmarking ALT models, this part is longer than the first part. Both of them are valuable and will provide useful insight to the community, but the second part feels disconnected and not well aligned with the title and abstract of the paper. I expected an analysis that emphasizes the benefits of the revised version and demonstrates how evaluating the previous version could lead to misunderstandings about the real performance of an ATL. After reading the paper, I am still unsure whether a model that had good results in the previous Jamendo produces a particular type of mistake that I can now spot with this new version. (It is clear that the line breaks and paragraphs weren't there before but it's never pointed out in the paper). On the other hand, the benefit of the new evaluation metrics is clear, since they directly indicate the type of error with respect to each token type, especially the line break and paragraphs.

I think it also misses an opportunity to provide a tool to verify if a given set of lyrics follows the guidelines or not to assess its quality, as well as an initiative/encouragement to adapt other datasets.

Author description of changes:

We now only evaluate the latest version (v3) of our in-house model, as we agree that it is not warranted to have two versions of it in the paper.
As we are not able to provide further details about our model, we instead reduced mentions of our model in the text of the paper.
We have better separated fully open-source models from proprietary ones in our results tables, and highlighted the best open-source results.
Following the suggestions of R1 and R4, we have added evaluation of the same models on the original JamendoLyrics dataset (Section 4.2, Table 2).
We removed speculations/details about Whisper and OWSM in Section 4.1, deemed out of scope.
We updated the discussion, addressing a limitation pointed out by R1.
We updated the first paragraph of Section 3 to clarify our goals in designing the metrics in response to R1.
We now better introduce Section 4.4, presenting results on the SWD dataset.
We fixed/clarified minor issues pointed out by R3.
We added the benchmark name (Jam-ALT) and a link to the data and code.

To address other comments:

R3: “Comparing metrics only for the word tokens would further explain their performance and whether these in-house proposals improve mainly on the newly introduced punctuation part or the traditional tokens.”

WER only takes word tokens into account, as mentioned in the paper.

R1: “It is unclear to what extent do (and whether) the proposed evaluation metrics (WER’, F_P, F_B, etc) reflect human’s readability on the transcription. [...]”

Note that our goal was to design metrics that are generally applicable, and not tightly bound to the dataset or specific formatting rules. We now mention this in Sections 3 and 5.

We agree with the point about punishing different formatting choices, and we now mention this more clearly in the discussion.