Abstract:

Methods based on deep learning have emerged as a dominant approach for cover song identification (CSI) literature over the past years, among which ByteCover systems have consistently delivered state-of-the-art performance across major CSI datasets in the field. Despite its steady improvements along previous generations from audio feature dimensionality reduction to short query identification, the system is found to be vulnerable to audios with noise and ambiguous melody when extracting musical information from constant-Q transformation (CQT) spectrograms. Although some recent studies suggest that incorporating lyric-related features can enhance the overall performance of CSI systems, this approach typically requires training a separate automatic lyric recognition (ALR) model to extract lyric-related features from music recordings. In this work, we introduce X-Cover, the latest CSI system that incorporates a pre-trained automatic speech recognition (ASR) module, Whisper, to extract and integrate lyrics-related features into modelling. Specifically, we jointly fine-tune the ASR block and the previous ByteCover3 system in a parameter-efficient fashion, which largely reduces the cost of using lyric information compared to training a new ALR model from scratch. In addition, a bag of tricks is further applied to the training of this new generation, assisting X-Cover to achieve strong performance across various datasets.

Reviews

No reviews available

Back to Top

© 2024 International Society for Music Information Retrieval