Abstract:

Piano cover generation aims to create a piano cover from a pop song. Existing approaches mainly employ supervised learning and the training demands strongly-aligned and paired song-to-piano data, which is built by remapping piano notes to song audio. This would, however, result in the loss of piano information and accordingly cause inconsistencies between the original and remapped piano versions. To overcome this limitation, we propose a transfer learning approach that pre-trains our model on piano-only data and fine-tunes it on weakly-aligned paired data constructed without note remapping. During pre-training, to guide the model to learn piano composition concepts instead of merely transcribing audio, we use an existing lead sheet transcription model as the encoder to extract high-level features from the piano recordings. The pre-trained model is then fine-tuned on the paired song-piano data to transfer the learned composition knowledge to the pop song domain. Our evaluation shows that this training strategy enables our model, named PiCoGen2, to attain high-quality results, outperforming baselines on both objective and subjective metrics across five pop genres.

Reviews
Meta Review

Based on the reveiwers and internal discussion, we would like to voted for a weak acceptance. Please see R2 on the concern on evaluation, and please also do incorporate the important missing reference to deliver a complete study.


Review 1

The paper introduces a method for generating piano covers from pop songs using a transfer learning model that employs weakly-aligned data instead of strongly-aligned, note-remapped data. This method retains the musical integrity of the original piano compositions and addresses data inaccuracy issues due to forced synchronization in previous methods. The paper is overall well-written and informative, yet several aspects could be improved:

First, the proposed pairing function for weakly aligning the piano and song addresses the rhythmic distortion caused by note-remapping but relies heavily on the performance of the beat detection algorithm. The paper lacks an investigation into the accuracy and efficiency of this algorithm. Since 50% of the bars were removed due to mapping errors (mentioned in Line 303), this method results in a significant loss of data for training or testing the model. A better justification for choosing this method over others, such as midi-to-audio alignment algorithms, is needed.

Second, the use of SheetSage to extract "high-level" musical ideas from song audio raises some concerns. While SheetSage was primarily trained for melody extraction, which aligns with generating accurate melody contours for piano cover generation, it may not effectively handle tempo and accompaniment reconstruction. The SheetSage encoder could lead to unexpected information loss compared to using the MT3 encoder for processing song audio. This issue could be observed in the provided listening samples, where chords and tempo were reconstructed less effectively compared to melody.

Finally, considering that melody matching is a crucial criterion for evaluating the success of a piano cover system, I recommend re-evaluating the melody chroma accuracy after some post-processing to align the generation with the target. This would provide a more meaningful comparison with the Pop2Piano model. Given that the target (human composition) has a low MCA score, it is difficult to justify the performance of the proposed model without this adjustment.


Review 2

Good task direction! without relying on perfect alignment is crucial for the task. The methodology is clear and well-explained. The final fine-tuning to avoid errors from beat detection is intelligent.

My first comment is that i would remove from the title "transfer learning approach and". Although the contributions of the paper are both, the latter for me is a strongest contribution.

However, my criticisms are mainly about the task evaluation. The subjective metrics are significantly better than the objective ones, which is noteworthy. Considering there is no bias in the subjective tests, I believe an analysis of why the objective metrics perform poorly would strengthen the paper. It would also be beneficial to provide reasons for the musical dimensions that are not being considered, and perhaps even propose a new metric for the task.

Another point of concern is the significant cherry-picking in the examples provided. I would prefer to see the YouTube video, "pop2song" and your main approach. Given that the dataset you compiled has fewer than 100 pairs, it would have been fairer to show everything. It is evident that the model does not perform like a human, but examining the cases where the algorithm fails is also interesting.


Review 3

Strengths

The paper "PIANO COVER GENERATION WITH TRANSFER LEARNING APPROACH AND WEAKLY ALIGNED DATA" presents a significant advancement in the field of automatic piano cover generation. One of the notable strengths of this work is its innovative use of transfer learning, which leverages pre-training on piano-only data followed by fine-tuning on weakly-aligned song-to-piano paired data. This approach effectively addresses the common issue of losing piano information during the synchronization process, which has plagued previous models that relied on strongly aligned data.

The methodology is robust, involving the use of a lead sheet transcription model as the encoder during the pre-training phase. This step ensures that the model learns high-level musical concepts rather than merely transcribing audio. By fine-tuning on weakly-aligned data, the model is able to retain the musical quality of the original piano performances, thereby producing more natural and musically coherent outputs. The experiments conducted across five music genres demonstrate the model's versatility and effectiveness, as it outperforms baseline models in both objective and subjective evaluations.

Moreover, the use of a decoder-only Transformer model that interleaves song audio and piano performance sequences enhances the temporal correspondence between the condition and target outputs. This architectural choice, coupled with the innovative training strategy, contributes to the high quality of the generated piano covers.

Limitations

Despite the strengths, the paper does have some limitations that should be addressed. One of the primary concerns is the lack of a direct comparison with the model presented in the arXiv paper "AUDIO-TO-SYMBOLIC ARRANGEMENT VIA CROSS-MODAL MUSIC REPRESENTATION LEARNING" (arXiv 2112.15110). This omission is significant, as the arXiv paper addresses a similar problem space using a cross-modal representation learning approach, which also aims to capture major audio information for symbolic music generation. A comparative analysis would provide a clearer benchmark for the performance of the proposed transfer learning model.

Furthermore, while the paper emphasizes the advantages of weakly-aligned data, it does not fully explore the potential limitations and challenges associated with this approach. For instance, the alignment errors between the piano segments and their corresponding song segments, even at a beat level, could introduce inconsistencies that affect the overall quality of the generated covers.


Author description of changes:

We appreciate the reviewers' feedback and have made the following changes in the final version:

  1. Added a comparison with Wang et al.'s work to provide more context and highlight our contributions.
  2. Removed the figure illustrating the alignment algorithm.
  3. Redrawn the system overview figure, incorporating more details about the training process.

Regarding the limitations of MCA in reflecting output quality and the inconsistencies in weak alignment, we have included them in our discussion of future work to emphasize their importance for further research.