MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Yun-Han Lan (Taiwan AI Labs)*, Wen-Yi Hsiao (Taiwan AI Labs), Hao-Chung Cheng (National Taiwan University), Yi-Hsuan Yang (National Taiwan University)

Keywords: Generative Tasks; MIR fundamentals and methodology -> music signal processing; Musical features and properties -> harmony, chords and tonality; Musical features and properties -> rhythm, beat, tempo, MIR tasks -> music generation

Abstract:

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted chords and rhythm features as the control signal. During inference, the control can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets---one derived from extracted features and the other from user-created inputs---demonstrates that MusiConGen can generate realistic music that aligns well with the specified temporal control. Sound examples can be found at the supplementary material and the anonymous demo page, \url{https://musicongen.github.io/musicongen_demo/}.

Reviews

No reviews available