Audio Prompt Adapter: Unleashing Music Editing Abilities for Text-to-Music with Lightweight Finetuning

Fang Duo Tsai (National Taiwan University)*, Shih-Lun Wu (Carnegie Mellon University), Haven Kim (University of California San Diego), Bo-Yu Chen (National Taiwan University, Rhythm Culture Corporation), Hao-Chung Cheng (National Taiwan University), Yi-Hsuan Yang (National Taiwan University)

Keywords: Creativity -> computational creativity; Generative Tasks -> music and audio synthesis; Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music; MIR and machine learning for musical acoustics; MIR tasks -> music generation, MIR tasks -> music synthesis and transformation

Abstract:

Recent text-to-music models have enabled users to generate realistic audio music with a simple command. However, editing music audios remains challenging due to conflicting desiderata: performing fine-grained alterations on the audio while maintaining a simplistic user interface. To address this challenge, we propose Audio Prompt Adapter (or AP Adapter), a lightweight addition to pretrained text-to-music models. We utilize AudioMAE to extract features from the input audio, and construct attention-based adapters to feed these features into the internal layers of AudioLDM2, a diffusion text-to-music model. With only 22M trainable parameters, AP Adapter empowers users to harness both global (e.g., style and timbre) and local (e.g., melody) aspects of music, using the original audio and a short text as inputs. Through objective and subjective studies, we evaluate AP Adapter on three tasks: timbre transfer, style transfer, and accompaniment generation. Additionally, we demonstrate its effectiveness on out-of-domain audios containing unseen instruments during training.

Reviews

No reviews available