The Surprising Effect of Song-Level Demixing for Music Foundation Model Pretraining
Junyan Jiang (New York University Shanghai)*, Akira Maezawa (Yamaha Corporation), Gus Xia (New York University Shanghai)
This paper will be presented virtually at the 12:15 PM - 12:45 PM PST and 11:15 PM - 11:45 PM PST sessions
Music foundation models have been playing a more and more important role in many downstream tasks for music understanding. Previous foundation models typically adopt an auto-regressive language modeling or masked language modeling as a training objective, which yields limited performance on some downstream tasks like source separation. In this extended abstract, we propose a new training target via self-supervised demixing of two randomly mixed music pieces. We show that with this only training target, the model learns strong general-propose representation, and also shows good performance on source separation tasks when Parameter-Efficient Fine-Tuning (PEFT) is applied.