Abstract:

Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this paper, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata collected from an online music forum and generated psudo captions. With MetaScore, we explore tag- and text-based controllable symbolic music generation. Both subjective test and objective test showcase the potential of our dataset in tag- and text-conditioned music generation.