SpecMaskGIT: Masked Generative Modelling of Audio Spectrogram for Efficient Audio Synthesis and Beyond

Marco Comunità (Queen Mary University of London), Zhi Zhong (Sony Group Corporation)*, Akira Takahashi (Sony Group Corporation), Shiqi Yang (Sony), Mengjie Zhao (Sony Group Corporation), Koichi Saito (Sony Gruop Corporation), Yukara Ikemiya (Sony Research), Takashi Shibuya (Sony AI), Shusuke Takahashi (Sony Group Corporation), Yuki Mitsufuji (Sony AI)

Keywords: Generative Tasks -> interactions; Generative Tasks -> real-time considerations; Generative Tasks -> transformations; MIR tasks -> automatic classification; Musical features and properties -> representations of music, Generative Tasks -> music and audio synthesis

Abstract:

Recent advances in generative models that iteratively synthesize audio clips sparked great success in text-to-audio synthesis (TTA), but at the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to the hundreds of iterations required in the inference phase and large amount of model parameters. To address these challenges, we propose SpecMaskGIT, a light-weight, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10 s audio clip in less than 16 iterations, an order of magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in a TTA benchmark, while being real-time with only 4 CPU cores or even 30× faster with a GPU. Next, built upon a latent space of Mel-spectrograms, SpecMaskGIT has a wider range of applications (e.g., zero-shot bandwidth extension) than similar methods built on latent wave domains. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope that our work will inspire the exploration of masked audio modeling toward further diverse scenarios.

Reviews

No reviews available