SOURCE-LEVEL PITCH AND TIMBRE EDITING FOR MIXTURES OF TONES USING DISENTANGLED REPRESENTATIONS

Yin-Jyun Luo (Queen Mary University of London)*, Kin Wai Cheuk (Sony AI), Woosung Choi (Sony AI), Toshimitsu Uesaka (Sony Group Corporation), Keisuke Toyama (Sony Group Corporation), Wei-Hsiang Liao (Sony Group Corporation), Simon Dixon (Queen Mary University of London), Yuki Mitsufuji (Sony AI)

This paper will be presented in person

Abstract:

We propose a model to learn latent representations of pitch and timbre of each individual source of instrument tones from a mixture of instruments. We employ variational autoencoders to train the model using a query-based inference network. Given a mixture, the model allows for precise source-level attribute editing, e.g., instrument or pitch replacement, by manipulating the pitch and timbre latents. On the synthetic audio clips of chords compiled using the JSB Chorales dataset, our quantitative evaluation protocol shows empirical success of the model on both pitch-timbre disentanglement of individual sources and source-level attribute manipulation of mixtures.