Abstract:

We propose a model to learn latent representations of pitch and timbre of each individual source of instrument tones from a mixture of instruments. We employ variational autoencoders to train the model using a query-based inference network. Given a mixture, the model allows for precise source-level attribute editing, e.g., instrument or pitch replacement, by manipulating the pitch and timbre latents. On the synthetic audio clips of chords compiled using the JSB Chorales dataset, our quantitative evaluation protocol shows empirical success of the model on both pitch-timbre disentanglement of individual sources and source-level attribute manipulation of mixtures.