A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument

Kyungsu Kim (Seoul National University)*, Junghyun Koo (Sony AI), Sungho Lee (Seoul National University), Haesun Joung (Seoul National University), Kyogu Lee (Seoul National University)

This paper will be presented in person

Abstract:

TokenSynth is a neural synthesizer that uses neural audio codecs and transformers to generate single-instrument musical audio from MIDI information and CLAP embeddings. The model can perform instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without fine-tuning. This enables various creative sound design applications and intuitive timbre control. The timbral similarity to target audio/text, and synthesis accuracy were evaluated using objective measures.