Abstract:

TokenSynth is a neural synthesizer that uses neural audio codecs and transformers to generate single-instrument musical audio from MIDI information and CLAP embeddings. The model can perform instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without fine-tuning. This enables various creative sound design applications and intuitive timbre control. The timbral similarity to target audio/text, and synthesis accuracy were evaluated using objective measures.