Masked Token Modeling for Zero-Shot Anything-to-Drums Conversion
Patrick O'Reilly (Northwestern University)*, Hugo Flores García (Northwestern University), Prem Seetharaman (Adobe), Bryan Pardo (Northwestern University)
This paper will be presented in person
Musicians often represent drum beats through sound gestures such as vocal imitation and finger tapping. While these gestures can convey rich rhythmic information, realizing them as fully-produced drum beats requires time and skill. We propose a system for mapping arbitrary percussive sound gestures to high-fidelity drum recordings. Our system, dubbed TRIA (The Rhythm In Anything), takes as input two audio prompts -- one specifying the desired drum timbre, and one specifying the desired rhythm -- and generates audio satisfying both prompts (i.e. playing the desired rhythm with the desired timbre). TRIA can synthesize realistic drum audio given rhythm prompts from a variety of non-drum sound sources (e.g. beatboxing, environmental sound) in a zero-shot manner, enabling novel creative interactions.