Masked Token Modeling for Zero-Shot Anything-to-Drums Conversion

Patrick O'Reilly (Northwestern University)*, Hugo Flores García (Northwestern University), Prem Seetharaman (Adobe), Bryan Pardo (Northwestern University)

This paper will be presented in person

Abstract:

Musicians often represent drum beats through sound gestures such as vocal imitation and finger tapping. While these gestures can convey rich rhythmic information, realizing them as fully-produced drum beats requires time and skill. We propose a system for mapping arbitrary percussive sound gestures to high-fidelity drum recordings. Our system, dubbed TRIA (The Rhythm In Anything), takes as input two audio prompts -- one specifying the desired drum timbre, and one specifying the desired rhythm -- and generates audio satisfying both prompts (i.e. playing the desired rhythm with the desired timbre). TRIA can synthesize realistic drum audio given rhythm prompts from a variety of non-drum sound sources (e.g. beatboxing, environmental sound) in a zero-shot manner, enabling novel creative interactions.