Abstract:

Vocal concerts in Indian music are invariably associated with the performers’ hand gesticulations that are believed to convey emotion, music semantics as well as the individual style of the performers. Video recordings, with one or more cameras, along with markerless human pose estimation algorithms can be employed to capture such movements, and thus potentially solve music information retrieval (MIR) queries. Nevertheless, off-the-shelf algorithms are built for the most part for upright human configurations contrasting with seated positions in Indian vocal concerts and the upper body movements in the context of performing music. Current state-of-the-art algorithms are black box neural network based and this calls for an investigation of the components of such algorithms. Key decisions involve the choice of one or more cameras, the choice of 2D or 3D features, and relevant parameters. such as confidence thresholds in common machine learning methods. In this paper, we quantify the increase in the performance with 3 cameras on two music information retrieval tasks. We offer insights for single and multi-view processing of videos.

Reviews

No reviews available

Back to Top

© 2024 International Society for Music Information Retrieval