Harnessing the Power of Distributions: Probabilistic Representation Learning on Hypersphere for Multimodal Music Information Retrieval

Takayuki Nakatsuka (National Institute of Advanced Industrial Science and Technology (AIST))*, Masahiro Hamasaki (National Institute of Advanced Industrial Science and Technology (AIST)), Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

Keywords: MIR fundamentals and methodology -> multimodality, MIR and machine learning for musical acoustics

Abstract:

Probabilistic representation learning provides intricate and diverse representations of music content by characterizing the latent features of each content item as a probability distribution within a certain space. However, typical Music Information Retrieval (MIR) methods based on representation learning utilize a feature vector of each content item, thereby missing some details of their distributional properties. In this study, we propose a probabilistic representation learning method for multimodal MIR based on contrastive learning and optimal transport. Our method trains encoders that map each content item to a hypersphere so that the probability distributions of a positive pair of content items become close to each other, while those of an irrelevant pair are far apart. To achieve such training, we design novel loss functions that utilize both probabilistic contrastive learning and spherical sliced-Wasserstein distances. We demonstrate our method's effectiveness on benchmark datasets as well as its suitability for multimodal MIR through both a quantitative evaluation and a qualitative analysis.

Reviews

No reviews available