Multimodal Structured Extraction for Self-Querying Music Video Retrieval and Playlist Generation

Kevin Dela Rosa (Aviary Labs)*

This paper will be presented in person

Abstract:

In this study we introduce early results for a music video structured extraction framework designed to extract key metadata and descriptions such as genre, mood, video style, and summaries of general music, lyrical and visual narrative content. Leveraging video language models (VLM) and zero-shot prompting techniques, the system supports three key applications: entity discovery and browsing, multimodal self-querying retrieval, and playlist generation. The multimodal self-querying retrieval setup intelligently combines structured metadata filtering (e.g., video style, musical genre, emotion, visual elements) with lexical and semantic search, allowing users to query music videos using multiple facets. Additionally, the structured extraction powers entity discovery, enabling exploration of videos based on extracted metadata across the dataset. We provide qualitative examples of structured information extraction over an initial dataset of over 60K music videos to showcase the potential for search and video playlist generation.