A Kalman Filter model for synchronization in musical ensembles

Hugo T. Carvalho (Federal University of Rio de Janeiro)*, Min S. Li (University of Birmingham), Massimiliano Di Luca (University of Birmingham), Alan M. Wing (University of Birmingham)

Keywords: Musical features and properties, MIR tasks -> alignment, synchronization, and score following; Musical features and properties -> expression and performative aspects of music; Musical features and properties -> rhythm, beat, tempo

Abstract:

The synchronization of motor responses to rhythmic auditory cues is a fundamental biological phenomenon observed across various species. While the importance of temporal alignment varies across different contexts, achieving precise temporal synchronization is a prominent goal in musical performances. Musicians often incorporate expressive timing variations, which require precise control over timing and synchronization, particularly in ensemble performance. This is crucial because both deliberate expressive nuances and accidental timing deviations can affect the overall timing of a performance. This discussion prompts the question of how musicians adjust their temporal dynamics to achieve synchronization within an ensemble. This paper introduces a novel feedback correction model based on the Kalman Filter, aimed at improving the understanding of interpersonal timing in ensemble music performances. The proposed model performs similarly to other linear correction models in the literature, with the advantage of low computational cost and good performance even in scenarios where the underlying tempo varies.

Reviews

Meta Review

The reviewers all agree that the music syntonization problem is well formulated reconciling music cognition and the KF approach is a promising approach. Reviewer #3 emphasized that this framework has a lot of potential in other MIR tasks as well. At the same time, all reviewers pointed out that the evaluation part is weak. Specifically, the experiment was conducted on a small dataset in limited music scenarios. The results are presented qualitatively without model comparison. Nevertheless, all reviewers are generally in favor of this paper, mainly because of its potential impact on the ISMIR community.

Review 1

The paper is well written and clear. I'm not an expert on how to analyze temporal alignment and synchronization between musician in an ensemble, but it was not difficult to follow the technical details, also thankfully for the clarity of the exposition. The Kalman Filter is ubiquitous and it has been proven effective also in studying how precise temporal synchronization between performers in a musical ensemble is achieved. The phase correction strength seems to be a good indicator for knowing which instruments are the leaders, and which are the followers. The results show on the Op. 74 no. 1 by Joseph Haydn are quite impressive to me. As I mentioned I'm not an expert on this very specific topic, so, unfortunately I do not know the literature, and nor the state of the art of the syncrony analysis. Give that, I red the paper with pleasure and it flows like a familiar topic to me. It would be nice to have a thorough evaluation also on other music ensemble examples. But I understand that the dataset can be a problem since this is a very specific ground truth that is not simple to obtain. Moreover, I can imagine some implication also in other MIR topics, like score followers and also generative models can use the phase correction gain to model the "humanity" of the timing of a computer generated music.

Review 2

The paper presents a linear model for ensemble synchronization using Kalman Filter. It essentially uses time-dependent version of ADAM, where instead of bGLS model, KF is used.

The paper evaluates the model using a simulation of normal performance, tempo change and deadpan condition, and shows that the filtered state variables seems to be consistent with the nature of the pieces.

The model is really exciting, bridging the gap between music cognition and interactive music systems. It will be immsensely useful for automatic accompaniment systems, beat tracking systems, or any system in which timing interaction between human musicians (or machines) are useful.

One issue is the evaluation, where the authors only provide the trajectory of the estimated parameters to describe why the model makes sense qualitatively. I understand that the result seems to make sense, but I would have like to see for example comparison with other methods for parameter estimation. For example, in all cases it takes about 20-30 onsets for the parameters to converge from the initial value. Are these because the process noise is so small for alpha, or are these because the correction values really change after starting the performance?

There are a few questions that I found might be beneficial for the authors to discuss in the paper.

My understanding is that linear models for phase/period synchronization has constraints that model relationship between parameters like timekeeper and motor variance, the constraints of which necessitates the use of bGLS algorithm instead of simple regression models. In the proposed method, however, the model is solved using KF, meaning such hard constraints cannot be handled. Without using more elaborate variants like Unscented KF, do the filtered estimates "make sense" from music cognition perspective? In other words, is the proposed method just a convenient state-space model for music ensemble modeling, or is it something that can provide insights to music cognition studies using ADAM? Please discuss some limitations regarding the parameter estimates, if any.
Is smoothing the period/phase correction useful? Fig. 1 seems that there is an excessive smoothing of the parameters, which hides the underlying interaction that is going on. It would have been interesting to see how the correction parameters alpha and beta vary by changing the process noise of the corresponding parameters, or perhaps considering them independent of each other and using multiple takes of the same piece to identify the parameters.

A comment

In the evaluation, all of the smoothed estimates start with 0.25 which I presume is a hardcoded initial value. I believe if the initial state covariance estimate is set to a very large value, the smoothed state estimates will have less influence on the choice of the initial value.

Review 3

This paper builds on the phase correction model and period correction model presented in [9-11], discussing how these can be optimized using Kalman filter assumptions. While a significant portion of the paper is dedicated to deriving equations, it is disappointing that the advantages of the methodology are not sufficiently demonstrated through experiments. Specifically, there is a lack of discussion on how this approach could be applied to general music scenarios. Additionally, it is unclear whether this methodology can only be applied to phase correction or period correction models, leaving the paper's significance in general synchronization contexts uncertain.

Author description of changes:

In a nutshell, the suggestions pointed out by the reviewers and the meta-reviewer were related to two main points: improve the clarity of the text and strengthen the experimental part of the paper. The first point was addressed by two modifications: writing a small sentence at the beginning of Section 4 indicating that the matrices there described aim to recover the proposed model (Eqs. 5-8) via Eqs. 9 and 10; regenerating Figure 1 with different markers, and not only different colors, to improve the readability of the paper on its black and white printed version. As to the second point (improving the experimental part of the paper), unfortunately the limitation of space does not allow us to properly discuss any other additional experiment. However, we acknowledge the issues raised by the reviewers and the meta-reviewer on the last paragraph of Section 5, indicating their resolution in future works.