Abstract:

This paper describes a streaming audio-to-MIDI piano transcription approach that aims to sequentially translate a music signal into a sequence of note onset and offset events. The sequence-to-sequence nature of this task may call for the computationally-intensive transformer model for better performance, which has recently been used for offline transcription benchmarks and could be extended for streaming transcription with causal attention mechanisms. We assume that the performance limitation of this naive approach lies in the decoder. Although time-frequency features useful for onset detection are considerably different from those for offset detection, the single decoder is trained to output a mixed sequence of onset and offset events without guarantee of the correspondence between the onset and offset events of the same note. To overcome this limitation, we propose a streaming encoder-decoder model that uses a convolutional encoder aggregating local acoustic features, followed by an autoregressive Transformer decoder detecting a variable number of onset events and another decoder detecting the offset events for the active pitches with validation of the sustain pedal at each time frame. Experiments using the MAESTRO dataset showed that the proposed streaming method performed comparably with or even better than the state-of-the-art offline methods while significantly reducing the computational cost.

Reviews
Meta Review

This paper received generally good ratings from the four reviewers, as the proposed model achieves high performance while working in real time, which has been less studied. The main weakness is the lack of clarity in the model descriptions. In particular, reviewer #1 pointed out a lot of missing details that make it difficult to understand the exact model operation, and thus recommended a weak rejection. Through discussion, the reviewers concluded that many of the issues can be addressed in the camera-ready version and that the main contributions of the paper deserve to be presented at the ISMIR conference. Therefore, the authors are strongly encouraged to incorporate all comments in the revision and improve the clarity.


Review 1

This paper proposes a streaming-capable automatic piano transcription method. Automatic piano transcription is a well-known task with a somewhat standardized evaluation method used since the Onsets and Frames paper (Hawthorne 2018). To make the model streaming-capable, the authors improve upon the sequence-to-sequence piano transcription model (Hawthorne 2021) so that rather than generating a full sequence of MIDI-like tokens, the decoder concerns onset or offset events in a single time frame only, at each time. The onset decoder autoregressively predicts the sequence of note onsets, and the offset decoder takes the set of active onsets and non-autoregressively predicts if any of them reached the offset.

The results in Table 1-2 show that the proposed method performs competitively with other SOTA models, while achieving 380 milliseconds of streaming latency, which I consider a strong result. While real-time / streaming capabilities are often overlooked in academic settings, a streaming-capable model can have a huge impact on the usability and interactive applications.

The proposed method section could be improved a lot to have clearer descriptions of the model:

  • some details of the model architecture are missing, such as how the encoder features are fed to the decoders; I could infer from Figure 2 that it's using cross attention (aka encoder-decoder attention) on the features with a causal mask to enable streaming, but this should have been explained in the main body of the text.
  • In Figure 2, each decoder layer seems to have layernorms in both the beginning and the end of the block, which is unconventional. (For context, the original transformer paper had layernorms after attention and mlp, but most implementations including T5 put the layernorms before them).
  • There should be a more detailed description on the sustain pedal detection, which in the title. I'm guessing from the output vocabulary that the pedal states are predicted as part of the onset decoder's prediction, but I can't tell if the tokens represent the states (presence or absence, as written in Section 4.1.3) or the state changes (press / release).
  • In that sense, it'd be helpful to include some example output sequences, to clarify if there is a preferred order of the tokens, e.g. BOS, any pedal tokens followed by notes lower to higher, and then EOS.
  • The offset decoder is described to be operating non-autoregressively, which makes sense because it just needs to determine if each active onset note ended or not at each frame, but it's easy to (mis?)understand from the paper that offset decoder also does sequence prediction, from the notations in Section 3.1 and 3.3.
  • I find the notation in Algorithm 1 is a little inconsistent. I have to guess that the superscript notations like 1:k_1 and 0:n_1 denote the variable-length onset and offset sequences in each time frame, but it's unclear why Y_t^{0:k_t} was computed but only Y_t^{1:k_t} is outputted. Maybe it is to exclude EOS, or the pedal token? In any case, a caption to the algorithms block would be nice, to provide a high-level overview of the algorithm explaining the overall flow. Lines 19-26 is pretty standard greedy LM decoding, so it could become a single line.
  • Given all the intricacies like the above, an open-source code and model release would be able to foster faster and wider adoption of the streaming piano transcription model.

Given that we have more space, further analyses like the following would have made it a stronger paper.

  • A piano roll representation of the predicted MIDI or the "posteriorgrams" could be useful, space permitting, to show a qualitative visualization of the model behavior.
  • Latency analysis: while 380ms is the systematic latency of the streaming architecture, the real-world latency would depend on the time between the actual and predicted onsets/offset events. A histogram of those latencies would be useful for understanding the real-world latency and its variability.
  • A bonus point if a video of a piano performance getting transcribed in real time, showing the player and animated piano roll representation on the screen.

To conclude, I really like the strong results and would love to see the paper accepted and the model becoming available for streaming piano transcription, but as above, I find the overall quality of the text to be below ISMIR's acceptance bar.


Review 2

The paper is well presented and the evualuation gaves promising results.

It would be interesting to comment on the choice of using the same number of frames before and after for the input of the encoders after to reduce latency.

L.5 “may call for a transformer” <- for using a Transformer model? L.52 rephrase to avoid repeat L.67 “under these circumstances” -> to overcome these limitations? L.87 “related work" L.384 "relying on long-term dependency of acoustic features"


Review 3

I found the paper to be clear, well-written, and easy to read. Congratulations to the authors. I don't have a lot of comments, and I think that this paper meets the requirements for publication at ISMIR.

Comments: - It would be nice to know if the model is actually usable in real-world applications, i.e. if the model is already ready for streaming transcription. An experiment made by the authors in real conditions could be a nice addition to the paper. - I find inelegant the use of a "while True" loop in the pseudo-code of Algorithm 1. In my opinion, the code could be rewritten to remove this condition while keeping the same behavior (using a while y!= EOS condition and initializing y out of the loop). - The code should be made open-source. If the authors intend to make the code open-source, they should mention it (for instance with a placeholder instead of the real link for double-blind review). - Line 87: the word "work" is missing in "this section review related WORK". - Line 96: I think there is a word missing after mainstream in "In APT, the framewise transcription approach has still been the mainstream due to [...]". Maybe mainstream "use" or something of the like? - The definitions of "frame-level accuracy" and "note-level accuracy" are not given in Section 2.1.


Author description of changes:

We truly appreciate the valuable comments and suggestions and have made corresponding modifications on the original manuscript as much as possible.

To Reviewer 1:

  1. We have updated Figure 2 to illustrate how encoder features are fed to the decoder, and clarified this process in Section 3.1.

  2. Figure 2 has been revised to demonstrate the relationship between layer normalization and other layers.

  3. We've added Table 1 to showcase the detection of sustain pedal events and the resulting output sequences.

  4. In Section 3.1, we now emphasize that the offset decoder does not perform sequence prediction, but instead predicts the offset for each detected onset simultaneously.

  5. The caption for Algorithm 1 now specifies that index 0 in the offset sequence represents tokens for the sustain pedal.

To Reviewers 2 and 3:

  1. We have corrected the spelling mistakes throughout the paper.

  2. The "while True" loop has been removed from Algorithm 1 to improve clarity.

  3. To avoid confusion, we've replaced "frame-level accuracy" with "frame-level transcription performance" throughout the paper.

To Meta Reviewer:

  1. Section 4.1.3 now includes an explanation for maintaining the full vocabulary for all decoders.

  2. We've added an explanation for consistent onset and offset decoding in Section 3.4.

  3. We've included the missing references:

  4. Kwon et al., "Polyphonic piano transcription using autoregressive multi-state note model," ISMIR 2020.
  5. Dasaem Jeong, "Real-time Automatic Piano Music Transcription System," ISMIR LBD 2020.

  6. We acknowledge that Automatic Music Transcription (AMT) and Automatic Speech Recognition (ASR) face different challenges.

To All Reviewers:

Due to limited research resources, we did not conduct more extensive model analysis. As noted in our conclusion, we recognize that decoding every frame may lead to unnecessary computations. We plan to address these issues and improve our model in future work and open-source our model in the future.