Abstract:

Guitar tablatures enrich the structure of music notation by assigning each note to a string and fret of a guitar in a particular tuning, defining precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the guitar, multiple string-fret assignments are possible for each pitch, which leads to a large combinatorial space that prevents exhaustive search approaches. Most modern methods use constraint-based dynamic programming approaches to minimize some cost function (e.g. hand position movement). In this work, we introduce a novel deep learning solution to symbolic guitar tablature estimation. We train an encoder-decoder Transformer model in a masked language modeling paradigm to assign notes to strings. The model is first pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on a curated set of professionally transcribed guitar performances. Given the subjective nature of assessing tablature quality, we conduct a user study amongst guitarists, wherein we ask participants to rate the playability of multiple versions of tablature for the same four-bar excerpt. The results indicate our system significantly outperforms competing algorithms.

Reviews
Meta Review

This paper presents a deep learning method (Transformer trained with masked language modeling) for MIDI-to-tablature conversion, which achieves better performance than existing methods.

While all three reviews and the independent meta-review liked the main idea and contributions of this work, they also pointed out a number of issues. Please read all the reviews to see the detailed comments. Specifically, the paper needs to clarify many important details including the network architecture, test data, the post-processing procedure, the metric on agreement between system output and human annotation, subjective evaluation instructions, and some analyses of experimental results.

Based on the reviews and discussions, the final recommendation is "Accept".


Review 1

This article adresses the task of assigning realistic guitar string/fret positions to the notes of a MIDI content, which is here judiciously simplified as a string assignment problem. This task is well known in the community, sometimes under the term of "fingering" problem. The probem is here addressed with a Transformer, trained with a masking process on the string information.

The paper is well written and clearly structured. The approach is promising, but some points are still to be clarified, in particular in the evaluation.

Regarding the title: because "Tablature" commonly refers to a particular type of musical score, the reader might wrongly think that the term "tablature inference" refers to "tablature generation" (analogous to "score generation"). It might be good therefore to clarify the title.

The quintile approach judiciously allows to provide some context regarding the data that are being labelled by the model. It although raises the question of the inference of the first ten tokens, which can not benefit from any past context. How does the model handle this ? It seems important as it will arguably impact the predictions for the following quantiles and so on.

In section 4.4, it would be more clear to state, within the third sentence, the proportion of unplayable predictions (I believe that it is 0,53%, as said in the end of the section, but I think the reader will wonder that number as soon as the problem is stated). Although the introduction of 4.4 is clear, the description of the algorithm lacks details. Are we talking about simultaneous notes ? or notes in a 10-notes window ? In 3.b, how could the center note have a fret value higher than MAX_DEVIATION (which is a fret interval, not a fret number) ?

The number of tab in the fine-tuning set should be indicated.

I don't understand why the evaluation is on the 9 tablatures only. The (large) size of the datasets used in this work arguably enables a much larger evaluation.

It is great to compare the results of the proposed approach with functionnalities of existing (and well-used) software, but I regret the lack of comparison with SOTA approaches that have been published, although not necessary implemented in any software. I suspect that GuitarPro, MuseScore and TuxGuitar might not have considered the performance of this functionality as a priority (given that these software are primarily thought for score/tab visualisation/playback/writing rather than for tab midi-to-tab transcription) and might be far from sota, and rather be an "easy" baseline.

I don't understand why a mistake like Fig.5 (right) could occur given that the post-processing algo (4.4) is supposed to avoid unplayable content. To disambiguate, it could be interesting to illustrate unplayability (of Section 4.4) with a concrete example.

The network architecture is poorly described (2 lines). Why is the hidden size 384-dimensional ? What is the input dimension ? I guess it must be the size of the token alphabet, so what is this size ?

The second sentence of Section 4.3 seems of strong importance, but it is hardly understandable. I think that the pedagogy of the paper could be improved by illustrating the BART architecture for the reader unfamiliar with the original publication.


Review 2

This paper is a nice contribution to the field of automatic transcription to tablature. It summarises prior art, uses a sensible corpus and evaluates in good ways. More explicit reflection on the problems of representation might be nice here - without temporal information ro representations of simultaneity in the input (am I understanding that correctly?), it's not that surprising that physically awkward or impossible results would come out. The example in figure 5 (along with the graph in figure 4) shows this well -- each decision taken individually seems sensible, except for the chords. This is especially true if the sequence of input notes had the B2 as the second note. Where the order of notes in a chord affects the result, that's a suggestion that the model is sub optimal...


Review 3

The paper presents a simple solution for guitar tablature generation from MIDI in which an encoder-decoder Transformer is trained using a masked language modeling supervision scheme (masking string tokens) to assign notes to strings. The inference is done in an autoregressive fashion and some post-processing heuristics are applied to the output of the model. The model is trained on a large dataset of crowdsourced tablatures and then fine-tuned on a small set of professionally transcribed performances.

It is not clear why the tokenization is deemed as novel in the introduction section of the paper. I understand that the model uses an existing tokenizer (MidiTok) and that the string is encoded with the track token. I encourage the authors to provide more information on why the tokenization is novel.

Experiments compare the proposed model with a commercial system and two open-source software implementations capable of producing tablatures from score or MIDI. Given the challenges involved in assessing the quality of tablatures, the experiments include a user study. The paper provides an interesting discussion of the results and limitations of the proposed approach and suggests various ideas for future research.

The manuscript is well-written and organized, and overall, the paper makes a good contribution to ISMIR.

The supplementary material includes a subset of the codebase, but the manuscript does not indicate whether the code or models will be available. It would be very important to clarify whether the code and model will be available for reproducibility.

Minor corrections

Line 43 - References to existing publications should be added here.

Line 56 - Please reconsider the novelty of the tokenization.

Section 5.1 - There is no reference in the text to Figure 4. Please link the figure to the text describing the evaluatiuon on the strech across the chords.


Author description of changes:
  1. We clarify the inference mechanism and illustrate why it reduces the asymmetry between inference and training by having higher confidence / probability values for past values input to the decoder.
  2. We clarify the notion of center note and the choice of a moving window of 11 notes in the post-processing heuristics.
  3. We add more information about the finetuning (train, test, and validation), instead of relying on the reference to Riley et al. [12]
  4. We clarified that subjects were instructed to ignore the difficulty of excerpts in the user study.
  5. We lessen the claim that the system as-is could be used as a generic arranger (regardless of the provenance of the MIDI data) and instead that it can be viewed as a guitar tablature arranging system.
  6. We describe that the BART architecture hyperparameters are not finetuned at all and are simply half the size of the BART base model.
  7. The second sentence of Section 4.3 is rewritten to be more clear.