Composer's Assistant 2: Interactive Multi-Track MIDI Infilling with Fine-Grained User Control
Martin E Malandro (Sam Houston State University)*
Keywords: Creativity -> human-ai co-creativity; MIR fundamentals and methodology -> symbolic music processing; MIR tasks -> music generation, Knowledge-driven approaches to MIR -> machine learning/artificial intelligence for music
We introduce Composer's Assistant 2, a system for interactive human-computer composition in the REAPER digital audio workstation. Our work upgrades the Composer's Assistant system (which performs multi-track infilling of symbolic music at the track-measure level) with a wide range of new controls to give users fine-grained control over the system's outputs. Controls introduced in this work include two types of rhythmic conditioning controls, horizontal and vertical note onset density controls, several types of pitch controls, and a rhythmic interest control. We train a T5-like transformer model to implement these controls and to serve as the backbone of our system. With these controls, we achieve a dramatic improvement in objective metrics over the original system. We also study how well our model understands the meaning of our controls, and we conduct a listening study that does not find a significant difference between real music and music composed in a co-creative fashion with our system. We release our complete system, consisting of source code, pretrained models, and REAPER scripts.
Reviews
The reviewers broadly appreciate the contribution of a generative model integrated into a DAW. While I think the empirical evaluation is somewhat weak, this is counterbalanced by the relevance of the artifact. The system will be released and open source, making this a valuable baseline for future work on DAW-integrated models.
Strengths I highly recognize this work for its wide range of controls that - focus on different aspects of music, including pitch, rhythm, density and novelty, which composers would likely find important and useful; and, - allow very high flexibility, i.e., users can choose whichever control to apply, on arbitrary track or measure.
I also thank the authors a lot for releasing their model/implementation as part of a DAW, which increases the system's potential real-world impact.
Weaknesses (W1) Somewhat insufficient ML technical descriptions -- The authors put most of the technical details in the appendix. To enhance readers' understanding of the method, I would recommend including at least the following content in the main text: - How does the model take care of future context? Is it by reordering, or placing future context on the encoder side? - How are the control tokens and tokens to be generated ordered/arranged? - What are the differences/advancements compared to REAPER v1? (The improvement seems sizable from the numbers in Table 1.)
(W2) Misleading plot about control effectiveness (Figure 5) -- I find this plot a bit difficult to understand, particularly because the control signals/levels were not shown. Perhaps a better presentation is to display generated examples conditioned on different note density levels.
strengths:
- strong objective evaluation of the improvements and controls added to the infilling model
- authors think about deep learning models (and the way we should control them) from a co-creative standpoint, developing their models and conditioning techniques guided by the interaction needs of a composer working with MIDI in a DAW, co-creating with a generative model.
- the authors contribute a fully open-source ecosystem with source code, pretrained models, and REAPER scripts for incorporating their system into a DAW.
weaknesses:
- would have been good to know about the musical background of the volunteers in the subjective evaluation.
This paper describes additional features to REAPER Infiller incorporating user controls (pitch, rhythm, horizontal and vertical note density). The methods and the model are well evaluated along with subjective evaluations.
While the paper is well written and most details are properly presented, it builds on previous work (RI). A section to review RI itself would have been helpful for an uninitiated reader. However, lines 280-289 lends some insight into the model architecture, so I am inclined to believe that the text in this paper is sufficient to understand the method.
Overall I feel this work is an important addition to RI and thus should be published.
-De-anonymized title and text.
-Added/discussed the 3 references suggested by Reviewer 2 and Meta-Reviewer 1.
-Two reviewers felt that Figure 5 was difficult to understand. Figure 5 has been replaced with a new figure (still Figure 5) that gives more information, and the text referring to Figure 5 has been rewritten.
-"How does the model take care of future context?...", "How are the control tokens and tokens to be generated ordered/arranged?"
It places future context on the encoder side. (Think of an input as being a page of sheet music with some track-measures masked. The model can "see" all unmasked notes on the page before writing any notes.) The paper has been updated in Fig 2 and Sec 4 to make this more clear. Sec 4 was also updated to describe where the control tokens are placed.
-"In Table 1, how much of the improvement of CA2 is attributable to general improvements of the CA model, vs. the greater control signal...?"
There were some general improvements to the CA codebase (primarily around training example generation), as well as improvements to the training dataset (some additional CC0 files and some reweighting of the training data) that we did not feel were worth discussing in the paper. To ensure we made a fair comparison, the CA model in the paper is actually a retrained CA model that incorporates these improvements. Hence, the numbers in Table 1 demonstrate exactly the improvement derived from the greater control signal.
The original CA model scores are as follows:
F1 50.63; 29.98; 53.35 Precision 52.21; 33.29; 55.16 Recall 49.67; 29.38; 52.76 PCHE difference 33.92; 52.77; 33.29 Groove sim 97.85; 96.17; 97.91
Roughly, general improvements increased performance by 1-2 points, and the remaining 20-40 points of improvement in the paper come from the increased control signal from the ground truth.
-Rewrote Sec 5.3 for clarity and discussion of subjective results.
-Removed a few sentences from Sections 1 and 2 to make room for the above changes.