The GigaMIDI Dataset with Features for Expressive Music Performance Detection

The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit Music Information Retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The GigaMIDI dataset contains over 1.4 million unique MIDI files, encompassing 1.8 billion MIDI note events and over 5.3 million MIDI tracks. GigaMIDI is currently the largest collection of symbolic music in MIDI format available for research purposes under fair dealing. Distinguishing between non-expressive and expressive MIDI tracks is challenging, as MIDI files do not inherently make this distinction. To address this issue, we introduce a set of innovative heuristics for detecting expressive music performance. These include the Distinctive Note Velocity Ratio (DNVR) heuristic, which analyzes MIDI note velocity; the Distinctive Note Onset Deviation Ratio (DNODR) heuristic, which examines deviations in note onset times; and the Note Onset Median Metric Level (NOMML) heuristic, which evaluates onset positions relative to metric levels. Our evaluation demonstrates these heuristics effectively differentiate between non-expressive and expressive MIDI tracks. Furthermore, after evaluation, we create the most substantial expressive MIDI dataset, employing our heuristic, NOMML. This curated iteration of GigaMIDI encompasses expressively-performed instrument tracks detected by NOMML, containing all General MIDI instruments, constituting 31% of the GigaMIDI dataset, totalling 1,655,649 tracks.

Four classes (NE= non-expressive, EO= expressive-onset, EV= expressive-velocity, and EP= expressively-performed) using heuristics for the expressive performance detection of MIDI tracks in GigaMIDI.

Access to the GigaMIIDI dataset is available via Hugging Face Hub.

Analyzing MIDI data may benefit symbolic music generation, computational musicology, and music data mining. The GigaMIDI dataset may contribute to MIR research by providing consolidated access to extensive MIDI data for analysis. Metadata analyses, data source references, and findings on expressive music performance detection may enhance nuanced inquiries and foster progress in expressive music performance analysis and generation.

Our novel heuristics for discerning between non-expressive and expressively-performed MIDI tracks exhibit notable efficacy on the presented dataset. The NOMML (Note Onset Median Metric Level) heuristic demonstrates a classification accuracy of 100%, underscoring its discriminative capacity for expressive music performance detection.

Data Accessibility and Ethical Statements

The GigaMIDI dataset consists of MIDI files acquired via the aggregation of previously available datasets and web scraping from publicly available online sources. Each subset is accompanied by source links, copyright information when available, and acknowledgments. File names are anonymized using MD5 hash encryption. We acknowledge the work from the previous dataset papers that we aggregate and analyze as part of the GigaMIDI subsets.

This dataset has been collected, utilized, and distributed under the Fair Dealing provisions for research and private study outlined in the Canadian Copyright Act. Fair Dealing permits the limited use of copyright-protected material without the risk of infringement and without having to seek the permission of copyright owners. It is intended to provide a balance between the rights of creators and the rights of users. As per instructions of the Copyright Office of Simon Fraser University, two protective measures have been put in place that are deemed sufficient given the nature of the data (accessible online):

1) We explicitly state that this dataset has been collected, used, and distributed under the Fair Dealing provisions for research and private study outlined in the Canadian Copyright Act.

2) On the Hugging Face hub, we advertise that the data is available for research purposes only and collect the user's legal name and email as proof of agreement before granting access.

We thus decline any responsibility for misuse.

You agree to use the GigaMIDI dataset only for non-commercial research or education without infringing copyright laws or causing harm to the creative rights of artists, creators, or musicians.

If you use the GigaMIDI dataset in your research, please acknowledge by citing our reference paper (see the Reference/Citation section below) to support knowledge sharing and advance the field.

Hugging Face Hub: https://huggingface.co/datasets/Metacreation/GigaMIDI

GitHub: https://github.com/Metacreation-Lab/GigaMIDI-Dataset

Reference

[2025] The GigaMIDI Dataset and Features for Expressive Music Performance Detection Lee, Keon Ju M. and Ens, Jeff and Adkins, Sara and Sarmento, Pedro and Barthet, Mathieu and Pasquier, Philippe – Publication in the Transactions of the International Society for Music Information Retrieval.

Previous
Previous

Alpha Prism

Next
Next

Music with MMM