|
Persistent Identifier
|
doi:10.18710/RDU8M2 |
|
Publication Date
|
2026-06-03 |
|
Title
| Replication Data for: Modeling monophthongal versus diphthongal /aɪ/ in sung vocal performance with interpretable machine learning |
|
Author
| De Timmerman, Romeohttps://ror.org/00cv9y106ORCIDhttps://orcid.org/0000-0003-4395-4755
Verbeke, Gilhttps://ror.org/00cv9y106ORCIDhttps://orcid.org/0000-0002-9491-9557 |
|
Point of Contact
|
Use email button above to contact.
De Timmerman, Romeo (Ghent University)
TROLLing curator |
|
Description
| Dataset description
This dataset contains replication data for the study "Modeling monophthongal versus diphthongal /aɪ/ in sung vocal performance with interpretable machine learning" (De Timmerman & Verbeke, 2026). The study investigates how binary perceptual annotations of monophthongal versus diphthongal /aɪ/ relate to measurable acoustic variation in sung vocal performance. Using a corpus of studio-recorded blues music processed with vocal-instrumental source separation, F1 and F2 formant trajectories were extracted for 1,004 /aɪ/ tokens, each perceptually categorized by a trained annotator as either monophthongal or diphthongal. These data were used to train and interpret two machine-learning models predicting the perceptual labels: a gradient boosted decision tree (XGBoost) trained on a set of engineered acoustic features, and a multilayer perceptron (MLP) trained directly on the raw, time-aligned formant trajectories. (2026-06-02)
Article abstract
This study combines modern source-separation techniques with interpretable machine learning methods to investigate how binary perceptual annotations of diphthongal and monophthongal /aɪ/ relate to measurable acoustic variation in sung vocal performance. Using a corpus of studio-recorded music processed with vocal-instrumental source separation, we extract F1 and F2 trajectories for 1,004 /aɪ/ tokens, each perceptually categorized as either monophthongal or diphthongal. These trajectories were modeled using two approaches: (i) a gradient boosted decision tree trained on engineered acoustic features, and (ii) a multilayer perceptron trained directly on raw formant trajectories. Both models achieved high accuracy and AUC-ROC, indicating that perceptual labels can reliably be predicted from both raw and engineered acoustic input. Moreover, a SHAP-based feature importance analysis showed that features such as Delta F1/Delta F2, cubic spline coefficients and trajectory derivatives captured systematic differences between perceived monophthongal and diphthongal tokens, highlighting the value of dynamic representations over static distance-based measures. The results indicate that perceptual annotations of diphthongal and monophthongal /aɪ/ correspond with quantifiable acoustic information and demonstrate how explainable machine learning can help map gradient vowel dynamics onto binary perceptual categories. The study further emphasizes how recent advances in source separation make sung performance a valuable new domain for phonetic research. (2026-06-02) |
|
Subject
| Arts and Humanities |
|
Keyword
| /aɪ/ vowel
monophthongization
formant trajectories
machine learning
sociophonetics
sociolinguistics
acoustics |
|
Related Publication
| Is Supplement To: De Timmerman, R. & Verbeke, G. (2026) [forthcoming]. Modeling monophthongal versus diphthongal /aɪ/ in sung vocal performance with interpretable machine learning. The Journal of the Acoustical Society of America. |
|
Language
| English |
|
Producer
| Ghent University (UGent) https://www.ugent.be/ |
|
Contributor
| Supervisor: Stef Slembrouck |
|
Distributor
| The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/ |
|
Depositor
| De Timmerman, Romeo |
|
Deposit Date
| 2026-06-01 |
|
Time Period
| Start Date: 2008; End Date: 2023 |
|
Date of Collection
| Start Date: 2023; End Date: 2024 |
|
Data Type
| tabular |
|
Software
| Python, Version: 3.9.18
Keras, Version: 3.10.0
XGBoost, Version: 2.1.4
SHAP, Version: 0.48.0
UMAP, Version: 0.5.7
Praat, Version: 6.4.18 |
|
Related Material
| GitHub repository containing python scripts/notebooks which were used to analyze this data: De Timmerman, R. (2026). Replication Code for "Modeling monophthongal versus diphthongal /aɪ/ in sung vocal performance with interpretable machine learning" [GitHub Repository]. https://github.com/romeodetimmerman/aae-in-blues-formants |
|
Data Source
| TROLLing repository containing original corpus data from which the vowel tokens were sampled: De Timmerman, R. (2025). Replication Data for: "Covering Blue Voices: African American English and Authenticity in Blues Covers" [Dataset]. DataverseNO. https://doi.org/doi:10.18710/DOJXAV
This dataset is dedicated to the public domain under CC0. |