voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

每日信息看板 · 2026-02-14

研究/论文

AI 总结

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted fro…

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embe…
Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferabi…
voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the p…
Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features subst…
HuBERT embeddings obtained from early layers yield the best result (~95
7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline

#arXiv #paper #研究/论文

原链接

内容摘录

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).