voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models
每日信息看板 · 2026-02-14
2026-02-14T23:51:53Z
Published
AI 总结
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted fro…
- We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embe…
- Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferabi…
- voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the p…
- Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features subst…
- HuBERT embeddings obtained from early layers yield the best result (~95
- 7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline
#arXiv #paper #研究/论文
内容摘录
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).