Acoustic and Tonal Modeling of the tpuri Language through a Multi-Modular Hybrid Approach
DOI:
https://doi.org/10.4108/eetismla.11285Keywords:
tpuri, tonal language, Wav2Vec 2.0, YIN, STFT, adaptive fusion, speech recognition, low resourceAbstract
Automatic speech recognition (ASR) for tonal low-resource languages remains challenging due to the scarcity of labelled data and the need to model complex prosodic systems. This paper presents a hybrid multimodular ASR architecture for tpuri, a Mboum-Day Niger-Congo language spoken in Cameroon and Chad that exhibits contrastive lexical tone, vowel length and nasalisation. The system combines a self-supervised Wav2Vec 2.0 acoustic encoder with a tonal processing module based on YIN pitch estimation and STFTderived spectral features, and an adaptive fusion mechanism that integrates acoustic and tonal representations before decoding. We pretrain the acoustic encoder on 45 hours of read and spontaneous speech and finetune it on 19h35 of scripted speech. On the scripted test set, our best configuration reaches a word error rate (WER) of 10.4%, a phone error rate (PER) of 8.7% and a tone error rate (TER) of 6.1%. Ablation experiments show that removing the tonal module (+1.5 WER, +2.3 TER) or self-supervised pretraining (+3.4 WER) substantially degrades performance, while adaptive fusion and tone-aware data augmentation yield smaller but consistent gains. A fine-grained error analysis across tonal, grammatical, syllabic and morphological dimensions indicates that the architecture is particularly effective at modelling lexical tone and clause-level syntax, but still struggles with complex syllable structures and rich morphology. Overall, the results demonstrate that competitive ASR is attainable for under-resourced tonal languages such as tpuri by tightly coupling self-supervised acoustic modelling with explicit tonal representations, and provide a reusable blueprint for extending ASR to other Niger-Congo languages.
Downloads
References
[1] Bird, S. (2009) Natural language processing and linguistic fieldwork. Computational linguistics 35(3): 469–474.
[2] Michailovsky, B., Mazaudon, M., Michaud, A., Guillaume, S., François, A. and Adamou, E. (2014) Documenting and researching endangered languages: the pangloss collection.
[3] De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., De Wet, F., Barnard, E. and De Waal, A. (2014) A smartphone-based ASR data collection tool for underresourced languages. Speech Communication 56(1): 119–131. doi:10.1016/j.specom.2013.07.001, URL http://dx.doi.org/10.1016/j.specom.2013.07.001.
[4] Bhaskararao, P. (2004) Phonetic documentation of endangered languages: Creating a knowledgebase containing sound recording, transcription and analysis. Acoustical Science and Technology 25(4): 219–226. doi:10.1250/ast.25.219, URLhttps://www.jstage.jst.go.jp/article/ast/25/4/25_4_219/_pdf/-char/ja.
[5] Benahmed, Y. (2018) Analyse Sémantique pour Systèmes de Dialogue Verbaux. Ph.D. thesis, Institut National de la Recherche Scientifique (Canada).
[6] Mehta, D., Diddee, H., Saxena, A., Shukla, A., Santy, S., Kommiya, R., Sharma, A. et al. (2022) Learnings from technological interventions in a low resource language. arXiv preprint arXiv:2211.16172 .
[7] Seignobos, C. and Tourneux, H. (2002) Le Nord- Cameroun à travers ses mots: dictionnaire de termes anciens et modernes: province de l’extrême-nord (KARTHALA Editions).
[8] Kolyang, D.T. (2010) PARLONS TPURI-Cameroun et Tchad (Harmattan).
[9] Ragni, A., Knill, K.M., Madhavi Mallidi, S.R., Gales, M.J.F. and Woodland, P.C. (2014) Data augmentation for low resource languages. In Proceedings of Interspeech: 810–814.
[10] Huang, H. and Seide, F. (2000) Pitch tracking and tone features for mandarin speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3: 1523–1526. doi:10.1109/ICASSP.2000.861942.
[11] Wang, Y. and Lee, L. (2010) Mandarin tone recognition using affine-invariant prosodic features and tone posteriorgram. In Proceedings of Interspeech: 2850–2853. doi:10.21437/Interspeech.2010-305.
[12] Lei, X. and Ostendorf, M. (2007) Word-level tone modeling for mandarin speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 665-668.
[13] Schultz, T., Vu, N.T. and Schlippe, T. (2013) Globalphone: A multilingual text & speech database in 20 languages. Speech and Signal Processing (IEEE): 8126–8130.
[14] Coto-Solano, R. (2021) Explicit tone transcription improves ASR performance in extremely low-resource languages: A case study in Bribri. In Mager, M., Oncevay, A., Rios, A., Ruiz, I.V.M., Palmer, A., Neubig, G. and Kann, K. [eds.] Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (Online: Association for Computational Linguistics): 173–184. doi:10.18653/v1/2021.americasnlp-1.20, URL https://aclanthology.org/2021.americasnlp-1.20/.
[15] Baevski, A., Zhou, Y., Mohamed, A. and Auli, M. (2020) wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (NeurIPS), 33: 12449–12460.
[16] Pratap, V., Sriram, A., Tomasello, P., Hannun, A., Liptchinsky, V., Synnaeve, G. and Collobert, R. (2020) Massively multilingual asr: 50 languages, 1 model, 1 billion parameters. In Proceedings of Interspeech: 4751–4755.
[17] Reitmaier, T. and colleagues (2022) Opportunities and challenges of automatic speech recognition systems for low-resource language speakers. Proceedings of the ACM on Human-Computer Interaction 6(CSCW1): 1–28. doi:10.1145/3517639.
[18] Ruelland, S. (1992) Description du parler tupuri de Mindaore (Tchad). In Phonologie, morphologie, syntaxe (Université de la Sorbonne Nouvelle Paris III).
[19] Wittenburg, P., Mosel, U. and Dwyer, A. (2002) Methods of language documentation in the DOBES project. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02) (Las Palmas, Canary Islands - Spain: European Language Resources Association (ELRA)). URL https://aclanthology.org/ L02-1221/.
[20] De Cheveigné, A. and Kawahara, H. (2002) Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111(4): 1917–1930.
[21] Heafield, K. (2011) Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (Edinburgh, Scotland: Association for Computational Linguistics): 187–197. doi:10.18653/v1/W11-2123, URL https://aclanthology.org/W11-2123.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Jules Paulin Bayang Souloukna, Patrick Nounamou Dabou, Paul Dabou Patrick, Kolyang

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.