EmoFedProto: Privacy-Preserving Vietnamese Speech Emotion Recognition via Prototype-Based Federated Learning

Quang-Anh Nguyen-Duc; Duc Minh Pham; Thai Dinh Kim; Thao Phuong Pham; Minh-Anh Nguyen; Xuan-Hai Le; Van-Ninh Nguyen

doi:10.4108/airo.11595

Authors

Quang-Anh Nguyen-Duc Vietnam National University, Hanoi https://orcid.org/0009-0003-1353-3367
Duc Minh Pham Vietnam National University, Hanoi https://orcid.org/0009-0002-7466-0928
Thai Dinh Kim Vietnam National University, Hanoi https://orcid.org/0000-0002-9060-4769
Thao Phuong Pham Vietnam National University, Hanoi
Minh-Anh Nguyen Vietnam National University, Hanoi
Xuan-Hai Le Vietnam National University, Hanoi
Van-Ninh Nguyen Vietnam National University, Hanoi

DOI:

https://doi.org/10.4108/airo.11595

Keywords:

Speech Emotion Recognition, Federated Learning, Prototype-Based Learning, Non-IID Data, Low-Resource Languages, Vietnamese Speech

Abstract

Speech Emotion Recognition (SER) plays a fundamental role in affective computing by enabling machines to infer human emotional states from vocal expressions. However, most existing SER systems rely on centralized training paradigms, which raise serious privacy concerns due to the sensitive nature of speech data. Federated Learning (FL) offers a privacy-preserving alternative by allowing collaborative model training without sharing raw data, yet its performance often degrades significantly under non-IID data distributions, a common characteristic of speech emotion datasets caused by speaker variability and emotion imbalance. To address these challenges, we propose EmoFedProto, a prototype-based federated learning framework with clustering-enhanced prototype aggregation tailored for Vietnamese speech emotion recognition in low-resource settings. Instead of exchanging full model parameters, EmoFedProto communicates class-level feature prototypes, enabling more robust alignment across heterogeneous clients. Experiments conducted on the VNEMOS dataset under realistic non-IID and few-shot conditions demonstrate that EmoFedProto achieves an accuracy of 0.875, outperforming the baseline FedProto (0.825), while reducing performance variability by 44%. These results indicate that clustering-based prototype federated learning is an effective and communication-efficient solution for privacy-preserving speech emotion recognition, particularly in low-resource languages and realworld federated environments.

Downloads

Download data is not yet available.

References

[1] Nguyen, T., Tran, T., & Truong, B. (2026). Human- Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition. arXiv preprint arXiv:2604.01711.

[2] Tran, L. T. T., Kim, H. G., La, H. M., & Van Pham, S. (2024). Automatic speech recognition of vietnamese for a new large-scale corpus. Electronics, 13(5), 977.

[3] Luong, H. T., & Vu, H. Q. (2016, December). A nonexpert Kaldi recipe for Vietnamese speech recognition system. In Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016) (pp. 51-55).

[4] Thanh, P. V., Huyen, N. T. T., Quan, P. N., & Trang, N. T. T. (2024, April). A robust pitch-fusion model for speech emotion recognition in tonal languages. In ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 12386-12390). IEEE.

[5] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascon, A., Ghazi, B., Gibbons, P. B., . . . Zhao, S. (2021). Advances and open problems in federated learning. Proceedings of the IEEE, 109(1), 40–108.

[6] Tsouvalas, V., Ozcelebi, T., & Meratnia, N. (2022). Privacy-preserving speech emotion recognition through semi-supervised federated learning. arXiv preprint.

[7] Nandi, A., & Xhafa, F. (2022). A federated learning method for real-time emotion state classification from multimodal streaming. Methods, 204, 340–347.

[8] Gahlan, N., & Sethia, D. (2024). Federated learning in emotion recognition systems based on physiological signals for privacy preservation: A review. Multimedia Tools and Applications.

[9] M. Davari, A. Harooni, A. Nasr, K. Savoji, and M. Soleimani, “Improving recognition accuracy for facial expressions using scattering wavelet,” EAI Endorsed Transactions on AI and Robotics, vol. 3, 2024. DOI: 10.4108/airo.5145.

[10] V. K. Mulukutla, S. S. Pavarala, S. R. Rudraraju, and S. Bonthu, “Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models,” EAI Endorsed Transactions on AI and Robotics, vol. 4, 2025. DOI: 10.4108/airo.8870.

[11] Z. Xue, B. Wang, et al., “FDD-YOLO: A Lightweight Multi-scale Prohibited Items Detection Model,” EAI Endorsed Transactions on AI and Robotics, 2025. DOI: 10.4108/airo.10277.

[12] Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., & Zhang, C. (2021). FedProto: Federated prototype learning across heterogeneous clients. arXiv preprint.

[13] Anh, N. Q., Ha, M. H., Nguyen, Q. C., Thi, T. H. N., Vu, Q., Minh-Duc, D. X., & Dinh, T. K. (2024). VNEMOS: Vietnamese speech emotion inference using deep neural networks. In Proceedings of the 9th International Conference on Integrated Circuits, Design, and Verification (ICDV) (pp. 97–101). IEEE.

[14] Sahu, A., Li, T., Sanjabi, M., Zaheer, M., Talwalkar, A., & Smith, V. (2018). Federated Optimization in Heterogeneous Networks. arXiv: Learning.

[15] Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S.J., Stich, S.U., & Suresh, A.T. (2019). SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. International Conference on Machine Learning.

[16] Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., & Zhang, C. (2022). FedProto: Federated prototype learning across heterogeneous clients. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8432–8440).

[17] Tan, Y., Long, G., Ma, J., Liu, L., Zhou, T., & Jiang, J. (2022). Federated learning from pre-trained models: A contrastive learning approach. Advances in neural information processing systems, 35, 19332-19344.

[18] Dai, Y., Chen, Z., Li, J., Heinecke, S., Sun, L., & Xu, R. (2023, June). Tackling data heterogeneity in federated learning with class prototypes. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 6, pp. 7314-7322).

[19] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

[20] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint.

[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

[22] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A Database of German Emotional Speech, Interspeech (2005).

[23] Nguyen-Duc, Q.-A., Ha, M. H., Dinh, T. K., Pham, M. D., & Van, N. N. (2024). Emotional Vietnamese speechbased depression diagnosis using dynamic attention mechanism. arXiv preprint.

[24] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).

[25] Liu, Z., Lin, Y., Cao, Y., Hu, H.,Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).

[26] Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H. (2019). Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).

[27] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. Interspeech 2019.

[28] Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Selfsupervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29, 3451-3460.

[29] Gong, Y., Chung, Y. A., & Glass, J. (2021). AST: Audio Spectrogram Transformer. Interspeech 2021.

EmoFedProto: Privacy-Preserving Vietnamese Speech Emotion Recognition via Prototype-Based Federated Learning

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Scopus CiteScore

SCimago

Latest publications

Information