HyperDyG: Hypergraph-Driven Dynamic Fusion for Semi-Supervised Multimodal Emotion Recognition
DOI:
https://doi.org/10.4108/eetinis.131.10903Keywords:
Hypergraph learning, Multimodal fusion, Cross-modal transformer, Dynamic gating, Semi-supervised emotion recognitionAbstract
Speech emotion recognition (SER) is important in healthcare, education, human–computer interaction, and customer service. Multimodal emotion recognition (MER) integrates audio and textual modalities to achieve a comprehensive understanding of human affect, but still suffers from limited labeled data and complex cross-modal relations. To address these challenges, we propose HyperDyG, a dynamic hypergraph-driven MER framework. The HyperDyG leverages the strengths of dynamic hypergraph learning (DHL), cross-modal transformer (CMT), and an adaptive gated multimodal unit (GMU) for robust multimodal fusion. HyperDyG is further enhanced with a semi-supervised learning strategy that incorporates weak–strong augmentation, confidence-filtered pseudo-labeling, and consistency regularization to effectively exploit large-scale unlabeled data. The HyperDyG achieves state-of-the-art (SOTA) performance on the benchmark emotion dataset and maintains stable accuracy across varying unlabeled ratios. The findings of HyperDyG highlight the effectiveness and scalability of the proposed architecture in real-world low-label MER scenarios.
Downloads
References
[1] Liu ZT, Rehman A, Wu M, Cao WH, Hao M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Information Sciences. 2021;563:309-25. Available from: https://doi.org/10.1016/j.ins.2021.02.016.
[2] Wani TM, Gunawan TS, Qadri SAA, Kartiwi M, Ambikairajah E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access. 2021;9:47795-814. Available from: https://doi.org/10.1109/ACCESS.2021.3068045.
[3] George SM, Muhamed Ilyas P. A review on speech emotion recognition: A survey, recent advances, 17 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 12 | Issue 4 | 2025 | N. M. Nguyen et al. challenges, and the influence of noise. Neurocomputing. 2024;568:127015. Available from: https://doi.org/10.1016/j.neucom.2023.127015.
[4] Pudasaini A, Al-Hawawreh M, Bouadjenek MR, Hacid H, Aryal S. A comprehensive study of audio profiling: Methods, applications, challenges, and future directions. Neurocomputing. 2025;640:130334. Available from: https: //doi.org/10.1016/j.neucom.2025.130334.
[5] Ahmed N, Aghbari ZA, Girija S. A systematic survey on multimodal emotion recognition using learning algorithms. Intelligent Systems with Applications. 2023;17:200171. Available from: https://doi.org/10.1016/j.iswa.2022.200171.
[6] Hazmoune S, Bougamouza F. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Engineering Applications of Artificial Intelligence. 2024;133:108339. Available from: https: //doi.org/10.1016/j.engappai.2024.108339.
[7] Nguyen NM, Nguyen TT, Tran PN, Lim CP, Pham NT, Dang DNM. Multimodal fusion in speech emotion recognition: A comprehensive review of methods and technologies. Engineering Applications of Artificial Intelligence. 2026;163:112624. Available from: https: //doi.org/10.1016/j.engappai.2025.112624.
[8] Khan M, Tran PN, Pham NT, El Saddik A, Othmani A. MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific reports. 2025;15(1):5473. Available from: https://doi.org/10.1038/s41598-025-89202-x.
[9] Khan M, Gueaieb W, El Saddik A, Kwon S. MSER: Multimodal speech emotion recognition using crossattention with deep fusion. Expert Systems with Applications. 2024;245:122946. Available from: https: //doi.org/10.1016/j.eswa.2023.122946.
[10] Xie Y, Sun C, Cao Z, Liu B, Ji Z, Liu Y, et al. A Dual Contrastive Learning Framework for Enhanced Multimodal Conversational Emotion Recognition. In: Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics; 2025. p. 4055-65.
[11] Xiang J, Zhu X, Cambria E. Integrating audio–visual text generation with contrastive learning for enhanced multimodal emotion analysis. Information Fusion. 2026;127:103809. Available from: https://doi.org/10.1016/j.inffus.2025.103809.
[12] Nguyen LH, Pham NT, Khan M, Othmani A, EI Saddik A. HuBERT-CLAP: Contrastive Learning-Based Multimodal Emotion Recognition using Self-Alignment Approach. In: Proceedings of the 6th ACM International Conference on Multimedia in Asia. MMAsia ’24. New York, NY, USA: Association for Computing Machinery; 2024. p. 1 6. Available from: https://doi.org/10.1145/3696409.3700183.
[13] Nguyen NM, Le TT, Nguyen TT, Phan DT, Tran AK, Dang DNM. CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms. In: 2025 25th Asia- Pacific Network Operations and Management Symposium (APNOMS); 2025. p. 1-4. Available from: https://doi. org/10.23919/APNOMS67058.2025.11181320.
[14] Qi X, Wen Y, Zhang P, Huang H. MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition. Neurocomputing. 2025;611:128646. Available from: https://doi.org/10.1016/j.neucom.2024.128646.
[15] Fan C, Lin J, Mao R, Cambria E. Fusing pairwise modalities for emotion recognition in conversations. Information Fusion. 2024;106:102306. Available from: https:// doi.org/10.1016/j.inffus.2024.102306.
[16] Zhang S, Chen M, Chen J, Li YF,Wu Y, Li M, et al. Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition. Knowledge- Based Systems. 2021;229:107340. Available from: https: //doi.org/10.1016/j.knosys.2021.107340.
[17] Hady MFA, Schwenker F. In: Semi-supervised Learning. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013. p. 215-39. Available from: https://doi.org/10.1007/ 978-3-642-36657-4_7.
[18] Yang X, Song Z, King I, Xu Z. A Survey on Deep Semi-Supervised Learning. IEEE Transactions on Knowledge and Data Engineering. 2023;35(9):8934-54. Available from: https://doi.org/10.1109/TKDE.2022.3220219.
[19] Arazo E, Ortego D, Albert P, O’Connor NE, McGuinness K. Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning. In: 2020 International Joint Conference on Neural Networks (IJCNN); 2020. p. 1-8. Available from: https://doi.org/10.1109/ IJCNN48605.2020.9207304.
[20] Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017. p. 1195 1204.
[21] Xie Q, Luong MT, Hovy E, Le QV. Self-Training With Noisy Student Improves ImageNet Classification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 10684- 95. Available from: https://doi.org/10.1109/CVPR42600.2020.01070.
[22] Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, et al. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In: Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 596-608.
[23] Agarla M, Bianco S, Celona L, Napoletano P, Petrovsky A, Piccoli F, et al. Semi-supervised cross-lingual speech emotion recognition. Expert Systems with Applications. 2024;237:121368. Available from: https://doi.org/10.1016/j.eswa.2023.121368.
[24] Chen H, Guo C, Li Y, Zhang P, Jiang D. Semi-Supervised Multimodal Emotion Recognition with Class-Balanced Pseudo-labeling. In: Proceedings of the 31st ACM International Conference on Multimedia. MM ’23. New York, NY, USA: Association for Computing Machinery; 2023. p. 9556–9560. Available from: https://doi.org/10.1145/3581783.3612864.
[25] Kyung J, Heo S, Chang JH. Enhancing Multimodal Emotion Recognition through ASR Error Compensation and LLM Fine-Tuning. In: Interspeech 2024; 2024. p. 4683-7. Available from: https://doi.org/10.21437/Interspeech.2024-2364.
[26] Wang S, Ma Y, Ding Y. Exploring Complementary Features in Multi-Modal Speech Emotion Recognition. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2023. p. 1-5. Available from: https://doi.org/10.1109/ICASSP49357.2023.10096709.
[27] Tsouvalas V, Ozcelebi T, Meratnia N. Privacy-preserving Speech Emotion Recognition through Semi-Supervised Federated Learning. In: 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops); 2022. p. 359-64. Available from: https://doi.org/ 10.1109/PerComWorkshops53856.2022.9767445.
[28] Chen S, Wang C, Chen Z, Wu Y, Liu S, Chen Z, et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing. 2022;16(6):1505-18. Available from: https://doi.org/10.1109/JSTSP.2022.3188113.
[29] Koroteev MV. BERT: a review of applications in natural language processing and understanding. arXiv preprint arXiv:210311943. 2021.
[30] Feng Y, You H, Zhang Z, Ji R, Gao Y. Hypergraph Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence. 2019 Jul;33(01):3558-65. Available from: https://doi.org/10.1609/aaai.v33i01.33013558.
[31] Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M. Modeling Relational Data with Graph Convolutional Networks. In: The Semantic Web. Cham: Springer International Publishing; 2018. p. 593- 607. Available from: https://doi.org/10.1007/978-3-319-93417-4_38.
[32] Yun S, Jeong M, Kim R, Kang J, Kim HJ. Graph Transformer Networks. In: Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc.; 2019. p. 11983 11993. Available from: https://doi.org/10.5555/3454287.3455360.
[33] Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, et al. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation. 2008;42:335-59. Available from: https://doi.org/10.1007/s10579-008-9076-6.
[34] Zhou K, Sisman B, Liu R, Li H. Emotional voice conversion: Theory, databases and ESD. Speech Communication. 2022;137:1-18. Available from: https: //doi.org/10.1016/j.specom.2021.11.006.
[35] Liu S, Gao P, Li Y, Fu W, Ding W. Multi-modal fusion network with complementarity and importance for emotion recognition. Information Sciences. 2023;619:679- 94. Available from: https://doi.org/10.1016/j.ins.2022.11.076.
[36] Khan M, El Saddik A, Alotaibi FS, Pham NT. AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network. Knowledge-Based Systems. 2023;270:110525. Available from: https://doi.org/10.1016/j.knosys.2023.110525.
[37] Prisayad D, Fernando T, Sridharan S, Denman S, Fookes C. Dual Memory Fusion for Multimodal Speech Emotion Recognition. In: Interspeech 2023; 2023. p. 4543-7. Available from: https://doi.org/10.21437/Interspeech.2023-1090.
[38] Pham NT, Phan LT, Dang DNM, Manavalan B. SER-Fuse: An Emotion Recognition Application Utilizing Multi- Modal, Multi-Lingual, and Multi-Feature Fusion. In: Proceedings of the 12th International Symposium on Information and Communication Technology. SOICT ’23. New York, NY, USA: Association for Computing Machinery; 2023. p. 870–877. Available from: https: //doi.org/10.1145/3628797.3628887.
[39] Khurana Y, Gupta S, Sathyaraj R, Raja SP. RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions. IEEE Transactions on Computational Social Systems. 2024;11(1):478-87. Available from: https://doi.org/10.1109/TCSS.2022.3228649.
[40] Yang J, Liu J, Huang K, Xia J, Zhu Z, Zhang H. Singleand Cross-Lingual Speech Emotion Recognition Based on WavLM Domain Emotion Embedding. Electronics. 2024;13(7):1380. Available from: https://doi.org/10.3390/electronics13071380.
[41] Fan W, Xu X, Liu F, Xing X. Multimodal speech emotion recognition via dynamic multilevel contrastive loss under local enhancement network. Expert Systems with Applications. 2025;281:127669. Available from: https: //doi.org/10.1016/j.eswa.2025.127669.
[42] Wang X, Zhao S, Sun H, Wang H, Zhou J, Qin Y. Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2025. p. 1- 5. Available from: https://doi.org/10.1109/ICASSP49660.2025.10889156.
[43] Varga A, Steeneken HJM. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication. 1993;12(3):247-51. Available from: https://doi.org/10.1016/0167-6393(93)90095-3.
[44] Guo LZ, Li YF. Class-Imbalanced Semi-Supervised Learning with Adaptive Thresholding. In: Proceedings of the 39th International Conference on Machine Learning. vol. 162 of Proceedings of Machine Learning Research. PMLR; 2022. p. 8082-94. Available from: https://proceedings.mlr.press/v162/guo22e.html.
[45] Dong H, Rodriguez AM, Guinaudeau C, Satoh S. Fairness Without Labels: Pseudo-Balancing for Bias Mitigation in Face Gender Classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops; 2025. p. 7683-92.
[46] Zhang B, Wang Y, Hou W, WU H, Wang J, Okumura M, et al. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. In: Advances in Neural Information Processing Systems. vol. 34. Curran Associates, Inc.; 2021. p. 18408-19.
[47] Manna S, Chattopadhyay S, Dey R, Pal U, Bhattacharya S. Dynamically Scaled Temperature in Self-Supervised Contrastive Learning. IEEE Transactions on Artificial Intelligence. 2025;6(6):1502-12. Available from: https: //doi.org/10.1109/TAI.2024.3524979.
[48] Sanchez Aimar E, Helgesen N, Xu Y, Kuhlmann M, Felsberg M. Flexible Distribution Alignment: Towards Long-Tailed Semi-supervised Learning with Proper Calibration. In: Computer Vision – ECCV 2024. Cham: Springer Nature Switzerland; 2025. p. 307- 27. Available from: https://doi.org/10.1007/978-3-031-72949-2_18.
[49] Wang S, Sun X, Chen C, Hong D, Han J. Semi- Supervised Semantic Segmentation for Remote Sensing Images via Multiscale Uncertainty Consistency and Cross-Teacher–Student Attention. IEEE Transactions on Geoscience and Remote Sensing. 2025;63:1-15. Available from: https://doi.org/10.1109/TGRS.2025.3585489.
[50] Lilja A, Wallin E, Fu J, Hammarstrand L. Exploring Semi-Supervised Learning for Online Mapping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; 2025. p. 2502-12. Available from: https://www.doi.org/10.1109/CVPRW67362.2025.00233.
[51] Li S, Zhang T, Chen CLP. Cyclic Data Distillation Semi-Supervised Learning for Multi-Modal Emotion Recognition. IEEE Transactions on Knowledge and Data Engineering. 2025;37(9):5078-92. Available from: https: //doi.org/10.1109/TKDE.2025.3581786.
[52] Liu B, Gu T, Wang H, Qian Y. MixPQ: Joint Pruning and Quantization for Speech and Language Foundation Models Compression. IEEE Transactions on Audio, Speech and Language Processing. 2025;33:4098- 112. Available from: https://doi.org/10.1109/TASLPRO.2025.3613948.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Nhut Minh Nguyen, Thanh Trung Nguyen, Thu Thuy Le, Ngoc-Hanh Dang, Luu Phuong Vo, Thanh Hien Lam, Duc Minh Ngoc Dang

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution CC BY 3.0 license, which permits unlimited use, distribution, and reproduction in any medium so long as the original work is properly cited.