Multimodal-Driven Emotion-Controlled Facial Animation Generation Model

Authors

DOI:

https://doi.org/10.4108/eetsis.7624

Keywords:

Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal

Abstract

INTRODUCTION: In recent years, the generation of facial animation technology has emerged as a prominent area of focus within computer vision, achieving varying degrees of progress in lip-synchronization quality and emotion control.

OBJECTIVES: However, existing research often compromises lip movements during facial expression generation, thereby diminishing lip synchronisation accuracy. This study proposes a multimodal, emotion-controlled facial animation generation model to address this challenge.

METHODS: The proposed model comprises two custom deep-learning networks arranged sequentially. By inputting an expressionless target portrait image, the model generates high-quality, lip-synchronized, and emotion-controlled facial videos driven by three modalities: audio, text, and emotional portrait images.

RESULTS: In this framework, text features serve a critical supplementary function in predicting lip movements from audio input, thereby enhancing lip-synchronization quality.

CONCLUSION: Experimental findings indicate that the proposed model achieves a reduction in lip feature coordinate distance (L-LD) of 5.93% and 33.52% compared to established facial animation generation methods, such as MakeItTalk and the Emotion-Aware Motion Model (EAMM), and a decrease in facial feature coordinate distance (F-LD) of 7.00% and 8.79%. These results substantiate the efficacy of the proposed model in generating high-quality, lip-synchronized, and emotion-controlled facial animations.

References

[1] Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. Proceedings of the 2017 Neural Information Processing Systems, 1(1), No. 30.

[2] Zhu, J. Y., Park, T., Isola, P., et al. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the 2017 IEEE International Conference on Computer Vision, 1(1), 2223-2232. https://doi.org/10.1109/ICCV.2017.240.

[3] Isola, P., Zhu, J. Y., Zhou, T., et al. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 1(1), 1125-1134. https://doi.org/10.1109/CVPR.2017.632.

[4] Wang, K. F., Gou, C., Duan, Y. J., et al. (2017). Generative adversarial networks: the state of the art and beyond. Acta Automatica Sinica, 43(3), 321-332. https://doi.org/10.1016/j.automatica.2017.07.001.

[5] Sha, T., Zhang, W., Shen, T., et al. (2023). Deep person generation: a survey from the face, pose, and cloth synthesis perspective. ACM Computing Surveys, 55(12), 1-37. https://doi.org/10.1145/3574786.

[6] Chen, L., Cui, G., Kou, Z., et al. (2023). What comprises a good talking-head video generation?: A survey and benchmark. arXiv. [EB/OL]. [2023-03-18]. https://arxiv.org/pdf/2005.03201.

[7] Zhu, H., Luo, M. D., Wang, R., et al. (2021). Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18, 351-376. https://doi.org/10.1007/s11633-021-1268-6.

[8] Jia, Z., Zhang, Z., Wang, L., et al. (2023). Human image generation: a comprehensive survey. arXiv. [EB/OL]. [2023-05-20]. https://arxiv.org/ftp/arxiv/papers/2212/2212.08896.

[9] Song, X. Y., Yan, Z. Y., Sun, M. Y., et al. (2023). Current status and development trend of speaker generation research. Computer Science, 50(08), 68-78.

[10] Liu, J., Li, Y., & Zhu, J. P. (2021). Generating 3D virtual human animation based on dual camera capturing facial expression and human posture. Journal of Computer Applications, 41(03), 839-844.

[11] Xia, Z. P., & Liu, G. P. (2016). Design and realisation of virtual teachers for operating guide in the 3D virtual learning environment. China Educational Technology, (5), 98-103.

[12] Zhou, W. B., Zhang, W. M., Yu, N. H., et al. (2021). An overview of deepfake forgery and defence techniques. Journal of Signal Processing, 37(12), 2338-2355. https://doi.org/10.1109/JSP.2021.9666106.

[13] Song, Y. F., Zhang, W., Chen, S. N., et al. (2023). A review of digital speaker video generation. Journal of Computer-Aided Design & Computer Graphics, 1(12), 1-12. [2023-11-29]. http://kns.cnki.net/kcms/detail/11.2925.tp.20231109.1024.002.html.

[14] Ji, X., Zhou, H., Wang, K., et al. (2021). Audio-driven emotional video portraits. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 14080-14089. https://doi.org/10.1109/CVPR46437.2021.01409.

[15] Liang, B., Pan, Y., Guo, Z., et al. (2022). Expressive talking head generation with granular audio-visual control. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 3387-3396. https://doi.org/10.1109/CVPR52688.2022.00346.

[16] Song, L., Wu, W., Qian, C., et al. (2022). Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17, 585-598. https://doi.org/10.1109/TIFS.2021.3080127.

[17] Thies, J., Elgharib, M., Tewari, A., et al. (2020). Neural voice puppetry: Audio-driven facial reenactment. Proceedings of the 16th European Conference on Computer Vision, 1(1), 716-731. https://doi.org/10.1007/978-3-030-58565-5_43.

[18] Wen, X., Wang, M., Richardt, C., et al. (2020). Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics, 26(12), 3457-3466. https://doi.org/10.1109/TVCG.2020.3004271.

[19] Chen, L., Maddox, R. K., Duan, Z., et al. (2019). Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 7832-7841. https://doi.org/10.1109/CVPR.2019.00804.

[20] Song, Y., Zhu, J., Li, D., et al. (2023). Talking face generation by conditional recurrent adversarial network. arXiv. [EB/OL]. [2023-04-07]. https://arxiv.org/pdf/1804.04786.

[21] Zhou, Y., Han, X., Shechtman, E., et al. (2020). Makelttalk: Speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6), 1-15. https://doi.org/10.1145/3386569.3392455.

[22] Fang, Z., Liu, Z., Liu, T., et al. (2022). Facial expression GAN for voice-driven face generation. The Visual Computer, 38, 1-14. https://doi.org/10.1007/s00371-022-02250-7.

[23] Eskimez, S. E., Zhang, Y., & Duan, Z. (2021). Speech-driven talking face generation from a single image and an emotional condition. IEEE Transactions on Multimedia, 24, 3480-3490. https://doi.org/10.1109/TMM.2021.3062603.

[24] Ji, X., Zhou, H., Wang, K., et al. (2022). EAMM: One-shot emotional talking face via audio-based emotion-aware motion model. Proceedings of the 2022 ACM SIGGRAPH 2022 Conference Proceedings, 1(1), 1-10. https://doi.org/10.1145/3532925.3532934.

[25] Zhen, R., Song, W., He, Q., et al. (2023). Human-computer interaction system: A survey of talking-head generation. Electronics, 12(1), 218. https://doi.org/10.3390/electronics12010218.

[26] Ma, Y., Wang, S., Hu, Z., et al. (2023). Styletalk: One-shot talking head generation with controllable speaking styles. arXiv. [EB/OL]. [2023-07-21]. https://arxiv.org/pdf/2301.01081.

[27] Sun, Y., Zhou, H., Wang, K., et al. (2022). Masked lip-sync prediction by audio-visual contextual exploitation in transformers. Proceedings of the 2022 SIGGRAPH Asia 2022 Conference Papers, 1(1), 1-9. https://doi.org/10.1145/3550498.3550566.

[28] Wang, H., & Xia, S. H. (2015). Semantic blend shape method for video-driven facial animation. Journal of Computer-Aided Design & Computer Graphics, 27(5), 873-882. https://doi.org/10.11919/j.ijcgg.2015.05.015.

[29] Yang, S., Fan, B., Xie, L., et al. (2020). Speech-driven video-realistic talking head animation using 3D AAM. Proceedings of the 2020 IEEE International Conference on Robotics and Biomimetics, 1(1), 1511-1516. https://doi.org/10.1109/ROBIO49542.2020.9298980.

[30] Blais, A., & Ghosh, S. (2020). Review of deep learning methods in image-to-image translation. Journal of Computer Science, 10(2), 150-159. https://doi.org/10.3844/jcssp.2020.150.159.

[31] Chen, H., & Zhang, Y. (2023). A survey of 3D face reconstruction from a single image. The Visual Computer, 39(3), 533-547. https://doi.org/10.1007/s00371-022-02492-5.

[32] Zhang, Z., Liu, X., & Yang, C. (2022). Talking head video generation via audio-driven full-face synthesis. ACM Transactions on Graphics, 41(1), 1-14. https://doi.org/10.1145/3508358.

[33] Xu, H., Wang, T., & Wang, C. (2023). Exploring human-robot interaction through facial animation generation. Journal of Human-Robot Interaction, 12(4), 29-42. https://doi.org/10.1145/3585756.

[34] Kim, H., Lee, J., & Park, J. (2021). A novel approach for deep learning-based audio-visual synthesis. Journal of Multimedia Processing and Technologies, 12(4), 1-12. https://doi.org/10.13189/jmpt.2021.120401.

[35] Tan, Z., Luo, M., & Sun, X. (2022). Real-time facial animation based on audio-visual synthesis. IEEE Access, 10, 7992-8001. https://doi.org/10.1109/ACCESS.2022.3145763.

[36] Liu, M., Zhang, T., & Liu, Y. (2023). Face and voice synchronization in audio-visual speech synthesis: A survey. IEEE Transactions on Affective Computing, 14(3), 993-1009. https://doi.org/10.1109/TAFFC.2022.3146391.

[37] Zhang, H., & Zhao, J. (2023). A review of facial animation technology based on audio information. Journal of Computer Graphics Techniques, 12(1), 45-65. https://doi.org/10.22059/JGTT.2023.344723.1006673.

[38] Yang, X., Zhang, L., & Wang, X. (2022). Lip-sync generation for audio-driven talking head video. ACM Transactions on Intelligent Systems and Technology, 14(3), 1-24. https://doi.org/10.1145/3485129.

[39] Guo, Q., Liu, Y., & He, D. (2022). Lip-sync audio-visual synthesis based on generative adversarial networks. IEEE Transactions on Image Processing, 31, 2178-2191. https://doi.org/10.1109/TIP.2022.3146802.

[40] Ren, J., Xu, C., & Li, Y. (2023). Advances in audio-visual speech synthesis for digital humans. Journal of Digital Human Research, 2(1), 50-70. https://doi.org/10.1007/s42087-023-00017-y.

[41] Wang, J., Yu, Y., & Huang, Z. (2023). Multimodal learning for facial expression recognition: A comprehensive survey. International Journal of Computer Vision, 131(2), 211-236. https://doi.org/10.1007/s11263-022-01680-1.

[42] Espino-Salinas, C. H., Luna-García, H., Celaya-Padilla, J. M., Barría-Huidobro, C., Gamboa Rosales, N. K., Rondon, D., & Villalba-Condori, K. O. (2024). Multimodal driver emotion recognition using motor activity and facial expressions. Frontiers in Artificial Intelligence,7,1467051. https://doi.org/10.3390/electronics13132601

[43] Huang, Y., Chen, Z., & Zhang, L. (2022). Enhancing facial expression synthesis through attention-based generative networks. Computer Animation and Virtual Worlds, 33(6), e2180. https://doi.org/10.1002/cav.2180.

[44] Li, J., Zhao, H., & Xu, Y. (2021). Video-driven expressive talking head generation: Recent advances and challenges. ACM Transactions on Graphics, 40(4), 1-15. https://doi.org/10.1145/3462935.

[45] Wu, W., Chen, H., & Zhang, Y. (2023). A comprehensive review of multimodal emotion recognition systems. Artificial Intelligence Review, 56(3), 2473-2497. https://doi.org/10.1007/s10462-022-10124-2.

[46] Zhang, Y., Liu, F., & Zhang, H. (2022). Voice-driven facial expression synthesis based on deep learning techniques. Journal of Signal Processing, 38(7), 1234-1248. https://doi.org/10.1109/JSP.2022.3148221.

[47] Zhou, L., Wang, J., & Hu, L. (2023). Real-time facial animation from speech: A review of the state-of-the-art. IEEE Transactions on Computational Imaging, 9, 1234-1247. https://doi.org/10.1109/TCI.2023.3149796.

[48] Li, P., Zhao, H., Liu, Q., Tang, P., & Zhang, L. (2024). TellMeTalk: Multimodal-driven talking face video generation. Computers and Electrical Engineering, 114, 109049. https://doi.org/10.1016/j.compeleceng.2023.109049

[49] Wang, B., Zhu, X., Shen, F., Xu, H., & Lei, Z. (2025). PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation. arXiv preprint arXiv:2503.14295. https://doi.org/10.48550/arXiv.2503.14295

[50] Song, H., & Kwon, B. (2024). Facial Animation Strategies for Improved Emotional Expression in Virtual Reality. Electronics, 13(13), 2601. https://doi.org/10.3390/electronics13132601

Downloads

Published

17-07-2025

How to Cite

1.
Qiu Z, Luo Y, Zhou Y, Gao T. Multimodal-Driven Emotion-Controlled Facial Animation Generation Model. EAI Endorsed Scal Inf Syst [Internet]. 2025 Jul. 17 [cited 2025 Sep. 1];12(4). Available from: https://publications.eai.eu/index.php/sis/article/view/7624

Funding data