Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models

Authors

  • Vamsi Krishna Mulukutla Vishnu Institute of Technology, Bhimavaram
  • Sai Supriya Pavarala Vishnu Institute of Technology, Bhimavaram
  • Srinivasa Raju Rudraraju Vishnu Institute of Technology, Bhimavaram
  • Sridevi Bonthu Vishnu Institute of Technology https://orcid.org/0000-0002-1971-4965

DOI:

https://doi.org/10.4108/airo.8870

Keywords:

Facial Emotion Detection, VLMs, Facial Expression Classification, Phi-3.5, CLIP

Abstract

Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models—VGG19, ResNet-50, and EfficientNet-B0—on the challenging FER-2013 dataset, which contains 35,887 low-resolution, grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.

Downloads

Download data is not yet available.

References

[1] Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee DH, Zhou Y. Challenges in representation learning: A report on three machine learning contests. InNeural information processing: 20th international conference, ICONIP 2013, daegu, korea, november 3-7, 2013. Proceedings, Part III 20 2013 (pp. 117-124). Springer berlin heidelberg.

[2] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning 2019 May 24 (pp. 6105-6114). PMLR.

[3] Mollahosseini A, Hasani B, Mahoor MH. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing. 2017 Aug 21;10(1):18-31.

[4] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023 Feb 27.

[5] Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 (pp. 12104-12113).

[6] Li S, Deng W. Deep facial expression recognition: A survey. IEEE transactions on affective computing. 2020 Mar 17;13(3):1195-215.

[7] Wang X, Li Y, Zhang H, Shan Y. Towards real-world blind face restoration with generative facial prior. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021 (pp. 9168-9178).

[8] Mienye ID, Swart TG. A comprehensive review of deep learning: Architectures, recent advances, and applications. Information. 2024 Nov 27;15(12):755.

[9] Hatcher WG, Yu W. A survey of deep learning: Platforms, applications and emerging research trends. IEEE access. 2018 Apr 27;6:24411-32.

[10] Jaiswal A, Raju AK, Deb S. Facial emotion detection using deep learning. In2020 international conference for emerging technology (INCET) 2020 Jun 5 (pp. 1-5). IEEE.

[11] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25.

[12] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017 May 24;60(6):84-90.

[13] Connie T, Al-Shabi M, Cheah WP, Goh M. Facial expression recognition using a hybrid CNN–SIFT aggregator. InInternational workshop on multi-disciplinary trends in artificial intelligence 2017 Oct 19 (pp. 139-149). Cham: Springer International Publishing.

[14] Abdin M, Aneja J, Awadalla H, Awadallah A, Awan AA, Bach N, Bahree A, Bakhtiari A, Bao J, Behl H, Benhaim A. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219. 2024 Apr 22

[15] Li Y, Liu H, Liang J, Jiang D. Occlusion-Robust Facial Expression Recognition Based on Multi-Angle Feature Extraction. Applied Sciences. 2025 May 6;15(9):5139.

[16] Karamizadeh S, Chaeikar SS, Najafabadi MK. Enhancing Facial Recognition and Expression Analysis With Unified Zero-Shot and Deep Learning Techniques. IEEE Access. 2025 Feb 26. Carneiro T, Da Nóbrega RV, Nepomuceno T, Bian GB, De Albuquerque VH, Reboucas Filho PP. Performance analysis of google colaboratory as a tool for accelerating deep learning applications. Ieee Access. 2018 Oct 7;6:61677-85.

[17] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.

[18] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) 2019 Jun (pp. 4171-4186).

[19] Bonthu S, Sree SR, Prasad MK. Framework for automation of short answer grading based on domain-specific pre-training. Engineering Applications of Artificial Intelligence. 2024 Nov 1;137:109163.

[20] Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.

[21] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G. Learning transferable visual models from natural language supervision. InInternational conference on machine learning 2021 Jul 1 (pp. 8748-8763). PmLR.

[22] Mulukutla VK, Pavarala SS, Kareti VK, Midatani S, Bonthu S. Sentiment Analysis of Twitter Data on ‘The Agnipath Yojana’. InInternational Conference on Multi-disciplinary Trends in Artificial Intelligence 2023 Jun 24 (pp. 534-542). Cham: Springer Nature Switzerland.

[23] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics 2010 Mar 31 (pp. 249-256). JMLR Workshop and Conference Proceedings.

[24] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. InEuropean conference on computer vision 2020 Aug 23 (pp. 213-229). Cham: Springer International Publishing.

[25] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014 Sep 4.

[26] Fergus P, Chalmers C, Matthews N, Nixon S, Burger A, Hartley O, Sutherland C, Lambin X, Longmore S, Wich S. Towards context-rich automated biodiversity assessments: deriving AI-powered insights from camera trap data. Sensors. 2024 Dec 19;24(24):8122.

[27] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778).

[28] Raiko T, Valpola H, LeCun Y. Deep learning made easier by linear transformations in perceptrons. InArtificial intelligence and statistics 2012 Mar 21 (pp. 924-932). PMLR.

[29] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning 2019 May 24 (pp. 6105-6114). PMLR.

[30] Li Z, Xie C, Cubuk ED. Scaling (down) clip: A comprehensive analysis of data, architecture, and training strategies. arXiv preprint arXiv:2404.08197. 2024 Apr 12.

[31] Ramprasath M, Anand MV, Hariharan S. Image classification using convolutional neural networks. International Journal of Pure and Applied Mathematics. 2018;119(17):1307-19.

[32] Wang W, Chen Z, Chen X, Wu J, Zhu X, Zeng G, Luo P, Lu T, Zhou J, Qiao Y, Dai J. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems. 2023 Dec 15;36:61501-13.

[33] Li Y, Hu B, Wang W, Cao X, Zhang M. Towards vision enhancing llms: Empowering multimodal knowledge storage and sharing in llms. arXiv preprint arXiv:2311.15759. 2023 Nov 27.

[34] Zhai X, Wang X, Mustafa B, Steiner A, Keysers D, Kolesnikov A, Beyer L. Lit: Zero-shot transfer with locked-image text tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 (pp. 18123-18133).

[35] Sarker IH. LLM potentiality and awareness: a position paper from the perspective of trustworthy and responsible AI modeling. Discover Artificial Intelligence. 2024 May 21;4(1):40.

[36] Raiaan MA, Mukta MS, Fatema K, Fahad NM, Sakib S, Mim MM, Ahmad J, Ali ME, Azam S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE access. 2024 Feb 13;12:26839-74.

[37] Gidaris S, Komodakis N. Object detection via a multi-region and semantic segmentation-aware cnn model. InProceedings of the IEEE international conference on computer vision 2015 (pp. 1134-1142).

[38] Arla LR, Bonthu S, Dayal A. Multiclass spoken language identification for Indian Languages using deep learning. In2020 IEEE Bombay Section Signature Conference (IBSSC) 2020 Dec 4 (pp. 42-45). IEEE.

[39] Jyothi UP, Dabbiru M, Bonthu S, Dayal A, Kandula NR. Comparative analysis of classification methods to predict diabetes mellitus on noisy data. InMachine Learning, Image Processing, Network Security and Data Sciences: Select Proceedings of 3rd International Conference on MIND 2021 2023 Jan 1 (pp. 301-313). Singapore: Springer Nature Singapore.

[40] Khaireddin Y, Chen Z. Facial emotion recognition: State of the art performance on FER2013. arXiv preprint arXiv:2105.03588. 2021 May 8.

[41] Taori R, Dave A, Shankar V, Carlini N, Recht B, Schmidt L. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems. 2020;33:18583-99.

Downloads

Published

11-08-2025

How to Cite

[1]
V. K. Mulukutla, S. S. Pavarala, S. R. Rudraraju, and S. Bonthu, “Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models”, EAI Endorsed Trans AI Robotics, vol. 4, Aug. 2025.