A Unified Hand-Landmark-Based Deep Learning Framework for Static and Dynamic Vietnamese Sign Language Recognition

Authors

DOI:

https://doi.org/10.4108/eetismla.11690

Keywords:

Vietnamese Sign Language, Deep Learning, Hand Landmark, MediaPipe

Abstract

Sign language recognition plays a crucial role in supporting communication between deaf communities and hearing individuals. In particular, Vietnamese Sign Language (VSL) recognition remains a challenging task due to the complexity of hand gestures and limited available datasets. However, most existing approaches address static gestures and dynamic sign phrases as separate recognition problems, often employing different feature representations and independent processing pipelines. This fragmentation increases system complexity and limits scalability for real-time sign language applications. This study proposes a unified hand-landmark-based deep learning framework for recognizing both static and dynamic VSL gestures within a single integrated system. The system begins by detecting hand landmarks through the MediaPipe hand tracking pipeline. Based on temporal motion analysis of landmark sequences, a gesture routing mechanism automatically determines whether the input corresponds to a static gesture or a dynamic sign phrase. Static gestures are classified using a convolutional neural network (CNN), while dynamic gestures are processed using a long short-term memory (LSTM) network to capture temporal dependencies. Experiments were conducted on a VSL dataset consisting of 23 static hand signs and 4 dynamic sign phrases collected from multiple participants with variations in hand shape and gesture execution style. The system performance is evaluated using accuracy, precision, recall, and F1-score. Experimental results demonstrate that the proposed framework achieves an average recognition accuracy of 92% for static gestures and 88.4% for dynamic sign phrases, outperforming traditional machine learning baselines. The proposed system provides a practical and efficient solution for VSL recognition and has potential applications in real-time assistive communication systems.

Downloads

Download data is not yet available.

References

[1] V. Adithya, P. R. Vinod, and U. Gopalakrishnan, “Artificial neural network based method for Indian sign language recognition,” in Proc. IEEE Conf. on Information & Communication Technologies (ICT), Apr. 2013, pp. 1080–1085, doi: 10.1109/CICT.2013.6558259.

[2] R. J. Ruben, “Sign language: Its history and contribution to the understanding of the biological nature of language,” Acta Otolaryngol. (Stockh.), vol. 125, no. 5, pp. 464–467, May 2005, doi: 10.1080/00016480510026287.

[3] G. Plouffe and A.-M. Cretu, “Static and dynamic hand gesture recognition in depth data using dynamic time warping,” IEEE Trans. Instrum. Meas., vol. 65, no. 2, pp. 305–316, Feb. 2016, doi: 10.1109/TIM.2015.2498560.

[4] Z. Zhou et al., “Sign-to-speech translation using machine-learning-assisted stretchable sensor arrays,” Nat. Electron., vol. 3, no. 9, pp. 571–578, Sep. 2020, doi: 10.1038/s41928-020-0428-6.

[5] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks for continuous sign language recognition by staged optimization,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1610–1618, doi: 10.1109/CVPR.2017.175.

[6] S. Masood, A. Srivastava, H. C. Thuwal, and M. Ahmad, “Real-time sign language gesture (word) recognition from video sequences using CNN and RNN,” in Intelligent Engineering Informatics, V. Bhateja et al., Eds. Singapore: Springer, 2018, pp. 623–632, doi: 10.1007/978-981-10-7566-7_63.

[7] A. H. Vo, V.-H. Pham, and B. T. Nguyen, “Deep learning for Vietnamese sign language recognition in video sequence,” Int. J. Mach. Learn. Comput., vol. 9, no. 4, pp. 440–445, Aug. 2019, doi: 10.18178/ijmlc.2019.9.4.823.

[8] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4207–4215.

[9] O. Koller, “Quantitative survey of the state of the art in sign language recognition”, arXiv:2008.09918, 2020.

[10] G. H. Samaan et al., “MediaPipe’s Landmarks with RNN for Dynamic Sign Language Recognition,” Electronics, vol. 11, no. 19, Oct. 2022, doi: 10.3390/electronics11193228.

[11] G. R. S. Murthy and R. S. Jadon, “A review of vision-based hand gestures recognition,” Int. J. Inf. Technol. Knowl. Manag., vol. 2, no. 2, pp. 405–410, 2009.

[12] P. Garg, N. Aggarwal, and S. Sofat, “Vision-based hand gesture recognition,” World Acad. Sci. Eng. Technol., vol. 49, pp. 972–977, 2009.

[13] O. K. Oyedotun and A. Khashman, “Deep learning in vision-based static hand gesture recognition,” Neural Comput. Appl., vol. 28, no. 12, pp. 3941–3951, 2017, doi: 10.1007/s00521-016-2294-8.

[14] F. R. Cordeiro, T. L. M. Barreto, J. P. Teixeira, and A. G. Guimarães, “A convolutional neural network with feature fusion for real-time hand posture recognition,” Appl. Soft Comput., vol. 73, pp. 748–766, 2018, doi: 10.1016/j.asoc.2018.09.010.

[15] P. Rathi, N. S. Chauhan, and R. Bhardwaj, “Sign language recognition using ResNet50 deep neural network architecture,” in Proc. Int. Conf. on Next Generation Computing Technologies (NGCT), 2020.

[16] P. Bhatia and A. Wadhawan, “Deep learning-based sign language recognition system for static signs,” Neural Comput. Appl., vol. 32, pp. 7957–7968, 2020.

[17] W. Wang, C. Wang, and J. Wu, “American sign language recognition using multidimensional hidden Markov models,” J. Inf. Sci. Eng., vol. 22, pp. 1109–1123, 2006.

[18] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Sign language recognition using 3D convolutional neural networks,” in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME), 2015.

[19] C. Lugaresi et al., “MediaPipe: A framework for building perception pipelines,” arXiv preprint arXiv:1906.08172, 2019.

[20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015, doi: 10.1038/nature14539.

[21] S. Duan, L. Wu, A. Liu, and X. Chen, “Alignment-enhanced interactive fusion model for complete and incomplete multimodal hand gesture recognition”, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 4661–4671, 2023.

[22] C. Li, D. Zhao, B. Zhang, W. Chu, and Y. Luan, “Research on a fast signal recognition method for a laser screen measurement system based on LSTM”, Measurement, vol. 254, p. 117905, 2025.

[23] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2015.

[24] R. Bowden, “Deep sign: Hybrid CNN-HMM for continuous sign language recognition”, in Procedings of the British Machine Vision Conference 2016, 2016.

[25] G. Devineau, F. Moutarde, W. Xi, and J. Yang, “Deep learning for hand gesture recognition on skeletal data”, in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, pp. 106–113.

Downloads

Published

13-04-2026

How to Cite

1.
Duong Thanh L, Pham Kim D. A Unified Hand-Landmark-Based Deep Learning Framework for Static and Dynamic Vietnamese Sign Language Recognition. EAI Endorsed Trans Int Sys Mach Lear App [Internet]. 2026 Apr. 13 [cited 2026 Apr. 13];3. Available from: https://publications.eai.eu/index.php/ismla/article/view/11690