NLP-Based Robust Object Detection and Recognition Through Multimodal Relation Graph Construction Using MPST AND DSWDCNN

Authors

DOI:

https://doi.org/10.4108/eetsis.11813

Keywords:

Object Detection and Recognition, Deep Learning, Natural Language Processing, Human Computer Interface (HCI) Applications, Multimodal Relation Graph Construction, Entity Relation Identification, Artificial Intelligence (AI)

Abstract

 

INTRODUCTION: Nowadays, object detection and recognition play an important role in various applications such as surveillance, autonomous driving, robotics, and medical imaging. However, none of the traditional works focuses on analyzing the explicit relationships, logical dependencies, and semantic conflicts in text-rich or complex scenes, affecting object recognition accuracy.

OBJECTIVES: A Natural Language Processing (NLP)-based robust object detection and recognition framework is developed through multimodal relation graph construction using Minimum Persistence Spanning Tree (MPST) and Deep Swim Wishart Distribution Convolutional Neural Network (DSWDCNN).

METHODS: Initially, image with their corresponding captions is collected. Then, the image preprocessing and text preprocessing are done independently. From the preprocessed image, the object is detected using You Aspect-ratio Adaptive Anchors Only Look Once version-8 (YAAAOLOv8), followed by visualization.  Meanwhile, from the preprocessed text, entity relations are identified. The multimodal relation graph is constructed using MPST. Further, the features from preprocessed text, relation graphs, detected objects, and visualized-images are extracted. Next, the multimodal analysis is carried out.

RESULTS: In the meantime, the word embedding is performed on the preprocessed texts. Finally, the object recognition is carried out using DSWDCNN.

CONCLUSION: The proposed framework achieves an object recognition accuracy of 98.8569%, demonstrating its effectiveness under weakly supervised conditions.

 

References

[1] Chang Y, Chen Y, Huang R, Yu Y. Enhanced image captioning with color recognition using deep learning methods. Appl Sci. 2022;12:209.

[2] Manakitsa N, Maraslidis GS, Moysis L, Fragulis GF. A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies. 2024;12:15.

[3] Wu T, Dong Y. YOLO-SE: Improved YOLOv8 for remote sensing object detection and recognition. Appl Sci. 2023;13:12977.

[4] Gui S, Song S, Qin R, Tang Y. Remote sensing object detection in the deep learning era: A review. Remote Sens. 2024;16:327.

[5] Ferreira LA, Meneghetti DDR, Lopes M, Santos PE. CAPTION: Caption analysis with proposed terms, image of objects, and natural language processing. SN Comput Sci. 2022;3(5):1–16.

[6] Turay T, Vladimirova T. Toward performing image classification and object detection with convolutional neural networks in autonomous driving systems: A survey. IEEE Access. 2022;10:14076–14119.

[7] Gu X, Lin T, Kuo W, Cui Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv. 2022;2104.13921.

[8] Amjoud AB, Amrouch M. Object detection using deep learning, CNNs and vision transformers: A review. IEEE Access. 2023;11:35479–35516.

[9] Arkin E, Yadikar N, Xu X, Aysa A, Ubul K. A survey: Object detection methods from CNN to transformer. Multimed Tools Appl. 2023;82:21353–21383.

[10] Rani S, Ghai D, Kumar S, Kantipudi MVVP, Alharbi AH, Ullah MA. Efficient 3D AlexNet architecture for object recognition using syntactic patterns from medical images. Comput Intell Neurosci. 2022;2022:7882924.

[11] Rani S, Lakhwani K, Kumar S. Three-dimensional object recognition and pattern recognition techniques: Related challenges. Multimed Tools Appl. 2022;1–44.

[12] Al Shamayleh AS, Adwan O, Alsharaiah MA, Hussein AH, Kharma QM, Eke CI. A comprehensive literature review on image captioning methods and metrics based on deep learning techniques. Multimed Tools Appl. 2024;1–50.

[13] Ondeng O, Ouma H, Akuon P. A review of transformer-based approaches for image captioning. Appl Sci. 2023;13:11103.

[14] Khan AS, Abbass MJ, Khan AH. Towards fault-aware image captioning: A review on integrating facial expression recognition and object detection. Sensors. 2025;25:5992.

[15] Abdulgalil HD, Basir OA. Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to multimodal large language models. Nat Lang Process J. 2025;12:1–20.

[16] Zhang Y. Intelligent edge caching and computing for scalable information systems. EAI Endorsed Trans Scalable Inf Syst. 2023;10(5).

[17] Jiang T, Li C, Yang M, Wang Z. An improved YOLOv5s algorithm for object detection with an attention mechanism. Electronics. 2022;11:2494.

[18] Zang Y, Li W, Han J, Zhou K, Loy CC. Contextual object detection with multimodal large language models. arXiv. 2024;2305.18279.

[19] Duhayyim MA, Alazwari S, Mengash HA, Marzouk R, Alzahrani JS, Mahgoub H, Althukair F, Salama AS. Metaheuristics optimization with deep learning enabled automated image captioning system. Appl Sci. 2022;12:7724.

[20] Alnashwan RO, Chelloug SA, Almalki NS, Issaoui I, Motwakel A, Sayed A. Lighting search algorithm with convolutional neural network-based image captioning system for natural language processing. IEEE Access. 2023;11:142643–142651.

[21] Gudivaka RK. Enhancing 3D vehicle recognition with AI: Integrating rotation awareness into aerial viewpoint mapping for spatial data. J Curr Sci Humanit. 2022;10(1):7–21.

[22] Rinaldi AM, Russo C, Tommasino C. Automatic image captioning combining natural language processing and deep neural networks. Results Eng. 2023;18:101107.

[23] Ricci R, Melgani F, Marcato J, Gonçalves WN. NLP-based fusion approach to robust image captioning. IEEE J Sel Top Appl Earth Obs Remote Sens. 2024;17:11809–11822.

[24] Al Malla MA, Jafar A, Ghneim N. Image captioning model using attention and object features to mimic human image understanding. J Big Data. 2022;9:1–16.

[25] Xu S, Zheng S, Xu W, Xu R, Wang C, Zhang J, Teng X, Li A, Guo L. HCF-Net: Hierarchical context fusion network for infrared small object detection. arXiv. 2024;2403.10778.

[26] Liu C, Ma X, Yang X, Zhang Y, Dong Y. COMO: Cross-Mamba interaction and offset-guided fusion for multimodal object detection. arXiv. 2024;2412.18076.

[27] Ikram S, Bajwa IS, Ikram A, Abdullah-Al-Wadud M, Haleema P. A transformer-based multimodal object detection system for real-world applications. IEEE Access. 2025;13:29162–29176.

Downloads

Published

27-04-2026

How to Cite

1.
Bo Feng, Yang J, Jingyue Xue. NLP-Based Robust Object Detection and Recognition Through Multimodal Relation Graph Construction Using MPST AND DSWDCNN. EAI Endorsed Scal Inf Syst [Internet]. 2026 Apr. 27 [cited 2026 Apr. 28];12(5). Available from: https://publications.eai.eu/index.php/sis/article/view/11813