Transformer-Based Object Detection with Deep Feature Fusion Using Carafe Operator in Remote Sensing Image

Shenao Chen; Bingqi Wang; Chaoliang Zhong

doi:10.4108/ew.3404

Authors

Shenao Chen Hangzhou Dianzi University
Bingqi Wang Beijing Forestry University
Chaoliang Zhong Hangzhou Dianzi University

DOI:

https://doi.org/10.4108/ew.3404

Keywords:

Remote sensing image, transformer, target decision

Abstract

Recently, broad applications can be found in optical remote sensing images (ORSI), such as in urban planning, military mapping, field survey, and so on. Target detection is one of its important applications. In the past few years, with the wings of deep learning, the target detection algorithm based on CNN has harvested a breakthrough. However, due to the different directions and target sizes in ORSI, it will lead to poor performance if the target detection algorithm for ordinary optical images is directly applied. Therefore, how to improve the performance of the object detection model on ORSI is thorny. Aiming at solving the above problems, premised on the one-stage target detection model-RetinaNet, this paper proposes a new network structure with more efficiency and accuracy, that is, a Transformer-Based Network with Deep Feature Fusion Using Carafe Operator (TRCNet). Firstly, a PVT2 structure based on the transformer is adopted in the backbone and we apply a multi-head attention mechanism to obtain global information in optical images with complex backgrounds. Meanwhile, the depth is increased to better extract features. Secondly, we introduce the carafe operator into the FPN structure of the neck to integrate the high-level semantics with the low-level ones more efficiently to further improve its target detection performance. Experiments on our well-known public NWPU-VHR-10 and RSOD show that mAP increases by 8.4% and 1.7% respectively. Comparison with other advanced networks also witnesses that our proposed network is effective and advanced.

Downloads

Citations

Citation Indexes: 2

Captures

Readers: 5

Mentions

News Mentions: 1

see details

References

J. Niemeyer, F. Rottensteiner, and U. Soergel, “Contextual classification of lidar data and building object detection in urban areas,” ISPRS J. Photogrammetry Remote Sens., vol. 87, pp. 152–165, 2014. DOI: https://doi.org/10.1016/j.isprsjprs.2013.11.001

M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high-resolution multispectral data with deep learning features,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., pp. 1873–1876, 2015. DOI: https://doi.org/10.1109/IGARSS.2015.7326158

Z. Chen, T. Zhang, and C. Ouyang, “End-to-end airplane detection using transfer learning in remote sensing images,” Remote Sens., vol. 10, Art. no. 139., 2018. DOI: https://doi.org/10.3390/rs10010139

J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” in Proc. Conf. Adv. Neural Inform. Process. Syst., pp. 379–387, 2016.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unifified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 779–788, 2016.

Q. Zhang et al., “Dense attention flfluid network for salient object detection in optical remote sensing images,” IEEE Trans. Image Process., vol. 30, pp. 1305–1317, 2021. DOI: https://doi.org/10.1109/TIP.2020.3042084

X. Zhou et al., “Edge-Guided Recurrent Positioning Network for Salient Object Detection in Optical Remote Sensing Images,” in IEEE. Transactions on Cybernetics, doi: 10.1109/TCYB.2022.3163152. DOI: https://doi.org/10.1109/TCYB.2022.3163152

J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy and D. Lin, “CARAFE: Content-Aware Reassembly of Features,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3007-3016, 2019, doi: 10.1109/ICCV.2019.00310. DOI: https://doi.org/10.1109/ICCV.2019.00310

Xiaobing Han, Yanfei Zhong, and Liangpei Zhang. An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sensing, 9(7), 666, 2017. DOI: https://doi.org/10.3390/rs9070666

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 1(2), 3, 2017.

Zhaozhuo Xu, Xin Xu, Lei Wang, Rui Yang, and Fangling Pu. Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sensing, 9(12), 1312, 2017. DOI: https://doi.org/10.3390/rs9121312

T. -Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, “Focal Loss for Dense Object Detection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, 1 Feb. 2020. doi: 10.1109/TPAMI.2018.2858826. DOI: https://doi.org/10.1109/TPAMI.2018.2858826

A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., pp. 5998–6008, 2017.

Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted Windows,” 2021, arXiv:2103.14030. DOI: https://doi.org/10.1109/ICCV48922.2021.00986

W. Wang, L. Yao, L. Chen, D. Cai, X. He, and W. Liu, “Crossformer: A versatile vision transformer based on cross-scale attention,” 2021, arXiv:2108.00154.

X. Dong et al., “CSWin transformer: A general vision transformer backbone with cross-shaped Windows,” 2021, arXiv:2107.00652. DOI: https://doi.org/10.1109/CVPR52688.2022.01181

A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929

W. Wang et al., “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 548-558, doi: 10.1109/ICCV48922.2021.00061.

“PVT v2: Improved Baselines with Pyramid Vision Transformer,” arXiv:2106.13797

A. Dosovitskiy, L. Beyer, and A. Kolesnikov, “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), pp. 1–15, 2021.

W. Wang et al., “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 568–578, Oct. 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00061

W. Wang et al., “PVT v2: Improved baselines with pyramid vision transformer,” Comput. Vis. Media, vol. 8, no. 3, pp. 415–424, Sep. 2022. DOI: https://doi.org/10.1007/s41095-022-0274-8

Gong Cheng, Junwei Han, Peicheng Zhou, Lei Guo. Multi-class geospatial object detection and geographic ISPRS Journal of Photogrammetry and Remote Sensing, 98: 119-132, 2014. DOI: https://doi.org/10.1016/j.isprsjprs.2014.10.002

Gong Cheng, Junwei Han. A survey on object detection in optical remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 117: 11-28, 2016. DOI: https://doi.org/10.1016/j.isprsjprs.2016.03.014

Gong Cheng, Peicheng Zhou, Junwei Han. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(12), 7405-7415, 2016. DOI: https://doi.org/10.1109/TGRS.2016.2601622

Tian Z, Shen C, Chen H, Fcos TH. Fully convolutional one stage object detection. In: Proceedings of the IEEE international conference on computer vision, pp 9627–9636, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00972

Dai J, Yi L, He K, Sun J. R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387, 2016.

Cai Z, Vasconcelos N. Cascade r-cnn: High quality object detection and instance segmentation, IEEE Trans Pattern Anal Mach Intell 1–1, 2019. DOI: https://doi.org/10.1109/CVPR.2018.00644

Guo C, Fan B, Zhang Q, Xiang S, Pan C. Augfpn: Improving multi-scale feature learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12595–12604, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01261

Zhang W, Jiao L, Liu X, Liu J. Multi-scale feature fusion network for object detection in vhr optical remote sensing images. In: IGARSS 2019-2019 IEEE international geoscience and remote sensing symposium, pp 330–333. IEEE, 2019. DOI: https://doi.org/10.1109/IGARSS.2019.8897842

Li K, Cheng G, Bu S, You X (2018) Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans Geosci Remote Sens 56(4), 2337–2348, 2018. DOI: https://doi.org/10.1109/TGRS.2017.2778300

Zhu, D., Xia, S., Zhao, J. et al. Spatial hierarchy perception and hard samples metric learning for high-resolution remote sensing image object detection. Appl Intell 52, 3193–3208, 2022. https://doi.org/10.1007/s10489-021-02335-0 DOI: https://doi.org/10.1007/s10489-021-02335-0

Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. Int JComput Vis 88(2), 303–338, 2010. DOI: https://doi.org/10.1007/s11263-009-0275-4

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unifified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 779–788, Jun. 2016. DOI: https://doi.org/10.1109/CVPR.2016.91

W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis. Springer, pp. 21–37, 2016. DOI: https://doi.org/10.1007/978-3-319-46448-0_2

R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014. doi: 10.1109/CVPR.2014.81. DOI: https://doi.org/10.1109/CVPR.2014.81

R. Girshick, “Fast R-CNN,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448, doi: 10.1109/ICCV.2015.169. DOI: https://doi.org/10.1109/ICCV.2015.169

S. Ren, K. He, R. Girshick and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017, doi: 10.1109/TPAMI.2016.2577031. DOI: https://doi.org/10.1109/TPAMI.2016.2577031

X. Zhou, K. Shen, Z. Liu, C. Gong, J. Zhang and C. Yan, “Edge-Aware Multiscale Feature Integration Network for Salient Object Detection in Optical Remote Sensing Images,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-15, 2022, Art no. 5605315, doi: 10.1109/TGRS.2021.3091312. DOI: https://doi.org/10.1109/TGRS.2021.3091312