Transformer-Based Object Detection with Deep Feature Fusion Using Carafe Operator (TRCNet) in Remote Sensing Image

Abstract: Recently, broad applications can be found in optical remote sensing images (ORSI), such as in urban planning, military mapping, field survey, and so on. Target detection is one of its important applications. In the past few years, with the wings of deep learning, the target detection algorithm based on CNN has harvested a breakthrough. However, due to the different directions and target sizes in ORSI, it will lead to poor performance if the target detection algorithm for ordinary optical images is directly applied. Therefore, how to improve the performance of the object detection model on ORSI is thorny. Aiming at solving the above problems, premised on the one-stage target detection model-RetinaNet, this paper proposes a new network structure with more efficiency and accuracy, that is, a Transformer-Based Network with Deep Feature Fusion Using Carafe Operator (TRCNet). Firstly, a PVT2 structure based on the transformer is adopted in the backbone and we apply a multi-head attention mechanism to obtain global information in optical images with complex backgrounds. Meanwhile, the depth is increased to better extract features. Secondly, we introduce the carafe operator into the FPN structure of the neck to integrate the high-level semantics with the low-level ones more efficiently to further improve its target detection performance. Experiments on our well-known public NWPU-VHR-10 and RSOD show that mAP increases by 8.4% and 1.7% respectively. Comparison with other advanced networks also witnesses that our proposed network is effective and advanced.


Introduction
Over the past decades, up against the in-depth evolution of optical remote sensing technology, ORSI has owned better resolution. ORSI contain much more information. The object detection of ORSI is targeted at identifying high-value objects (aircraft, buildings, oil tanks, etc.) and locating them accurately, which has been broadly applied in urban planning [1] [2], military reconnaissance [3], etc.
In the wake of the deep learning framework evolution, innovations have found continual expressions in CNN-based target detection algorithms in the past ten years, with two important branches emerging as follows. The two-stage detection model is represented by RCNN and the single-stage one by yolo [4] [5]. After feature extraction using CNN, the two-stage detection model first uses RPN to generate highquality RoI, then pools the RoI before finally regressing and classifying the bounding box. In contrast, the single-stage detection model directly regresses and classifies the bounding box. The two-stage model is slow and more accurate in the application. The single-stage model can rapidly function and achieve the real-time detection, but its accuracy is slightly defective. Therefore, this paper lays emphasis on the accuracy perfection of the single-stage target detection model while retaining its advantages.
The CNN-premised target detection algorithm can achieve good results on ordinary optical images with simple and clear scenes, but many differences exist between ORSI and ordinary optical images taken by mobile phones [6]. The shooting of ORSI is done by satellites or aircraft flying at high altitudes. Long-distance shooting leads to the characteristics of multisize, multi-resolution, and multi-direction. In addition, the background of the target is more complex with more diverse changes in background [7]. Traditional CNN has a limited receptive field, hindering the global information acquisition in the target recognition task of ORSI. Using stacking depth and pooling operation, the receptive field of CNN can be expanded. However, this will give rise to the degradation of small target detection performance.
Furthermore, FPN greatly promotes the development of a multi-scale target detection algorithm, which transmits highlevel semantics, fuses it with low-level semantics after upsampling to generate high-resolution and strong semantic feature maps, and enhances the detection performance of small targets. However, given that it adopts nearest neighbor up-sampling without incorporating the semantics of the feature map, it cannot effectively use semantics in feature fusion and reorganization [8].
This paper proposes TRCNet based on Retinanet, a singlestage detection model to tackle the aforementioned problems. As for the backbone, we use PVTv2 premised on the transformer to obtain the global information and do global modeling to eliminate the performance degradation of small target detection caused by insufficient receptive field and complex background of CNN. As for the Neck, we introduce the carafe operator to the FPN up-sampling process and guide the up-sampling process for efficient multi-features fusion. More specifically, the followings are our main contributions: A new network structure, TRCNet, is blueprinted to detect multi-scale objects in ORSI with higher accuracy.
The network is perfected premised on Retinanet. In terms of the backbone, we introduce a transformer module for features extraction. With regard to the Neck, we utilized FPNcarafe to fuse features with different granularity more efficiently and explore the backbone with different depths.
This experiment is launched on the premise of NWPU-VHR-10 and RSOD data sets to test the TRCNet performance, manifesting the validity of our method. The second part is themed at related work, introducing the evolution of a target detection network in satellite remote sensing images and transformer structure development. Then, we discuss the proposed method, the general network structure, and principles related to the backbone network, neck, and detector in more detail. The experimental part is introduced later to elaborate on the results and analysis of the ablation experiment before the last summary of the full text.

Evolution of Remote Sensing Target Detection
Up against the progress of CNN network architecture, the target detection algorithm performance has been greatly optimized. Various algorithms based on CNN network architecture have sprung up. Generally, given that whether the target detection algorithm has RPN or not, it is subsumed under single-stage target detection algorithms such as yolo series [34], SSD [35], Retinanet [12], etc., as well as two-stage ones such as RCNN [36], fast RCNN [37], faster RCNN [38], etc. Optical remote sensing image has the trait of scale diversity, visual angle particularity, high complexity of Beijing, smaller target than the background, and so on. However, the general target detection algorithms mentioned above are not specially designed for the problems of ORSI. Many workers have been working hard to solve these problems. The RP-Faster R-CNN framework [9] specially serves small target detection. Meanwhile, for the sake of importing detection compliance, deformable conversion layers [10] and R-FCN are united [11]. In this paper, the wellknown Retinanet will be further improved to achieve better performance in remote sensing target detection tasks.

Transformer Structure
Transformer [13] was originally designed to solve NLP problems, with its unique self-attention mechanism used to model sequence input for long range, achieving great success in the NLP field. In recent years, researchers have spared no efforts to apply transformer modules to computer vision, which has proved that it also has the great potential [14][15][16] to rival or even surpass CNN in some fields. VIT was the first to use a transformer as the backbone network. For the sake of adapting to computer vision tasks, the input image is inclined to be uniformly divided into non-overlapping image blocks. Then, the transformer uses its multi-head attention mechanism to model the input image blocks in a long range and generates the feature map needed by downstream tasks. Although VIT [17] makes somewhat difference, it is fragile when encountering the multi-scale target detection and highresolution tasks due to its inability to provide multi-scale feature maps and the high computational cost of a multi-head attention mechanism. PVT [18] effectively resolved these problems. It is a pyramid-structured transformer backbone with a spatial-reduction attention mechanism, which makes it still perform well confronted with multi-scale target detection and high-resolution tasks.

EAI Endorsed Transactions on Energy Web
Chen et al. Figure 1 which is subsumed under three main modules, that is, backbone, neck, and head. We input satellite remote sensing images into TRCNet and feature extraction will be carried out in the backbone part based on the transformer to obtain multiscale feature maps C2, C3, C4, and C5. Then, C3, C4, and C5 will be input into Neck for a more detailed feature fusion operation. The neck is mainly composed of FPN with a carafe operator. After receiving the multi-scale feature map from the backbone, the FPN carafe module will carry out in-depth feature extraction plus detailed high-level and low-level feature fusion. Finally, P3, P4, P5, P6, and P7 as detailed feature maps will be generated. The Head acquire the input detailed feature map and then classify objects plus regress the bounding box. In this way, the final target detection results are output. The sections of backbone network and the Neck are explained in detail below. Transformer-Based Object Detection with Deep Feature Fusion Using Carafe Operator (TRCNet) in Remote Sensing Image

Backbone
Since the objects in ORSI are chaotic with obviously various size, the contrast between the background and objects is small, etc., and the interference to objects is serious. Therefore, how to detect objects accurately in ORSI is worthy of thinking [39]. Traditional CNN has a limited receptive field, so the global information acquisition in the target recognition task of ORSI takes time. Although the receptive field of CNN can be expanded through stacking depth and pooling operation, it will lead to the degradation of small target detection performance.
Similar to Retinanet's traditional Resnet 50-based hierarchical backbone, our transformer-based PVTv2 backbone consists of four stages outputing multi-scale feature maps. There is a Conv1 module before the first phase. First of all, the input image is preprocessed and the input image as . All stages adopt a similar structure of PVTv2blocks. We used PVTv2blocks with depths of 3, 4, 6, and 3 in the first, second, third, and fourth stages, as shown in Figure 2  Where the channel number of various detection heads are signified by d = C/N. Average pooling is taken advantaged by LSRA to reduce the size of the scale. The application of LSRA can reduce the computational/memory burden compared to MHA. In this way, the transformer block is qualified to extract long-distance dependencies with global receptive fields in essence. Apart from that, a 3 × 3 depth-wise convolution is incorporated with GELU [60] activation layer by the PVTv2 block into CFFL between two entirely connected layers, in which CFFL is denoted as Where the operation of a entirely connected layer is

Neck
Feature up-sampling is vital for multi-scale icon detection. After proposing the feature pyramid, it is more and more common to sample high-level features and fuse them with lowlevel ones. However, the traditional up-sampling method fails to use the feature map semantics, which limits the feature fusion potential. Decomposition obtains the up-sampling kernel through the network. Although it uses semantics, it introduces a lot of parameters and calculations and applies the same kernel at every position of the feature, so it fails to use the semantics of the feature graph efficiently. However, the CARAFE operator possesses a large receptive field and the upsampling kernel is pertinent to the feature map semantics, which strengthens the effect of multi-scale target detection after fusing multi-level features without introducing too many parameters and calculations [8].
Therefore, we retain the Neck of the traditional Retinanet and adopt the FPN structure to fuse the multi-level feature map after sampling. We introduce the Carafe operator in the upsampling to improve their fusion effect.
Carafe includes two steps as shown in figure 3.
Step 1 is to generate an up-sampling kernel premised on the input feature map's semantics.
Step 2 is to check the features of the input feature map taking into account the generated up-sampling to carry out up-sampling reorganization.
Suppose we input a feature map and the upsampling ratio is . A new feature map will be generated after passing the Carafe operator. It is expressed as follows by the mathematical formula: Where L and l are target positioning, of the Y, is the mapped source location on X, and . The indicates the neighborhood of size centered on . As for , we will show the details later.

Kernal prediction module
The kernel prediction module can generate an up-sampling kernel based on the semantics of the input feature map. Every source position on X can correspond to the target position on Y. Each target location has an up-sampling core of size .
In order to generate the up-sampling kernel, the first step is to mitigate the input feature channel from to via the convolution, fastening the operation after reducing the channel. Then, the convolution kernel with kernel size and channels = is used for convolution operation to generate up-sampling kernels. Each up-sampling kernel is normalized in space by softmax.

EAI Endorsed Transactions on Energy Web
Transformer-Based Object Detection with Deep Feature Fusion Using Carafe Operator (TRCNet) in Remote Sensing Image For each up-sampling kernel , the content-aware restructuring module will recombine through and calculate with the input feature map to get the up-sampling one, which is generally a weighting operator. With regard to a (target location) and (the corresponding square region) centered at , the following mathematical formula demonstrates the calculation: Where

Neck Finishing Process
Firstly, three feature maps C5, C4, and C3 from the backbone are received. Moreover, the three feature maps are up-sampled, feature extracted, and feature fused to obtain P7, P6, P5, P4, and P3 for downstream target detection, which is expressed by the mathematical formula: Where refers to the Conv2 module, containing a conversion layer (kernel size = 1 × 1, stride size = 1, channels = 256).

Head and Loss
The target detection probe and the loss we use will be introduced here. Same as the traditional Retinanet, the target detection probe used in this paper is a weight-sharing predictor based on convolution operation. It is divided into two branches, which respectively predict each anchor's category and the regression parameters of the target bounding box.
Positive and negative samples is the same as Retinanet in matching strategy usage. Comparing each anchor with the prelabeled GT box, the positive sample is more than 0.5. If the value of the anchor and all GT boxes is less than 0.4, it is negative. The rest are discarded.
Total loss consists of classification loss and regression loss. Both positive and negative samples will calculate the classification loss. But only the positive ones will be calculated for the regression loss.

Where
indicates the Sigmoid Focal loss.
indicates the L1 loss. denotes the number of positive samples. all the positive samples. all the negative samples.

Dataset
The NWPU VHR-10 data set published by Northwestern University in 2014 [23,24,25] contains 10 categories of objects, that is, aircraft, ships, storage tanks, baseball diamonds, tennis courts, basketball courts, ground orbital fields, ports, bridges, and vehicles. The dataset contains 800 very high resolution (VHR) RSI derived from the Google Earth and Vaihingen datasets, which are annotated by experts in person. We randomly divided 640 pictures into training sets and the remaining 160 pictures were test sets.
RSOD is an open target detection dataset used for target detection in RSI. There are four kinds of objects, including airplanes, fuel tanks, sports fields, and overpasses. It includes 4,993 aircraft in 446 images, 191 playgrounds in 189 images, 180 overpasses in 176 images, and 1,586 fuel tanks in 165 images. 885 pictures have been divided as training sets and the remaining 216 pictures test sets.

Implementation Details
We train the TRCNet by using PyTorch on a PC with 4 kernels Intel (R) Xeon (R) Silver 4110 CPU @ 2.10 GHz, 16-GB RAM, and an NVIDIA GTX 2080Ti GPU. We adopt data argumentation -Random flip before training. In the training, this network shoulders the pretrained weights of the backbone and the remaining parameters are randomly initiated by Xavier. Besides, the mesoscale is (1,000, 600) and the keep ratio is true, the max epoch is 72, the batch size is 4, and the optimizer is Adam W. Meanwhile, the learning rate and the weight decay are both . We adopt the operation of warm-up. VOC2007 11-point metric [33] is applied to evaluate the proposed method's performance. s show in Figure 4.5.6.

Comparison Results with the Latest Methods
Its Performance will be evaluated quantitatively in this body. We compared it with some advanced methods on the NWPU VHR-10 data set, pertinent results are in Table 1. We use several general target detection algorithms (FCOS [26], R-FCN [27], Cascade R-CNN [28], AugFPN [29]), and methods of remote sensing image object detection (MS-FF [30], HRBM [31], SHDET [32]. The experienced results (mAP, AP) in the below tables are converted to percent (%).

Ablation Study
To test whether the TRCNet works or not, we designed four ablation experiments premised on the NWPU-VHR-10 data set, and finally tested the TRCNet validity premised on the RSOD data set. We use Rtinanet-Resnet 50 as the baseline.

Analysis of TRCNet Backbone
We replaced the traditional Retinanet backbone based on Resnet50 with PVTv2-b0 based on the transformer. After training epoch=72 on the NWPU-VHR-10 data set, we can find in Table 2 that the mAP of baseline is 0.843, while the mAP of the baseline_PVTv2_b0 is 0.917, which increases by 7.4%. This is because the transformer has a large receptive field, which can be used for long-range modeling. It is more efficient than cnn structure to extract features in satellite remote sensing images when the background interference is large and the distribution of objects is messy.

Analysis of Different Depths of TRCNet Backbone
We replaced Retinanet's backbone with PVTv2 modules of different depths to explore the performance of PVTv2 modules of different depths. We use PVTv2_b0, PVTv2_b1, and PVTv2_b2 respectively. The mAP of PVTv2_b0, PVTv2_b1, and PVTv2_b2 in Table 3 are 0.895, 0.900, and 0.917 respectively. In the wake of increasingly advanced network depth, the mAP heralds an upward trend. Apart from that, the backbone feature extraction is enhanced and the obtained feature information is more abundant. Considering the size of network parameters, this paper only goes deep into b2.

Analysis of FPN_carafe
On the basis of the baseline, we introduce the Carafe operator into the up-sampling feature fusion module of FPN, train it, and compare it with the baseline. Through Table 4, we can find that after introducing the Carafe operator, the mAP of baseline _FPNcarafe is 0.870, which is 2.7% higher than baseline_PVTv2_b2. This is because the Carafe operator can guide feature fusion more efficiently when FPN is fused with up-sampling features. At the same time, the fused features obtained are more accurate and richer than those obtained by FPN fusion alone, which makes it possible to obtain better performance in multi-scale target detection.

Analysis of backbone and FPN_carafe
On the basis of the baseline, we introduce the Carafe operator into the up-sampling feature fusion module of FPN and replace its backbone with PVTv2_b2, training and comparing with baseline. Through Table 5, we can find that after introducing the Carafe operator and replacing backbone, the mAP of the baseline_PVTv2_b2_FPNcarafe on the NWPU-VAR-10 data set is 0.927, which is 8.4% higher than baseline. Meanwhile, that of baseline_PVTv2_b2_FPNcarafe on the RSOD data set is 0.929, which is 1.6% higher than the baseline. It shows that the introduction of PVTv2_b2 and carafe can play a role at the same time.

Figures that still need to be supplemented
(1) The comparison figure between the NWPU baseline model and XXNET detection results is generally divided into two rows. The previous row is four figures of baseline, and the next row is the corresponding detection figure with a better XXNET effect than the baseline.
(2) It's enough to give 4 pictures with a score of about 0.9 for the ROSD test result diagram, and the box test is relatively accurate.
(3) The loss curve graph Section that needs to be added: Compared with the state-of-the-art Comparison with other networks.

Conclusion
This paper explores the feasibility of a target detection algorithm based on a new transformer structure in RSI with serious background interference, different target sizes, and uneven distribution of geospatial objects. Targeting at further boosting the accuracy of remote sensing target detection, the weakening performance caused by insufficient global modeling ability of traditional CNN-based model and the low efficiency of feature fusion caused by uniform up-sampling of the FPN network is solved. In the proposed TRCNet, we introduce a hierarchical PVTv2_b2 module based on a transformer as a backbone to extract features, so as to obtain more accurate and richer feature maps than the backbone based on CNN. Then Carafe operator is introduced in the multi-level features fusion of the FPN network. This operator can use the semantics in the upper feature map to guide the up-sampling process, instead of uniform up-sampling, which makes the feature fusion of the FPN network efficient and accurate, perfecting the performance of multi-scale target detection. Finally, compared with Retinanet, the mAP of TRCNet on NWPU-VHR-10 and RSOD increased by 8.4% and 1.7% respectively. Furthermore, we also compare the results of the network on NWPU VHR-10 with other advanced networks, which proves its advanced nature.