Enhancing Single-Image Super-Resolution using Patch-Mosaic Data Augmentation Method on Lightweight Bimodal Network

With the advancement of deep learning, single-image super-resolution (SISR) has made significant strides. However, most current SISR methods are challenging to employ in real-world applications because they are doubtlessly employed by substantial computational and memory costs caused by complex operations. Furthermore, an e ffi cient dataset is a key factor for bettering model training. The hybrid models of CNN and Vision Transformer can be more e ffi cient in the SISR task. Nevertheless, they require substantial or extremely high-quality datasets for training that could be unavailable from time to time. To tackle these issues, a solution combined by applying a Lightweight Bimodal Network (LBNet) and Patch-Mosaic data augmentation method which is the enhancement of CutMix and YOCO is proposed in this research. With patch-oriented Mosaic data augmentation, an e ffi cient Symmetric CNN is utilized for local feature extraction and coarse image restoration. Plus, a Recursive Transformer aids in fully grasping the long-term dependence of images, enabling the global information to be fully used to refine texture details. Extensive experiments have shown that LBNet with the proposed data augmentation with zero-free additional parameters method outperforms the original LBNet and other state-of-the-art techniques in which image-level data augmentation is applied.


Introduction
The main objective of single image super-resolution (SISR) is to restore the corresponding high-resolution (HR) image with rich details and improved visual quality from a degraded low-resolution (LR) image. Because of their powerful feature extraction ability, convolutional neural networks (CNN)-based SISR methods have recently surpassed traditional methods. [1], for example, developed the Super-Resolution Convolutional Neural Network (SRCNN). Afterward, with the introduction of ResNet [2] and DenseNet [3], a plethora of CNN-based SISR models, such as ★ Enhancing Single-Image Super-Resolution using Patch-Mosaic Data Augmentation Method on Lightweight Bimodal Network * Corresponding authors. Email: (1) nqtoan@g.hongik.ac.kr and (2) hieutq10@fpt.edu.vn VDSR [4], EDSR [5], and RCAN [6], were proposed. All these methodologies demonstrate that the deeper the network, the more effectively it performs. Nonetheless, due to limited storage and computing capabilities, these methods are challenging to implement in realworld environments. Therefore, research into a model that can acquire better performance while maintaining the network's lightweight has become promising. A recursive mechanism, such as DRCN [7] or DRRN [8], is one of the most common approaches. The other is to establish lightweight structures, such as CARN [9], FDIWN [10], and PFFN [11]. Although these models lower the number of model parameters to some extent through multiple techniques and structures, they also degrade performance, making it difficult to reconstruct high-quality images with rich details. Furthermore, neural networks for SISR, without question, require massive datasets to attain acceptable performance. It can sometimes be difficult to gather or improve the quality or quantity of qualified data samples. Most standard data augmentations are conducted at the image-level resulting in overall performance gains for both generality and robustness. Imagelevel augmentation typically preserves semantics globally, following human cognitive intuition. Humans, on the other hand, can recognize objects based on only a portion of the information. Patches, or image insider information, are powerful natural signals. Several lowlevel vision [12][13] and high-level vision [14][15] [16] works used patches prior to the deep learning era. A major element of Vision Transformer (ViT) recently has been the splitting of a single image into multiple nonoverlapping patches as input for neural networks [17]. Notwithstanding, research on how to implement nonimage-level data augmentations (i.e., patch-oriented) is rare.
The proposed Patch-Mosaic could play a vital role in augmenting datasets for training computer vision models by providing a zero-parameter yet efficient method, it is hypothesized that it is suitable to launch a strategy for conducting augmentations out of the image-level. An image can be viewed as a patchwork of several patches. We can perform a specific augmentation on these pieces individually and then combine the transformed randomly pieces back into a single image via the Mosaic data augmentation method introduced in YOLOv4 [18], an improvement of CutMix [19], which combines four training images in specific ratios into one. Mosaic accomplishes this by resizing each of the four images, putting them together, and then selecting a random cutout from the stitched images to create the final Mosaic image. This strategy increases diversity at both the local region and overall image-levels, and it may also encourage neural networks to attain the same cognitive ability as humans in recognizing objects from partial information. As a result, it aids in boosting the dataset for significantly better training and model performance. In summary, the main contributions are as follows • Patch-Mosaic data augmentation is proposed as a data augmentation method for improving SISR model performance.
• Using Patch-Mosaic in conjunction with Lightweight Bimodal Network (LBNet) [20] to improve model capabilities in popular test sets in scenarios of the low computational cost required application.

Single-image Super-resolution
CNN-based. SISR methods based on CNN have advanced significantly in recent years [21]. For example, SRCNN [1] was the first to employ CNN to SISR and achieve competitive performance at the time. EDSR [5] significantly improved model performance by incorporating residual blocks [2]. RCAN [6] developed an 800-layer network and introduced the channel attention mechanism. Several lightweight SISR models have been introduced in recent years, in addition to these deep networks. For example, [9] used the cascade mechanism to propose a lightweight Cascaded Residual Network (CARN). Using the distillation and selective fusion strategy, [22] proposed an Information MultiDistillation Network (IMDN). MADNet [23] improved multi-scale feature representation and learning by employing a dense lightweight network. [24] proposed a simple but effective deep lightweight SISR model capable of generating convolutional kernels adaptively from each position's local information. Nevertheless, the effectiveness of these lightweight models is low in quality as they prevent access to larger receptive fields and global information.
Transformer-based. Many Transformer-based methods for computer vision tasks have recently been proposed, promoting the development of SISR. [25] proposed a pre-trained Image Processing Transformer for image restoration. [26] proposed a SwinIR that produced excellent results by directing the Swin Transformer directly to the image restoration task. [27] proposed an Effective Super-resolution Transformer (ESRT) for SISR using a lightweight Transformer and feature separation strategy to decrease GPU memory consumption. Nevertheless, none of these models take into account the fusion of CNN and Transformer, making it difficult to strike the best balance of model size and performance.

Patches in data augmentation
Patches are commonly used as strong signals in traditional and learning-based vision tasks. Examples

EAI Endorsed Transactions on Industrial Networks and Intelligent Systems
Online First Enhancing Single-Image Super-Resolution using Patch-Mosaic Data Augmentation on Lightweight Bimodal Network of applications include texture synthesis [28], bag-offeatures-based classification [14][15] [16], image denoising [13], super-resolution [29], image-to-image translation [30] [31]. CNNs have recently adopted it and ViTs as input for classification networks [32] [17]. Patches are also utilized for data augmentation. [33] only uses Gaussian Blur on a subset of the images. [34] creates large labeled instance datasets by cutting and pasting object instances onto backgrounds. [35] overlays 2D object images onto images of realworld environments. CutMix [19] integrates one image patch with a patch from another. PAA [36] extends the AutoAug [37] configuration by searching for augmentation policies in pre-defined patches. Through patch-based negative augmentation, [38] improves the robustness of ViTs. YOCO [39] applied patch data augmentation by cutting images into 2 pieces, then processing augmenting methods individually, then combining 2 images in the original one. Even so, no previous research has looked into how to perform the same augmentation on a non-image-level and Mosaic together.

Lightweight Bimodal Network (LBNet)
Network Architecture. The Lightweight Bimodal Network (LBNet) [20], as shown in figure 1, is primarily composed of Symmetric CNN, Recursive Transformer, and reconstruction module. Symmetric CNN is conducted for local feature extraction, whereas Recursive Transformer is designed to learn image long-term dependence. I LR , I SR , and I HR are the input LR image, reconstructed SR image, and corresponding HR image. A 3×3 convolutional layer is employed at the model's head for shallow extracting features.
F sf is the extracted shallow features, while f sf () denotes the convolutional layer. The shallow features will then be transmitted to Symmetric CNN for local extracting the features.
where f CN N () is the Symmetric CNN and F CN N is the extracted local features. One of the most important components of LBNet is symmetric CNN made up of several pairs of parameter-sharing Local Feature Fusion Modules (LFFMs) and channel attention modules. The following section will go over all of these modules.
Following that, the Recursive Transformer will be given all of these features for long-term dependence learning.
where f RT () represents the Recursive Transformer and F RT is the feature improved by global information. Finally, the refined F RT and shallow F sf features are combined and passed to the reconstruction module for the reconstruction of the SR image.
where f build () represents the reconstruction module constituted of a 3×3 convolutional layer and a pixelshuffle layer.
LBNet is optimized with the L1 loss function [40] throughout training. The training dataset is given ,it is processed by where θ denotes the LBNet parameter set, F(I LR ) = I SR denotes the reconstructed SR image, and N represents the number of training images.

Symmetric CNN and Recursive Transformer.
Symmetric CNN is a type of CNN that is specifically designed for extracting local features. It consists primarily of paired parameter-sharing Local Feature Fusion Modules (LFFMs) and Channel Attention (CA) modules. The parameter-sharing of every two symmetrical modules can enhance parameter and performance balance. Moreover, each pair of parameter-sharing modules will be merged via the channel attention module to fully utilize the extracted features.
Symmetric CNN is a dual-branch network, as illustrated in figure 1. The shallow feature F sf will be transferred to the top branch first, and the outputs of each LFFM in the top branch will be used as input to the corresponding LFFM in the down branch. The entire operation can be formulated as having: compression. Therefore as result, the most efficient features extracted at various levels will be forwarded to the following section to learn the long-term dependence of images. Symmetric CNN relies mainly on LFFM. Figure 2 shows that LFFM is essentially an improvement of DenseBlock [3]. In contrast to DenseBlock, (1) FRDAB is applied to substitute for the original convolutional layer to strengthen feature extraction; (2) a 1×1 group convolutional layer is used before each FRDAB to reduce the dimension; and (3) local residual learning is created to better information transmission. The full operation of LFFM is as follows: where F i FRB is the output of the i − th (i = 1,2,3) FRDAB module in LFFM. f i gc () represents the j − th (j = 1,2) group convolutional layer followed by FRDAB. The input and output of the m − th LFFM module are denoted by F m−1 LFFM and F m LFFM , respectively.
FRDAB, as shown in 2, is a dual-attention block that is specifically designed for feature improvement. The multi-branch structure is created specifically for feature extraction and utilization. The feature will be passed to two branches in this section, and each branch will use a different value of convolutional layers to modify the size of the receptive field to acquire different scaled features. Following that, channel attention is used to extract channel statistics for re-weighting in the channel dimension, and spatial attention is applied to re-weight the pixel based on the feature map's spatial context relationship. At last, the addition operation combines the results of these two attention operations. The final features obtained using this method will have a firmer suppression of the smooth areas of the input image.
Symmetric CNN is intended for the extraction of local features. Notwithstanding, this is far from sufficient to reconstruct high-quality images because the depth of the lightweight network makes having a large enough receptive field to achieve global information difficultly. To tackle this issue, a Recursive Transformer and Transformer to learn the long-term dependence of images (RT) are employed. Unlike previous methods, a recursive mechanism is applied to fully train the Transformer without significantly increasing GPU memory requirement or model parameters. Figure  1 depicts that RT comes before the reconstruction module, which is made up of two Transformer Modules 4 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First Enhancing Single-Image Super-Resolution using Patch-Mosaic Data Augmentation on Lightweight Bimodal Network where f 3×3 () and f T M () signify the convolutional layer and T M, respectively. The recurrent connection is denoted by ⟳ indicating the output of TM will be functioned as its new input and looped S times. Only the encoding part of the standard Transformer structure is utilized which is inspired by ESRT [27] for the T M. As illustrated in 3, T M is primarily made up of two layer normalization layers, one Multi-Head Attention (MHA), and a Multi-Layer Perception (MLP) (MLP). Defining the input embeddings as F in and the output embeddings F out can be formulated by The layer normalization operation is represented by f norm (). The MHA and MLP modules are depicted by f MHA () and f MLP (). MHA's input feature map, like ESRT's [27], is projected into Q, K, and V via a linear layer to reduce GPU memory consumption. In the meantime, a feature reduction strategy is used to reduce the Transformer's memory consumption further. According to [41], each MHA head may undertake a scaled dot product attention, concatenate all outputs, and then conduct a linear transformation to result in the output. The scaled dot product attention is demonstrated as The Transformer can be fully trained and utilized using this strategy without increasing the model's parameters or GPU memory consumption.

Patch-Mosaic
This section outlines the proposed Patch-Mosaic method. Let X ∈ R C×H×W represent an image and a() denotes data augmentations, with a() : R C×H×W − → R C×H×W , X * = a(X). Here, X * is the augmented image, and a() may be any data augmentation or a combination of data augmentations. Unlike image-level augmentation, which applies the augmentation directly to the image X * = a(X), Patch-Mosaic initially cuts the image into four equally sized patches, with equal probability in either the height or width domain, as Afterward, within each patch, a() is processed separately, and the augmented pieces are integrated back together using the Mosaic augmentation method as Randomness determines augmentations including random probabilities of being applied, random implementation, and random magnitudes. a1(), a2(), a3(), and a4() are all instances of a(), but despite using the same data augmentation a(), they may differ significantly, enhancing diversity at both the local region and holistic image-levels. 5 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First In YOLOv4 [18] the Mosaic data augmentation algorithm refers to the CutMix [19] data augmentation technique, which is a further improvement of CutMix. General methods of data augmentation are divided into 5 categories (1)

Experiment
The machine with an AMD Ryzen Threadripper 2950X @ 4.40 GHz, RAM 64GB DDR4 2666MHz, NVIDIA GeForce GTX 2080 Ti 12GB x 2, and CUDA are used to train all the networks. The operating system is Ubuntu 20.04.4. Overall, the model has 32 input and output channels, three (n = 3) LFFMs, and the Transformer module recurses twice (S = 2).

Dataset
DIV2K [5] is used as the training set. Five benchmark test datasets are used for evaluation to validate the effectiveness of the proposed Patch-Mosaic with LBNet. Set5 [42], Set14 [43], BSDS100 [44], Urban100 [45], and Manga109 [46] are five benchmark test datasets used to validate the effectiveness of the proposed Patch-Mosaic with LBNet. Table 1 summarises the dataset and augmentation information for training models.

Evaluation metrics
Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) are used as performance indicators for SR images on the Y channel of the YCbCr color space. These evaluations were demonstrated using two images. They are referred to as image 1 and image 2. Image 1 is the original degraded image from the test dataset, while image 2 is a reconstruction of image 1. SSIM is a method for determining how similar two images are. The SSIM values range from -1 to 1. PSNR is used to make comparisons of image 1's quality to image 2's quality calculated using the mean squared error (MSE). The mean square error (MSE) of an estimator represents the average of the squares of the estimation errors, or the discrepancy between the estimate and what is evaluated, lower MSE is better performance. Unlike, the higher the value of PSNR and SSIM, the more high-quality the reconstructed image is, as determined by equation 19, 20 and 21. The ground truth (reference) is represented by Y , and the reconstructed images are described by Y * : The SSIM assessment between images P Y * and P Y is summarized as follows: where σ P Y (σ P Y * ) and µ P Y * (µ P Y ) denotes the knowledge and standard deviation of patch P Y * (P Y ). Minor constants are defined as c1 and c2. Therefore, SSIM's average score patch-based over the image is SSIM (Y * ,Y ).

Result
Compared to SISR SOTA models using image-level data augmentation methods, LBNet with the proposed Patch-Mosaic achieved better performance. A detailed comparison of the top 5 most common datasets with a 4x scale (Set5 [42], Set14 [43], BSDS100 [44], Urban100 [45], and Manga109 [46]) with LBNet [20], SwinIR [26], and ESRT [27] is shown in table 2. The results clearly show that LBNet (trained on DIV2K) with Patch-Mosaic acquired competitive results using the same parameters as LBNet. These results demonstrate the efficacy of the proposed Patch-Mosaic. 6 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First Enhancing Single-Image Super-Resolution using Patch-Mosaic Data Augmentation on Lightweight Bimodal Network

Conclusion
In this study, a novel approach to data augmentation called Patch-Mosaic was introduced as a means to enhance the performance of single-image superresolution (SISR) models. This technique was specifically applied to the Lightweight Bimodal Network  The Recursive Transformer, on the other hand, trains the Transformer fully through its recursive mechanism, thereby enabling the Transformer to learn global constants and enhance features. In short, the Patch-Mosaic method effectively combines patch-level image augmentation with mosaic-generating data techniques to enhance the performance of lightweight SISR networks while reducing computational costs. The results of this study highlight the significance of data augmentation in improving the performance of super-resolution models. Notwithstanding, this research only conducted Patch-Mosaic with 4 patches (an image extracted for 4 patches, then a combined image includes 4 mosaic-augmented patches). In future research, working with more patches could be employed, it may have the potential to better the task of data augmentation for SISR.