A Review of Convolutional Neural Network Development in Computer Vision

Convolutional neural networks have made admirable progress in computer vision. As a fast-growing computer field, CNNs are one of the classical and widely used network structures. The Internet of Things (IoT) has gotten a lot of attention in recent years. This has directly led to the vigorous development of AI technology, such as the intelligent luggage security inspection system developed by the IoT, intelligent fire alarm system, driverless car, drone technology, and other cutting-edge directions. This paper first outlines the structure of CNNs, including the convolutional layer, the downsampling layer, and the fully connected layer, all of which play an important role. Then some different modules of classical networks are described, and these modules are rapidly driving the development of CNNs. And then the current state of CNNs research in image classification, object segmentation, and object detection is discussed.


Introduction
CNNs are prevalent in deep learning architecture.Local awareness and parameter sharing are features of the convolutional structure, which can lower the model's complexity and the number of parameters.CNNs are also a very flexible machine learning model containing multilevel nonlinear transformations, and their design is inspired by the way the animal visual cortex is organized.After being inspired, CNNs have developed rapidly, solved many difficult problems in the field of artificial intelligence in the past, have strong robustness and fault tolerance [1], and are easy to train and optimize.In practical applications, at the same time, it is also combined with the Internet of Things (IoT) to improve the accuracy of face recognition.In terms of medical treatment, the IoT technology can be used to obtain data, load data, and put it into the convolutional model, which can complete the intelligent management of people and things.
In 1998, LeCun proposed LeNet-5 [2] network, which designed local receptive field, shared weight, and downsampling is designed to keep the translation, scale, and distortion invariance of handwritten digits [3][4][5].In small-scale handwritten digit recognition, the system performed well.In 2012, in the ImageNet competition, the new model AlexNet [3] won the best performance in the image classification challenge competition, which attracted wide attention.Compared with the LeNet network, AlexNet network has a complicated design, as ReLU (Rectified Linear Unit) is built as a nonlinear activation function, using Dropout randomly inactivating neurons to address the issue of parameters as well as prevent network overfitting.With the success of AlexNet, researchers continue to improve on this basis to improve its performance, among which the representative architecture ZFNet [4], NIN [6], and VGGNet [5], GoogLeNet [7], ResNet [8].With the development of this architecture, although these networks continue to improve the accuracy of ImageNet classification tasks, they also

CNNs Architecture
CNNs take the original image as the network input [9][10][11].After the simple transformation of the data, it carries out a series of operations such as convolution, pooling, and nonlinear activation function mapping and abstracts, the original image layer by layer into the final feature [12] representation needed by its task.Finally, it ends with the linear mapping from the feature to the task target.Although there are many variants of CNNs, their structures are very similar [13][14][15], generally composed of the input layer, convolutional layer, pooling layer, the fully connected layer, and output layer.Figure 1 depicts the LeNet network.

Convolutional Layer
As the first layer of image processing, the convolutional layer aims to learn the feature representation of the input image.The convolutional layer consists of multiple filters that map different features [16,17].In a convolutional neural network, an element in the output of a certain layer is determined when the region size of the corresponding input layer is called the receptive field.The new feature mapping can convolution the input with the learning filter [18], and then apply the nonlinear activation function to the convolutional result to obtain the output result [19][20][21].The filters in the low convolutional layer are used to detect low-order features such as linear textures at edges and corners.But the filters in the high layer are used to learn abstract and more specific features [22][23][24].By stacking multiple convolutional layers, the network model can gradually extract higher-level feature representation.Figure 2 illustrates the convolution operation.

Figure 2. convolution operation
The preceding layer's feature map is convolution using a learnable convolutional kernel, and further output feature map is then generated using an activation function.The values of numerous feature maps can be convoluted using every output feature map.The calculation process is as follows(1)(2).
( ) Among them, the net activation of the jth channel of the convolutional layer  is referred to as    , this is created

Pooling Layer
Pooling layers are in the middle of consecutive convolutional layers.The feature map of the pooling layer corresponds to the upper network layer [25,26], that is, the pooling operation does not change the number of feature maps.The traditional network is often too large when processing images, which is inconvenient to process.The introduction of a pooling layer to reduce the size of the intermediate parameter matrix is to prevent overfitting.In addition, the pooling layer can maintain the translation and rotation invariance of convolution, that is, rotate or translate the image, and can also extract image features and improve the model's ability to flourish [27][28][29].Commonly used pooling methods include max pooling, which takes the point with the largest value in the local receptive field, mean pooling, which averages all parameters in the local receptive field, and stochastic pooling, which randomly takes a value from the value in the local receptive field [30][31][32].
In addition, there are pooling methods such as mixed pooling and spatial pyramid pooling.The function of pooling is to reduce the size of the model.We know that the amount of information contained in an image is huge, and the features are also very large.However, some information is not very useful for us to identify key features.Therefore, it is necessary to compress a relatively complex large matrix into a relatively small matrix through pooling to reduce its complexity to improve the operation speed and improve the robustness of the model to deal with more complex problems.[33][34][35].The max pooling is shown in Figure 3 Figure

Max Pooling
Average pooling can be regarded as a structural regularization, which can improve the consistency between feature surfaces and categories [36][37][38].There are no parameters that need to be optimized in the globally averaged sampling layer, so overfitting can be avoided.Furthermore, the globally averaged sampling layer sums the spatial information and is, therefore, more robust to spatial variations of the input.The average pooling is shown in Figure 4.

Fully Connected Layer
We know that the convolution operation generates local features and that the fully connected layer's purpose is to add up the prior local features before generating the classification result [39,40].It's the same as translating the feature space of many distinct local representations previously learned to the machine's sample feature space [41][42][43] which is convenient for handing over to the final classifier or regression.The calculation formula is(3) (4).

( )
Where   is called the net activation of the fully connected layer , which includes by weight and bias the output feature map  (−1) of the previous layer,   is the weight coefficient of the fully connected network, and   is the bias of the fully connected layer , (•) is called the activation function.

Activation Function
In deep CNNs, the activation function is a critical component.The traditional network can only cope with some linearly separable issues when there is no activation function.The nonlinear activation function is presented, which is useful for boosting the model's resilience, increasing its nonlinear expression ability, and eliminating difficulties like gradient disappearance [44][45][46].

() x
x e The Figure 5 is shown in Sigmoid activation function.

ReLU
ReLU is the most frequently used activation function in various models so far.Specifically, when the input is negative, the ReLU function's output is 0; when the input is positive, the ReLU function's output is x.Compared with Sigmoid, ReLU, the convergence speed of the model is faster, and it is more beneficial to the gradient update of backpropagation [47][48][49].At the same time, the function of neurons in the hidden layer is set to 0, which brings sparseness and makes it easy for the network to obtain sparse representation, reduce the number of parameters [50], and reduce overfitting.[51,52] Experiments show that ReLU has better performance than Sigmoid, and can be better to solve the gradient vanishing problem.The function formula of the ReLU is (6).
The ReLU activation function is shown in Figure 6.

Network in Network
The standard CNN is generally connected by linear convolutional layer, pooling layer, and the fully connected layer.The convolutional layer carries on the linear convolutional operation through the filter, then uses the nonlinear activation function to process the convolutional result, and finally generates the characteristic graph.Because the convolutional layer uses a linear filter, the acquired features have a strong linear representation [53,54], so it is more suitable for learning linearly separable features and limits the task application scenarios to a great extent.However, the features of the sample to be extracted are generally highly nonlinear.For example, in face recognition, human ears, noses, and mouths all have different features.Therefore, Lin et al. [6] designed a Network in Network (NIN) model, whose main idea is to replace the traditional convolutional layer with a Multilayer Perceptron (MLP) composed of multiple the fully connected layers with nonlinear [55] activation functions.The Linear convolutional layer and MLP layer are depicted in Figure 7.So the nonlinear neural network is used to replace the linear filter, which enables it to approach more abstract representations of potential features and have stronger generalization ability.In the traditional CNNs structure, the fully connected layers have too many parameters and are prone to overfitting [57], so it relies heavily on the dropout regularization technique.The NIN structure uses global average pooling to replace the original the fully connected layers, which greatly reduces the parameters of the model.It averages each feature surface of the last MLP convolutional layer through the global average pooling method and then concatenates these values into a vector, which is finally input into the softmax classification layer [58].MLP convolutional layers can handle more complex nonlinear problems and extract more abstract features.With the development of CNNs, the depth and width of the network increase, the network will become more complex, and it is theoretically easier to fit complex feature representations.However, the network will face degradation problems and gradient exploding problems or gradient vanishing problems instead of overfitting problems [65,66].The specific performance is that the network performance no longer improves when depth deepens, and even when the network depth further increases, the model performance declines seriously [32,67].

Inception Block and Improved Inception Block
Residual block proposed by He et al.ResNet [8] is similar to Highway network [68], which also allows input information to spread across multiple hidden layers.The difference is that the threshold mechanism of the residual network is no longer learnable, that is, it always maintains a smooth state of information [69], which is extremely the number of hyperparameters is greatly reduced [70], the network convergence is accelerated, and a series of problems caused by network degradation is greatly reduced.The residual module is shown in Figure 9.The input of the residual module is defined as X, and the result is described as H(X)=F(X)+X, the residual is defined as F(X), and the network learns the residual F(X) during the training process, which is easier than directly learning the output H(X).
The proposal of residual networks marks a new stage in the development of convolutional neural networks.Residual blocks can be used to train deep network structures, and then a large number of studies have been carried out to improve residual structures [71].By using the residual block to superimpose the depth, the accuracy is slightly improved, so the researchers tried to study the influence of the width on the network and found that the width is more important than the depth, and it is unnecessary to train a network with more than 50 layers [72], so there is currently a lot of research work to optimize the structure of the residual network from the network width.Zagoruyko et al. [73] think that ResNet cannot the fully feature reuse during training, which is manifested in that the gradient cannot flow through each Residual block during backpropagation, and there are only a few the residual module can learn useful feature representations.The author proposes a Wide Residual Network (WRN) [73].By widening the network width, reducing its depth, the training speed of WRN is increased by 2 times compared with the previous residual network, but the number of network layers is reduced by 50 times, which greatly reduces the amount of computation.Targ et al. [74] proposed a generalized residual network that combines the residual network and the standard convolutional neural network in parallel, removes invalid information while retaining the effective feature expression, improves the expressive ability of the network, and has a significant effect on the CIFAR-100 dataset.Zhang et al. [75] by adding additional bypass connections to the residual network and increasing the width to improve the learning ability of the network, the proposed Residual networks of residual networks (RoR) [76] can be used as a general module for constructing the network.Abdi et al. [77] experimentally support that the residual network is the hypothesis obtained by the fusion of several shallow networks, the model proposed by the author increases the number of residual functions in the residual module to improve the model's expressive capabilities.

DenseNet
In 2016, inspired by the idea of skip connections in ResNet [8], Huang et al. proposed a DenseNet [78] model.The model first uses forward propagation in the convolutional layer to connect each layer with other layers in the network and then uses the feature maps of all previous layers as the input of each subsequent layer to construct DenseNet [79,80].On popular image classification benchmarks, DenseNet can achieve comparable accuracy to ResNet in the ImageNet image classification competition, but it requires significantly fewer parameters [81,82].In addition, it improves the gradient vanishing problem, and at the same time, it strengthens the feature propagation process and promotes feature reuse through the reorganization of feature maps, reducing the amount of irrelevant computation.The DenseNet model structure is shown in Figure 10.

Other Innovative Blocks
In the exploration of the design space of network structure, a great deal of effort has gone into fine-tuning module design, and a series of research results have been obtained.Such as to reduce the training parameters of the fully connected layer, NIN first proposed using global average pooling to replace the fully connected layer, which is equivalent to the whole connection layer.The network structure is regularized to prevent overfitting [83][84][85].The global mean pooling establishes a connection between the feature map and the output category label, which is more interpretable than the fully connected layer, and then GoogLeNet adopts this structure to obtain performance improvement Huang et al. believe that the success of extremely deep networks comes from the introduction of bypass connections, and their proposed Dense block has direct connections between any two-layer network.For any network layer, its input comes from the output of all previous network layers.The output is used as the input of the following layers.This dense connection improves the propagation speed of the gradient and has a regularizing effect on the network, which makes the overfitting problem on small datasets optimized.Another advantage of dense connection is that it allows feature reuse.The trained DenseNet has a smaller number of parameters and is easier to train.
Traditional models like VGG models are too large to be enabled on lightweight devices.The lightweight network constructed by MobileNet [86] proposed by Howard et al. can be used on mobile embedded devices.Specifically, the traditional convolutional process is decomposed into two steps: depthwise convolutional and pointwise convolutional, which reduces the size of the model and the amount of computation.Sandler et al. combined the residual module with the depth-wise separable convolutional and Inverse residual with linear bottleneck are proposed, and the MobileNetv2 [87] constructed from this is superior to MobileNet in speed and accuracy.Zhang et al. further proposed ShuffleNet [75] of pointwise group convolution and channel shuffle on the basis of MobileNet, which achieved greatly improved in both image classification and target detection tasks.The Depthwise Convolutional block is shown in Figure 11.

Image Classification
Image classification is one of the core problems in image processing.Image classification refers to predicting the category of an image given an image.
In the last several years, CNNs have been widely used in the field of image processing.Krizhevsky et al. [3] used CNNs in the LSVRC-12 competition for the first time, they use the ReLU + Dropout technology for the first time, the depth of the CNNs model was deepened, and the best classification results were obtained at that time, which was called the AlexNet model.Compared with traditional CNNs, there is a great improvement.Because the nonlinear ReLU activation function is used in AlexNet, the ability of the model to deal with nonlinear functions is improved, the computational complexity of the model is reduced, and the training speed of the model is reduced.In addition, through dropout technology, some neurons are set to 0 at random throughout the training process, that is, some neurons in the middle are randomly inactivated in each cycle, the model has stronger robustness, and reduces the number of parameters of the fully connected layer to avoid overfitting [88].
Through the success of AlexNet, Szegedy et al. [7] try to increase the depth of CNNs, proposing a CNNs structure with more than 20 layers (called GoogLeNet).Convolution operations have been enhanced to three types (1*1, 3*3, and 5*5) that are used in the GoogLeNet structure.The main feature of this structure is that the convolution computation is reduced, the parameters are 12 times less, and the GoogLeNet The accuracy rate is higher, and it won the first place in the "specified data" group of image classification in LSVRC-14.
In the deep network model with very deep layers, in addition to the gradient vanishing problem and gradient exploding problem, there is also a degradation problem.Batch Normalization (BN) [89] is an efficient method to solve the gradient diffusion problem.The so-called degradation problem is: as the depth increases, the network accuracy will first rise and tend to saturate, and then rapidly decline.But the performance drop is not caused by overfitting, due to increasing the depth of the network so that its training error also increases.He et al.

Object Detection
In the subject of computer vision, object detection has long been a major research focus [90][91][92], Its purpose is to locate the image target accurately and determine the target category.[93].The use of CNNs for target detection can be traced back to the 1990s [94,95].However, due to the lack of training data and limited data processing capability of computing equipment, at that time, target detection based on CNNs developed very slowly before 2012 [96].The great success of CNNs in the ImageNet challenge in 2012 re-inspired researchers interest [62] in based CNNs object detection, and it also led to the improvement of object detection accuracy.Object detection has also produced many more classical networks including R-CNN [97], OverFeat [88], Fast R-CNN [98], Faster R-CNN [99], FPN [100], and Mask R-CNN [101].Nowadays, object detection has been widely used in security, military, transportation, and other fields.
Although the R-CNN [97] algorithm has achieved significant performance improvement in the object detection task, the CNNs feature extractor is executed for each candidate region, thus consuming a lot of computing time, resulting in high computational cost.The researchers in order to solve this problem proposed OverFeat [88].OverFeat is the first time to use this model for multiple tasks.The characteristics of CNNs are fully utilized.First, the basic features are extracted from the image through CNNs, and then the basic features are extracted and assigned to different feature tasks.Because the weights are reused, the network propagation calculation is reduced.It solved the problem of long operation time.Later, Fast R-CNN [98] was introduced to improve the network by using the end to end training method.All convolutional layers can update parameters during fine-tuning, which improves the efficiency of code execution and improves the accuracy of detection.

Conclusion
Over the last few years, CNNs have made continuous breakthroughs in the field of computer vision.So far, the Internet of things (IoT) technology has penetrated our daily life, such as unmanned vending machines, driverless cars, smoke alarms, etc.Among them, the Internet of Things uses convolutional networks to develop related unmanned driving technologies, body temperature detection systems, and intelligent safety equipment, which has been widely used.
This paper expounds the modules in the classic model from the convolutional layer, pooling layer, and activation function, and finally summarizes some research progress of CNNs in image classification, object detection, and object segmentation is presented.Although convolutional neural networks have been widely used, there is still room for exploration in computer vision and other fields [103].
First, since CNNs architectures are getting deeper and deeper, large-scale annotated datasets and huge computing power are required for training.Secondly, the current research on CNNs in computer vision is almost all supervised learning, so manual collection of labelled data sets requires a lot of manpower and financial resources, so it becomes particularly important for unsupervised learning exploration; At the same time, when testing, CNNs deep models need to take up a lot of video memory, and the training time is very long.Large networks sometimes require several months of training time, which makes them unsuitable for deployment on mobile platforms with limited resources.Reducing the complexity of the model and being able to run the model on the underlying device without loss of accuracy is very important for the development of convolutional neural networks [104].
Finally, choosing appropriate hyperparameters has always been a major obstacle to applying CNNs to new tasks, such as the size of the learning rate, the size of the convolution kernel, the selection of the stride and the number of convolutional layers, and the selection of the

Figure 1 .
Figure 1.LeNet Network example by convolution and summation of the preceding layer's output feature map   −1 plus the bias, and    is the output of the jth channel of the convolutional layer , (•) called the activation function, and sigmoid or ReLU are examples of activation functions are frequently used.The   represents the subset of input feature maps used to calculate    ,    is the convolutional kernel matrix, and    is the bias of the feature map after convolution.The convolutional kernel    corresponding to each input feature map   −1 may differ form an output feature map    , and "*" is the convolutional representation.

Figure 7 .
Figure 7. Linear convolutional layer and MLP layer MLP is a nonlinear convolutional layer of the NIN structure, which replaces the original generalized linear model with MLP.NIN obtains feature maps of convolutional layers by sliding a miniature neural network through the input.Similar to the weight sharing of convolutional, MLP also shares all local receptive fields of the same feature surface, that is, the same for the same feature map MLP.The reason why NIN chooses MLP is that MLP uses the backpropagation algorithm for training and can be integrated with the CNNs structure [56].At the Szegedy et al.[7] proposed an Inception model in 2014, which uses dimensionality reduction (1×1 convolution) to reduce the amount of computation and the cost of computation.Its main idea uses three small-scale filters of different sizes to extract feature information of different scales from the previous input layer, and then use this feature information and transmit it to the next layer.Inception has 1×1, 3×3, and 5×5 filtering among them, the 1×1 filter is mostly used for data dimensionality reduction, which may drastically reduce calculation time.This greatly improves the running speed of the code.Through the feature fusion of 4 channels, more useful features are extracted[59][60][61].Szegedy et al. thought of a method to improve the accuracy of CNNs, which used decomposed convolutional and dimensionality reduction in the network[62][63][64].The improved Inception model reduces the number of parameters and speeds up computation by replacing the 5×5 convolution in the Inception model with two 3×3 convolutions.The Inception and the improved Inception block are shown in Figure8.

Figure 8 .
Figure 8. Inception block and improved Inception block

[ 8 ]
adopt Residual Networks to solve the degradation problem.The main feature of ResNet is the cross-layer connection, which adds the input cross-layer pass and the convolution result by introducing Shortcut Connections.In other words, the unit's input is directly added to its output before being activated.Experiments show that the residual network can indeed solve the degradation problem of deep neural networks due to the depth of the network.ResNet enables the underlying network to be fully trained, the extracted shallow features are more abundant, and the accuracy is significantly improved with the deepening of the depth.When demonstrated in the LSVRC-15 competition using a deep ResNet with a depth of 152 layers, it achieved the 1st place result in image classification.EAI Endorsed Transactions on Internet of Things 04 2022 -04 2022 | Volume 7 | Issue 28 | e2 Deep convolutional neural networks have been successfully applied to image detection, classification, and other tasks, many researchers have applied CNNs to the field of image segmentation.The Fully Convolutional Network (FCN) proposed by Long et al.[102] at CVPR2015 can learn from end to end.Object classification results belong to pixel-level learning.Unlike classical CNNs architectures, traditional CNNs can only accept pixel inputs of fixed size.FCN replaces the traditional fully connected layer with a fully convolutional layer with a convolution kernel size of 1, which allows FCN to accept pixel inputs of any size.The network uses pooling and deconvolution operations to ensure input and output.With the same size, richer feature information is obtained by fusing the low-level features of the shallow network with the high-level semantic features of the deep network.That is, the semantic information from the deep convolutional layers is combined with the appearance information of the shallow convolutional layers through the residual connection structure to generate accurate and detailed image semantic segmentation.The schematic structure of FCN is shown in Figure12.