A facial expression recognizer using modified ResNet-152

In this age of artificial intelligence, facial expression recognition is an essential pool to describe emotion and psychology. In recent studies, many researchers have not achieved satisfactory results. This paper proposed an expression recognition system based on ResNet-152. Statistical analysis showed our method achieved 96.44% accuracy. Comparative experiments show that the model is better than mainstream models. In addition, we briefly described the application of facial expression recognition technology in the IoT (Internet of things). Abstract


Introduction
Facial expression is the result of position and movement of facial muscles. It conveys human emotional information through these movements. Therefore, facial expression is regarded as a form of non-verbal communication. In short, facial expression is a physical and psychological response of the body, which is usually used to convey emotion. In interpersonal communication, human can enhance the effect of communication by controlling their facial expressions. Facial expression recognition (FER) plays an important role in the field of human-computer interaction. In order to make the interaction more convenient, the computer must have FER ability. For the cutting-edge pages that need to communicate with human users, we need to utilize a FER system for situational understanding [1]. In this paper, we construct a facial expression recognizer based on ResNet-152.
Neural network has a strong nonlinear fitting ability for all kinds of data. But for row data, such as speech information, image information. It is difficult to get the ideal result by machine learning. The traditional method to extract features from images is to use the features created by hand. When we encounter the problem image processing, it is hard for us to extract features from the original pixels to describe the image. Until the emergence of Deep Neural Network (DNN), the problem of image processing has been well solved. However, one of the disadvantages of DNN is that it has a large number of parameters, so it is a great challenge to update parameter. Another disadvantage is that the number of parameters is easy to cause overfitting problems. Then comes the Convolution Neural Network (CNN). It has the characteristic of parameter sharing, which can greatly reduce parameters. As a result, CNN can not only extract features, but also control the amount of computation. This may also be the reason why CNN [2] has always been popular in the field of image processing [3,4]. In general, CNN consists input layer, convolution layers, pooling layers, fully-connected layers. The convolution layer is mainly used to extract the features of image, which may be local features or combine local features. The pooling layer is used to reduce the data dimension and does not involve additional parameters. The fully-connected layer prepares for out. Simonyan [5] utilized an model with small convolution kernels to increase the depth of network. These works achieved very good results in the classification and localization tasks. He [6] proposed ResNet for classification tasks. This network model solves the problem of degradation caused by the deepening of the network.
FER aims to take an image as input, then analyze the image through a model, and finally divide the image to specific class. Classification of facial expression according to reference [7], expressions can be grouped into seven types: happy, sadness, fear, anger, surprise, disgust, and neutral. The whole process of recognition can be roughly divided into four parts: The first part is that we need to collect images for training. In the second part, these images are processed. The third part uses a model for feature extraction. The fourth part classifies the images. The third part and the fourth part are particularly important. Some researchers have done many experiments. For example, Ali [8] employed the radon transform (RT) and the traditional SVM method. Lu and Evans [9] proposed to use Haar wavelet transform (HWT) method. Yang [10] introduced cat swarm optimization (CSO) and achieved 89.49% accuracy. Li used ResNet-18 [11] and ResNet-50 [12] for feature extraction and classification. The methods mentioned above further improve the accuracy of model. Through the summary of the above literature, we find that that the model will lose some image information in the process of feature extraction. Therefore, assuming that we can extract more complete features from the image, the recognition accuracy can be further improved.

EAI Endorsed Transactions
In this paper, we propose a novel FER algorithm based on deep learning. The main contributions of this paper are as follows: (i) A Modified ResNet-152 is proposed. (ii) Retraining weights were used for facial data. (iii) The recognition system was better than the state-of-the-art methods.
The improvement of facial recognition technology in accuracy and stability shows great power in the field of IoT [13,14]. For example, in health care, finance, transportation and so on. Thus, we must create some algorithms and master some core technologies. Only in this way can we promote the development of the IoT field.

Dataset
We adopted the data set [15]. The dataset includes seven facial emotion images: happy, sadness, fear, anger, surprise, disgust and neutral. There are 100 images for each type of expression. We have 700 images in total. These images are taken by professional photographer using Canon digital camera. Figure  1 displays seven emotion classes of a male and a female faces.

Input layers
When we recognize an image, we use the image as input, but we don't input all the raw data of an image. Before inputting the data, we preprocess the data. Preprocessing operations usually include de-averaging, normalization, de-correlation and de-whitening.

De-averaging operation
There are two main ways for De-averaging. Suppose we have ten images with a size of 100 100 3  , then we sum the corresponding pixels of ten images and calculate the average. This is a way. We directly calculate the mean value of the three color channels of RGB. When there is a new image as input, subtract the corresponding average from the R channel. Other channels are also same operation. This is another way. If we don't de-averaging, the model will be easily fitted. This operation result in all dimensions of the input data to be centralized to 0.

Normalization
If the range of each feature is different, it will have a bad impact on the optimization algorithm. For example, the data of one feature is between 1000 and 1500, and the data of another feature is between 1 and 10. This gap will have a negative impact on training, so we will scale the image. This operation includes two types, one is the normalization of the maximum value, the other is the normalization of the mean variance. The former is suitable for data distributed in a limited range. For example, the maximum is normalized to 1, and the minimum is normalized to 0. The whole process will follow the following formula: represents minimum of th column in the pixel matrix. represents maximum of th column in the pixel matrix. , represents pixel at the position of the pixel matrix ( , ) . , * represents the value at the normalized position of pixel matrix ( , ). The latter is suitable for cases where the distribution has no obvious boundary. In most cases, the mean is normalized to 0, and the variance is normalized to 1. The whole process satisfies the following formula: represents the normalized value. σ(x) represents the standard deviation.
represents the mean value. represents the set of th features.
The advantage of this operation is that each characteristic scale is controlled in the same range, so that the convergence time can be shortened and the optimal solution can be found easily. This operation is also the most common preprocessing operation.

De-correlation and De-whitening
Principal Component Analysis (PCA) is used to Decorrelation. The core idea of PCA is to use a small number of representative and unrelated features to replace a large number of related features, so as to accelerate the training process. PCA can also achieve the purpose of reducing dimension.
In a static image, each pixel has a certain correlation. Therefore, it is redundant for a large of pixels. At this time,

EAI Endorsed Transactions on
Internet of Things 04 2022 -04 2022 | Volume 7 | Issue 28 | e5 3 the whitening operation can be used to de-correlation. This operation has the following advantages. 1. Reduce the correlation between features. 2. Features have the same variance.

Convolution layers
When the variables of convolution are functions ( ) and ( ) the convolution formula is is the integral variable, is the amount of displacement of function (− ), and the "*" denotes convolution. When the variables of convolution are sequence ( ) and ( ) , the convolution formula is If the variable is zero, (− ) is the result of inverse sequence of ( ), the "*" denotes convolution. Convolution operation plays an essential role in neural networks. As is shown in (3), ( ) and ( ) denote two integrable functions [16]. In the application of image processing, ( ) denotes the pixels from image, ( ) denotes parameters from convolution kernels [17,18]. In fact, the result of convolution processing is to take into account the surrounding pixels of each pixel, or even the entire image pixels, and carry out some kind of weighted processing on the current pixels to achieve a certain purpose. The explanations of the use of the convolution operation can be found in Refs [19][20][21]. The process is shown in Figure 2.  When we understand the convolution operation, the reason why the number of CNN parameters is much lower than that of DNN parameters is self-evident. The reason for too many DNN parameters is that when the neuron is connected with each node in the previous layer, there will be a different parameter, so the number of parameters is huge. But for CNN, the parameters inside the convolution kernel are fixed. We will convolution kernel to traverse every position of the image. The number of our parameters does not change during the traversal. This is the property that convolution neural network has shared parameters. These parameters need to be learned. Generally speaking, the convolution operation not only reduces the number of parameters, but also extracts features from the image.

Pooling layer
The convolution layer is followed by pooling layer. Why can't the results after convolution be directly used for classification? After convolution operation, we get some feature maps. The dimensions of these feature maps are high, and if they are directly connected to the fully-connected layer, it will lead a large amount of computation that can't be underestimated. Therefore, the pooling layer reduces the dimension of features and solves this problem very well [22,23]. The other reason is adjacent pixels in the image tend to have similar values, so adjacent output pixels in the convolution layer usually have similar values. This means that most of the information contained in the output of the convolution layer is redundant. We also use pooling layer to solve this problem. What's more, pooling can prevent overfitting to a certain extent and make optimization more convenient [24]. Last but not the least, pooling can achieve some invariance, such as rotation invariance, translation invariance and contraction invariance. Translation invariance [25][26][27] means that the vector translation of the output remains basically unchanged to the input. For example, the input is vector (1, 2, 5), the result of max-pooling is 5. If we shift the input one to the right to get (0, 1, 5), the result of the output is still 5. To put it simply, CNN can identify same results for the image and its translated version. Therefore, this property is beneficial to the classification task. In addition, the pooling layer can improve the ability of model feature extraction [28,29]. We introduce two kinds of pooling below.

Average Pooling
The average pooling [30] operation is to take the average of each block, extract the information of all the features from feature maps. Then pass average value to the next layer. When the image contains a lot of useful information, we usually use average pooling. For example, average pooling is often used before the full connection layer. This is because the last layers contain a wealth of semantic information. If we use max pooling, we will lose a lot of important information. According to Figure 3, the left represents the pixel matrix of the image, with a size of . The right represents the result after average pooling. From the result, we can see that the size of kernel is and the stride is 2. In addition, the dimension of the feature map changes from 4 to 2. For the specific calculation process, we take the first pixel in the upper left of the result matrix as an example. The max pooling operation takes the maximum value in each block, while other pixels will not pass to the next layer. Academically, max pooling is to extract a feature in any quadrant [31][32][33], then its maximum value will be obtained. If this feature is not mentioned, it may not exist in this quadrant, then the maximum value is still very small [34][35][36]. When the image contains a small amount of useful information, we usually use max pooling. For example, in the first few layers of the network, images with noise, etc. The reason why people use max pooling is that this method works well in many experiments.
According to Figure 4Figure 4 , the left represents the pixel matrix of the image, with a size of 44  . The right represents the result after max pooling. From the result, we can see that the size of kernel is 22  and the stride is 2. In addition, the dimension of the feature map changes from 4 to 2. For the specific calculation process, we take the first pixel "3" in the upper left of the result matrix as an example. 11 12 21 22 max( , , , ) max(3, 0, 0,1) 3 x x x x ==  In the process of forward propagation, the maximum pooling needs to record the location of the maximum value. Because the gradient only comes from this maximum value, the gradient update only updates this maximum value, and the gradient at other locations is 0. It is not necessary for average pooling. The loss comes from each element in the feature map above. So the gradient is divided by the size of the block.

Activation function
Activation functions are divided into two categories: one is saturated function, the other is non-saturated function. Saturated functions include Tanh, sigmoid and so on. Nonsaturated functions include ReLU and its variants. Why the activation function was introduced? The reason for introducing activation function: If activation functions are not used, in this case, the output is a linear combination of inputs. The result is equivalent to the effect of no hidden layers [37]. This is the most primitive Perceptron situation [38]. Let's introduce the saturated activation function: Let ℎ( ) be an activation function. When approaches positive infinity and the derivative of the activation function approaches 0, we call it right saturation. The formula is as follows When approaches negative infinity and the derivative of the activation function approaches 0, we call it left saturation. The formula is as follows When a function satisfies both left and right saturation, we call it saturation. A function that does not meet the above conditions is called a non-saturated activation function. The most commonly used is the ReLU function. It is shown in Figure 5 and (10).  (10), denotes the output of the previous convolution layer, denotes the output through the activation function. In Figure 5, the horizontal axis denotes and the vertical axis denotes . In addition, in the process of training weights, we need to utilize the derivative of activation function. The derivative of ReLU is as follows: ReLU function has the following advantages: 1. The problem of gradient vanishing gradient in back propagation is solved. 2. The speed of promoting the convergence of neural network is much faster than sigmoid and Tanh. 3. The calculation speed is very fast. The user needs to check if the input is greater than 0.

Fully-connected layer
The fully-connected layer means that all neurons have weighted connections between the two layers. Because of the large amount of calculation, it is usually placed at the tail of the neural network. In practical application, the fully connected layer can be regarded as implemented by convolution operation. That is, 11  convolutions performed on the previous layer. In addition, it will act as a classifier. The structure is shown in Figure 6.
where denotes the true value, and denotes the predicted value. In general, the smaller the loss function, the higher the model's accuracy. If we want to improve the accuracy of the learning model, we should reduce the value of the loss function as much as possible. We use gradient descent to optimize the value of the loss function. In the gradient descent process, we use the following two equations to iterate until the convergence of the loss function remains unchanged.
There are many ways to train neural networks. For example, Stochastic Gradient Descent (SGD) algorithm, Back Propagation (BP) algorithm. SGD [39] is more effective than BP. If we use SGD for training, then be sure to use batch normalization. With the process of the training process, the distribution of input data in subsequent layers will change. We call this phenomenon as "Internal Covariate Shift" [40]. Batch normalization can alleviate this problem. What's more, it can also solve the problem, which needs to set some hyperparameter artificially.

Modified ResNet-152
When the network is deepened, the representation ability of the model will become stronger. In other words, the model can extract more features from image. But network is hard to train. The first problem encountered during training is gradient vanishing/ exploding. This is because with the increase of the number of layers, the amount of calculation increases rapidly. The gradient becomes unstable in the back propagation process. For the problem of gradient vanishing, researchers proposed many solutions, such as batch normalization, initialization of MSRA+BN and so on. Another problem is the network degradation, that is, with the increase of depth of network, the performance of model becomes poor. Specially, the accuracy of the training set will be reduced. We can be sure that this is not caused by overfitting. Because in the case of overfitting, the accuracy of the training set should be very high. ResNet solve this obstacle through residual learning [6]. A building block is shown in Figure 7.
According to Figure 7, we can see that ResNet provides two choices, identity mapping, and residual mapping. We suppose the input of network has reached the optimal level. In other words, is optimal. When we continue to deepen the network, the residual mapping will be set to 0. At this time, the network only has identity mapping. This is the reason why the network has always been in an optimal state [41]. The most important thing is that deepening of the network will not decrease the performance of the model. This paper uses Modified ResNet-152, a variant of ResNet-152, as a recognizer to classify facial expressions. In the Modified ResNet-152 architecture, the first layer is 7*7 convolutions. Then there are four building blocks with 9 layers, 24 layers, 108 layers and 9 layers respectively. Finally, perform a global average pooling, a 7-way fully connected layer and softmax. It is shown in Figure 8

Measure
Cross-validation is also called Rotation estimation, which is a statistical method to cut data samples into smaller subsets. The core idea of cross-validation is to group the original dataset, one part as the training set and the other as the validation set or test set. The whole process is divided into two parts. In the first part, the classifier is trained with the training set. In the second part, the model is tested with the validation set. The purpose of using this technology is to obtain a reliable and stable model. In this way, the generalization ability of the model is greatly improved. In this paper, we used ten-fold cross validation.
The dataset has 700 images in total. Each fold includes 70 images. Each type of expression includes 10 images. We used 8 folds for training, 1 fold for validation, 1 fold for testing. In order to visualize the performance of the algorithm, we use the format of confusion matrix. We suppose In the confusion matrix, denotes the number of runs, denotes the number of folds. When = 1, = 10, the ideal result is as follows: denotes the correct number of expressions that are recognized for each type of expression. Therefore, in the ideal state, = 100, = 1, … ,7. In general, we run 10 times to avoid random and improve the generalization ability of the model. In other words, when = 10, = 10 , we can get 10 confusion matrices. Ideally, we add up these 10 confusion matrices. The results are as follows: In practical problems, it's hard for us to make the model in an ideal state. We use the sensitivity and overall accuracy (OA) as the evaluation index of the model. We use the following two formulas to measure.  denotes the number of images belonging to class and identified as class . Table 1 shows the sensitivity analysis of each class. Figure 9 shows the trend of the sensitivity of class. From Table 1 and Figure 9, we can get the sensitivity of each expression is:

Comparison with State-of-art Approaches
The state-of-art methods are HWT [9], CSO [10], and BBO [42], and the corresponding OA is 78.37+1.50%, 89.49+0.76%, and 93.79+1.24% respectively. The result of comparison is shown in Table 3. According to the data from Figure 10, we can obviously see that the method of "Modified ResNet-152" has 96.44+0.56% accuracy. The second-highest accuracy is BBO, which achieve 93.79 +1.24% accuracy. The third-highest accuracy is CSO, which achieve 89.49+0.76% accuracy. The lowest accuracy is HWT, which achieve 78.37+1.50% accuracy.
According to Table 1, "Modified ResNet-152" method get the highest OA mainly depends on (i) CNN can extract features from images at different scales; (ii) the problem of degradation can be well solved.
Furthermore, the method in second place is BBO, which solves the optimization problem. The core of BBO algorithm is migration and variation, which are important step to solve problem. In addition, the method in third place is CSO, which is a global optimization algorithm. The idea of CSO algorithm is to mimic the behaviors of cat.

Comparison with other ResNet variants
ResNet solves the problem of network degradation very well. Can we get a better model by deepening the ResNet network in the classification task? Li, et al. used ResNet-18 [11] and ResNet-50 [12] as facial expression classifiers to achieve the accuracy of 94.80+1.43%, 95.39+1.41%, respectively. Based on Table 4 and Figure 11, we can see that the "Modified ResNet-152" achieves higher accuracy.

Conclusion
This paper suggests a facial expression recognizer based on Modified ResNet-152. We show that our recognizer can classify human facial expression accurately. The low accuracy of some categories is considered that the number of images for the category is imbalanced.
Our further work starts from two aspects: on the one hand, we will continue to deepen the residual network's depth to improve the model's performance. On the other hand, human may have multiple expression types in one expression, we try to construct a system to recognize compound expressions. What' more, we try to optimize algorithms [49][50][51] to reduce the training time and improve the accuracy. Last but not least, if we can get a recognition model with high accuracy and good stability, we believe it will further promote the development of the IoT field.