SHELF: Combination of Shape Fitting and Heatmap Regression for Landmark Detection in Human Face

Today, facial emotion recognition is widely adopted in many intelligent applications including the driver monitoring system, the smart customer care as well as the e-learning system. In fact, the human emotions can be well represented by facial landmarks which are hard to be detected from images, due to the high number of discrete landmarks, the variation of shapes and poses of the human face in real world. Over decades, many methods have been proposed for facial landmark detection including the shape fitting, the coordinate regression such as ASMNet and AnchorFace. However, their performance is still limited for real-time applications in terms of both accuracy and e ffi ciency. In this paper, we propose a novel method called SHELF which is the first to combine the shape fitting and heatmap regression approaches for landmark detection in human face. The heatmap model aims to generate the landmarks that fit to the common shapes. The method has been evaluated on three datasets 300W-Challenging, WFLW, 300VW-E with 31557 images and achieved a normalized mean error (NME) of 6.67% , 7.34%, 12.55% correspondingly, which overcomes most existing methods. For the first two datasets, the method is also comparable to the state of the art AnchorFace with a NME of 6.19%, 4.62%, respectively.


Introduction
Recent years, given the higher and higher demands in intelligent applications for human monitoring, the recognition of facial expressions from images becomes active research field in literature.With the advancements of deep learning, this can be formulated as an image classification problem which is addressed by a state of the art model such as ResNet [2], EfficientNet [3], MobileNetV2 [4], ShuffleNetV2 [5], VisionTransformer [6] with a large dataset of training images.However, these models are not efficient because many pixels in the image do not contribute to the facial emotions.
In fact, a more promising approach for facial emotion recognition is based on the detection of facial landmarks which is a group of important pixels locating around the eyes, the nose, the mouth and the boundary of the face as shown in Fig. 1.The facial emotions can be clearly recognized by only a small ensemble of landmarks if they are correctly detected.The task of facial landmark detection is to locate these points in a given face image as depicted in Fig. 2 1 .This problem is very challenging due to the variation of facial appearance, the high the number of discrete 1 The original image is referred from https://www.dreamstime.com/photos-images/bus-driver.htmlBefore, the facial landmarks are detected by shape fitting models which are composed of Active Shape Model (ASM) [7], Active Appearance Models (AAM) [8], Constrained Local Model (CLM) [9], Discriminative Response Map Fitting (DRMF) [10] as well as DeFA [11].The models do not aware of facial appearance.In such cases, the landmarks are regressed from common facial shapes in a given dataset.These models are fast converged but usually under-fitted due to the high variation of facial shapes in the real world.
Recently, the regression networks are preferred thanks to advancements in convolutional neural networks (CNN).These include a CNN backbone for feature engineering and a regression head.For the direct regression methods like Style Aggregation Network (SAN) [12], the landmarks are directly regressed from the aggregated feature map of the CNN backbone.The feature map focus more on the facial appearance but less on the shape and pose.However, in case of faces that are occluded or of large pose, without appearance, certain landmarks may not be detected.
In heatmap regression models such as MobileFAN [13], the CNN backbone is replaced by a fully CNN (FCN) with additional de-convolutional layers.FCN is really an auto-encoder which encodes the face image and decodes to a corresponding heatmap highlighting the landmarks.Better than the direct regression methods, these models can reconstruct the facial landmarks in case of occlusion and of large pose.However, similar to generative models, these are less converged and suffered from the hallucination issue.
In this paper, we propose a novel method called SHELF which appropriately combines the Shape fitting and the HEatmap regression approaches for detection of Landmarks in human Face.This is because their trade-off can compensate to each other.Our main contributions are therefore four-fold as follows: • A heatmap generation neural network is built using a CNN with additional de-convolutional layers.
• A regression head is designed for determining the landmark with the highest probability using a softmax-argmax layer.Then, the shape fitting loss and the heatmap regression loss are combined in an efficient manner.
• A large dataset called 300VW-E of 31757 facial images, each labelled with 20 landmarks, has been prepared for recognition of emotions in human face.This is an extension of the 300VW public dataset.
• An evaluation of SHELF is effectuated on three datasets consisting of 300W-Challenging, WFLW and 300VW-E.The method achieved a low normalized mean error (NME) of 6.67%, 7.34% and 12.55%, respectively.These results outperform existing methods such as DeFA, SAN, MobileFAN, ASMNet and CFSS on all three datasets.SHELF is less performant than the state of the art AnchorFace [14] due to using less number of anchor shapes.
The rest of the paper is organized as follows.In Section 2, we introduce different approaches for facial landmark detection.Then, the proposed method SHELF is presented in detail in Section 3. We summarize the experimental results of SHELF on three datasets and study the ablation of SHELF in Section 4. Finally, Section 5 concludes our works.

Related Works
Over decades, many different approaches for facial landmark detection have been proposed.In this section, we introduce state of the art methods relating to the shape fitting as well as the regression of landmarks.

Shape Fitting
Traditional template matching approaches such as ASM [7], AAM [8], CLM [9] and DeFA [11] detect the facial landmarks by learning their common distribution and from a mean shape, computed from certain active samples, regressing them.ASM is based on the dimension reduction method Principle Component Analysis (PCA) [15] for shape fitting.AAM improved the performance of ASM by combining both the shape and appearance models in iterative manner.CLM introduced another appearance sampling technique in which the pixel values in the texture patches are normalized with zero mean and unit variance.Using CNN, DeFA models the facial shape in 3D to not only aligns facial landmarks but also matches SIFT (Scale-Invariant Feature Transform) points as well as the facial contours.However, due to limited feature engineering, the performance of such approaches are limited especially in case of occluded face images.

Landmark Regression
As introduced, the neural networks for facial landmark detection usually include a CNN backbone and a regression head which is fed with a feature vector.The networks can be categorized as coordinate and heatmap regression according to the way such vector is built from the backbone.

Coordinate Regression.
In case of coordinate regression networks, any CNN encoder can be used as their backbone.The regression head is directly fed with the flattened feature embedding of the backbone.Mnemonic Descent Method (MDM) [16] is a combined convolutional recurrent neural network which aims to cooperate the regressors of facial landmarks.DeepReg [17] is a deep regressor for gradual detection of facial landmarks with two-stage initialisation.In Wing [18], the wing regression loss was proposed for landmark localization rather than the L1 and L2 losses thanks to its ability to help the regression networks not only deal with large localization errors as L1 and L2, but treat also well the medium and small localization ones.Wing has been experimented with Resnet-50 [2] backbone.However, such average loss for regression of a high number of positions on the whole face is unable to assure small prediction errors for individual landmarks.
Heatmap Regression.The heatmap regression networks such as AWing [19], MobileFAN [13], Gaussian Vector (GV) [20] and AdNet [21] are autoencoder backbone which is composed of a CNN encoder and a decoder to produce probability distributions in form of heatmaps corresponding to the facial landmarks.In each heatmap, the position with the highest probability is chosen for the respective landmark.
AWing proposed an adaptive Wing loss function for coordinate regression from facial boundary map for better conforming the heatmap pixels to the facial shape.Gaussian Vector (GV) converts heatmap in to a pair of vector for each landmark to preserve spacial information and simplify the post-processing.AdNet introduced anisotropic direction loss and anisotropic attention module for better learning the facial structure as well as the texture details and mitigating the errorbias of facial landmarks.

Joint Shape Fitting and Regression Networks
There are also few methods which combine the shape fitting approach and the regression network such as LAB [22], ASMNet [23] and AnchorFace [14].LAB is a combination of the boundary fitting and the coordinate regression.Using a stacked Hourglass network [24] as an autoencoder backbone to produce facial boundary map, LAB then regresses the coordination of facial landmarks from the boundary in order to avoid the ambiguities of such key-points.ASMNet leveraged the light-weight MobileNetV2 [4] as backbone and presented a multi-task loss which is the sum of the mean square error and the active shape model loss.This enables ASMNet to learn both the shape and the coordination of the facial landmarks with less parameters than LAB.
In AnchorFace, the authors introduced certain anchor templates and regress the offsets on each template.They then aggregates the predictions on every templates to produce the final results.AnchorFace utilized ShuffleNetV2 [5] as its backbone.AnchorFace can deal with face poses of large variations thanks to its anchor templates.Nevertheless, the anchor templates need to be carefully selected and the inference time must be improved.AnchorFace is also known as anchor-based method.
Such joint approaches are usually more performant than the separate ones.However, existing joint methods are only between coordinate regression and the shape fitting.In this paper, we propose SHELF, a facial landmark detection method based on shape fitting and heatmap regression to fill the gap as well as to leverage the robustness of such combination.

SHELF: the proposed model
Our proposed method SHELF consists of a heatmap regression network a training loss function including both the coordination and the shape matching errors.Two principal components of the heatmap regression network are the heatmap-generated backbone and the heatmap regression head.

The Heatmap Regression Head
Given a set of n heatmaps H = {H i }, i = 1, n, each of size K x K (in this case K is qual to 224) and flattened to a vector of , the regression head of SHELF can predict the coordination for the respective facial landmarks using a soft arg-max function as follows: where f (j), j = 1, K 2 is a probability distribution function defined as follows in which α ≥ 1 is the temperature parameter.For the i th heatmap H i , the function sof targmax returns an index j * where f (j * ) is the maximal value of {f (j), ∀j = 1, K 2 }.
From j * , we can calculate the coordination ( xi , ŷi ) for the corresponding i th facial landmark.This function can be differentiated that can be used in SHELF instead of the traditional argmax and sof tmax functions.

The Multitask Loss Function
As we aim to integrate the facial landmarks in to a given shape, we designed a multitask loss function for training our proposed network.
The Coordination Loss.The mean square error is used as the coordination loss as follows: where n is the number of facial landmarks, (x i , y i ), ( xi , ŷi ), i = 1, n is the ground truth and predicted coordination of the i th facial landmark, respectively.
The Shape Loss.Given a training set with m samples in which the j th , j = 1, m is represented as a vector of 2n dimensions s j = (x j 1 , y j 1 , x j 2 , y j 2 , ..x j n , y j n ), using PCA (Principal Component Analysis) [7], this can be approximated by sj as follows: where s is the mean shape and P = (p 1 |p 2 |..|p t ) is a matrix constituted from t eigenvectors with the highest corresponding eigenvalues λ 1 , λ 1 , .., λ t of the following co-variance matrix: and b j is a t-dimensional vector containing a set of parameters for a deformable model: The shape loss is then calculated as follows The Multitask Loss.For every training samples, the overall loss is the combination of the coordinate and the shape ones as the following where β is the shape fitting rate which varies in reverse proportionally to the number of the training epochs for SHELF.This is because as many other convolutional neural networks, SHELF learns the shape before featuring the pixel-wise image.The ratio can then be defined as the following discrete function: where e, N e is the current and total number of training epochs, respectively.the initial steps of SHELF training where the shape features are important, the shape fitting rate β is also high enough.Reversely, at the final steps, β is set to zero since there exists mainly pixel featuring in the network.

Datasets
Our proposed SHELF method is evaluated on two famous facial landmark datasets including 300W and WFLW.We also conducted experiments on our private dataset.
300W.The 300W dataset totally consists of 3837 facial images with 68 landmarks annotated.The training set includes 3148 images in which 2000 are from HELEN [25], 811 from LFPW [26] and 337 from AFW [27].The full testing set is composed of 689 images which is divided in to a common set of 554 combining those from HELEN and LFPW and a challenging set with 135 images.
WFLW.The WFLW dataset [22] includes 10000 facial images which are annotated by 98 landmarks.Three fourths of the dataset are used for training and the rest for testing.This latter is composed of six subsets with different difficulties including 314 for expression, 326 for large pose, 206 for make-up, 736 for occlusion, 698 for illumination and 773 for blurring.
300VW-E.Our private dataset called 300VW-E include 31757 facial images which are extracted from videos in 300VW dataset 2 as well as from our driver-monitoring camera in the real world.These images are then annotated with only 20 landmarks locating mostly on the eyes of a human face as depicted in Fig. 4.This aims to clearly flash the facial emotions such as sleepy, tired, scared or distracted for DMS.Nearly 80% of these images are used for training, about 15% for validation and the rest for testing.

Evaluation Metrics
As commonly used for benchmarking of facial landmark detection methods, we also adopt the normalized mean error (NME) to evaluate our proposed method SHELF as follows: where n is the number of landmarks, N is the number of images in the testing set, (x i j , y i j ), ( xi j , ŷi j ) correspond to the ground truth and predicted coordination of the j th landmark on the i th facial image of the testing set and d is the distance between the two outer eye corners (inter-ocular) specifically for each dataset.This is also the normalized factor used in the 300W and WFLW datasets.
The failure rate (FR) is also involved in this case to evaluate the robustness of the methods in term of NME.This indicates the rate of failed recognition in which NME is less than 10%.The smaller FR is, the more powerful the model is.

Model Training
The input images are all resized to 224x224 before training.SHELF used Resnet 50 as its backbone for better heatmap featuring and is implemented in Pytorch.The model is trained by 50 epochs using Adam optimizer with the learning rate of 10e-5, the decay of 10e-5 and batch size of 64 on a K80 GPU of Google Colaboratory.

Results
In this section, we present the experimental results of SHELF on 300VW-E, 300W and WFLW datasets.

Facial landmark detection with SHELF.
After training, the model can be used to flash a given facial image to the landmarks thanks to their corresponding heatmaps, as visualized in Fig. 5.These visualizations prove the explainability of SHELF over other existing methods.
Evaluation results on 300VW-E dataset.SHELF is firstly evaluated on 300VW-E and achieved a NME of 12.55%.
In fact, the dataset contains a high number of expressive facial images that makes the landmarks highly biased.However, as in Table 1, SHELF is much better than other coordinate regression and shape fitting methods such as SAN, CPM and ASMNet with NME of 13.05%, 15.58% and 18.47%, correspondingly.Clearly, the combination of heatmap regression and shape fitting makes SHELF more tolerant to such biases.

Table 1. NME(%) of SHELF and other comparative methods on 300VW-E dataset
Method Category NME ASMNet [28] Coordinate Regression, Shape Fitting 18.47 CPM [29] Coordinate Regression 15.58 SAN [12] Coordinate Regression 13.05 SHELF (ours) Heatmap Regression, Shape Fitting 12.55 Evaluation results on 300W dataset.The results of SHELF on 300W dataset can be seen on the Table 2. Our model SHELF achieved a NME of 3.79%, 6.67% and 4.35% on the Common, Challenging and Full subset of 300W, respectively.These outperform most of the recent methods of coordinate regression, heatmap regression as well as shape fitting such as DeFA, MobileFAN, PCD-CNN, CPM, ASMNet especially on the Challenging subset.SHELF is a bit less accurate than the state-ofthe-art AnchorFace but it runs faster at the rate of 43 frames per second (FPS) on NVIDIA Tesla K80 GPU than AnchorFace with 45 FPS on much more powerful NVIDIA GTX Titan Xp GPU.These results prove the efficiency of the combination between the heatmap regression and the shape fitting in our SHELF method.
Evaluation results on WFLW dataset.SHELF is also evaluated on the WFLW dataset using both NME and FR metrics as in Table 3 Ablation Study.Given the efficacity of the combination of heatmap regression and shape fitting through the variation of β coefficient in the loss function of SHELF, we go a further step to explore how relevant this coefficient is on a given dataset.We conducted an experiment of SHELF on the 300VW-E dataset, with different variation pattern of β including continuous, constant and stepped as demonstrated in Fig. 6.The experimental results in Table 4 show that SHELF achieved the best NME of 12.55% with stepped variation of β, and exhibited a poor NME of 24.88% and 24.89% in case of constant and continuous ones.Notice that, in case of stepped pattern, the value of β is set to zero at a given training epoch.This confirms that the heatmap regression network learns the facial shapes only at the very beginning epoches of training.

Discussion
Facial landmark detection is an active research topic over many years because this can be more efficiently used to recognize the human facial emotion than relying on the whole human face.However, most recent methods focus more on the feature engineering of the individual facial landmarks but less on their distribution meaning the shape of the face.Although, the power of deep learning backbone networks has been thoroughly leveraged, the performance of such coordination and heatmap regression methods remains limited.ASMNet was the first to take in to account the shape fitting in to its coordination regression and initially gained positive results.However, the coordination regression approach aims to extract features at the cell level while the heatmap regression targets to the pixel level of the image which is closer to the facial landmarks in this case.Our proposed method SHELF is a combination of heatmap regression and shape fitting achieved a much better performance and

Conclusion
As discussed, the facial landmark detection is necessary for recognition of human emotion which can be applied in advanced driver assistance systems.This task is really hard due to the variation of facial appearance, shape, pose and the dispersion of high number of landmarks on the human face.Efficient methods such as ASMNet and AnchorFace all take in to account facial shapes and poses.However, these coordination regression methods extract the feature at the cell level which is less accurate than at the pixel level as in case of heatmap regression.In this paper, we proposed a novel facial landmark detection method called SHELF which is the first combination between heatmap regression and shape fitting.The evaluation on 300W, WFLW datasets and on the private one which is an extension of 300VW showed that SHELF outperforms many existing methods including SAN, ASMNet.SHELF can not be compared to AnchorFace due to using less number of anchor shapes.These results proved that such combination is reasonable and the SHELF can also be

Figure 1 .
Figure 1.A facial image with 68 landmarks

Figure 2 .
Figure 2. Detection of facial landmark on an image: The left one is of a bus driver.The middle one denotes his angry face associated with 68 landmarks.The right one describes the resulted locations of the landmarks on the angry face.

2 EAI
Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 10 | Issue 3 |

4 EAI
Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 10 | Issue 3 | a) A facial image b)20 landmarks

Figure 4 .
Figure 4.A facial image and its corresponding 20 landmarks in the 300VW-E dataset.

6 EAIFigure 5 .
Figure 5. Generation of heatmaps corresponding to the 20 facial landmarks on a given image of 300VW-E dataset.

Table 2 .
NME(%) of SHELF and other comparative methods on 300W dataset

Table 4 .
NME(%) of SHELF with different variation pattern of β on 300VW-E dataset EAI Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 10 | Issue 3 |