1. Introduction
Facial landmark detection, also known as face alignment, aims to localize landmarks of given faces. It is an essential step in many face analysis tasks, e.g., face verification [
1,
2,
3], expression recognition [
4,
5,
6], face editing [
7,
8] and face recognition [
9,
10].
In recent years, convolutional neural networks (CNNs) have promoted the progress of robust facial landmark detection. However, the robustness of landmark detection on unconstrained faces still suffers from occlusion, illumination and large pose variation problems.
To achieve robust facial landmark detection, some works [
11,
12,
13] impose face shape constraint over all landmarks against occlusion. For example, LAB [
11] imposes the shape constraint by estimating the boundary information that is predicted by an additional stacked hourglass network. However, facial boundary estimation significantly increases computational costs. Other methods, such as MDM [
12], learn the shape-indexed features from local patches surrounding a mean shape to predict all landmarks, and the shape constraint is encoded in the regressor.
Figure 1 shows the local patches used to learn shape-indexed features in existing methods.
Figure 1a,b shows the problems with two initialization strategies when presented with a large pose. The initial landmarks are extremely far from the ground-truth landmarks. In addition, shape-indexed features only provide coarse shape constraints, which are vulnerable to occlusion due to the lack of facial context in local patches.
This paper proposes a sparse-to-dense network (STDN) to reduce the noise data under large pose variations and handle the occlusion problem in facial landmark detection. The process is functionally divided into two stages: the patch resampling stage and the relation reasoning stage. In the patch resampling stage, STDN adopts the sampling method as shown in
Figure 1c. First, STDN downsamples the mean shape into sparse landmarks and then crops large-sized local patches by using these sparse landmarks. This allows us to use a lightweight network to predict a set of offset values based on these large-sized patches. Then, according to these offsets, the mean shape is adjusted to a reinitialized shape. In the relation reasoning stage, the input is the small-sized local patches cropped surrounding the reinitialized shape. The whole features, learned based on such small-sized patches, are used to predict the whole face shape. A group-relational module exploits the geometric relations between facial components, which first disentangles the nose feature from all features to constrain the other facial components according to the geometric relations. Meanwhile, all features play a role in imposing the global shape constraint. The main contributions of this work are summarized as follows:
We propose a sparse-to-dense network (STDN), a two-stage framework, to reduce the noise data with large pose variations and address the severe occlusion problem;
We suggest a sparse to dense patch sampling strategy to efficiently improve the quality of the cropped local patches with large pose variations;
We take advantage of a group-relational module to handle the severe occlusion problem, which learns the geometric relations between facial components to enhance the shape constraint against occlusion.
2. Related Work
Facial landmark detection falls into three main categories, i.e., classic methods, coordinate regression methods and heatmap regression methods. Although these methods have achieved great success, it is still challenging to deal with severe occlusion and large pose variations.
Classic methods, such as ASM [
14] and AAM [
15], are based on statistical shape models. They use the principal component analysis (PCA) method to model the appearance and shape by updating the coefficient vector, which can minimize the difference between shape-based appearance and input images. However, these methods only rely on the appearance features so that the performances of models tend to severely degrade when dealing with occlusion and faces with large pose variations.
Coordinate regression methods directly predict the coordinates of landmarks from the input image using regression models without relying on appearance models. These methods [
12,
16,
17,
18,
19,
20,
21,
22] typically utilize a coarse-to-fine manner to update the shape iteratively. DR [
18] used a global layer to estimate the initial shape and then uses multiple local layers to update the shape iteratively. Park et al. [
20] pretrained a feature extraction network to learn local feature descriptors from global facial features, which led to a higher face alignment accuracy. TR-DRN [
21] designed a two-stage network to solve the initialization issue, which used the full face region for rough prediction in the global stage and refined landmarks in different parts of the face in the local stage. DAC-CSR [
22] separated the face into multiple domains to train the domain-specific cascaded shape regression (CSR). Then, it used the dynamic attention-controlled method to select the appropriate subdomain CSR for landmark refinement. Coordinate regression methods take a small amount of time and ameliorate the robustness of the classic methods when facing the easy occlusion, but are not robust enough to handle the severe occlusion.
Some regression methods [
23,
24,
25] also learn regression models based on the shape-indexed features that were first proposed in ESR [
23]. It used the mean shape as the initial shape and gradually updated the landmarks by predicting the offset based on the local features extracted surrounding the initial shape. Wu et al. [
24] considered that different face shapes should have various regression functions. Therefore, the model they proposed can automatically change the regression parameters according to current face shapes to better approximate the ground-truth shapes.
Heatmap regression methods [
11,
26,
27,
28,
29,
30,
31,
32] obtain the heatmap by generating a Gaussian distribution over the channels; the point with the highest response on the predicted heatmap is liable to be the prediction. DU-Net [
27] used a quantized densely connected U-Net for effective facial landmark localization and used a K-order dense connection to achieve better detection accuracy with fewer parameters. AWing [
29] designed a loss function of heatmap regression that achieved a greater penalty for foreground pixels and a smaller penalty for background pixels. ADC [
31] combined global and local feature information for facial landmark detection without sacrificing image resolution and quality. Heatmap regression methods can achieve good performance, but they require deep networks and many parameters, resulting in complicated calculations and slow detection.
In recent years, with more attention to severe occlusion and large pose variations, an increasing number of works [
16,
33,
34,
35,
36,
37,
38,
39,
40,
41] have aimed at overcoming such obstacles in facial landmark detection. RCPR [
16] detected the occlusion area while estimating the landmarks, and used the occlusion proportion of the area to weight the regressor. PCD-CNN [
33] took the detected 3D face pose as the initial condition to detect landmarks under large pose variations. ODN [
34] achieved robustness for occlusion by applying adaptive weights to facial regions and restored low-rank features of occluded regions by exploiting the geometric structure of the face. LUVLI [
35] used a stacked hourglass network to jointly estimate landmark locations, the uncertainties of these predicted locations, and the visibility of landmarks. CCDN [
36] proposed a cross-order cross-semantic deep network to activate multiple related facial parts, which fully explored more discriminative and fined semantic features to solve the problems of partial occlusions and large pose variations. MTAAE [
37] proposed a multi-task adversarial autoencoder network based on the idea of multi-task learning, which could learn the more representative facial appearance and improve face alignment performance in the wild. SAAT [
38] proposed a sample-adaptive adversarial training approach, in which the attacker generated adversarial perturbations to reflect the weakness of the detector, and the detector must improve its robustness to adversarial perturbations to defend against adversarial attacks. DSCN [
39] proposed a dual-attentional spatial-aware capsule network to improve the ability to capture the spatial positional relations between landmarks by using the capsule network that can remember the location information of the entity. MSM [
40] used spatial transformer networks, hourglass networks and exemplar-based shape constraint to detect landmark under unconstrained conditions. Fard et al. [
41] designed two teacher networks, a Tolerant-Teacher and a Tough-Teacher, to guide the lightweight student network. The Tolerant-Teacher was trained using soft-landmarks created by active shape models, while the Tough-Teacher was trained using the ground truth landmarks. Meanwhile, they designed an assistive loss to determine the landmarks of teacher network prediction as positive or negative auxiliary.