In this section, we first introduce the datasets, evaluation metric, and the details of the training settings, then describe the performance of the baseline model, and compare and analyze the experimental results of the proposed methods.
4.3. Experimental Results
We constructed the face recognition model with the architecture shown in
Table 1, and used it as the baseline. We used the ArcFace loss [
14] to supervise the training process, where the scale factor was set to 64 and the angle margin was 0.5. We trained our network from scratch. To verify the training effect of the model, during training, the k-fold cross-validation result on LFW [
2] was calculated. As shown in
Figure 6, the curves in red and blue represent how the loss and accuracy of the baseline changed during training stage. It can be seen that the loss experiences three large drops at the milestones where the learning rate changes, and the decline in other places is relatively flat. In the process of reducing the loss through gradient backpropagation, a local minimum may occur. When the decline becomes slow, it is likely that the local minimum is encountered. We set three milestones and divided the learning rate by 10 at 8, 12, and 14 epochs. Because the batch size was set to 256 and the training set contains 5,822,653 face images, the number of steps for one training epoch was about 22,745. Therefore, when the number of steps was approximately 180,000, 280,000 and 320,000, the learning rate changed, resulting in the sudden drops of loss. In addition, the accuracy rate constantly approached 1 as the training progressed.
Figure 7 shows how the best threshold of the baseline changed on LFW during training, and that the best threshold constantly changed. After calculation, the average of these thresholds was about 1.485, and the standard deviation was about 0.023. Accuracy of the baseline on each test set is shown in
Table 3. The baseline occupied 4.8 MB memory and the accuracy achieved on LFW reached 99.47%.
Figure 8 from left to right shows the Receiver Operating Characteristic (ROC) curve of the baseline on LFW, AgeDB-30, and CFP-FP, the performance is consistent with the accuracy rate and was best on LFW.
Combining the improved structures based on the channel attention mechanism proposed in
Section 3.1, i.e., the depthwise SE (DSE) module, depthwise separable SE (DSSE) module, and linear SE (LSE) module, we conducted corresponding experiments of training and testing.
Figure 9 and
Figure 10 show the change in loss and accuracy during training after utilizing different modules in the architecture. In the figures, the blue curve represents the baseline, the orange represents the model with LSE, the green represents the model with DSSE, the red represents the model with DSE, the purple represents the model with DSSE and LSE, and the brown represents the model with DSE and LSE. It can be seen from
Figure 9 that the overall trend of training loss with different modules is consistent with the baseline. When magnifying the curves at the end, we can see that after utilizing the SE blocks, the training loss is lower than that of the baseline. Moreover, the model with LSE has the lowest drop in training loss, whereas the training loss with DSSE and LSE has the maximum drop.
Figure 10 shows that accuracy of each model on LFW is higher than that of the baseline, and all are above 99%.
The test results of the models with different SE modules on each test set are shown in
Table 4. According to the average accuracy, the overall recognition effect of the model with DSSE and LSE is the best, and that of the model with LSE is the worst, and worse than that of the baseline, whereas other models have the best recognition effect on some test sets. Because LSE is a linear SE module, the channel attention mechanism is only used in the last linear layer of the model, and its effect is minimal on the entire network. Therefore, it is difficult to improve the model using LSE alone. When we combine the model with LSE and DSSE, it achieves the best average accuracy because the channel attention enhancement is set after 1 × 1 convolutions. Compared with the model that only uses DSSE, its feature extraction ability is further improved. The feature map is extracted and integrated through depthwise separable convolutions or linear 1 × 1 convolutions, and the features contain deeper semantic information, which is more conducive to the extraction of facial features. In addition, the STD column represents the standard deviation value of each model on different test sets. It can be seen that compared to the baseline, the STD values of our proposed models are smaller. As introduced in
Section 4.1, different test sets contain face pairs with different attributes, and the standard deviation can reflect the generalization ability and robustness of the model on these test sets. The smaller the value, the better the performance.
On the basis of the above experiments, we conducted experiments with the teacher–student training pattern proposed in
Section 3.2. In the proposed pattern, when the weight parameter
is set to 0.5, the losses of the soft and hard targets are at the same level, and the training effect is best.
Figure 11 and
Figure 12 show, respectively, the curves of the loss and accuracy of the model with different SE modules during the training stage in the teacher–student training pattern. In the figures, the blue curve represents the baseline, the orange curve represents the baseline in the proposed training pattern, and other curves represent the models with different modules in the proposed training pattern. As
Figure 11 shows, the decline of training loss in the teacher–student training pattern is significantly greater than that of the baseline, which indicates that the teacher–student training pattern proposed in this paper introduces supervision information from the teacher network and speeds up the convergence of the student network. In addition, when zooming in on the curves at the end, it can be seen that the loss of models in the teacher–student training pattern is lower than that of the baseline. The loss decline of the baseline in the teacher–student training pattern is the smallest, and the decline of the model with LSE in the proposed training pattern is second, where the model with DSE in the teacher–student training pattern has the maximum loss drop. It can be seen in
Figure 12 that the accuracy of each model on LFW is higher than that of the baseline, and they are all above 99%. Moreover, the model with DSE and LSE in the teacher–student training pattern can achieve the highest accuracy on LFW.
The test results of the models with different SE modules in the teacher–student training pattern on each test set are shown in
Table 5. According to the average accuracy, the overall recognition effect of the model with DSE is the best in the training pattern, and that of the model with LSE is the worst, whereas other models have the best recognition effect on some test sets. It can be seen that the accuracy of the model trained in the teacher–student training pattern is improved compared to that of the baseline, so the training pattern proposed in this paper effectively compresses the knowledge of the teacher network to the student network, and improves the feature extraction ability of the student network. In addition, we used SE-ResNet50-IR [
14] as the teacher network, which is introduced in
Section 3.2, and the SE blocks are located at the end of each stacking units and act after 3 × 3 convolutions. The model with DSE in the teacher–student training pattern places the SE blocks after depthwise convolutions composed of 3 × 3 filters. Compared with other models, the architecture of the model with DSE is more consistent with that of the teacher model, so it integrates the feature extraction ability of the teacher network to the greatest extent, and not only inherits the features of the similar structure, but also learns features extracted by other structures. In addition, the STD column represents the standard deviation value of each model on different test sets. It can be seen that compared to the baseline, the STD values of our proposed models are smaller. As introduced in
Section 4.1, different test sets contain face pairs with different attributes, and the standard deviation can reflect the generalization ability and robustness of the model on these test sets. The smaller the value, the better the performance.
Table 6 compares the performance indicators of different models involved in this paper, including the model size, inference time, and the number of parameters and calculations. The MACs represent the multiply–accumulate operations that contain a multiplication and an addition, which can be used to measure the computational complexity of the model. The inference time is measured on the same GPU platform through the Event function of CUDA. To overcome the randomness of a single sample, we first count the inference time of 600 samples and then compute the mean and variance. It can be seen in the table that we improved the performance of the model and increased the number of parameters by 0.15 MB at most, and the inference time only increased by about three milliseconds, whereas the computational complexity remained almost unchanged. Therefore, we achieved the research goal of making the model as lightweight as possible while maintaining the recognition accuracy. Because the models based on the teacher–student training pattern are only different from those obtained in the normal training pattern in terms of the training method, and the architectures are not changed, the model size and the number of parameters and calculations are the same as those in the normal training pattern, and the results are not repeated here.
Table 7 compares the performance of the model proposed in this paper with the state-of-the-art (SOTA) face recognition models, including complex models and lightweight models. It can be seen that the model proposed is competitive in model size and recognition accuracy.