4.3.1. Ablation Experiment
This section is dedicated to ablation experiments aimed at assessing the impact of the various proposed modules on the network and evaluating the overall performance of our network solution. These experiments are conducted using the LLVIP dataset. We initiated the testing by evaluating the effect of the proposed Data Augmentation (DA) method on detection performance, thereby confirming the effectiveness of this augmentation technique. Subsequently, we delved into the contribution of the Multispectral Feature Mutual Guidance (MFMG) module and the Dual-Branch Feature Fusion (DBFF) module to network performance. To carry this out, we conducted experiments using real-world images, both with and without DA. Moreover, we assessed the individual contributions of MFMG and DBFF by enabling or disabling these modules. In the following sections, we provide a detailed analysis of the results obtained from these experiments, shedding light on the significance of each module and its impact on the overall performance of our multispectral target detection network.
(1) Effectiveness of DA: To assess the impact of the proposed Data Augmentation (DA) method on detector training, we conducted a comparative analysis by training detectors both with and without DA. This assessment was performed on both the baseline detector and our proposed detector. Initially, we evaluated the influence of DA on the training performance of the baseline detector. We introduced a probability parameter for applying DA during algorithm implementation. This probability parameter was adjusted during training to produce detection models with varying DA probabilities, and we observed their respective performance differences. The results are presented in
Table 2. As shown in the table, we applied DA with probability parameters of 0.1, 0.3, 0.5, 0.7, and 0.9 to the baseline detector. Subsequent testing on the LLVIP dataset revealed that the proposed DA method improved the mean Average Precision (mAP) evaluation metric for the baseline detector by 0.2%, 0.9%, 0.8%, 0.6%, and 0.3%, respectively. Furthermore, other evaluation metrics for the baseline detector also demonstrated improvements when influenced by the DA method. This comprehensive improvement underscores the effectiveness of the proposed DA method. The results indicate that the baseline detector performed optimally when DA was applied with a probability of 0.3. The underlying principle of the DA method is to introduce noise while retaining the original data. When the noise level becomes excessive and overshadows the original data, it can lead to a degradation in training results. Consequently, we selected a DA probability of 0.3 for testing the effects of the MFMG and DBFF modules.
(2) Effectiveness of the MFMG module: In
Figure 7, we integrated an MFMG module between the two backbone networks to facilitate the exchange of information during the feature extraction process. This module’s bidirectional design ensures seamless information interchange without affecting the output dimensions of the two backbone networks. The primary function of the MFMG module is to enhance the feature extraction capabilities of both backbone networks. It achieves this by facilitating the exchange of information between modalities, thereby promoting superior feature learning through the exploitation of associated complementary information. To validate the efficacy of this module, we conducted an ablation experiment. Specifically, we set the DA probability to 0.3 and evaluated the impact of the presence or absence of the MFMG module on the baseline detector’s performance. As shown in
Table 2, the MFMG module significantly improved the baseline detector’s mAP performance by 2.8%. This result provides empirical evidence of the MFMG module’s effectiveness.
(3) Effectiveness of the DBFF module: In contrast to the MFMG module, the DBFF module is designed to facilitate the fusion of information from two modalities. This module combines features from both modalities in the channel and spatial dimensions using the attention mechanism. Subsequently, it performs a deep fusion of the two fused features through data-driven learning. The fusion in the channel dimension primarily focuses on merging different semantic layers, which is advantageous for classification tasks. On the other hand, spatial dimension fusion emphasizes combining location information and is particularly relevant for localization tasks. Similar to the MFMG ablation experiment, we set the DA probability to 0.3 and tested the real effectiveness of the DBFF module by enabling or disabling it on the baseline detector. As shown in
Table 2, this module improved the mAP performance of the baseline detector by 2.3%, unequivocally confirming its effectiveness.
In summary, we conducted ablation experiments on the three designed components, and the test results demonstrated the independent effectiveness of the designed components. Additionally, the test results of the joint use of these components also highlighted their compatibility. As indicated in
Table 2, using MFMG and DBFF together with the baseline detector improved the mAP performance by 4.4%. When all three components were combined, the performance of the baseline detector improved by 5%.
4.3.2. Comparison with State-of-the-Art Methods
In order to further validate the performance of the proposed method, this section will compare it with current state-of-the-art methods on four datasets, demonstrating the advancement of our proposed method from both qualitative and quantitative perspectives. Specifically, we will compare test accuracy on three public datasets: VEDAI, M3FD, and LLVIP, and finally, we will compare algorithm inference efficiency on the FLIR dataset.
(1) Comparative experiments on the VEDAI dataset: In this comparison, we selected advanced single-modal detection methods and multi-modal detection methods to assess the performance of the proposed method. The single-modal methods considered included YOLOv5, YOLOv8, EfficientDet [
41], SSSDET [
42], among others, while the multi-modal methods encompassed LRAF-Net, YOLO Fusion, CFT, and similar approaches. The detection test results of these advanced methods, along with those of the proposed method on the VEDAI dataset, are summarized in
Table 3. The findings clearly demonstrate the superiority of the proposed method. Specifically, in comparison to the best single-modal method, the proposed method improved the mAP evaluation metric by an impressive 13.2%. When compared with the best-performing multi-modal method, the proposed method enhanced the mAP evaluation index by 0.3%.
The qualitative detection results are depicted in
Figure 8. In comparison to the baseline detectors, our proposed method significantly mitigates both missed detections and false detections. In
Figure 8, missed detections are indicated by red arrows, while false detections are denoted by green arrows. As evident in
Figure 8a,d, the baseline detector employs a simple addition for feature fusion, leading to a higher occurrence of missed detections and false detections. Conversely, all methods, including the proposed one, have been augmented in their feature extraction capabilities and depth of feature fusion, resulting in a substantial improvement in detection and classification accuracy. As seen in
Figure 8b,c, although the baseline detector can locate the target, it faces challenges in accurately classifying it. In contrast, the proposed detection method capitalizes on its robust feature learning capability and feature channel-level fusion, enhancing its ability to discern small objects. Since this dataset encompasses complex backgrounds and poses difficulties in detecting small targets, the qualitative comparison reaffirms the remarkable detection capabilities of the proposed method.
(2) Comparative experiments on the M3FD dataset: In a manner akin to the comparison experiment conducted on the VEDAI dataset, we selected advanced single-modal target detection methods and multi-modal target detection methods for comparison with the proposed method on the M3FD dataset. The quantitative comparison results are presented in
Table 4. As observed in the table, the proposed method enhances the mAP evaluation metric by 2.7% compared to the best single-modal method and improves it by 1.7% compared to the best multi-modal method.
In addition, the qualitative comparison results are shown in
Figure 9. As shown in
Figure 9a,b, in low-light challenging environments, the baseline detector cannot detect small or blurred targets. The proposed method successfully detects them with its powerful representation learning and feature fusion capabilities. Furthermore, as shown in
Figure 9c,d, both the baseline detector and the proposed method can effectively detect objects; however, the classification performance of the baseline method still lags slightly behind the proposed method, that is, misidentifications occur.
(3) Comparative experiments on the LLVIP dataset: The quantitative comparison results between the proposed method and advanced target detection methods on the LLVIP dataset are presented in
Table 5. As evident from the table, the proposed method once again excelled in target detection performance. Specifically, when compared to the best single-modal method, the proposed method enhances the mAP evaluation metric by 4.6%, and when compared to the best multi-modal method, it improves the mAP evaluation metric by 0.2%. Additionally, the qualitative comparison results are depicted in
Figure 10. As observed in
Figure 10, the detection scenes in LLVIP are exclusively nocturnal street scenes where visible light information is limited. Therefore, treating visible light and infrared information equally would unlikely yield better detection results. As shown in
Figure 10a–c, the baseline detector struggles to weight the information from the two modalities effectively, due to independent feature extraction and simple feature addition, resulting in numerous false positives. In
Figure 10d, the baseline detector struggles to accurately identify targets with overlap and occlusion, while the proposed method achieves precise detection by effectively fusing the complementary information from the two modalities.
In summary, the proposed method does not extract features independently and treat the two modalities equally, as the baseline detector does. Instead, it interactively extracts features and deeply fuses the features from both modalities. Consequently, the proposed method is exceptionally well-suited to various challenging detection scenarios.
(4) Computational efficiency analysis: To compare the computational efficiency of the proposed method with other advanced object detection methods, we conducted inference efficiency comparison experiments. Specifically, we used the same computing platform (1080Ti) and dataset (FLIR) as in previous research for a fair comparison. The results, including network parameters, floating-point operations (FLOPs), and inference time for the proposed method and related comparison methods, are presented in
Table 6. As observed in the table, the proposed method achieves real-time inference speed, with an inference time of 23.4 ms. While the computational efficiency of this method is slightly lower than that of several comparison methods, it maintains higher accuracy.
In summary, our proposed method has demonstrated superior detection performance on three public datasets: VEDAI, M3FD, and LLVIP, while also showcasing its adaptability to different environmental conditions. Furthermore, inference experiments on the FLIR dataset have confirmed its real-time inference capabilities.