Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection

Ren, Guangyu; Yu, Yinxiao; Liu, Hengyan; Stathaki, Tania

doi:10.3390/s22166188

Open AccessArticle

Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection

by

Guangyu Ren

^1,*,

Yinxiao Yu

²,

Hengyan Liu

¹ and

Tania Stathaki

¹

Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UK

²

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Sensors 2022, 22(16), 6188; https://0-doi-org.brum.beds.ac.uk/10.3390/s22166188

Submission received: 23 June 2022 / Revised: 11 August 2022 / Accepted: 15 August 2022 / Published: 18 August 2022

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

RGB-D salient object detection (SOD) demonstrates its superiority in detecting in complex environments due to the additional depth information introduced in the data. Inevitably, an independent stream is introduced to extract features from depth images, leading to extra computation and parameters. This methodology sacrifices the model size to improve the detection accuracy which may impede the practical application of SOD problems. To tackle this dilemma, we propose a dynamic knowledge distillation (DKD) method, along with a lightweight structure, which significantly reduces the computational burden while maintaining validity. This method considers the factors of both teacher and student performance within the training stage and dynamically assigns the distillation weight instead of applying a fixed weight on the student model. We also investigate the issue of RGB-D early fusion strategy in distillation and propose a simple noise elimination method to mitigate the impact of distorted training data caused by low quality depth maps. Extensive experiments are conducted on five public datasets to demonstrate that our method can achieve competitive performance with a fast inference speed (136FPS) compared to 12 prior methods.

Keywords:

RGB-D; salient object detection; dynamic knowledge distillation

1. Introduction

Salient object detection (SOD) aims at locating prominent objects in a given scenario under consideration. In recent years, SOD has attracted significant attention and substantial progress has been demonstrated in the field. The object detection task can be treated as a pre-processing methodology that can be subsequently used in diverse fields, such as image understanding [1], video detection and segmentation [2], semantic segmentation [3], object tracking [4], person re-identification [5] and others. However, due to the complicated real-life scenarios, RGB-based SOD still fails in generating satisfactory prediction maps. In order to overcome this issue and obtain better detection performance in complex scenarios, depth images along with an independent network have been introduced to provide supplementary information. Specifically, Figure 1 illustrates three fusion methods in RGB-D based SOD. Existing state-of-the-art methods mainly adopt late fusion or multi-scale fusion and focus on designing feature-enhanced modules and complicated feature-fusion modules, which indeed improve the overall detection results. However, due to the processing of a high volume of information, the models tend to become extremely complicated, leading to an issue that weakens the practicality of SOD using RGB-D data. In addition to the regular strategies, recent approaches [6,7] propose joint learning frameworks and treat the RGB-D based SOD as a multi-task learning problem. However, these frameworks employ extra network branches and supervision labels, which cause an analogous problem with the aforementioned frameworks. Different from them, the early fusion in Figure 1c integrates the separate inputs into a unified representation before the feature extraction process. It provides an alternative strategy to lighten the model but suffers from the noise issue caused by low-quality depth information. That motivates us to explore the potential of early fusion from a novel perspective and compress the model size for SOD while maintaining high detection accuracy.

Recently, knowledge distillation (KD) has been proposed [8] to transfer knowledge from a large model to a smaller one. The main idea is that a small student model mimics a cumbersome model, namely, a so-called teacher model, to achieve competitive performance. The cumbersome network has a larger knowledge capacity than smaller models, but this capacity may not be utilized for its full potential. In other words, a lightweight network can reach a similar performance to a cumbersome network by KD without increasing the number of parameters. Similar to human behaviours, this teacher–student learning process can be implemented by a simple and effective way, which forces the student model to directly learn the final prediction of the teacher model.

KD has been applied in a range of machine learning applications. Zhang [9] utilizes the KD to RGB saliency detection and proposes an efficient model by reducing the number of channels. Piao [10] explores the cross-modal distillation on RGB-D data and uses an adaptive weight to distil the depth knowledge from the teacher model. Nevertheless, both adjust their student networks according to the teacher networks and distillation strategies. In addition, the adaptive distillation [10] is proposed for the cross-modal distillation and only considers the performance of the teacher model, which limits the utilization of this KD method.

In order to tackle the above issues from a new perspective, we use a concise framework based on the early fusion strategy for RGB-D based SOD and propose a dynamic knowledge distillation (DKD) weight to help the model pay more attention on hard samples by considering both teacher and student performance. We also investigate the issue of RGB-D early fusion strategy in distillation and propose a simple noise elimination method to mitigate the impact of distorted training data caused by low quality depth maps. Combing these two methods can lead to a reasonable distillation strategy for RGB-D saliency detection. Our final model achieves a good balance between accuracy and model size on widely used benchmarks as shown in Figure 2. In a nutshell, our main contributions can be summarised as follows:

We propose a novel dynamic distillation strategy, which can adaptively assign the distillation weight by simultaneously considering the detection performance of the teacher and student networks within the training stage. As a result, the final model can pay more attention on hard samples and improve the overall performance.
We propose a noise elimination method by taking full merit of knowledge prior from the teacher network to alleviate the impact of depth maps with low quality. The student network can take benefit from this method without increasing extra parameters and computations.
We adopt a single stream for RGB-D SOD in order to bypass the depth network and avoid designing a complicated model. This single stream achieves competitive performance by only using VGG16 (57.9 MB) and VGG19 (78.2 MB), which are more applicable for practical use. Extensive experimental results on five benchmarks demonstrate that our methods can achieve competing performance within a fast lightweight architecture.

2. Related Work

RGB-D Salient Object Detection. RGB-D based SOD has obtained increasing attention in order to handle object detection tasks in complicated environments. Depth information is firstly introduced by [11], where they model the distribution of depth-induced saliency by using Gaussian mixture models. Zhao [12] proposes a feature-enhanced module and a contrast-enhanced net, which augments the contrast between the foreground and background by fluid pyramid integration. Pang [13] adopts multi-scale fusion and proposes a dynamic dilated pyramid module with adaptive receptive fields, which is generated by densely integrating cross-modal features. Chen [14] constructs a lightweight depth stream and designs a refinement network, which is progressively stacked by guided residual blocks. This method can alternately alleviate the mutual degradation and refine predictions in a progressive way. Zhou [15] leverages a novel feature aggregation network, which utilizes the K-nearest neighbor graph neural networks and the non-local module to dig the geometric cues and global semantic features. Zhang [6] proposes a multi-stage cascaded learning framework and transfers the maximization of joint entropy problem in multi-modal learning tasks to the minimization of mutual information, which can explicitly model the complementary information between the RGB image and depth data. These previous works focus on alleviating the impact of depth maps and enhancing the feature integration through delicate modules and networks.

Knowledge Distillation. Knowledge Distillation was formally publicised by [8] in a teacher–student learning framework. This method proposes an effective way to compress model size and attempts to imitate the human beings’ learning mechanism. Cheng [16] designs mathematical metrics to quantify and compare the methods of learning from the teacher model and learning from raw data. They explain the superiority of KD in three aspects. First, more reliable visual concepts can be learned through KD. Second, KD makes the model able to learn various concepts simultaneously. Furthermore, learning from KD can generate more stable optimization directions in the training phase. Zheng [17] proposes a novel divide-and-conquer distillation strategy for dense object detection. They transfer the semantic and localization knowledge separately and show that the student takes more benefit from the original logits distillation than feature imitation. Yang [18] explores the difference between the features of students and teachers and proposes a focal distillation to make the student focus on the teacher’s critical pixels and channels. Then they further design a global distillation to help the student learn the relation between pixels. Xu [19] follows the human learning process and proposes a teacher–student collaborative KD. This method combines the teacher–student KD and student self-distillation to enhance the performance. However, the student self-distillation model is built by extra multiple exit classifiers from deep to shallow. Recently, KD has been used in SOD tasks. Zhang [9] designs the student model by reducing the amount of channels and applies multi-scale KD on the corresponding scales between teacher and student models. Piao [10] applies cross-modal distillation on RGB-D based SOD and proposes an adaptive distiller to distil the depth information, which alleviates the impact of low-quality depth maps. Different from the aforementioned methods, our method takes both the performance of teacher and student models into consideration and generates a dynamic weight to control the regular teacher–student KD process. In addition, we analyse the depth issues in the specific RGB-D SOD task and optimize the training phase through a threshold.

3. Methodologies

3.1. Overview

Existing methodologies for RGB-D SOD tend to build two-stream networks in order to process RGB and depth features separately. This two-stream design could improve detection performance but meanwhile introduces a large amount of parameters, which increases the complexity and reduces the practicality of models. Feature pyramid network (FPN) [20] is an effective structure which utilizes multi-scale features in different resolutions to achieve detection tasks. Figure 3 illustrates the overall framework, we do not focus on designing networks and only adopt the classic FPN based on a VGG16 and a VGG19 [21] as the student model. In order to obtain a stronger teacher model, we employ four receptive field blocks [22] in multi-scale layers to boost the detection performance. Considering different cross-modal fusion strategies, we choose the simple early fusion way which directly concatenates RGB images and depth images to form four-channel inputs. Similar to normal KD, we transfer the probability distribution of the final layer from the teacher model to the student model by utilizing the so-called DKD.

3.2. Dynamic Knowledge Distillation

As mentioned above, KD benefits the student model but the weight of knowledge transfer is still hand-designed. Piao [10] proposes an adaptive weight for cross-modal distillation. However, in [10] they only distill the depth information by considering the performance of teacher model. In our method, we consider both performances of teacher and student networks and combine these two factors as a dynamic weight for KD.

Concretely, the accuracy of teacher model represents the detection performance which also indicates the confidence of knowledge. Inspired by IOU [23] used in SOD, we design a dynamic factor

α_{t}

to modulate the correct knowledge which can be transferred from the teacher model as follows:

α_{t} = \frac{P_{t} \cdot G}{P_{t} + G - P_{t} \cdot G}

(1)

where

P_{t}

and G represent the prediction of teacher model and the ground truth, respectively.

α_{t}

indicates the confidence of knowledge which can be transferred to the student model. Then, we propose another dynamic factor

β_{s}

to show the degree of desired knowledge for the student model as follows:

β_{s} = 1 - \frac{P_{s} \cdot G}{P_{s} + G - P_{s} \cdot G}

(2)

where

P_{s}

represents the prediction of student model. This dynamic factor

β_{s}

is error rate of the current training sample. In other words, KD should also consider the current performance of student model.

β_{s}

is inversely related to the accuracy between the output of student model and the ground truth. This indicates that hard samples which have large error rates need to learn more from the teacher model. Therefore, we propose a simple and effective formulation to find a plausible distillation weight

θ_{t, s}

:

θ_{t, s} = tanh ({α_{t}}^{p} \cdot {β_{s}}^{1 - p})

(3)

here tanh is treated as a scale function:

tanh (x) = \frac{exp (x) - exp (- x)}{exp (x) + exp (- x)}

(4)

More specifically, we define the

θ_{t, s}

by the weighted geometric mean of the knowledge confidence

α_{t}

from teacher and the knowledge demand

β_{s}

from student. We define the hyper-parameter

p \in [0, 1]

to balance the ratio between the teacher and student networks. It is worth noting that large variation of

θ_{t, s}

leads to convergence issue in training phase. In this case, we further use a tanh function to scale the

θ_{t, s}

. The overall loss function can be formulated as:

L_{d y n a m i c} = θ_{t, s} L_{K L} (P_{s}, P_{t}) + (1 - θ_{t, s}) L_{C E} (P_{s}, G)

(5)

where

L_{K L}

is the Kullback-Leibler divergence loss and

L_{C E}

represents the cross-entropy loss. In the final network, we set the distillation temperature to 5 in

L_{K L}

and

p = 0.7

.

3.3. Noises Elimination with the DKD

As mentioned above, we simplify the procedure of KD and the student network architecture in RGB-D task. Concretely, we only distill the final output distribution and abandon the depth stream by concatenating RGB and depth maps to form a four-channel input. However, this fusion strategy suffers from the noise issue caused by low-quality depth information. As illustrated in Figure 4, we investigate the reasons that cause the distortion of depth maps: (1) besides the salient object, other objects in depth image dominate salient features; (2) low contrast between salient object and background in depth; (3) depth distortion caused by camera. Intuitively, training loss is supposed to reduce drastically if the training data are distorted. Therefore we propose that these depth maps can be treated as noises when combining with RGB maps and further set an accuracy threshold during KD to control the impact of noises:

θ_{t, s} = \{\begin{cases} tanh ({α_{t}}^{p} \cdot {β_{s}}^{1 - p}) & α_{t} > t h r e s h o l d \\ ϵ & O t h e r s \end{cases}

(6)

where

ϵ

indicates a small weight which is set to 0.01 in this paper.

α_{t}

provides a knowledge prior from teacher network and indicates whether the depth distortion happens. Here

t h r e s h o l d

is set to 0.5. Under this circumstance, the student model is able to know the useless training data when receiving knowledge from the teacher model.

Compared to considering one aspect or enforcing a fixed weight to the student model, our dynamic weight considers both the correctness of teacher’s knowledge and the error of student network, which allows the student network to receive the knowledge according to the degree of difficulty of samples.

θ_{t, s}

varies little in the start of training phase. As for late stage of training, the student network is able to detect most simple scenarios except for some hard samples. Therefore,

θ_{t, s}

automatically assign to relatively bigger weights for hard samples which can be detect accurately in teacher network but student network. The noise elimination method takes full merit of the knowledge prior from teacher network and effectively reduce the negative impact of depth maps in low quality. Extensive experiments demonstrate in Section 4 that this DKD could boost the detection performance without increasing extra parameters and model size. The process of the proposed methods is illustrated as Algorithm 1.

Algorithm 1 DKD

Require:

P_{t}

is the prediction of teacher network,

P_{s}

is the prediction of student network,
G is the corresponding ground truth.

1:: Stage 1: Training the teacher network;
2:: $l o s s = L_{C E} (P_{t}, G)$
3:: Stage 2: Training the student network;
4:: $α_{t} = I O U (P_{t}, G)$ ;
5:: $β_{s} = 1 - I O U (P_{s}, G)$ ;
6:: if $α_{t} > T h r e s h o l d$ then
7:: $θ_{t, s} = tanh ({α_{t}}^{p} \cdot {β_{s}}^{1 - p})$
8:: else
9:: $θ_{t, s} = 0.01$
10:: end if
11:: $l o s s = θ_{t, s} L_{K L} (P_{s}, P_{t}) + (1 - θ_{t, s}) L_{C E} (P_{s}, G)$ ;

4. Experiments

4.1. Datasets and Evaluation Metrics

Datasets. Extensive experiments are conducted on five widely used RGB-D datasets, namely, NLPR [24], NJUD [25], SIP [26], DES [27] and LFSD [28]. These datasets contain large-scale images with different resolutions and diverse scenarios. We adopt the same training dataset with [12], which contains 1500 samples from NJUD and 700 samples from NLPR. The rest images in these two datasets together with other three datasets are used for testing.

Evaluation Metrics. We adopt five metrics to comprehensively evaluate SOD tasks. These metrics include the F-measure curves, the F-measure score (

F_{β}

), the Mean Absolute Error (M), the S-measure (

S_{α}

) and the E-measure (

E_{θ}

). Specifically,

F_{β}

measures the accuracy of the model as follows:

F_{β} = \frac{(1 + β^{2}) \cdot Precision \cdot Recall}{β^{2} \cdot Precision + Recall}

(7)

where

β^{2}

is set to 0.3 as default. M measures the error rate of the model as follows:

M = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} |S (x, y) - G (x, y)|

(8)

where W denotes the width and H denotes the height of prediction. S is the prediction saliency map and G is the corresponding ground truth.

4.2. Implementation Details

Our model is implemented using Pytorch Toolbox and trained on a GTX TITAN X GPU for 40 epochs with mini-batch size 4. We use a VGG16- and VGG19-based FPN as our final student architecture. Both RGB and depth images are resized to 256 × 256. To avoid overfitting, simple flipping and rotating are adopted to augment the training dataset. The initial learning rate is set to 1 × 10

^{- 3}

and we adopt a 0.0005 weight decay for the stochastic gradient descent (SGD) with a momentum of 0.9.

4.3. Comparisons with the State of Arts

We conduct our experiments on 12 different prevalent methods in recent years, including DF [29], CTMF [30], AFNet [31], MMCI [32], DMRA [33], CPFP [12], TANet [34], D3Net [26], A2delde [10], DANet [35], FANet [15] and CMINet [6]. For fair comparisons, we directly use the released evaluation results or generate the results by the public saliency maps under the same evaluation framework.

Quantitative Evaluation. Table 1 shows the quantitative results over five datasets. It can be observed that our method achieves the best scores in most metrics, especially on the NJUD dataset which contains 500 testing image pairs, for which our method performs better as far as all metrics are concerned. As for LFSD and SIP, although higher results come from other methods, we still obtain competing results in smaller VGG16 and VGG19 based networks. Figure 5 shows the comparison results using one-dimensional curves. Our method is represented by the red line which demonstrates better overall performance in both lightweight student models. In addition, it is apparent in Table 2 that our final VGG16-based network only has 57.9 MB with a faster speed, which drastically improves the inference speed and reduces the number of parameters. The above results indicate that without designing complicated models, accurate detection results can be obtained by only using an FPN with the help of the proposed methods.

Qualitative Evaluation. Figure 6 exhibits the visual comparisons with prevalent methods in recent years. Images contain diverse objects and scenarios, which are picked from different testing datasets. It can be observed that the saliency maps generated by our method are closer to the ground truth. More specifically, row 1 shows the case where the depth image has low contrast especially on the bottom part and row 2 and 3 show complex backgrounds in the RGB images. Under these circumstances, our method generates better saliency maps with less distortion and irrelevant objects compared to other methods.

4.4. Ablation Studies

Dynamic Knowledge Distillation. As shown in Table 3, our baseline is an FPN with VGG19 backbone trained on a cross-entropy loss, which can achieve the basic detection task. RGB indicates that only using RGB maps in training and RGBD concatenates depth maps as input. It is worth noting that directly using early fusion strategy also shows potential in RGB-D saliency detection. Then, we employ KD on the baseline to compare the results in different weights on four datasets. It is observed that KD can improve the performance and our DKD achieves better results across four testing datasets.

Furthermore, Figure 7 shows detection performance of the student network in different KD weights. It is observed that in the last 10,000 iterations of training stage, the proposed method has better overall accuracy, where the lowest accuracy is still above 0.4, which is even better than the teacher network. Specific examples in Figure 8 illustrates that compared with the DKD, conventional fixed weights suffer from more false positives and negatives. Consequently, it is demonstrated that our DKD adaptively controls the KD in an appropriate way, leading to the improvement of overall detection performance.

Noise Elimination with DKD. We investigate the low accuracy issue in teacher network by visualising extremely hard samples which has been shown in Figure 4. It is observed in Table 3 that with the help of noise elimination, all evaluation metrics over four testing datasets approach better results. Red arrows and rectangles in Figure 9 label the details which are refined by the proposed methods, especially on the part of object in the low contrast background, further illustrating that the proposed noise elimination effectively mitigate the noise during distillation and make the student model be able to learn more semantic details in the useful training data.

Further analysis. In order to show the generalization of the proposed DKD, we replace the teacher network to DANet. Experimental results in Table 4 indicate that our DKD can compress the model size of existing method and approach similar accuracy to the teacher model, especially on the DES, where the performance of student model even outperforms the teacher model. To this end, the proposed dynamic distillation strategy can be explored on different teacher models. We further conduct experiments on VGG16-based FPN with different KD hyper-parameters as shown in Table 5. Specifically, we set temperature to 10 and only use the RGB images in the distillation training phase. Experimental results demonstrate that the proposed DKD can be utilized on different networks with different training settings, proving the effectiveness and generalization of DKD and leading to the potential of achieving RGB-D SOD tasks through RGB data within a lightweight structure.

5. Conclusions

In this paper, we propose a DKD strategy and a noise elimination method for RGB-D based SOD. The proposed dynamic strategy considers the performance of both teacher and student models to generate an adaptive weight for KD. In order to reduce the final model size, we adopt the early fusion strategy for features fusion from different domains and the simple FPN as the final student model without designing extra networks. In addition, we investigate the noise issue caused by depth maps and alleviate this problem by setting a threshold during KD. The propose methods can be exploited on different teacher models and provide a new perspective which avoids designing extra networks for RGB-D SOD. We conduct comprehensive experiments on five challenging benchmark datasets to demonstrate that our method achieves competitive performance by only using a simple FPN model, which significantly compresses the model size and increases the inference speed. We further apply this dynamic strategy on different distillation temperatures with diverse models to prove the effectiveness and generalization of our method.

Author Contributions

Methodology, G.R.; Software, Y.Y.; Supervision, T.S.; Writing—original draft, G.R.; Writing—review & editing, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhu, J.Y.; Wu, J.; Xu, Y.; Chang, E.; Tu, Z. Unsupervised object class discovery via saliency-guided multiple class learning. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Fan, D.P.; Wang, W.; Cheng, M.M.; Shen, J. Shifting more attention to video salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Shimoda, W.; Yanai, K. Distinct class-specific saliency maps for weakly supervised semantic segmentation. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Mahadevan, V.; Vasconcelos, N. Saliency-based discriminant tracking. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Zhao, R.; Oyang, W.; Wang, X. Person re-identification by saliency learning. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 356–370. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Fan, D.P.; Dai, Y.; Yu, X.; Zhong, Y.; Barnes, N.; Shao, L. RGB-D saliency detection via cascaded mutual information minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4338–4347. [Google Scholar]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H. Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction. arXiv 2022, arXiv:2203.04895. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhang, P.; Su, L.; Li, L.; Bao, B.; Cosman, P.; Li, G.; Huang, Q. Training Efficient Saliency Prediction Models with Knowledge Distillation. In Proceedings of the ACM, Aberdeen, UK, 15–17 July 2019. [Google Scholar]
Piao, Y.; Rong, Z.; Zhang, M.; Ren, W.; Lu, H. A2dele: Adaptive and attentive depth distiller for efficient RGB-D salient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lang, C.; Nguyen, T.V.; Katti, H.; Yadati, K.; Kankanhalli, M.; Yan, S. Depth matters: Influence of depth cues on visual saliency. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
Zhao, J.X.; Cao, Y.; Fan, D.P.; Cheng, M.M.; Li, X.Y.; Zhang, L. Contrast prior and fluid pyramid integration for RGBD salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Pang, Y.; Zhang, L.; Zhao, X.; Lu, H. Hierarchical dynamic filtering network for RGB-D salient object detection. arXiv 2020, arXiv:2007.06227. [Google Scholar]
Chen, S.; Fu, Y. Progressively guided alternate refinement network for RGB-D salient object detection. arXiv 2020, arXiv:2008.07064. [Google Scholar]
Zhou, X.; Wen, H.; Shi, R.; Yin, H.; Zhang, J.; Yan, C. FANet: Feature aggregation network for RGBD saliency detection. Signal Process. Image Commun. 2022, 102, 116591. [Google Scholar] [CrossRef]
Cheng, X.; Rao, Z.; Chen, Y.; Zhang, Q. Explaining knowledge distillation by quantifying the knowledge. arXiv 2020, arXiv:2003.03622. [Google Scholar]
Zheng, Z.; Ye, R.; Wang, P.; Ren, D.; Zuo, W.; Hou, Q.; Cheng, M.M. Localization Distillation for Dense Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–23 June 2022; pp. 9407–9416. [Google Scholar]
Yang, Z.; Li, Z.; Jiang, X.; Gong, Y.; Yuan, Z.; Zhao, D.; Yuan, C. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–23 June 2022; pp. 4643–4652. [Google Scholar]
Xu, C.; Gao, W.; Li, T.; Bai, N.; Li, G.; Zhang, Y. Teacher–student collaborative knowledge distillation for image classification. Appl. Intell. 2022, 1–13. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Máttyus, G.; Luo, W.; Urtasun, R. Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Peng, H.; Li, B.; Xiong, W.; Hu, W.; Ji, R. RGBD salient object detection: A benchmark and algorithms. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Ju, R.; Liu, Y.; Ren, T.; Ge, L.; Wu, G. Depth-aware salient object detection using anisotropic center-surround difference. Signal Process. Image Commun. 2015, 38, 115–126. [Google Scholar] [CrossRef]
Fan, D.; Lin, Z.; Zhao, J.; Liu, Y.; Zhang, Z.; Hou, Q.; Zhu, M.; Cheng, M. Rethinking RGB-D Salient Object Detection: Models, Datasets, and Large-Scale Benchmarks. arXiv 2019, arXiv:1907.06781. [Google Scholar] [CrossRef] [PubMed]
Cheng, Y.; Fu, H.; Wei, X.; Xiao, J.; Cao, X. Depth enhanced saliency detection method. In Proceedings of the Conference on Internet Multimedia Computing and Service, Xiamen, China, 10–12 July 2014. [Google Scholar]
Li, N.; Ye, J.; Ji, Y.; Ling, H.; Yu, J. Saliency detection on light field. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Qu, L.; He, S.; Zhang, J.; Tian, J.; Tang, Y.; Yang, Q. RGBD salient object detection via deep fusion. IEEE Trans. Image Process. 2017, 26, 2274–2285. [Google Scholar] [CrossRef]
Han, J.; Chen, H.; Liu, N.; Yan, C.; Li, X. CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Image Process. 2017, 26, 2274–2285. [Google Scholar] [CrossRef] [PubMed]
Wang, N.; Gong, X. Adaptive fusion for RGB-D salient object detection. IEEE Access 2019, 7, 55277–55284. [Google Scholar] [CrossRef]
Chen, H.; Li, Y.; Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognit. 2019, 86, 376–385. [Google Scholar] [CrossRef]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Chen, H.; Li, Y. Three-stream attention-aware network for RGB-D salient object detection. IEEE Trans. Image Process. 2019, 28, 2825–2835. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, L.; Pang, Y.; Lu, H.; Zhang, L. A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]

Figure 1. Comparison of three fusion strategies.

Figure 2. Weighted F-measure and model size on NLPR and SIP datasets. Our lightweight model with simple early fusion can achieve satisfactory detection results over different datasets. Ours and Ours* indicate simple VGG16-based and VGG19-based FPN, respectively.

Figure 3. Our framework consists of two stages. We adopt a cumbersome network as the teacher model and a feature pyramid network as the student model.

Figure 4. Low quality depths in training.

Figure 5. F-measure curves across four benchmarks on VGG16 and VGG19-based FPN.

Figure 6. Visual comparisons with existing methods.

Figure 7. Teacher and student scatters with different distillation weights. We exhibit the last 10,000 training iterations and DKD shows better overall performance.

Figure 8. Heatmap visual comparisons of different distillation weights: heatmaps from left to right indicate weights 0.3, 0.5, 0.7 and dynamic, respectively.

Figure 9. Ablation studies of the proposed methods. Baseline represents that the student network is only trained on cross-entropy loss. DKD represents the proposed DKD and NE means noise elimination.

Table 1. Quantitative comparisons through the maximum of F-score

F_{β}

, S-score

S_{α}

, E-score

E_{θ}

, and error-score M, over five widely evaluated datasets. ↑ and ↓ indicate that larger and smaller scores are better. Ours and Ours * indicate simple VGG16-based and VGG19-based FPN repsectively. ite refers to the training iterations.

Table 1. Quantitative comparisons through the maximum of F-score

F_{β}

, S-score

S_{α}

, E-score

E_{θ}

, and error-score M, over five widely evaluated datasets. ↑ and ↓ indicate that larger and smaller scores are better. Ours and Ours * indicate simple VGG16-based and VGG19-based FPN repsectively. ite refers to the training iterations.

	Metric	DF	CTMF	AFNet	MMCI	TANet	DMRA	CPFP	D3Net	A2dele	DANet	FANet	CMINet	Ours	Ours *
NJUD	$F_{β}$ ↑	0.789	0.857	0.804	0.868	0.888	0.896	0.890	0.903	0.905	0.890	0.892	0.925	0.928	0.934
	$S_{α} ↑$	0.735	0.849	0.772	0.859	0.878	0.885	0.878	0.895	0.867	0.897	0.899	0.939	0.916	0.920
	$E_{θ} ↑$	0.818	0.866	0.847	0.882	0.909	0.920	0.900	0.901	0.914	0.926	0.914	0.956	0.949	0.952
	$M ↓$	0.151	0.085	0.100	0.079	0.061	0.051	0.053	0.051	0.052	0.046	0.044	0.032	0.032	0.030
NLPR	$F_{β} ↑$	0.752	0.841	0.816	0.841	0.876	0.888	0.884	0.904	0.891	0.908	0.885	0.909	0.922	0.930
	$S_{α} ↑$	0.769	0.860	0.799	0.856	0.886	0.898	0.884	0.906	0.889	0.908	0.913	0.941	0.921	0.924
	$E_{θ} ↑$	0.840	0.869	0.884	0.872	0.926	0.942	0.920	0.934	0.937	0.945	0.951	0.964	0.958	0.960
	$M ↓$	0.110	0.056	0.058	0.059	0.041	0.031	0.038	0.034	0.031	0.031	0.026	0.019	0.022	0.021
DES	$F_{β} ↑$	0.625	0.865	0.775	0.839	0.853	0.906	0.882	0.917	0.897	0.916	0.874	0.926	0.926	0.928
	$S_{α} ↑$	0.685	0.863	0.770	0.848	0.858	0.899	0.872	0.904	0.883	0.905	0.894	0.953	0.918	0.918
	$E_{θ} ↑$	0.806	0.911	0.874	0.904	0.919	0.944	0.927	0.956	0.918	0.961	0.925	0.970	0.965	0.966
	$M ↓$	0.131	0.055	0.068	0.065	0.046	0.030	0.038	0.030	0.030	0.028	0.026	0.015	0.022	0.023
LFSD	$F_{β} ↑$	0.854	0.815	0.780	0.813	0.827	0.872	0.850	0.849	0.858	-	0.855	0.862	0.865	0.862
	$S_{α} ↑$	0.786	0.796	0.738	0.787	0.801	0.847	0.828	0.832	0.833	-	0.850	0.877	0.834	0.839
	$E_{θ} ↑$	0.841	0.851	0.810	0.840	0.851	0.899	0.867	0.860	0.875	-	0.882	0.911	0.883	0.883
	$M ↓$	0.142	0.120	0.133	0.132	0.111	0.076	0.088	0.099	0.077	-	0.076	0.064	0.080	0.078
SIP	$F_{β} ↑$	0.704	0.720	0.756	0.840	0.851	0.847	0.870	0.882	0.855	0.901	-	0.887	0.872	0.882
	$S_{α} ↑$	0.653	0.716	0.720	0.833	0.835	0.800	0.850	0.864	0.828	0.878	-	0.894	0.855	0.865
	$E_{θ} ↑$	0.794	0.824	0.815	0.886	0.894	0.858	0.899	0.903	0.890	0.914	-	0.933	0.908	0.914
	$M ↓$	0.185	0.139	0.118	0.086	0.075	0.088	0.064	0.063	0.070	0.054	-	0.044	0.060	0.056
	Backbone	VGG16	VGG16	VGG16	VGG16	VGG16	VGG19	VGG16	VGG16	VGG16	VGG16/19	VGG16	ResNet50	VGG16	VGG19
	Epoch	-	-	-	30,000(ite)	-	50	10,000(ite)	30	50	40	40	100	40	40

Table 2. The model size and inference speed of different methods.

Method	MMCI	TANet	PCANet	D3Net	CPFP	DMRA	DANet	CMINet	A2dele	Ours
Model Size (MB)	951.9	929.7	533.6	519	278	238.8	106.7	84	57.3	57.9/78.2
FPS	19	-	15	-	7	10	32	10	120	136

Table 3. Ablation analysis on 4 datasets. RGB and RGBD indicate that the student network is trained without/with depth maps, respectively, by cross-entropy loss. s indicates the weight of KD. Here we use the mean value of F-score

F_{m}

and

w F_{m}

to show the overall accuracy. ↑ and ↓ indicate that larger and smaller scores are better.

Table 3. Ablation analysis on 4 datasets. RGB and RGBD indicate that the student network is trained without/with depth maps, respectively, by cross-entropy loss. s indicates the weight of KD. Here we use the mean value of F-score

F_{m}

and

w F_{m}

to show the overall accuracy. ↑ and ↓ indicate that larger and smaller scores are better.

	Metric	RGB	RGBD	s = 0.3	s = 0.5	s = 0.7	s = Dynamic	+Threshold
SIP	$F_{m}$ ↑	0.704	0.773	0.832	0.845	0.843	0.849	0.853
	$w F_{m} ↑$	0.654	0.724	0.796	0.805	0.809	0.811	0.817
	$M ↓$	0.108	0.086	0.063	0.061	0.059	0.058	0.056
NJUD	$F_{m} ↑$	0.776	0.830	0.902	0.902	0.898	0.904	0.914
	$w F_{m} ↑$	0.739	0.799	0.895	0.889	0.880	0.893	0.901
	$M ↓$	0.080	0.060	0.030	0.034	0.037	0.032	0.030
NLPR	$F_{m} ↑$	0.780	0.816	0.873	0.876	0.870	0.876	0.890
	$w F_{m} ↑$	0.746	0.781	0.877	0.875	0.865	0.876	0.887
	$M ↓$	0.046	0.041	0.022	0.024	0.026	0.024	0.021
LFSD	$F_{m} ↑$	0.713	0.784	0.825	0.832	0.830	0.834	0.835
	$w F_{m} ↑$	0.656	0.741	0.780	0.793	0.790	0.795	0.796
	$M ↓$	0.142	0.102	0.086	0.080	0.080	0.078	0.078

Table 4. Quantitative comparisons of applying DKD on DANet. We use the DANet as the teacher model and exploit the proposed DKD to transfer knowledge from DANet to the VGG-19 based FPN. ↑ and ↓ indicate that larger and smaller scores are better.

	Metric	DANet	DANet + DKD
SIP	$F_{m}$ ↑	0.864	0.848
	$w F_{m} ↑$	0.829	0.811
	$M ↓$	0.054	0.058
DES	$F_{m} ↑$	0.891	0.892
	$w F_{m} ↑$	0.848	0.870
	$M ↓$	0.028	0.025
	Model Size (MB)	106.7	78.2

Table 5. Distillation on VGG16 with RGB maps. ↑ and ↓ indicate that larger and smaller scores are better.

	Metric	s = 0.3	s = 0.5	s = 0.7	s = Dynamic
NLPR	$F_{m}$ ↑	0.882	0.876	0.876	0.884
	$w F_{m} ↑$	0.869	0.865	0.866	0.879
	$M ↓$	0.024	0.025	0.027	0.024
LFSD	$F_{m} ↑$	0.790	0.786	0.788	0.798
	$w F_{m} ↑$	0.745	0.735	0.751	0.753
	$M ↓$	0.104	0.106	0.101	0.097

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, G.; Yu, Y.; Liu, H.; Stathaki, T. Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection. Sensors 2022, 22, 6188. https://0-doi-org.brum.beds.ac.uk/10.3390/s22166188

AMA Style

Ren G, Yu Y, Liu H, Stathaki T. Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection. Sensors. 2022; 22(16):6188. https://0-doi-org.brum.beds.ac.uk/10.3390/s22166188

Chicago/Turabian Style

Ren, Guangyu, Yinxiao Yu, Hengyan Liu, and Tania Stathaki. 2022. "Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection" Sensors 22, no. 16: 6188. https://0-doi-org.brum.beds.ac.uk/10.3390/s22166188

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection

Abstract

1. Introduction

2. Related Work

3. Methodologies

3.1. Overview

3.2. Dynamic Knowledge Distillation

3.3. Noises Elimination with the DKD

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparisons with the State of Arts

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI