Next Article in Journal
The Annual Cycling of Nighttime Lights in India
Next Article in Special Issue
A Method of Segmenting Apples Based on Gray-Centered RGB Color Space
Previous Article in Journal
Object-Based Predictive Modeling (OBPM) for Archaeology: Finding Control Places in Mountainous Environments
Previous Article in Special Issue
Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images
 
 
Article
Peer-Review Record

ZoomInNet: A Novel Small Object Detector in Drone Images with Cross-Scale Knowledge Distillation

by Bi-Yuan Liu 1, Huai-Xin Chen 1,*, Zhou Huang 1, Xing Liu 1 and Yun-Zhi Yang 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 7 February 2021 / Revised: 14 March 2021 / Accepted: 15 March 2021 / Published: 21 March 2021

Round 1

Reviewer 1 Report

The approach presented in the article sounds interesting and, according to the presented results, gives better results than the current SOTA. There is, however, one thing that I didn't fully understand. The authors said, that the comparison was made under the same conditions for all of the tested networks (the same backbone, number of epochs etc.). Does it mean, that the authors implemented all of the presented networks on their own in the same framework (or adapted/modified the existing models)? Is there a guarantee, that those 20 training epochs were enough to achieve the top performance among those models? 

There are some mistakes in the text and notation, some of them are listed below:

  • in line 241, AP_ is used twice (one of them should be probably AR_ for average recall)
  • in tables 1 and 2, average precision is used twice, one of them should be recall probably
  • in eq. (3) and at the bottom of the page 6 CRB is used (should be CBR probably)
  • CBR shortcut is used (line 198) before it is explained (line 203)
  • not well -> not good? (line 149)
  • strdie->stride (page 6)
  • "the two model" -> the second model? (line 220)

Author Response

Thanks for your precious time and efforts invested in reviewing this manuscript. Your insightful advice is much appreciated, we feel highly honored by your affirmation and approval. We have addressed all your concerns and comments, please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Authors propose a cross-scale knowledge distillation (CSKD) for small object detection. In CSKD, the student network can learn features in the teacher network trained using double size images. Processes of the CSKD are feature layer alignment or layer adaptation, and adaptive key distillation positions algorithm. In the experiments, the proposed method demonstrated better performances than SSD, RetianNet, PANET, NAS-FPN, Libra-RCNN.

 

The major comments are as follows:

 

  1. My main concern is that the small detection accuracy is still lower than 10% and mAPs in the proposed methods are also lower than ones of the other SOTA such as S+D and CenterNet [1] reported in VisDrone-DET2019 [2]. Authors compare the proposed method with SSD and RetinaNet and so on, however, the other better methods have already proposed. Authors should clarify the advantage of the proposed method and compare the other better methods in [2] at least.
    [1] Zhou, D. Wang, and P. Krahenb ¨ uhl. Objects as points. ¨ CoRR, abs/1904.07850, 2019.
    [2] Du, Dawei & Zhu, Pengfei & Wen, Longyin & Bian, Xiao & Ling, Haibin & Hu, Qinghua & Peng, Tao & Zheng, Jiayu & Wang, Xinyao & Zhang, Yue & Bo, Liefeng & Shi, Hailin & Zhu, Rui & Kumar, Aashish & Li, Aijin & Zinollayev, Almaz & Askergaliyev, Anuar & Schumann, Arne & Mao, Binjie & Liu, Ziming. (2019). VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. 10.1109/ICCVW.2019.00030.

 

  1. Authors proposed knowledge-distillation-based method, while fine-tuning is one of the other approaches. You can pre-train a network using double size images and then the network can be re-trained using the original size images. Fine-tuning is simple and may improve the accuracy as reported in SNIP [3]. Authors should clarify the reason why authors use knowledge-distillation.
    [3] Singh, B.; Davis, L.S. An analysis of scale invariance in object detection snip. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3578–3587.

  2. Which network parameters are these parameters CBR and X in Eq.8? Please clarify it.

 

  1. Please explain settings of the hyperparameters such as K, lambda_distill, cls, loc and so on.

  2. Which layer do you focus on for computing Eq.19? There are three feature maps according to Figure 3.

 

  1. Which feature maps do you use for making masks in AKDP?

  2. Authors investigated influences of background in Sec 3.5.1. What is the purpose of the investigation? Authors also use FPNs. Is the destruction of object detection context solved in the proposed method? If not so, the proposed methods have the same problem.

  3. Please check the following sentences and equations.
    • In page 6, “there are P unknown parameters”. P is correct?
    • In Eq.8 A_MNx1. A is correct?
    • There is no box_B in the right side of Eq.10.
    • 12 is the definition of thresh_IOU, but thresh_IOU = lambda * thresh_IOU
    • Table 1,2,4 show Avg. Precision, Area twice.
    • What is Ours-FA_CSKD_th? Is it the same as SNet + FA_ThinHead+CSKD?

 

I hope these comments will be helpful.

---End---

Author Response

Thank you very much for your precious time and efforts invested in improving this paper. We feel highly honored by your affirmation and approval. Your insightful advice is highly appreciated. We attempted to address all your concern, which greatly improved the quality of our paper. We hope that our efforts will meet your approval. Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Please check the attached file.

Comments for author File: Comments.docx

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop