1. Introduction
Optical glass is an important and tiny signal transducer in optical devices. Most defects on optical glass are in the sub-millimeter level, so quality inspection of optical glass usually depends on high-resolution microscope images. To the best of our knowledge, many advanced object detection methods are applied on images with relatively low-resolution, such as VOC2007/2012 [
1,
2] (about 500 × 400) and MS COCO [
3] (about 600 × 400), which is a good trade-off between the detection accuracy and computational cost. However, for high-resolution images in industrial quality inspection, the detector requires a lot of processing time, which is also a great challenge to the memory of the GPU. Moreover, defects may exist on any surface of the optical glass. Accurate results are difficult to obtain with just one image taken from one perspective. In this case, a multi-perspective image detection result fusion method is needed.
Figure 1 shows images of normal and defective optical glass under different perspectives. Specifically, (a) is the front view and side view of the defective optical glass, and (b) is the qualified sample at the above perspectives.
In this paper, we propose a video-based two-stage optical glass defect detection network to solve the problems above. Since the high-resolution images need abundant computational cost and the down-sampled low-resolution images may make small defect areas invisible, the optical glass defect detection process is carried out in a coarse-to-fine manner. First, the optical glass area is located on the down-sampled version of the image, and then the defect is detected within that optical glass area with the higher resolution version. Such a detection framework can well control the computational cost and significantly reduce false alarms coming from the background. Since the defects may appear in various positions of the optical glass, we resort to taking videos to capture multi-images of the surface of the glass and carefully design a video-based defect detection framework to further improve the detection recall. Moreover, considering that low-quality video frames are useless or even harmful for defect detection, we propose an image quality evaluation method based on clustering to pick up high-quality video frames. Extensive experiments demonstrate the superiority of our method in terms of both quantitative and qualitative evaluation.
The main contributions of this paper are as follows:
A coarse-to-fine two-stage detection network is proposed to detect tiny defects on high-resolution images of the optical glass.
A video-based detection framework is suggested to support multi-perspectives defect detection.
A video frame image quality evaluation method based on clustering algorithm is proposed to pick useful images for detection.
We contribute a new dataset named “OGD-DET” that includes 3415 images collected in real industry situations.
2. Related Works
In recent decades, there has been considerable academic research on defect detection. One research direction is based on traditional image processing techniques, such as statistical analysis [
4,
5], spectrum-based approaches [
6,
7]. Another research field is deep learning based object detection method.
The process of traditional methods usually includes image pre-processing, hand-crafted feature extraction, classification, and post-processing [
8,
9,
10,
11]. In order to improve defect detection accuracy, template matching methods attempt to compare the pre-setting template image and the detected image at pixel-level, which has become popular in the field of industrial visual defect detection in terms of its high efficiency and stability [
12]. All these traditional methods can achieve promising results. However, these methods are strongly dependent on the experience of experts and the high standard of image collection setting. Otherwise, the detection performance will degenerate severely.
The development of deep learning has led to a series of breakthroughs in object detection. Girshick et al. [
13] propose a two-stage detector R-CNN, which firstly uses a selective search method to extract about 2000 proposal regions, and then feeds them to the convolution neural network (CNN) for object detection. R-CNN improves mean average precision by more than 30 % compared to the previous best results at that time on VOC2010 [
14]. He et al. [
15] use spatial pyramid pooling on the feature map after CNN layers on the whole image in SPP-net, raising computational efficiency and allowing processing of images of any size and any aspect ratio. Based on R-CNN and SPP-net, Fast R-CNN [
16] and Faster R-CNN [
17] are consecutively proposed to significantly reduce the computation cost, which utilizes ROI Pooling and Region Proposal Network for achieving an end-to-end object detection structure and can reach 0.5 fps and 7 fps on a K40 GPU. Different from detectors of R-CNN and its variants, YOLO [
18,
19,
20,
21,
22] designs a one-stage detector, which is extremely fast but usually underperforms R-CNN based methods. Moreover, SSD [
23], as another interesting work, is a good trade-off on detection accuracy and efficiency, which is also a one-stage detector similar to YOLO and design multiple anchors like Faster R-CNN.
Most of the existing deep detectors usually take a low-resolution version image after downsampling as input for computational efficiency. However, the tiny defects may be ignored in this low-resolution version image. One strategy for tackling this problem is to divide the whole image into sub-images and perform detections on each sub-image. Adam [
24] first proposes the conventional method of object detection on large-size images. Specifically, Adam performs sliding window cropping of super-large-resolution images into multiple sub-images and then conducts object detection in each sub-image. The detection results of large-size images are finally achieved by splicing all detection results of sub-images followed by NMS filtering. Another trend to deal with detection on high-resolution images is designing the cascade detection framework. In cascade detection, Szegedy et al. [
25] starts to use multi-stage architecture by different Intersection over Union (IoU) thresholds to solve mismatches proposals from training and inferencing in previous R-CNN and its variants, thus cascade architecture further improves detection accuracy. Gao et al. [
26] use a coarse-to-fine approach by designing the Markov model for dynamically selecting regions to improve the detection speed on large images. Inspired by these works, we propose a video-based coarse-to-fine defect detection network to detect tiny defects on high-resolution images of optical glass, hoping that it can meet the requirements of effectiveness and efficiency at the same time.
3. Methods
In this section, we introduce our methods in detail. Firstly, we illustrate the overall architecture of our two-stage coarse-to-fine detection network. Then we introduce the implementation of a video-based detection framework by video frames fusing module. Finally, an image quality evaluation (IQE) method is introduced to pick high-quality images for better detection.
3.1. Coarse-to-Fine Defect Detection Network
Tiny defects on high-resolution images are hard to detect by existing deep detectors in real time since high-resolution images need high computational cost and the down-sampled low-resolution images may make small defect areas invisible. For solving this problem, we propose a two-stage coarse-to-fine detection network for tiny defects detection on high-resolution images. As shown in
Figure 2, the coarse detection stage focuses on locating the glass area on the down-sampled images. The detected glass area will be restored to its original size of high-resolution and fed to the fine detection stage for defect detection. In addition, the detection networks are slightly different in two stages. Network details are described below.
3.1.1. The Coarse Detection Stage
Considering the requirement of real-time in industrial visual inspection, we design a detection network based on YOLOv4 [
21] for both two stages. YOLOv4 is a typical model that well balances speed and accuracy, which consists of three sub-nets, the backbone network, neck network, and head network. During the coarse detection stage, glass regions are firstly located on the input image of the down-sampled version. To further improve the model efficiency to achieve real-time detection, a more lightweight backbone is chosen, which has shallow neck and head networks. As shown in
Figure 3a, the Darknet-tiny network [
20] is selected as the backbone network rather than the default version of CSPDarkNet53 network in YOLOv4. Specifically, the proposed network has only 13 × 13 and 26 × 26 two-scale feature maps for the neck and head networks, less than three scale feature maps which are 13 × 13, 26 × 26, and 52 × 52 in the original YOLOv4, respectively. Finally, the computation cost of Darknet-tiny backbone is only 5.56 billion FLOPs, less than the 9 billion FLOPs for the CSPDarkNet53 backbone of YOLOv4.
3.1.2. The Fine Detection Stage
In the fine detection stage, the detected glass area will firstly be restored to its original size which is a high-resolution version, and then sent to detect defects. Since the scale of the defect area on glass is still quite tiny and the smallest has only 2 × 7 pixels on images, seeking abundant feature information on the input RGB images is essential. More importantly, different color channels of RGB images have distinctive features. In consideration of these, we propose a Color Channel Separation (CCS) convolution to improve the model’s feature extraction capabilities in the fine detection stage. As illustrated in
Figure 3b, the CCS convolution works on the backbone network, replacing the traditional convolution. Specifically, the proposed CCS convolution arranges a CBL and a MCBL4 operation after every color channel to help learn more semantic information. The CBL operation refers to a convolution layer, a batch normalization layer, and a leaky-relu layer, as shown in
Figure 3c. And the MCBL(n) represents a Max-pooling layer, followed by a CBL operation, where (n) means such structure repeats for n times. With the CCS convolution, more abundant semantic information of the input image is leveraged, which would be further delivered to the later layers, leading to better detection results.
Compared to the single-stage detection method, the proposed coarse-to-fine two-stage framework not only effectively reduces false positives from the background but also meets the requirements of real-time processing.
3.2. Video-Based Detection Framework
Since the defects may exist in any place of the optical glass, we design a video-based detection framework by fusing the results of multiple frames captured from various perspectives. The video frames fusing module provides a video-level prediction as the final output.
Firstly, the videos are captured by an industrial camera under a controllable illumination environment. Specifically, the manipulator picks up the optical glass on the operating table and rotates it under the camera to collect multi-perspective video. The video-based detection framework takes the multiple frames as input and a video frame fusing module is proposed to fuse the results of multi-perspective video frames to give the final prediction result.
To specify fusion methods in detail, our proposed video frames fusing module is presented in
Figure 4. Assuming that the video has
N images, and
M images are detected to be defective in them. Then the probability of the video being defective can be determined by the proportion of defective frames, in other words, the sample will be considered to have defects when
where
T is the threshold of the ratio between abnormal images and all images. However, some defects are visible in a few frames and they will be ignored under the aforesaid rule. Therefore, the confidence score is introduced to further improve the fusing strategy. Here,
i is the index of the sampled frame,
refers to the confidence score of detected defects in the
frame. The confidence threshold is set as
. When
, this sample will also be considered to have defects. In our experiment, the confidence threshold
is 0.8, and the ratio threshold
.
3.3. Image Quality Evaluation Module
Images captured by industrial microscope cameras may be blurry during the glass rotation process because of motion blur and out-of-focus. Blurry images are not appropriate for defect detection, which may hurt the final detection accuracy. Thus, we propose a clustering-based image quality evaluation module (IQE) to pick out images with high quality for the final prediction.
As illustrated in
Figure 5, the proposed clustering-based IQE framework has two main steps: the training phase (left) and the testing phase (right). In the training phase, we first extract HOG features from all images which include both clear images and blurry images to realize self-classification by K-means clustering algorithm. These images would be classified into multiple classes with different image clarity. We manually choose one or more classes in which images are of high quality. Finally, all classes’ clustering centers will be recorded and will later be utilized for image clarity testing. In the testing phase, the HOG features are firstly extracted for the input images. Then, the mean square error (MSE) in terms of HOG features is computed between the input images and all clustering centers computed in the training stage. If the smallest MSE value is calculated from the chosen classes with high image quality, the image will be considered as high-quality samples, which are further utilized for defect detection. The image is deemed to not meet the image quality requirements and will be discarded.
4. Experiments
In this section, we conduct extensive experiments to verify the effectiveness of our approach. First of all, we introduce datasets and experimental settings. Then we present ablation studies and finally compare our method with the state-of-the-art methods.
4.1. Datasets
To evaluate our method, we propose an optical glass dataset called OGD-DET, which is captured from a real industrial production environment. All the optical glass videos are taken by the Omega industrial camera, and the resolution is 1536 × 1024 pixels. It has 40 videos, where 25 videos are used for training and 15 videos are used for testing. The OGD-DET dataset is available on GitHub (website:
https://pntehan.github.io/OGD-DET/, accessed on 3 December 2021). To evaluate our methods at the image level, all videos are framed into images with 8 images per second. Finally, we obtain 3415 images in total, and 511 images are utilized for testing. In our experiments, all images are resized to 416 × 416 pixels. The proposed IQE module works to select glass images with high quality from the first detection stage for the final prediction. In our K-means clustering process, the initial clustering center number is set to 20, and we choose 9 classes with a high image quality from them. The number of K-means clustering iterations is 100.
4.2. Performance Evaluation Metrics
We adopt recall, precision, average precision (AP), and frames-per-second (FPS) for performance evaluations in our experiments. AP represents the average detected precision for each defect category. APx represents the AP at IoU = x. For example,
[email protected] represents the AP at intersection over union (IoU) = 0.25. Recall and precision are determined when IoU is 0.25 and confidence score is 0.5. FPS refers to how many images are processed per second, indicating the model inference speed.
4.3. Ablation Studies
To verify the effectiveness of each module in our approach, we conduct experiments on ablation studies. To verify the role of the coarse-to-fine detection framework, the comparisons between one-stage defects detection network and two-stage defects detection network are conducted. The one-stage defects detection network based on YOLOv4 is carried out to detect defects directly with the down-sampled input images. As shown in
Table 1, the two-stage defects detection network has a large improvement up to 13.60% on Recall and 9.80% improvement in Precision compared to the one-stage detection network under IoU = 0.25. The results demonstrate that the two-stage detection network is effective to detect tiny defects and reduce the false positives from the background as well.
Our Color Channel Separation (CCS) convolution achieves 100.00% and 90.50% in terms of Recall and Precision under IoU = 0.25, respectively, which is higher than a conventional convolution that gets 80.10% in terms of Recall and 97.40% in terms of Precision under IoU = 0.25. It demonstrates that the CCS module can help learn more detailed information from each color channel, which is beneficial to detect tiny defects. Moreover, our IQE module further improves the performance to 100.00% on Recall and 99.48% on Precision under IoU = 0.25 by selecting images with high quality for the final prediction.
4.4. Comparison with the State-of-the-Art Methods
We compare our methods with Faster RCNN, SSD, and Yolov5 [
22] methods for defect detection. All the methods are trained on the same training dataset. Moreover, all the methods are evaluated under both image-level and video-level test settings.
The experiment results under the image-level setting are shown in
Table 2. Our method significantly outperforms Faster RCNN and SSD methods in terms of AP. Benefiting from the coarse-to-fine two-stage detection framework, our method achieves much better detection results, especially on
[email protected]. It demonstrates that the proposed two-stage detection framework can locate the defect area more precisely. Moreover, our method can perform in real-time to achieve 21 FPS on 3080ti GPU Server, which is much faster than Faster RCNN, and comparable to SSD512 and Yolov5. Although our method is slower than SSD300, it significantly outperforms SSD300 in terms of detection accuracy with a large improvement. The FLOPs and the number of Parameters of each method are illustrated in
Table 3.
Table 4 illustrates the results under video-level setting. Fifteen videos are tested, 5 normal samples and 10 defect samples included. Our method achieves perfect results of 100% on AP and 100% on Recall, much better than Faster RCNN and SSD300. The performance gain comes from the video-based detection framework which fuses the results of multiple frames captured from various perspectives.
Visualization results of Faster RCNN, SSD, and our method are shown in
Figure 6. Images under different illumination conditions are presented and the defect area has been enlarged and shown in the upper right corner of each image. The results can be concluded that for easy cases, all methods achieve good results as shown in the first two columns of
Figure 6. Whereas for the hard cases shown in the last four columns, our method achieves more robust results than both Faster RCNN and SSD when the illumination is not good enough.
5. Conclusions
We propose a video-based two-stage defects detection network for sub-millimeter defect detection on optical glass. Specifically, we propose a coarse-to-fine two-stage detection framework to promote the detection performance on tiny defects which can effectively reduce the false alarming rate from backgrounds. A video-based detection framework is designed to detect defects from multi perspectives, which well tackles the problem that defects may exist on multiple surfaces of optical glass. Moreover, the CCS convolution and IQE module further improve defect detection results by improving feature representations and picking out high-quality samples for detection. In the future, we will conduct more experiments on large real-world datasets to further verify the effectiveness of our method.
Author Contributions
Conceptualization, H.Z. and J.Z.; methodology, H.Z. and X.Y.; software, H.Z. and J.C.; validation, H.Z., X.Y. and Z.W.; formal analysis, H.Z. and Z.W.; investigation, H.Z. and Z.W.; resources, J.Z., Y.D., J.C. and X.Z.; data curation, X.Y.; writing—original draft preparation, H.Z. and X.Y.; writing—review and editing, J.Z.; visualization, X.Y. and Z.W.; supervision, J.Z., Y.D., J.C. and X.Z.; project administration, J.Z. and Y.D.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Funding
The APC was funded by National Natural Science Foundation of China (No. 62176251).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/htmldoc/index.html (accessed on 5 May 2007).
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/htmldoc/index.html (accessed on 5 May 2012).
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Goldstein, M.; Dengel, A. Histogram-Based Outlier Score (hbos): A Fast Unsupervised Anomaly Detection Algorithm. KI-2012: Poster and Demo Track. 2012, pp. 59–63. Available online: https://www.google.com.hk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjS2-6S3b34AhWuq1YBHbmFCuMQFnoECAQQAQ&url=https%3A%2F%2Fwww.dfki.de%2Ffileadmin%2Fuser_upload%2Fimport%2F6431_HBOS-poster.pdf&usg=AOvVaw0KM26WXglR4TQVsSKDpXsg (accessed on 5 May 2012).
- Pittino, F.; Puggl, M.; Moldaschl, T.; Hirschl, C. Automatic Anomaly Detection on In-Production Manufacturing Machines Using Statistical Learning Methods. Sensors 2020, 20, 2344. [Google Scholar] [CrossRef] [PubMed]
- Hou, X.D.; Zhang, L.Q. Saliency Detection: A Spectral Residual Approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007. [Google Scholar]
- Bai, X.L.; Fang, Y.M.; Lin, W.S.; Wang, L.; Ju, B.-F. Saliency-Based Defect Detection in Industrial Images by Using Phase Spectrum. IEEE Trans. Ind. Inform. 2014, 10, 2135–2145. [Google Scholar] [CrossRef]
- Fang, X.X.; Luo, Q.W.; Zhou, B.X.; Li, C.; Tian, L. Research progress of automated visual surface defect detection for industrial metal planar materials. Sensors 2020, 20, 5136. [Google Scholar] [CrossRef] [PubMed]
- Chu, M.X.; Gong, R.F.; Gao, S.; Zhao, J. Steel surface defects recognition based on multi-type statistical features and enhanced twin support vector machine. Chemom. Intell. Lab. Syst. 2017, 171, 140–150. [Google Scholar] [CrossRef]
- Kwon, B.K.; Won, J.S.; Kang, D.J. Fast defect detection for various types of surfaces using random forest with VOV features. Int. J. Precis. Eng. Manuf. 2015, 16, 965–970. [Google Scholar] [CrossRef]
- Jian, C.X.; Gao, J.; Ao, Y.H. Automatic surface defect detection for mobile phone screen glass based on machine vision. Appl. Soft Comput. 2017, 52, 348–358. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, Y.; Xiao, C.; Zhu, Q.; Zhao, H. Automated visual inspection of glass bottle bottom with saliency detection and template matching. IEEE Trans. Instrum. Meas. 2019, 68, 4253–4267. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2010/htmldoc/index.html (accessed on 5 May 2010).
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. Available online: https://arxiv.org/abs/1504.08083v (accessed on 7 April 2022).
- Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. Available online: https://arxiv.org/abs/1804.02767v (accessed on 7 April 2022).
- Adarsh, P.; Rathi, P.; Kumar, M. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. In Proceedings of the 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 March 2020; pp. 687–694. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H. Yolov4: Optimal Speed and Accuracy of Object Detection. Available online: https://arxiv.org/abs/2004.10934v1 (accessed on 7 April 2022).
- Glenn, J. Yolov5. Available online: https://github.com/glenn-jocher/yolov5 (accessed on 7 April 2022).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Available online: https://arxiv.org/abs/1512.02325 (accessed on 7 April 2022).
- Adam, V.E. You Only Look Twice: Rapid Multi-Scale Object Detection in Satellite Imagery. Computer Vision and Pattern Recognition. Available online: https://arxiv.org/abs/1805.09512 (accessed on 7 April 2022).
- Szegedy, C.; Toshev, A.; Erhan, D. Deep Neural Networks for object detection. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Gao, M.; Yu, R.; Li, A.; Morariu, V.I.; Davis, L.S. Dynamic zoom-in network for fast object detection in large images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; Available online: https://arxiv.org/abs/1711.05187 (accessed on 7 April 2022).
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).