Next Article in Journal
Assessment of a Spalart–Allmaras Model Coupled with Local Correlation Based Transition Approaches for Wind Turbine Airfoils
Next Article in Special Issue
Strain State in Metal Sheet Axisymmetric Stretching with Variable Initial Thickness: Numerical and Experimental Results
Previous Article in Journal
A Comparative Biomechanical Analysis during Planned and Unplanned Gait Termination in Individuals with Different Arch Stiffnesses
Previous Article in Special Issue
Design and Accuracy Analysis of a Micromachine Tool with a Co-Planar Driving Mechanism
 
 
Article
Peer-Review Record

Failure Detection for Semantic Segmentation on Road Scenes Using Deep Learning

by Junho Song 1, Woojin Ahn 2, Sangkyoo Park 2 and Myotaeg Lim 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 23 December 2020 / Revised: 15 February 2021 / Accepted: 15 February 2021 / Published: 20 February 2021

Round 1

Reviewer 1 Report

The authors discussed the failure detection and the mean intersection of union (mIoU) prediction for images on the semantic segmentation network results.

The abstract is very well-structured.

Introduction:

The paragraph "Advanced driver assistance systems make autonomous driving possible. The National Highway Traffic Safety Administration categorizes five levels of developmental stages of autonomous driving technology" needs a literature reference.

Section 1.1:

It needs literature references about the different warning situations, e.g.,

10.3390/electronics9030416, 10.3390/su9091530, 10.3390/s110807420

Section 1.2:

Needs an introductory sentence before the bullets

Section 2 may be included as subsection of Introduction.

Sections 3 and 4 are very well presented, but the results must be discussed with the literature.

 

Author Response

Failure detection for semantic segmentation on road scenes using deep learning

Reviewer 1 

Q1. Section 1.1: Introduction: The paragraph "Advanced driver assistance systems make autonomous driving possible. The National Highway Traffic Safety Administration categorizes five levels of developmental stages of autonomous driving technology" needs a literature reference.

A1. We acknowledge that you mentioned that the paragraph needs reference. We added reference for five levels of developmental stages of automatic driving technology published by NATHS.

In section 1, line 22 

“Advanced driver assistance systems make autonomous driving possible. The National Highway Traffic Safety Administration categorizes five levels of developmental stages of autonomous driving technology:”  has been changed to

“Based on vision methods, advanced driver assistance systems make autonomous driving possible. National Highway Traffic Safety Administration [24] categorizes five levels of developmental stages of autonomous driving technology:”

Q2. Section 1.2:  It needs literature references about the different warning situations.

A2. We fully acknowledge that in the first draft of this paper, as an example of a warning segmentation, we only mentioned when the results of a semantic segmentation were wrong. As you mentioned, we have added three additional references that judge warning situations.

In section 1.1, line 41 

“Therefore, it is crucial to allow the system to detect failures.”

has been change to

“In other words, allowing the system to detect failures is important for self-driving [25-27] in autonomous driving systems.”

Q3. Section 1.2:  Needs an introductory sentence before the bullets.

A3. We added an introductory sentence in front of important bullets to keep the reader's attention focused. 

We added the sentence in section 1.3, line 93

“Our main attributes can be summarized as follows.”

Q4. Section 2 may be included as subsection of Introduction.

A4. We relocated Section 2 to 1.2, the subsection of Section 1. The purpose is to explain the basic information about deep learning, but we think the content is more suitable in Section 1.

Q5. Section 3 and 4 are very well presented, but the results must be discussed with the literature.

A5. As your opinion, we added a description of the table and picture in Section 3 and 4. For example, we added a description of the lack of explanation for obtaining GT mIoU values in Section 3, and Section 4 includes a description of what class means for each confusion matrix, what it is used for in the paper, and how it applies to the interpretation of the results.

We added the sentence in section 2.7, line 170

“In the case of semantic segmentation, intersection of union(IoU) value becomes the denominator of the number of pixels corresponding to any class and the number of pixels accurately predicted for that class. The average value is called mIoU. In this paper, we compute the mIoU between GT segmentation map and segmentation map obtained via ESPNet to obtain the GT mIoU value.”

Please also find the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents a mIoU prediction network for the application of failure detection of semantic segmentation results. The proposed CNN model achieves at least 83% failure detection accuracy on two public datasets. I have some questions regarding this article.

  1. In Section 3.6, the authors propose a modified loss function to deal with the imbalance distribution of GT-mIoU values in the Cityscapes dataset. The design of the proposed modified loss function is based on the observation shown in Table 2. However, this observation may not be suitable for other datasets. Moreover, the distribution of GT-mIoU values also depends on the semantic segmentation model used in the system. Therefore, the proposed loss function may not be usable for other semantic segmentation models.
  2. Another issue is that the modified loss function is a discontinuous function and is non-differentiable at 0.1. Therefore, the optimizer may fail when the loss value equals 0.1. This problem may happen when using a different dataset or a different semantic segmentation model.
  3. Based on the above two comments, I think the proposed modified loss function only can be used for ESPNet training on the Cityscapes dataset. This limitation makes the proposed loss function unusable for other segmentation models.
  4. In Figure 7, the authors should clarify how to perform mIoU calculation to obtain GT-mIoU value.
  5. The authors claim that the proposed model can be used for self-diagnosis of autonomous driving systems (ADS). However, the proposed model only predicts the mIoU value of the segmentation result, which is not a discriminative information for self-diagnosis. For example, in Figure 12, the proposed model predicts a mIoU value higher than the GT-mIoU value for the first two segmentation results. However, the first result fails in the road-space segmentation. On the contrary, the second result is successful in road-space segmentation, which can be used for the ADS to perform path planning process. Therefore, I think that the output information of the proposed model cannot help ADS in the self-diagnosis.
  6. In Figure9, the authors should clarify the definition of TP, TN, FP, FN based on the predicted mIoU and the GT-mIou values.
  7. I suggest that the English writing of this paper should be revised by an English-speaking editor.

Author Response

Failure detection for semantic segmentation on road scenes using deep learning

Reviewer 2  

 

Q1.  In Section 3.6, the authors propose a modified loss function to deal with the imbalance distribution of GT mIoU values in the Cityscapes dataset. The design of the proposed modified loss function is based on the observation shown in Table 2. However, this observation may not be suitable for other datasets. Moreover, the distribution of GT mIoU values also depends on the semantic segmentation model used in the system. Therefore, the proposed loss function may not be usable for other semantic segmentation models

A1.  As you mentioned, I agree that the modified loss function proposed in this paper may not work properly on data from other distributions. In addition, the distribution of GT mIoU values is dependent on the performance of the semantic segmentation model used by the system.

First, we also added a table for GT mIoU for HMG dataset to represent the comparison with a table representing the number and fraction of images per interval for GT mIoU for cityscape dataset. As shown in Table 2 and 14, the two data showed a very different distribution, even though they used the same model structure. In our opinion, the distribution is different because the domain is different from the front camera and the segmentation is simpler because the HMG dataset is a vertical image. However, we believe that the model we proposed does not simply use the mIoU value as an indicator, but the MAE value, which is the difference in the value, as an indicator, so it can perform well in both distribution data.

(Table 2, please find attachment)

We added new table about HMG distribution Table 14 near line 346

(Table 14, please find attachment)

 

Secondly, we conducted a new experiment to validate the performance of modified loss function by comparing MSE with modified loss function which emphasizes our distribution to readers. As a result, we show that both the cityscape dataset and the HMG dataset improve the mIoU predication accuracy and the Failure detection accuracy with modified loss function.

We add extra explanation and experiment results on Cityscape dataset in line 255

“The accuracy of MSE loss function and the accuracy of modified loss function differ as follows (Table 8). As a result of comparison, the loss function that fits well with the characteristics of unbalanced data was used to confirm 4.6% accuracy improvement for failure detection and 2.3% for mIoU predication.”

(Table 8, please find attachment)

And also for HMG dataset in line

In the case of HMG dataset, we experimented with the use of loss function as MSE and the application of modified loss function (Table 15). The table shows that using the modified loss function gives better performance on both Cityscapes and HMG dataset.”

(Table 15, please find attachment)

Finally, we conduct a comparative experiment using the structure of DeeplabV3+ for experiments on whether the proposed loss function works well only on ESPNet. 

We added a new section 3.5 for Deeplab experiment in line 321 and Table 12, 13.

“3.5. Experimental Results on the DeepLabV3+ model

We experimented with DeepLabV3+ to ensure that the proposed method is also applicable to other semantic segmentation models. We conducted an experiment to ensure that performance is maintained even when using GTmIoU values generated by other segmentation models after fixing the structure of the network, which is the second step of our network structure. Table 12 represents the distribution of GTmIoU values generated using DeepLabV3+ models. 

The results of the learning using MSE and the modified loss function are shown in Table 13. As the shown in the table, proposed loss function shows improved performance, but not much as using ESPNet in Table 8. the performance improvement was not significant when applying the modified loss function, as the distribution of GTmIoU generated using DeepLabV3+ models is much more concentrated on the mean value than the distribution of the data made with ESPNet. However, the performance improvement was shown slightly in the accuracy of failure detection and mIoU predication. We confirm that the proposed loss function not only results in significant performance improvements in unbalanced data, but also performs well in data from other distributions.”

(Tables 12 and 13, please find attachment)

Q2. Another issue is that the modified loss function is a discontinuous function and is nondifferentiable at 0.1. Therefore, the optimizer may fail when the loss value equals 0.1. This problem may happen when using a different dataset or a different semantic segmentation model.

A2.  I am very sympathetic to the point that the loss function is discontinuous at 0.1 and non-differentiable, which can cause problems with other datasets or segmentation models. However, we modified the loss function to solve the problem of data unbalance by distribution and I think it worked as a solution.

We are studying the modified sigmoid function as a general-purpose loss function for applying more general data sets and models as future work.

Q3.  Based on the above two comments, I think the proposed modified loss function only can be used for ESPNet training on the Cityscapes dataset. This limitation makes the proposed loss function unusable for other segmentation models.

A3.  We have confirmed with a few additional experiments that the proposed modified loss function works well on semantic segmentation models such as ESPNet, DeepLabV3+, and also on rain/haze data sets and HMG data sets. In addition, we are working on a task of more diverse structures and research to apply our method to models for the task.

Q4.  In Figure 7, the authors should clarify how to perform mIoU calculation to obtain GT mIoU value.

A4.  As you mentioned, I think there is a lack of explanation in figure 7 only with the word mIoU calculation. We added a detailed pipeline of mIoU calculation to the picture so that readers can understand more intuitively. Also, in the second paragraph of 2.7 we further explained the explanation and how to get mIoU value.

Figure 7 has been changed as following:

(Figure 7, please find attachment)

and extra explanation in line 170,

“In the case of semantic segmentation, intersection of union(IoU) value becomes the denominator of the number of pixels corresponding to any class and the number of pixels accurately predicted for that class. The average value is called mIoU. In this paper, we compute the mIoU between GT segmentation map and segmentation map obtained via ESPNet to obtain the GT mIoU value.”

Q5.  The authors claim that the proposed model can be used for self-diagnosis of autonomous driving systems (ADS). However, the proposed model only predicts the mIoU value of the segmentation result, which is not a discriminative information for self-diagnosis. For example, in Figure 12, the proposed model predicts a mIoU value higher than the GT-mIoU value for the first two segmentation results. However, the first result fails in the road-space segmentation. On the contrary, the second result is successful in road-space segmentation, which can be used for the ADS to perform path planning process. Therefore, I think that the output information of the proposed model cannot help ADS in the self-diagnosis.  

A5.  I think the meaning might have been conveyed vaguely because the text did not explain enough about the picture. 

First of all, we didn't think only the road area was important in ADS. For example, in the first picture, the road area was not properly distinguished, but in the second picture, the floor area where trees and signs were planted was not properly distinguished. This is the same in other pictures.

In this paper, we used the mIoU value for learning for the characteristics of these ADS. mIoU value is an average of the IoU value of each class, which has a characteristic that simplifies the detection accuracy for the entire class of an image.  In our opinion, it has the advantage of using scalar values for a single image to deliver a simpler warning to the driver than to send out what part is dangerous.

Q6. In Figure 9, the authors should clarify the definition of TP, TN, FP, FN based on the predicted mIoU and the GT-mIou values.

A6. As you mentioned, I think it is necessary to define TP, TN, FP and FN clearly. First, the image was edited in high resolution to enhance the reader's readability. In addition, the text explained the meaning of each class and the criteria for classifying the images used in the paper into each case using the values of the predicated mIoU and GT mIoU. 

Figure 9 has been changed as follows:

(Figure 7, please find attachment)

and clear statement on TP, TN, FP and FN in line 188,

“The results of the experiment are classified into each situation according to the following criteria: The TP is a case in which both the mIoU prediction value and the GT mIoU value are greater than 0.5 and the TN is smaller. In both cases, the detection of the failure case was successful. On the other hand, the FP means that the mIoU prediction value would be greater than 0.5, but the GT mIoU prediction value would be less than 0.5, the FN predicted that the mIoU prediction value would be greater than 0.5. In both cases, it is defined that the failure to properly detect the failure case.”

Q7.  I suggest that the English writing of this paper should be revised by an Englishspeaking editor.

A7. According to your suggestion, I received advice from the English-speaking editor about English writing. I revised the whole paper where the sentence flow was found to be awkward so that the content could be connected more naturally.  

 

Thank you very much for your sincere comments.

 

Please also find the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper discusses on preventing failure cases in an image by predicting mean intersection of union (mIoU). The failure case in this paper is proposed as a situation in which semantic segmentation has not been performed well in the image. They have used a two-stage deep learning model. The first being the encoder network ESPNet extract features for image segmentation and the second being detection of failure cases and also proposed a modified loss function to solve the data imbalance problem of the ground truth mIoU.

 

Overall, the manuscript was well-written and presented different outputs along with False Negatives, False Positives etc. in the city scape dataset.  Failure-case images looks reasonable considering the rain / haze, which affect the performance of the network badly. The authors need to improve the algorithm performance in the Hyundai Motor Group dataset or add a table to show the accuracy.

 

Author Response

Failure detection for semantic segmentation on road scenes using deep learning

Reviewer 3

Q1. Overall, the manuscript was well-written and presented different outputs along with False Negatives, False Positives etc. in the city scape dataset. Failure-case images looks reasonable considering the rain / haze, which affect the performance of the network badly. The authors need to improve the algorithm performance in the Hyundai Motor Group dataset or add a table to show the accuracy.

A1. To show the accuracy of the algorithm for HMG dataset as you mentioned, Section 3.5.2 shows the accuracy of the error detection and mIoU predication as a table. However, there were parts in the text that did not refer to the table number, so the table design was modified and further mentioned in the text so that readers could easily see it. Also a description of the table is added.

(Tables 14, 15, 16 and 17, please find attachment).

 

Please also find attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The manuscript is improved. However, the sections 3.5.1 and 3.6.2 are not visible in the manuscript correctly. More one time, I think that the discussion with other results in the literature is presented here, but it is not visible. 

---------------

Update later:

With these new sections that performed the comparison with the literature, the paper is now acceptable.

Author Response

Q1. The manuscript is improved. However, the sections 3.5.1 and 3.6.2 are not visible in the manuscript correctly. More one time, I think that the discussion with other results in the literature is presented here, but it is not visible.  

A1. Thank you for your kind comments. We acknowledged that the sections 3.5.1 and 3.6 went over the paper. We fixed and uploaded a new version of manuscript as you mentioned.

(Figure, please see attachment)

Please also refer to Revision Note_round 2

Author Response File: Author Response.pdf

Reviewer 2 Report

In this revision, the authors have addressed several problems in the previous version. However, I still have some questions about this revision.

  1. In Figure 7, why the input of the ESPNet is the GT segmentation map?
  2. On Lines 190-191, I am confused that why the case of TP is recognized as a successful failure detection case? In my opinion, only the case of TN can be recognized as successful failure detection.
  3. The description on Line 193 may be wrong.
  4. In Table 4, the definition of mIoU accuracy is missing.
  5. In Table 8, the definitions of failure detection accuracy and mIoU prediction accuracy are missing.
  6. In Figure 13, why are the failure detection results of the third and fifth images not TN?
  7. In Figure 15, why all the failure detection results are FP, not TP?

Author Response

Q1. In Figure 7, why the input of the ESPNet is the GT segmentation map?

A1. As you mentioned that, we acknowledged that the ESPNet should take input image and make segmentation prediction. We changed the pipline figure to avoid misunderstanding of ESPNet.

Figure 7 has been changed as following:

(Figure, please see attachment)

Q2. On Lines 190-191, I am confused that why the case of TP is recognized as a successful failure detection case? In my opinion, only the case of TN can be recognized as successful failure detection.

 A2. Thank you for your valuable comment. The reviewer commented that TP should not be recognized as the successful failure detection. We think that TP and TN are both successful detection where TN is failure detection and TP is non-failure detection. In other words, TP is important as TN because the network needs to recognize not only the failure case but also success case to determine whether it is failure or not. We have made the definition clearer in the paper.

From line 189, 

“The TP is a case in which both the mIoU prediction value and the GT mIoU value are greater than 0.5 and the TN is smaller. In both cases, the detection of the failure case was successful” has been changed to:

“True positive (TP) and true negative (TN) are the cases when the performance of ESPNet is well predicted by mIoUNet. In more detail, TP is defined when the mIoU prediction value and the GT mIoU value are greater than a threshold value of 0.5 where TN smaller than 0.5. In both cases, the mIOUNet successfully detects not only the failure case but also success case for ESPNet.

Q3. The description on Line 193 may be wrong.

 A3. Thank you for your valuable comment. As you mentioned in Q2, we acknowledged that the FP and FN were not clearly explained related to failure detection. False positive and false negative are when mIoUNet fails to predict the failure or success case of ESPNet.

From line 193, 

“On the other hand, the FP means that the mIoU prediction value would be greater than 0.5, but the GT mIoU prediction value would be less than 0.5, the FN predicted that the mIoU prediction value would be greater than 0.5. In both cases, it is defined that the failure to properly detect the failure case.” has been changed to:

“On the other hand, false positive (FP) and false negative (FN) are when mIoUNet fails to predict the failure or success case of ESPNet. FP is defined as the mIoU prediction value is larger than 0.5, but the GT mIoU prediction value is less than 0.5. On the contrary, FN is when the mIoU prediction value would be greater than 0.5. In both cases, we defined that the mIOUNet fails to detect properly the failure cases and success cases.”

Q4. In Table 4, the definition of mIoU accuracy is missing.

A4. Thank you for your valuable comment. As you mentioned, we found that the mIoU accuracy was not defined in the paper. mIoU accuracy can be calculated with formula (5), (6). GT mIoU and mIoU predictions are averaged, subtracted from 1, and multiplied by 100 to be expressed as a percentage. Also, to avoid confusion, we changed “mIoU accuracy” to “mIoU prediction accuracy” for all the table. Also, for future readers we added extra explanation.

In line 226, we added:

“The mIoU prediction accuracy is calculated using (5), (6)”

Q5. In Table 8, the definitions of failure detection accuracy and mIoU prediction accuracy are missing.  

A5. Thank you for your valuable comment. We acknowledge that the definitions of failure detection accuracy and mIoU prediction accuracy could not be found easily in paper. mIoU prediction accuracy can be calculated as mentioned in A4. Failure detection accuracy can be obtained by dividing number of TN, TP over number of test images. To be clear we mentioned the formula used for calculating both values.

In line 263, we added

“The mIoU prediction accuracy is obtained using (5), (6). Failure detection accuracy is calculated using (7).”

Q6. In Figure 13, why are the failure detection results of the third and fifth images not TN?  

A6. As you mentioned, we found we made mistakes in selecting images for the figure. The third and fifth image are TN as both mIoU prediction and GT mIoU are lower than 0.5. We changed the images of figure with FP and FN which the mIoU prediction and GT mIoU give different results.

Figure 13 has been changed as follows (please see attachment).

Q7. In Figure 15, why all the failure detection results are FP, not TP?

A7. Thank you for your valuable comment. We acknowledged that there was an error in selecting the picture. Except for the first image, the four images were FN cases. Note that, HMG dataset has a threshold value of 0.6 instead of 0.5 as in Cityscapes. We modified the images in figure with FP cases.

Figure 15 has been changed as follows  (please see attachment).

 

Thank you very much for your sincere comments.

 

Please also refer to Revision Note_Round 2.

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

On Lines 196-197, FN should be the mIoU prediction value less than 0.5.

Author Response

Q1. On Lines 196-197, FN should be the mIoU prediction value less than 0.5.

A1. Thank you for your kind revision. We acknowledged the false negative have been wrongly defined. As you mentioned, FN should be less than 0.5. To avoid mis-understanding, we clarified the definition as follows.

“FP is defined as the mIoU prediction value is be larger than 0.5, but the GT mIoU prediction value is less than 0.5. On the contrary, FN is when the mIoU prediction value would be greater than 0.5.”

has been changed to

“FP is defined as the mIoU prediction value is larger than 0.5, but the GT mIoU prediction value is less than 0.5. On the contrary, FN is defined as mIoU prediction value is smaller than 0.5, but GT mIoU prediction value is greater than 0.5.”

Thank you very much for your sincere comments.

 

Please also refer to Revision Note_Round 3_Reviewer 2.

Author Response File: Author Response.pdf

Back to TopTop