Next Article in Journal
A Hierarchical Convolution Neural Network (CNN)-Based Ship Target Detection Method in Spaceborne SAR Imagery
Next Article in Special Issue
Mapping Winter Crops in China with Multi-Source Satellite Imagery and Phenology-Based Algorithm
Previous Article in Journal
Wildfire Probability Mapping: Bivariate vs. Multivariate Statistics
Previous Article in Special Issue
Retrieving Corn Canopy Leaf Area Index from Multitemporal Landsat Imagery and Terrestrial LiDAR Data
 
 
Article
Peer-Review Record

A New CNN-Bayesian Model for Extracting Improved Winter Wheat Spatial Distribution from GF-2 imagery

by Chengming Zhang 1,2, Yingjuan Han 3, Feng Li 4,*, Shuai Gao 5, Dejuan Song 1,2, Hui Zhao 3, Keqi Fan 1 and Ya’nan Zhang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 30 January 2019 / Revised: 8 March 2019 / Accepted: 12 March 2019 / Published: 14 March 2019

Round 1

Reviewer 1 Report

General feedback

The manuscript entitled „A New CNN-Bayesian Model for Extracting Improved Winter Wheats Spatial Distribution from GF-2 imagery” addresses the identifying winter wheats in high-resolution satellite images with the main focus on improving the performance of a CNN in the edge regions. The authors propose a CNN-Bayesian classification model, which use a two-level approach for determining the category of a pixel. To process the low-confidence pixels, a modified Bayesian model was used. The results show that the proposed CNN-Bayesian model overperform in accuracy the ResNet and DeepLab. 

The research topic is relevance and is in the scope of the journal. The manuscript is well structured and the methods seem to be appropriative. The results of the study will positively contribute to the existing knowledge on this research question. However, there are some aspects in the manuscript which need to be clarified. 

 

Major comments: 

1.      The proposed CNN-Bayesian model shows a high performance in comparison to other classification models. However, it would be interesting to see how suchCNN would perform on this dataset without the proposed Bayesian model for determining the category for low-confidence pixels. This would give you a possibility to properly evaluate the power of the Bayesian model you propose. 

In my opinion, your comparison to SegNet and DeepLab is not compatibly fair, since the architectures of these models are quite different. So, SegNet seems to perform also worse in term of feature extraction, which resulted in a poorer prediction not only in the edge regions but also in the inner parts of the polygons (as illustrated in Figure 5). Please comment. 

2.     Your testing set is quite small. Only 10 image sunsets of 700 x 700 pixels, which were collected based on only three GF-2 images. These might be not enough for a proposer validation of your model. What was your motivation to use 29 images (from 32 in total) for establishing the training set, and only three for the testing set? Why not 20 GF-2 images for training and 12 for testing? Applying a cross-validation technique would allow you to use your data set efficiently.  

3.     The comparison to other similar works is missing in the Disscussion. The problem with misclassification in edge regions when using CNNs is not new and there are a lot of other scientific works on this topic. For example, combining CNN with CRF as you already mentioned in the Introduction and was used in DeepLab, or with Reccurrent Neural Network (as was used for example in Maggiori et al. 2017. Recurrent Neural Networks to Correct Sattelite Image Classification Maps. IEEE Transitions on Geosciences and Remote Sensing, 55(9)).


Other comments: 

L. 63-64. I agree with the statement that the traditional methods for extracting features from spectral information (i.e. calculating vegetation indices) might be not so efficient for high-resolution images in comparison to those with medium- and low-resolution. But there are other methods, such as extracting textural information (e.g., Gray-Level Co-Occurrence Matrix) or object-based image analysis. Please revise and complement. 

L. 63. Please remove replace “spectral statistical characteristics” by “spectral characteristics”. 

L. 158. It is not clear for which reasons these 479 sample points were collected. In L. 268-269 you mentioned that your training set was labeled using visual interpretation. Please clarify here the goal of collecting these sample points. 

L. 160. A table with detailed overview of your training and testing sets would be helpful for a reader. Please provide information how many samples per (number of pixels) category you used for training the models. 

Formula 1. What was your motivation to propose this change to the original pooling method? How it may affect the model performance?

L. 246. The manuscript is quite reach for formulas. Please keep only the formulas which reflect some changes you propose to the training process of a standard neural network. 


Please add to this section (“Training Model”) the information about the hyperparameter setup you used to train your model, as well as SegNet and DeepLab. What mini-batch size, learning rate and momentum you used? How many epochs? Did you applied some techniques to prevent overfitting?

L. 272. How the confidence threshold (“0.23 in this study”) was considered?

L. 293. The formulas listed in this section (13-16) are not necessary since these are widely used standard accuracy metrics. Please replace them by adding corresponding reference. 

L. 332. Significant at which significance level? Did you perform some significance tests?

L. 340. Replace “background” with “other categories”. 

L. 380-381. This sentence is not clear. Please rephrase.  

L. 383. See the comment for the l 332. The reader might expect to see results of some significance test. 

Figures 8 and 9. What data where used for these visualizations? Are these the results on applying your model, SegNet and DeepLab to your test set? 

I would recommend to use histograms for this visualization. This would improve the readability.  

 

Decision: major revision. 

Author Response

Dear Reviewer:

We would like to thank you for the comments and suggestions from you. We have substantially revised the manuscript according to your good suggestions, and detailed responses are provided below. All revised contents are in blue.

 

Major comments:

 

1. The proposed CNN-Bayesian model shows a high performance in comparison to other classification models. However, it would be interesting to see how such CNN would perform on this dataset without the proposed Bayesian model for determining the category for low-confidence pixels. This would give you a possibility to properly evaluate the power of the Bayesian model you propose.

In my opinion, your comparison to SegNet and DeepLab is not compatibly fair, since the architectures of these models are quite different. So, SegNet seems to perform also worse in term of feature extraction, which resulted in a poorer prediction not only in the edge regions but also in the inner parts of the polygons (as illustrated in Figure 5). Please comment.

 

Reply: According to your good suggestions, we have revised the relevant content. Firstly, we removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier. Secondly, we added one paragraph in Section 5.1 to illustrate the reason why the accuracy of winter wheat inner higher than that of winter wheat edge.

 

The revised relevant content o Section 4 is as follows:

4.1. Experimental Setups

 

SegNet [35] and DeepLab [37] are classic semantic segmentation models for images that have achieved good results in the processing of camera images. Moreover, the working principles of these two models are similar to that of our study, and we therefore chose these as comparison models to better reflect the advantages of our model in feature extraction and classification. We also removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier.

We used data enhancement techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast. After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90°, 180°, 270°). There are 6100 images in our final data set. We also employed cross-validation strategy for training and testing model to prevent overfitting. During each training and test round, 4880 images randomly selected from the mage-label datasets were used as training data, and the remaining 1220 images were used as test data. The SegNet, DeepLab, VGG-Ex and CNN-Bayesian model were trained with the same image dataset. This was done five times. Table 3 shows the total number of samples of each category used in each training and test round.

Table 3. Total number of samples of each category used in each training and test round.

Category

Number of training samples

Number of test samples

Winter wheat

1253258035

318536417

Agricultural buildings

5033165

1279263

Woodland

452984832

115133645

Developed land

956301312

243059917

Roads

45298483

11513364

Water bodies

45298483

11513364

Farmland

1212992717

308302316

Bare fields

1061997773

269924434

 

4.2. Results and Evaluation

Table 4 shows the confusion matrices for the segmentation results of the four models. Each row of the confusion matrix represents the proportion taken by the actual category, and each column represents the proportion taken by the predicted category. Our approach achieved better classification results. the proportion of “winter wheat” wrongly categorized as “non-winter wheat” was on average 0.033, and the proportion of “non-winter wheat” wrongly classified as “winter wheat” was on average 0.021.

Table 4. Confusion matrix of the winter wheat classification.

Approach

Predicted

Winter wheat

Non-winter wheat

CNN-Bayesian

Winter wheat

0.669

0.021

Non-winter wheat

0.033

0.277

VGG-Ex

Winter wheat

0.631

0.059

Non-winter wheat

0.049

0.261

SegNet

Winter wheat

0.574

0.116

Non-winter wheat

0.093

0.217

DeepLab

Winter wheat

0.605

0.085

Non-winter wheat

0.063

0.247

 

In this paper, we used four popular criteria, named Accuracy, Precision, Recall and Kappa coefficient to evaluate the performance of the proposed model [45]. Table 5 shows the values of evaluation criteria of the four models.

Table 5. Comparison of the four models’ performance.

Index

CNN-Bayesian

VGG-Ex

SegNet

DeepLab

Accuracy

0.946

0.892

0.791

0.852

Precision

0.932

0.878

0.766

0.837

Recall

0.941

0.872

0.756

0.825

Kappa

0.879

0.778

0.616

0.712

 

To further compare the classification accuracy of planting area edges, we further subdivided the categories into “inner” and “edge” labels. If only winter wheat category pixels are used in the convolution process to extract the pixel features, it is classified as inner; otherwise it is classified as edge. Table 6 show the confusion matrices for the segmentation results of the four models.

Table 6. Confusion matrix for winter wheat inner/edge classification.

Approach

Predicted

Winter wheat inner

Winter wheat edge

Non-winter wheat

CNN-Bayesian

Winter wheat inner

0.542

/

0.001

Winter wheat edge

/

0.127

0.02

Non-winter wheat

0.006

0.027

0.277

VGG-Ex

Winter wheat inner

0.539

/

0.012

Winter wheat edge

/

0.092

0.047

Non-winter wheat

0.008

0.041

0.261

SegNet

Winter wheat inner

0.532

/

0.035

Winter wheat edge

/

0.042

0.081

Non-winter wheat

0.033

0.06

0.217

DeepLab

Winter wheat inner

0.538

/

0.026

Winter wheat edge

/

0.067

0.059

Non-winter wheat

0.015

0.048

0.247

 

As can be seen from table 4, the accuracy of inner category of four models’ results were similar, but the CNN-Bayesian model was more accurate with regard to the edge category. The accuracy of CNN-Bayesian model in edge recognition is three times higher than that of SegNet, two times higher than that of DeepLab. By comparing the accuracy of winter inner edge of CNN-Bayesian and that of VGG-Ex, it can be found that the ability of CNN-Bayesian to recognize winter wheat edge is improved by nearly 30% due to the use of Bayesian classifier.

Figure 5 shows ten images and corresponding results randomly selected from the tested images, each containing 1204 × 1024 pixels. The CNN-Bayesian model misclassified only a small number of pixels at the corner of the winter wheat planting area. In the DeepLab results and VGG-Ex results, the misclassified pixels were mainly distributed at the junction of winter wheat and non-winter wheat areas, including edge and corner locations, but the number of misclassified pixels in the VGG-Ex model results is less than that of the DeepLab. The SegNet results had the most errors, which were scattered throughout the image; most misclassified pixels were located on the edges and corners, with some also occurring in the planting area.

                                             

Figure 5. Comparison of segmentation results for Gaofen 2 satellite imagery: (a) original images, (b) ground truth, (c) results of CNN-Bayesian, (d) results of VGG-Ex, (e) results of SegNet, and (f) results of DeepLab.

 

The new paragraph in Section 5.1 is as follow:

 

As can been seen for the statistical result of SegNet, Although the feature values of winter wheat inner pixels and winter wheat edge pixels are scattered, the feature values of winter wheat inner pixels are basically not overlapped with the feature values of other categories. However, the overlap between the feature values of winter wheat edge pixels and other categories is large, which is the reason that the accuracy of winter wheat inner higher than that of winter wheat edge.

 

 

2.  Your testing set is quite small. Only 10 image sunsets of 700 x 700 pixels, which were collected based on only three GF-2 images. These might be not enough for a proposer validation of your model. What was your motivation to use 29 images (from 32 in total) for establishing the training set, and only three for the testing set? Why not 20 GF-2 images for training and 12 for testing? Applying a cross-validation technique would allow you to use your data set efficiently. 

 

Reply: According to your good suggestions, we redesigned and conducted the experiment. The image used now has a size of 1024 × 1024 pixels. Then we have revised the relevant content, the revised content is as follows:

 

The revised content of Section 2.3 is as follows:

We selected 305 non-overlapping region images from the GF-2 images described in Section 2.2 to establish the image-label dataset for training and test, and each image contained 1024 × 1024 pixels. The dataset covered all land use types of the study area, including winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, farmland and bare fields. We manufactured a label file for each image, which was used to record the category number of each pixel on the image. In combination with the ground investigation data described in Section 2.2.2, we used visual interpretation and ENVI software to establish the label file. Figure 3 illustrates a training image and corresponding label file.

 

 

The revised content of Section 4.1 is as follows:

SegNet [35] and DeepLab [37] are classic semantic segmentation models for images that have achieved good results in the processing of camera images. Moreover, the working principles of these two models are similar to that of our study, and we therefore chose these as comparison models to better reflect the advantages of our model in feature extraction and classification. We also removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier.

We used data enhancement techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast. After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90°, 180°, 270°). There are 6100 images in our final data set. We also employed cross-validation strategy for training and testing model to prevent overfitting. During each training and test round, 4880 images randomly selected from the mage-label datasets were used as training data, and the remaining 1220 images were used as test data. The SegNet, DeepLab, VGG-Ex and CNN-Bayesian model were trained with the same image dataset. This was done five times. Table 3 shows the total number of samples of each category used in each training and test round.

Table 3. Total number of samples of each category used in each training and test round.

Category

Number of training samples

Number of test samples

Winter wheat

1253258035

318536417

Agricultural   buildings

5033165

1279263

Woodland

452984832

115133645

Developed land

956301312

243059917

Roads

45298483

11513364

Water bodies

45298483

11513364

Farmland

1212992717

308302316

Bare fields

1061997773

269924434

 

 

 

3.  The comparison to other similar works is missing in the Disscussion. The problem with misclassification in edge regions when using CNNs is not new and there are a lot of other scientific works on this topic. For example, combining CNN with CRF as you already mentioned in the Introduction and was used in DeepLab, or with Reccurrent Neural Network (as was used for example in Maggiori et al. 2017. Recurrent Neural Networks to Correct Sattelite Image Classification Maps. IEEE Transitions on Geosciences and Remote Sensing, 55(9)).

 

Reply: According to your good suggestions, we added Section 5.3 to illustrate the comparison to other similar works. The new content is as follow.

5.3 comparison to other similar works

At present, there are some methods focus on improving the classification accuracy of edge regions[43-45, 67]. These methods describe the association between inputs from the semantic level, so that the relationship between prediction labels of adjacent pixels can be described, and the prediction results are not only related to the features of the predicted pixels. Relevant, also affected by the results of previous predictions. Our method can describe the statistical of inputs. The prediction result is determined by the features of the pixel itself and the regional statistical features, which is more in line with the characteristics of remote sensing data.

 

 

 

Other comments:

 

(1) L. 63-64. I agree with the statement that the traditional methods for extracting features from spectral information (i.e. calculating vegetation indices) might be not so efficient for high-resolution images in comparison to those with medium- and low-resolution. But there are other methods, such as extracting textural information (e.g., Gray-Level Co-Occurrence Matrix) or object-based image analysis. Please revise and complement.

 

Reply: According to your good suggestions, we have revised the relevant content, the revised content is as follows:

 

The spectral characteristics of low- and middle-resolution remote sensing images are usually stable. Vegetation indexes are generally used as pixel features in studies using data from sources including the Moderate Resolution Imaging Spectroradiometer (MODIS) [6,13–16], Enhanced Thematic Mapper/Thematic Mapper [13,17], and Systeme Probatoire d’ Observation de la Terre [7,10]. These indexes include the normalized difference vegetation index (NDVI) [5,6,13–15], relationship analysis of NDVI [8], and enhanced vegetation index (EVI) [3,18], which are extracted from band values. Common classification methods include decision trees [5,11,13], linear regression [6], statistics [7], filtration [13], time-series analysis [14-15], the iterative self-organizing data analysis technique (ISODATA) [16], and the Mahalanobis distance [17]. Texture features can better describe the spatial structure of pixels, the Gray-Level Co-Occurrence Matrix is a commonly used texture feature [19], Gabor[20] and wavelet transform [19,21] are often used to extract texture features. Moreover, object-based image analysis technology is also widely used in pre-pixel classification [22, 23]. Such methods can successfully extract the spatial distribution of winter wheat and other crops, but limitations in spatial resolution restrict the applicability of the results.

 

 

(2) L. 63. Please remove replace “spectral statistical characteristics” by “spectral characteristics”.

Reply: According to your good suggestions,, we have replaced “spectral statistical characteristics” by “spectral characteristics”.

 

 

(3) L. 158. It is not clear for which reasons these 479 sample points were collected. In L. 268-269 you mentioned that your training set was labeled using visual interpretation. Please clarify here the goal of collecting these sample points.

 

Reply: According to your good suggestions, we have revised the relevant content, the revised content is as follows:

 

The main land cover in Zhangqiu County during winter includes winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, farmland, and bare fields. In fused GF-2 images, bare fields, agricultural buildings, developed land, water bodies, farmland, and roads are all visually distinct from each other and vegetated areas during winter. In order to accurately distinguish whether a piece of vegetation area is winter wheat or woodland in manual visual interpretation, the sample information of winter wheat area and woodland area should be obtained, so we conducted ground investigations in 2017 and 2018, obtaining 367 sample points (251 winter wheat, 116 woodland); time, location and land use were recorded for all points (Figure 2b).

 

 

(4) L. 160. A table with detailed overview of your training and testing sets would be helpful for a reader. Please provide information how many samples per (number of pixels) category you used for training the models.

 

Reply: According to your good suggestions, we have revised the relevant content in Section 4.1, the revised content is as follows:

 

We used data enhancement techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast. After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90°, 180°, 270°). There are 6100 images in our final data set. We also employed cross-validation strategy for training and testing model to prevent overfitting. During each training and test round, 4880 images randomly selected from the mage-label datasets were used as training data, and the remaining 1220 images were used as test data. The SegNet, DeepLab, VGG-Ex and CNN-Bayesian model were trained with the same image dataset. This was done five times. Table 3 shows the total number of samples of each category used in each training and test round.

Table 3. Total number of samples of each category used in each training and test round.

Category

Number of training samples

Number of test   samples

Winter wheat

1253258035

318536417

Agricultural buildings

5033165

1279263

Woodland

452984832

115133645

Developed land

956301312

243059917

Roads

45298483

11513364

Water bodies

45298483

11513364

Farmland

1212992717

308302316

Bare fields

1061997773

269924434

 

 

 

(5) Formula 1. What was your motivation to propose this change to the original pooling method? How it may affect the model performance?

 

Reply: According to your good suggestions, we have revised the relevant content in Section 5.1, the revised content is as follows:

Because the CNN-Bayesian model and the VGG-Ex model use the same feature extractor, we selected the most different set of semantic features from the last layer of the CNN-Bayesian, SegNet, and DeepLab models for comparative analysis, Figure 6 show the statistical results, respectively. The degree of confusion in the CNN-Bayesian model results is smaller than that in the other two models because its network structure and data organization mode are better, and the improved pooling algorithm used in feature extractor has a larger receptive field, and has a greater advantage in feature aggregation than the classical pooling algorithm. The CNN-Bayesian model feature extractor can keep the size of the feature image of the last layer unchanged without using deconvolution. Furthermore, it can eliminate location errors of the feature value that may be caused by the deconvolution operation and ensure one-to-one correspondence between the feature value and the pixel, thus reducing the degree of confusion between the features of winter wheat edge and non-winter wheat areas. Compared with the comparison model, the CNN-Bayesian model better suits the data features of high-resolution remote sensing images.

 

 

(6) L. 246. The manuscript is quite reach for formulas. Please keep only the formulas which reflect some changes you propose to the training process of a standard neural network.

Reply: According to your good suggestions, we have deleted these formulas and keep only the formulas which reflect some changes we propose to the training process of a standard neural network.

 

(7)Please add to this section (“Training Model”) the information about the hyperparameter setup you used to train your model, as well as SegNet and DeepLab. What mini-batch size, learning rate and momentum you used? How many epochs? Did you applied some techniques to prevent overfitting?

 

Reply: According to your good suggestions, we have revised the relevant content in Section 3.2, the revised content is as follows:

 

We trained the CNN-Bayesian model in an end-to-end manner, B-level classifier does not participate in the training stage. The parameters required for B-level classifier to perform calculations are obtained by statistics after training completed. The training stage consists of the following steps:

1.        Image-label pairs are input into the CNN-Bayesian model as a training sample dataset, and parameters are initialized.

2.        Forward propagation is performed on the sample images.

3.        The loss is calculated and back-propagated to the CNN-Bayesian model.

4.        The network parameters are updated using the stochastic gradient descent [45] with momentum.

Steps 2–4 are iterated until the loss is less than the predetermined threshold values.

Table 1 shows the hyperparameters setup we used to train our model. In the comparison experiments, the hyperparameters also applied to the comparison model.

Table 1. T the hyperparameters setup.

Hyperparameter

Value

mini-batch size

32

learning rate

0.0001

momentum

0.9

epochs

20000

 

The added content in Section 4.1 to illustrate the techniques used to prevent overfitting is as follow:

We used data enhancement techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast. After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90°, 180°, 270°). There are 6100 images in our final data set. We also employed cross-validation strategy for training and testing model to prevent overfitting. During each training and test round, 4880 images randomly selected from the mage-label datasets were used as training data, and the remaining 1220 images were used as test data. The SegNet, DeepLab, VGG-Ex and CNN-Bayesian model were trained with the same image dataset. This was done five times. Table 3 shows the total number of samples of each category used in each training and test round.

 

 

(8) L. 272. How the confidence threshold (“0.23 in this study”) was considered?

 

Reply: According to your good suggestions, we have add a paragraph in Section 5.2, the new content is as follows:

As can be seen from Figure 8 and 9, for the CNN-Bayesian model, the number of pixels with confidence lower than 0.23 is small, but the proportion of misclassification is very large. This is the reason we choose 0.23 as confidence threshold described in Section 3.3.

 

 

(9) L. 293. The formulas listed in this section (13-16) are not necessary since these are widely used standard accuracy metrics. Please replace them by adding corresponding reference.

Reply: We have done according to your good suggestions

 

 

(10) L. 332. Significant at which significance level? Did you perform some significance tests?

 

Reply: According to your good suggestions, we have revised the relevant content, the new content is as follows:

 

To further compare the classification accuracy of planting area edges, we further subdivided the categories into “inner” and “edge” labels. If only winter wheat category pixels are used in the convolution process to extract the pixel features, it is classified as inner; otherwise it is classified as edge. Table 6 show the confusion matrices for the segmentation results of the four models.

Table 6. Confusion matrix for winter wheat inner/edge classification.

Approach

Predicted

Winter wheat inner

Winter wheat edge

Non-winter wheat

CNN-Bayesian

Winter wheat inner

0.542

/

0.001

Winter wheat   edge

/

0.127

0.02

Non-winter wheat

0.006

0.027

0.277

VGG-Ex

Winter wheat   inner

0.539

/

0.012

Winter wheat   edge

/

0.092

0.047

Non-winter wheat

0.008

0.041

0.261

SegNet

Winter wheat   inner

0.532

/

0.035

Winter wheat   edge

/

0.042

0.081

Non-winter wheat

0.033

0.06

0.217

DeepLab

Winter wheat   inner

0.538

/

0.026

Winter wheat   edge

/

0.067

0.059

Non-winter wheat

0.015

0.048

0.247

 

As can be seen from table 4, the accuracy of inner category of four models’ results were similar, but the CNN-Bayesian model was more accurate with regard to the edge category. The accuracy of CNN-Bayesian model in edge recognition is three times higher than that of SegNet, two times higher than that of DeepLab. By comparing the accuracy of winter inner edge of CNN-Bayesian and that of VGG-Ex, it can be found that the ability of CNN-Bayesian to recognize winter wheat edge is improved by nearly 30% due to the use of Bayesian classifier.

 

(11) L. 340. Replace “background” with “other categories”.

 

Reply: We have done according to your good suggestions, the new content is as follows:

To distinguish winter wheat from other categories, a popular deep learning algorithm CNN was applied to explore the features.

 

(12) L. 380-381. This sentence is not clear. Please rephrase. 

Reply: according to your good suggestions, we have revised the relevant content, the new content is as follows:

We select the number of pixels in each confidence level of the CNN-Bayesian, VGG-Ex, SegNet, and DeepLab models for comparative analysis (Figure 8).

 

(13) L. 383. See the comment for the l 332. The reader might expect to see results of some significance test.

Reply: According to your good suggestions, we have delete the word “significance”

 

 

(14) Figures 8 and 9. What data where used for these visualizations? Are these the results on applying your model, SegNet and DeepLab to your test set?

I would recommend to use histograms for this visualization. This would improve the readability. 

 

Reply: according to your good suggestions, we have revised the relevant content, the new content is as follows:

 

 

We select the number of pixels in each confidence level of the CNN-Bayesian, VGG-Ex, SegNet, and DeepLab models for comparative analysis (Figure 8). The pixel ratio of the SegNet and DeepLab models is higher than that of the CNN-Bayesian model and VGG-Ex at a lower confidence level. Considering that the four models use the same encoder, this shows that the feature composition of the CNN-Bayesian model is more reasonable, because it uses color and texture features in addition to the high-level semantic features used by all three models.

 

Figure 8. Distribution of confidence values for the four models.

As the confidence increases, the classification errors of the four models decrease and the degree of reduction increases (Figure 9). This is because the confidence value directly reflects the degree to which the pixel characteristics match the overall category characteristics and thus the likelihood that the classification result is correct. Therefore, it is reasonable to choose confidence value as the index of the confidence that a given pixel will be classified into a certain category.

Figure 9. Distribution of misclassified pixels for all four models.

Overall, these results show that the CNN-Bayesian model is more capable than the comparison models, reflecting its advantageous use of a two-level classifier structure. Since the second-level classifier makes full use of the confidence and planting structure information, the number of misclassified pixels is effectively reduced.

 

 

 


Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes a per-pixel classification model using CNN and Bayesian models (CNN-Bayesian model) for improving the extraction accuracy of the spatial distribution of winter wheat from high-resolution remote sensing imagery. The proposed method employs a feature extractor to generats a feature vector for each pixel, an encoder to transform the feature vector of each pixel into a category-code vector, and a two-level classifier to utilize the difference between elements of category-probability vectors as confidence to perform per-pixel classifications. Some suggestions are as below:

It would be better to clarify why Bayesian is used with CNN. i.e. the motivation is unclear to me. Moreover, some experiments should be conducted to support the proposed idea, e.g. only use either of CNN and Bayesian.

Is the proposed method trained in an end-to-end manner? i.e., the CNN modual and the Bayesian one? If no, is it possible?

how to determine the optimal vaule of the parameters? it would be better to investigate the inflence of parameters.

Some recent works in  high-resolution remote sensing image and deep learning apporaches are encouraged to review, e.g., Embedding Structured Contour and Location Prior in Siamesed Fully Convolutional Networks for Road Detection; Structured AutoEncoders for Subspace Clustering; Multiple Marginal Fisher Analysis; Locality Adaptive Discriminant Analysis for Spectral-Spatial Classification of Hyperspectral Images;


Author Response

Dear Reviewer:

We would like to thank you for the comments and suggestions from you. We have substantially revised the manuscript according to your good suggestions, and detailed responses are provided below. All revised contents are in blue.

 

1.  It would be better to clarify why Bayesian is used with CNN. i.e. the motivation is unclear to me. Moreover, some experiments should be conducted to support the proposed idea, e.g. only use either of CNN and Bayesian.

 

Reply: According to your good suggestions, we have revised the relevant content. Firstly, we have rewritten the last paragraph of Section 1, and it be divided into two paragraphs. We added content to clarify why Bayesian is used with CNN. Secondly, we removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier.

 

The revised content of last paragraph of Section 1 is as follows:

Previous experimental results [44-56] have shown that misclassified pixels are primarily located at the intersections of two land use types, such as field edges or corners. This is because when the features of pixels in these areas are acquired, the used pixel blocks usually contain more pixels of other categories, resulting in the features often being different from the feature of inner pixels of the planting area, which frequently cause classification errors. By analyzing the probability vector of these misclassified pixels, it can be found that the difference between the maximum probability value and the second-maximum probability value is generally small. These errors are due to the inherent structure of the convolutional layer, which needs to be combined with the classifier to be improved.

The Bayesian model can synthesize information from different sources and improve the reliability of inferred conclusions [71,72]. Therefore, when judging the category of a pixel whose difference between the maximum probability value and the second-maximum probability value is small, the spatial structure information of the pixels can be further introduced to improve the reliability of the judgment by using the Bayesian model. In this study, we developed a new CNN consisting of a feature extractor, encoder, and a Bayesian classifier, which we refer to as a Bayesian Convolutional Neural Network (CNN-Bayesian model). We then used this model to extract winter wheat spatial distribution information from Gaofen 2 (GF-2) remote sensing imagery and compared the results with those achieved by other methods.

 

The revised relevant content o Section 4 is as follows:

4.1. Experimental Setups

 

SegNet [35] and DeepLab [37] are classic semantic segmentation models for images that have achieved good results in the processing of camera images. Moreover, the working principles of these two models are similar to that of our study, and we therefore chose these as comparison models to better reflect the advantages of our model in feature extraction and classification. We also removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier.

We used data enhancement techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast. After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90°, 180°, 270°). There are 6100 images in our final data set. We also employed cross-validation strategy for training and testing model to prevent overfitting. During each training and test round, 4880 images randomly selected from the mage-label datasets were used as training data, and the remaining 1220 images were used as test data. The SegNet, DeepLab, VGG-Ex and CNN-Bayesian model were trained with the same image dataset. This was done five times. Table 3 shows the total number of samples of each category used in each training and test round.

Table 3. Total number of samples of each category used in each training and test round.

Category

Number of training samples

Number of test samples

Winter wheat

1253258035

318536417

Agricultural buildings

5033165

1279263

Woodland

452984832

115133645

Developed land

956301312

243059917

Roads

45298483

11513364

Water bodies

45298483

11513364

Farmland

1212992717

308302316

Bare fields

1061997773

269924434

 

4.2. Results and Evaluation

Table 4 shows the confusion matrices for the segmentation results of the four models. Each row of the confusion matrix represents the proportion taken by the actual category, and each column represents the proportion taken by the predicted category. Our approach achieved better classification results. the proportion of “winter wheat” wrongly categorized as “non-winter wheat” was on average 0.033, and the proportion of “non-winter wheat” wrongly classified as “winter wheat” was on average 0.021.

Table 4. Confusion matrix of the winter wheat classification.

Approach

Predicted

Winter wheat

Non-winter wheat

CNN-Bayesian

Winter wheat

0.669

0.021

Non-winter wheat

0.033

0.277

VGG-Ex

Winter wheat

0.631

0.059

Non-winter wheat

0.049

0.261

SegNet

Winter wheat

0.574

0.116

Non-winter wheat

0.093

0.217

DeepLab

Winter wheat

0.605

0.085

Non-winter wheat

0.063

0.247

 

In this paper, we used four popular criteria, named Accuracy, Precision, Recall and Kappa coefficient to evaluate the performance of the proposed model [45]. Table 5 shows the values of evaluation criteria of the four models.

Table 5. Comparison of the four models’ performance.

Index

CNN-Bayesian

VGG-Ex

SegNet

DeepLab

Accuracy

0.946

0.892

0.791

0.852

Precision

0.932

0.878

0.766

0.837

Recall

0.941

0.872

0.756

0.825

Kappa

0.879

0.778

0.616

0.712

 

To further compare the classification accuracy of planting area edges, we further subdivided the categories into “inner” and “edge” labels. If only winter wheat category pixels are used in the convolution process to extract the pixel features, it is classified as inner; otherwise it is classified as edge. Table 6 show the confusion matrices for the segmentation results of the four models.

Table 6. Confusion matrix for winter wheat inner/edge classification.

Approach

Predicted

Winter wheat inner

Winter wheat edge

Non-winter wheat

CNN-Bayesian

Winter wheat inner

0.542

/

0.001

Winter wheat edge

/

0.127

0.02

Non-winter wheat

0.006

0.027

0.277

VGG-Ex

Winter wheat inner

0.539

/

0.012

Winter wheat edge

/

0.092

0.047

Non-winter wheat

0.008

0.041

0.261

SegNet

Winter wheat inner

0.532

/

0.035

Winter wheat edge

/

0.042

0.081

Non-winter wheat

0.033

0.06

0.217

DeepLab

Winter wheat inner

0.538

/

0.026

Winter wheat edge

/

0.067

0.059

Non-winter wheat

0.015

0.048

0.247

 

As can be seen from table 4, the accuracy of inner category of four models’ results were similar, but the CNN-Bayesian model was more accurate with regard to the edge category. The accuracy of CNN-Bayesian model in edge recognition is three times higher than that of SegNet, two times higher than that of DeepLab. By comparing the accuracy of winter inner edge of CNN-Bayesian and that of VGG-Ex, it can be found that the ability of CNN-Bayesian to recognize winter wheat edge is improved by nearly 30% due to the use of Bayesian classifier.

Figure 5 shows ten images and corresponding results randomly selected from the tested images, each containing 1204 × 1024 pixels. The CNN-Bayesian model misclassified only a small number of pixels at the corner of the winter wheat planting area. In the DeepLab results and VGG-Ex results, the misclassified pixels were mainly distributed at the junction of winter wheat and non-winter wheat areas, including edge and corner locations, but the number of misclassified pixels in the VGG-Ex model results is less than that of the DeepLab. The SegNet results had the most errors, which were scattered throughout the image; most misclassified pixels were located on the edges and corners, with some also occurring in the planting area.

Figure 5. Comparison of segmentation results for Gaofen 2 satellite imagery: (a) original images, (b) ground truth, (c) results of CNN-Bayesian, (d) results of VGG-Ex, (e) results of SegNet, and (f) results of DeepLab.

 

2. Is the proposed method trained in an end-to-end manner? i.e., the CNN modual and the Bayesian one? If no, is it possible?

 

Reply: According to your good suggestions, we have revised the relevant content, the revised content is as follows:

 

We trained the CNN-Bayesian model in an end-to-end manner, B-level classifier does not participate in the training stage. The parameters required for B-level classifier to perform calculations are obtained by statistics after training completed. The training stage consists of the following steps:

1.        Image-label pairs are input into the CNN-Bayesian model as a training sample dataset, and parameters are initialized.

2.        Forward propagation is performed on the sample images.

3.        The loss is calculated and back-propagated to the CNN-Bayesian model.

4.        The network parameters are updated using the stochastic gradient descent [45] with momentum.

Steps 2–4 are iterated until the loss is less than the predetermined threshold values.

Table 1 shows the hyperparameters setup we used to train our model. In the comparison experiments, the hyperparameters also applied to the comparison model.

Table 1. T the hyperparameters setup.

Hyperparameter

Value

mini-batch   size

32

learning   rate

0.0001

momentum

0.9

epochs

20000

 

 

3. how to determine the optimal vaule of the parameters? it would be better to investigate the inflence of parameters.

 

Reply: According to your good suggestions, we have added relevant content, the new content is as follows:

 

How to determine the optimal value of the parameters is an important problem in the use of convolutional neural networks. stochastic gradient descent with momentum [45] is a common and effective training method. Data Augmentation technology [33, 35, 41] and dropout technology [33] used to prevent overfitting, so as to ensure that the model can obtain the optimal parameters. Practice has proved that reasonable use of BN layer is also helpful for model training to obtain the optimal parameters [42,43].

 

 

4. Some recent works in high-resolution remote sensing image and deep learning apporaches are encouraged to review, e.g., Embedding Structured Contour and Location Prior in Siamesed Fully Convolutional Networks for Road Detection; Structured AutoEncoders for Subspace Clustering; Multiple Marginal Fisher Analysis; Locality Adaptive Discriminant Analysis for Spectral-Spatial Classification of Hyperspectral Images;

 

Reply: According to your good suggestions, we have revised the relevant content, the revised content is as follows:

 

These convolution-based per-pixel-label models have been applied in remote sensing image segmentation with remarkable results. For example, researchers have used CNN to carry out remote sensing image segmentation and used conditional random fields to further refine the output class map [45–48]. To suit the characteristics of specific remote sensing imagery, other researchers have established new convolution-based per-pixel-label models, such as multi-scale fully convolutional networks [49], patch-based CNNs [50], and two-branch CNNs [51]. Effective work has also been carried out in extracting information from remote sensing imagery using convolution-based per-pixel-label models, e.g., extracting crop information for rice [52,53], wheat [54], leaf [55], and rape [56], as well as target detection for weeds [57–59], diseases [60-62], and extracting Road information using improved FCN [63]. Some new feature extraction techniques are being applied to crop information extraction, including 3D-CNN [64], deep recurrent neural networks [65], and CNN-LSTM [66], and Recurrent Neural Networks (RNN) was also used to Correct Satellite Image Classification Maps [67]. Some new techniques are proposed to improve the segmentation accuracy, including structured autoencoders [68], Locality Adaptive Discriminant Analysis [69]. Moreover, the research on how to automatically determine the feature dimension that could be adaptive to different data distributions will help to obtain a good performance in machine learning and computer vision [70].

 

The added references is as follow:

 

63.      Wang, Q.; Gao, J.Y. ; Yuan, Y. Embedding Structured Contour and Location Prior in Siamesed Fully Convolutional Networks for Road Detection. IEEE T. Intell. Transp. 2018, 19, 230-241. doi: 10.1109/TITS.2017.2749964.

64.      Ji, S.; Zhang, C.; Xu, A.; Shi, Y.; Duan, Y. 3D convolutional neural networks for crop classification with multi-temporal remote sensing images. Rem. Sens. 2018, 10, 75. doi:10.3390/rs10010075.

65.      Ndikumana, E.; Ho Tong Minh, D.; Baghdadi, N.; Courault, D.; Hossard, L. Deep recurrent neural network for agricultural classification using multitemporal SAR Sentinel-1 for Camargue, France. Rem. Sens. 2018, 10, 1217. doi:10.3390/rs10081217.

66.      Namin, S.T.; Esmaeilzadeh, M.; Najafi, M.; Brown, T.B.; Borevitz, J.O. Deep phenotyping: deep learning for temporal phenotype/genotype classification. Plant Methods 2018, 14, 66 doi: 10.1186/s13007-018-0333-4.

67.      Maggiori, E.; Charpiat, G.; Tarabalka, Y.; Alliez, P. Recurrent Neural Networks to Correct Satellite Image Classification Maps. arXiv: 1608.03440v3 [cs.CV] 2017.

68.      Peng, X.; Feng, J.S.; Xiao, S.J.; Yau, W.Y.; Zhou, J.T.; Yang, S.F. Structured AutoEncoders for Subspace Clustering. IEEE T Image Process 2018, 27, 5076-5086. doi: 10.1109/TIP.2018.2848470.

69.      Wang, Q.; Meng, Z.T.; Li, X.L. Locality Adaptive Discriminant Analysis for Spectral-Spatial Classification of Hyperspectral Images. IEEE Geosci. Remote. S. 2017, 14, 2077-2081. doi: 10.1109/LGRS.2017.2751559.

70.      Huang, Z. ; Zhu, H. ; Zhou, J. T.; Peng, X. Multiple Marginal Fisher Analysis. IEEE Trans. Ind. Electron. 2018(Early Access). doi: 10.1109/TIE.2018.2870413

71.      Jung, M. C.; Park, J.; Kim, S. Spatial Relationships between Urban Structures and Air Pollution in Korea. Sustainability 2019, 11, 476; doi:10.3390/su11020476.

72.      Maosi Chen1, Zhibin Sun1, John M. Davis1, Yan-An Liu2,3,4, Chelsea A. Corr1, and Wei Gao1,5. Improving the mean and uncertainty of ultraviolet multi-filter rotating shadowband radiometer in situ calibration factors: utilizing Gaussian process regression with a new method to estimate dynamic input uncertainty. Atmos. Meas. Tech. 2019,12, 935–953. doi: 10.5194/amt-12-935-2019.

 


Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I thoroughly checked the revision and found that my comments on the previous submission have been appropriately addressed.

 

A few specific comments that should be considered for the final version:

L. 30-31. I would recommend to list here some accuracy metrics, which will illustrate the performance of your classification model in comparison to other models (for example, overall accuracies of all four models), instead of average accuracy, recall and Kappa coefficient of the CNN-Bayesian only.

L. 152-153. This statement is in conflict with L. 318-320, where you describe another way of generating testing and training data sets.

L. 178. Please replace “a piece of vegetation area” with “a vegetation area” and “manual visual” with “visual”.

L. 314-317. Did you performed “data enhancement” or data augmentation? There are differences. The data processing steps you describe here belong to data augmentation techniques.

L. 317-320. To my knowledge, the validation techniques you are describing here is not a cross-validation, but a random split (80% for training and 20% for testing) repeated 5 times. This validation technique is also appropriative.

L. 318. A cross-validation will not prevent overfitting of your model. It just shows you how your model is performing on a new data set.

L. 319. Should be “mage-label dataset” written as “image-label dataset”?

Table 3. As you described above you trained your models 5 times. Each time you randomly selected 80% of our data for training und used the other 20% of images for testing the classification models. In such case, indicating in this table the total number of samples per each category only (in Mio, for a better readability) would be enough.

L. 329. Please replace “classification results. the proportion” with “classification results. The proportion”.

L. 421-423. This statement is not fully correct since your CNN-Bayesian and VGG-Ex use the same feature extractor. It is slightly different for two other classification models.

L.466. The sentence is not clear. Please correct it.


Author Response

Dear Reviewer:

We would like to thank you for the comments and suggestions from you. We have substantially revised the manuscript according to your good suggestions, and detailed responses are provided below. All revised contents are in blue.

(1) L. 30-31. I would recommend to list here some accuracy metrics, which will illustrate the performance of your classification model in comparison to other models (for example, overall accuracies of all four models), instead of average accuracy, recall and Kappa coefficient of the CNN-Bayesian only.

Reply: According to your good suggestions, we have revised the L. 30-31. The revised content is as follows:

Compared to existing models, our approach produced an improvement in overall accuracy, the overall accuracy of SegNet, DeepLab, VGG-Ex, and CNN-Bayesian was 0.791, 0.852, 0.892, and 0.946, respectively.

(2) L. 152-153. This statement is in conflict with L. 318-320, where you describe another way of generating testing and training data sets.

Reply: According to your good suggestions, we compare L.152-153 and L. 318-320, and have deleted L.152-153, because the description here is false.

(3) L. 178. Please replace “a piece of vegetation area” with “a vegetation area” and “manual visual” with “visual”.

Reply: According to your good suggestions, we have revised the L. 178. The revised content is as follows:

In order to accurately distinguish whether a vegetation area is winter wheat or woodland in visual interpretation

(4) L. 314-317. Did you performed “data enhancement” or data augmentation? There are differences. The data processing steps you describe here belong to data augmentation techniques.

Reply: According to your good suggestions, we have revised the L. 314 to correct the mistake. The revised content is as follows:

We used data augmentation techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast.

 

(5) L. 317-320. To my knowledge, the validation techniques you are describing here is not a cross-validation, but a random split (80% for training and 20% for testing) repeated 5 times. This validation technique is also appropriative.

Reply: According to your good suggestions, we have revised the L. 317 to correct the mistake. The revised content is as follows:

We also employed random split technique for training and testing model to prevent overfitting.

(6)  L. 318. A cross-validation will not prevent overfitting of your model. It just shows you how your model is performing on a new data set.

Reply: According to your good suggestions, we have revised the L. 318 using the right name of the technique we used. The revised content is as follows:

We also employed random split technique for training and testing model to prevent overfitting.

(7) L. 319. Should be “mage-label dataset” written as “image-label dataset”?

Reply: According to your good suggestions, we have revised the content to correct the mistake. The revised content is as follows:

During each training and test round, 4880 images randomly selected from the image-label datasets were used as training data, and the remaining 1220 images were used as test data.

(8) Table 3. As you described above you trained your models 5 times. Each time you randomly selected 80% of our data for training und used the other 20% of images for testing the classification models. In such case, indicating in this table the total number of samples per each category only (in Mio, for a better readability) would be enough.

Reply: According to your good suggestions, we have revised the relevant content. The revised content is as follows:

Table 3. Total number of samples of each category used in each training and test round.

Category

Number of total samples (million)

Winter   wheat

1572

Agricultural   buildings

6

Woodland

568

Developed   land

1199

Roads

51

Water   bodies

57

Farmland

1521

Bare   fields

1332

 

(9) L. 329. Please replace “classification results. the proportion” with “classification results. The proportion”.

Reply: According to your good suggestions, we have revised the relevant content. The revised content is as follows:

The proportion of “winter wheat” wrongly categorized as “non-winter wheat” was on average 0.033

(10) L. 421-423. This statement is not fully correct since your CNN-Bayesian and VGG-Ex use the same feature extractor. It is slightly different for two other classification models.

Reply: According to your good suggestions, we have revised the relevant content. We deleted the clause “Considering that the four models use the same encoder”. After deletion, the statement is clearer.

(11) L.466. The sentence is not clear. Please correct it.

Reply: According to your good suggestions, we have revised the relevant content. The revised content is as follows:

The number of categories that can be extracted by the proposed CNN-Bayesian model is determined by the number of categories of samples in the training dataset.


Author Response File: Author Response.pdf

Reviewer 2 Report

all concerns of mine have been well addressed by authors and I have no more suggestion. 

Author Response

Dear Reviewer:

We would like to thank you for your review of our manuscript.


Author Response File: Author Response.pdf

Back to TopTop