Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Recognition for Stems of Tomato Plants at Night Based on a Hybrid Joint Neural Network

Agriculture 2022, 12(6), 743; https://0-doi-org.brum.beds.ac.uk/10.3390/agriculture12060743

by Rong Xiang^*

, Maochen Zhang and Jielan Zhang

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Pierre-Yves Burgi

Agriculture 2022, 12(6), 743; https://0-doi-org.brum.beds.ac.uk/10.3390/agriculture12060743

Submission received: 14 April 2022 / Revised: 16 May 2022 / Accepted: 21 May 2022 / Published: 24 May 2022

(This article belongs to the Special Issue Application of Robots and Automation Technology in Agriculture)

Round 1

Reviewer 1 Report

Dear authors,

I have finished the review of your paper. Regarding Machine Learning, I have the following concerns:

It is not clear wether you used augmentation methods for your original dataset.
Summarize the dataset in a single table, the original and the aumentated (if it is the case).
Provide a wider explanation and the layer structure of your Neural Network model.
Add the confusion matrix for the results sections.
Include more ML models in your tests.

I hope you find these recommendations useful.

Author Response

please refer to the attachment. Thank you.

Author Response File: Author Response.pdf

Reviewer 2 Report

Review of the article „Recognition for stems of tomato plants at night based on a hybrid joint neural network“ by authors: Rong Xiang, Maochen Zhang and Jielan Zhang

Shortcomings of the article:

In the conclusions must clearly show what problems the researchers have solved and how much to get results are better than the results of other researches. The conclusions should be clear and concise with the numerical values provided to support and justify the results obtained. The presented conclusions are not informative. For example, in the conclusions are written: “less false negative errors and higher recall rate”. Must be written how much less, how much higher the frequency and etc.

Author Response

please refer to the attachment. Thank you.

Author Response File: Author Response.pdf

Reviewer 3 Report

This article deals with the problem of separating the stems from the leaves of a plant. While the introduction part presents the problem in a rather clear way, the rest of the article is very confusing. The 4 methods used, not to mention their combinations, are very poorly described and generally lack an explanation on a conceptual level. For example the so-called "traditional" method is based on a neural network (I would call traditional one based on edge detection). Then there is a whole discussion around the edge extraction methods without explicitly mentioning how the stereoscopic method is involved in the process. For a computer vision expert this article is very difficult if not impossible to understand. Knowing that the readers come from the field of agriculture, I don't dare to think what they will retain from this article. It would be better to present at a more conceptual level the 4 methods (and their combinations), put the implementation details in an appendix, and focus on the results and also the impact of changing some parameters (by the way where is the exhaustive list of parameters?).

Major comments

1) Given your objective is to work in the night I was wondering why you do not use infrared light. Maybe spectrum of stem and leaves are different and easier to distinguish?
2) Why choosing a pulse-couple neural net among other possibilities. What the pulsed feature brings to your results ? Please justify this choice. And in what sense is this a "Traditional Method (TM)" ?
3) In Section 2.2.1 I am entirely confused about what you call "Traditional method" given you apply a PCNN. For me the output of the PCNN would come with a segmentation of the stem/leaves. But then you develop a whole section on edge extraction. In Fig. 3 it seems that the PCNN is superior to edge extraction, so why do you use edge extraction?
4) In Equation 4 you speak about left and right images. This reminds me that you use stereoscopy but at any place you explain how you match both images to recover depth.

5) As a whole, the way you describe the methods is too confusing to follow the paper and understand what you achieve.

Minor comments

1) lines 49-50: "this method is reliant on support ropes to recognise stems". What do you mean, be able to separate ropes from stem ? Please specify.
2) lines 50-51: "Secondly, only the main stems can be recognised using this method.". Why, please explain.
3) lines 56-59: The explanation regarding the third method looks contradictory to what you said on lines 53-54. To my understanding, using stereoscopy should help distinguish stems from leaves based on the shape, and not on the colour. Please rewrite these sentences to be more consistent.
4) line 71: Please specify a reference describing these difficulties.
5) lines 76-81: Please avoid jargon such as Mask R-CNN. Define also what you mean by "hybrid joint neural network". It is important in the introduction to define the main notions you will develop further.
6) line 99: delete "were" before "appeared"
7) line 114: Why do you use a "pulse-couple neural net" and not simply a neural net ?
8) lines 118-119: you should summarise differently how PCNN works compared to simple threshold-based segmentation, such as "PCNN relies on spatial information and grayscale information to extract image features." This sentence has the advantage to be understood by non-expert in computer vision !
9) lines 119-133: I would rather not show the Equations. They are too simplistic to understand the PCNN. First, explain why you compute the difference between green and red colours at every pixel. This then constitutes your input to the PCNN. Try to explain without equation what are the main steps in PCNN. For instance how many layers are there ? What is the back propagation rule to adjust the weights of the neural cells, etc. You also speak about the OTSU. That could be described through Equations to better understand what is the effect on raw data.
10) Fig. 2: You do not explain in the text the role of the Mask R-CNN1 and R-CNN2. Moreover, what is the advantage of combining a traditional method with the output of the PCNN (if I understand correctly Fig. 2 ? )
11) lines 136-142: This explanation is again not understandable. You again describe only partially the working of the PCNN model, but this level of explanation is not useful. Either you explain at a more conceptual level what you do, or you go much deeper into the working of the PCNN.
12) line 162: change "furcation" into "fork" (and everywhere in the text). Furcation is not very much used to my knowledge !
13) Fig. 3: how do compute the "bonding boxes/rectangles" (Fig. 3g) ? And what is the link between Fig. 2 and 3 ? Why do you mix with edge extraction methods ?
14) line 168-169: this sentence is unclear. What do you mean by "synchronously" ? Why using "public edge" ? This is a strange denomination.
15) You should explain why you are seeking forked edges.
16) line 229. Where is Section 2.2.4 ? OK I found it finally. Not easy to follow!
17) lines 233-236: this is entirely not understandable. As the rest of the paragraph ... Please reformulate what you want to say.
18) line 247: The fact that stem diameter (average is better than "general") is 25 (what unit?) should be discussed before in the text as this is an important parameter to understand the algorithm.
19) lines 251-264: remove all this text and rather refer to the chart in Fig. 7.
20) lines 274-306: same comment, you should summarise all this discussion within a chart. Nobody will read that; or put these explanations in an Appendix.
21) lines 307-322. I thought finally you were to explain more conceptually the process. But it is again a mixture of technical details. This part should be probably introduced at the beginning of section 2.2.1. Again, you are not explicit enough on stereoscopy, how you combine information, why you use stereoscopy, etc.
22) Section 2.2.2: Finally you introduce "Mask R-CNN". But what does that mean to train "Mask R-CNN" and compare the TM performance ? What is "TM" ? (Traditional method, but what is it, see my comment above)
23) line 327: What is a "regional proposal network" ? With 3 branches ???
24) line 329: Why introducing "Resnet101" at this stage ? What's a feature pyramid network?
25) line 341, What is MS COCO ????
26) line 355: "the cascade of the TM and the Mask R-CNN" not clear what you mean by cascade. Is it a mixture of two methods?
27) line 362: Explain better what is the "cascade net"
28) line 385, "TP + FP is the area of recognition results", what does that mean ?
29) lines 390-396, your explanation about common object level metrics and areas is not clear.
30) lines 430-435. The discourse on FP vs. FN could be shortened.

31) The conclusion part is too short. This is where you should develop more what you achieved, but without too much jargon (at a more conceptual level).

Author Response

please refer to the attachment. Thank you.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Dear authors,

Thank you for addressing my initial concerns about your work, I do not have more recommendations about it.

Author Response

Please refer to the attachment. Thank you!

Author Response File: Author Response.pdf

Reviewer 3 Report

While I recognize that the authors have done a definite job of improvement, there are still parts that are too confusing, see my comments below. Both the abstract and the conclusion parts of the article should be written in a way that presents the purpose of the study with less technical detail and using higher level concepts.
Finally, there is still a lot of work to be done in English - I can't rewrite the whole article myself. Proofreading by an English speaker is essential.

Comments

1) Lines 8-21: Abstract: You should consider writing it at a more conceptual level and remove jargon such as « Mask R-CNN, », « pixel level metrics », « pixel recall rate », pixel F1 value » without introducing them. Actually who will care about « pixel F1 value were 57.22%, 69.69%, and 62.84% » in an abstract ? Please remain at a qualitative level in the abstract. And honestly, you can remove the 100th number a use 57%, 70%, and 63%, this will do as well !
2) Lines 81-82: Replace with « The hybrid joint neural network is a class of neural network formed by the cascading and parallel connection of several networks. »
3) Lines 85-86: "design a cascaded neural network composed of the traditional method and Mask R-CNN": Is that similar to your hybrid joint neural network? If yes, then removed this sentence. Again I do not think a good idea to speak about "traditional method". Traditional with respect to what? This is really a strange denomination.
4) Lines 90-93: "The 3D localisation of stems is a separate research task and is not the research content of this study. Considering the localisation of stems in future research, this study used a binocular stereo camera to capture images of tomato plants.", replace by "This study used a binocular stereo camera to capture images of tomato plants to identify the tomato stems. However, the 3D localisation of stems is a separate task, not addressed in this study."
Note: you should explain why you use a binocular stereo vision system: do you use depth information or not ? This is still not clear to me and so will not be clear to readers.
5) Lines 117-118: "and the cascaded neural network, itself made of a TM and Mask R-CNN2".
6) Line 121: use the denomination "cascaded NN"
7) Lines 124-125: "two methods, the results of Mask R-CNN and the cascaded neural network were combined into final results." --> "two types of neural networks, the results of both were combined."
8) Line 128: Are there differences btw Mask R-CNN and Mask R-CNN2 ? If not, I would not use different names.
9) Line 133: replace "in the following" by "in the next sub-sections".
10) Line 142: replace ROI by "Region of Interest (ROI).
11) Line 145: For the first time you speak about "deep learning environment". I do not understand the link with the neural networks (Fig. 2) at this stage. Please explain what you want to do with TensorFlow related to deep learning, Not at all clear.
12) Line 147: Explain what geometric operations you are performing for getting augmented dataset.
13) Line 161: add ":" after "TM".
14) Line 164: Please introduced PCNN before using this word. For instance: "The image segmentation step was performed based on a Pulsed Coupled Neural Network which consists of ...." and then "The output of this segmentation step consists of segmented leaf regions". You should be more precise in your explanations !
15) Lines 165-166: "The second step extracted the stem regions based on duality edge pairs scheme"."
16) Lines 171-172: "... TM was still based on traditional edge processing such as Canny edge detection".
17) Line 172: Remove "So, we still named the method as "Traditional method"."
18) Line 173: "The raw image and the TM result are the same as in Fig.2."
19) Line 174: Add a sentence like "The eight steps are now described in more details".
20) Lines 178-181: "PCNN was designed on the basis of a underlying mechanism wherein similar stimuli evoke synchronous oscillation through the network, which can be interpreted as specific visual features [36-37].
21) Lines 186-188: Is this colour segmentation part of the PCNN or need an additional step ? You should specify. If part of the PCNN, you should say that color information is one feature which the PCNN uses to segment the visual scenes.
22) Lines 192-199: "The former renders active the neuron through external input stimuli, while the latter ignite neurons from their neighbours. Thus ignited neurons will in turn activate their neighbours with similar visual inputs, forming a pulse wave that spreads within the network[37]. PCNN establishes a connection between similar pixels in the neighbourhood of the object and background regions and manages to segment them."
23) Line 200: Please explain what is OTSU and refer then to the Appendix.
24) Lines 226-228: "The aim of seeking forked points is to extract the longest edge contained in each separated tomato plant region, removing noise in the process."
25) Line 229: "... can be found in Appendix A"
26) Lines 294-295: Why is the obtained stem edge wider than ground truth? Is it systematic, in average, etc. ? Please explain.
27) Lines 302-304: "The cascade NN in the study was composed of the TM and Mask R-CNN. Raw images were the TM inputs, and the TM output became the input of the Mask R-CNN module. This architecture allowed to comprehensively utilize the advantages of the two algorithms and to eliminate the false positives that resulted from the TM stage."
28) Lines 310-312: delete them.
29) Lines 318-322: "The hybrid joint neural network was combined through an "or" operation between the cascaded NN and the Mask R-CNN. As illustrated in Fig.2, such a combination improves the recognition of the stem regions otherwise reconstructed thinner than ground truth."
30) Line 326: You introduce a new acronym "YOLACT". What is it ? Why is it not described before?
31) Line 332: Please define what is F1 value (and in the abstract). Is it defined in Eq. 5 ? If so, why do you call it "F1"? Then you cannot use it in the abstract as it is not a known definition.
32) Line 339: What is the "pixel recall rate" ? Its frequency in terms of probabilities ? If so, please define it properly.
33) Lines 342-346: You describe "pixel level metrics" without defining it. Is it some kind of normalisation? If so, please define it properly.
34) Line 358: "true" --> "truth"
35) Lines 357-358: Not at all clear what you mean. Please rephrase.
36) Lines 364 and 365 and Tables 2: "matrixes" --> "matrices"
37) Table 2: I do not think presenting the results this way is adapted. There is a graphic representation of confusion matrices which exists. Please use this graphic form instead.
38) Table 3: limit precision to 1/10 in percentage values.
39) Lines 372-373: "These results show that the algorithm presented in this study can recognise stem edges of tomato plants.". What your results show is that stem edges are correctly recognised in 57, 70, and 63% of cases. Be more precise in your language.
40) Lines 400-401:"high false positive can decrease the deleafing efficiency but decrease the risk of missing deleafing.": is it not contrary? "high false positive can decrease the deleafing efficiency and so increase the risk of missing deleafing." ?
41) Legend fig. 7: please say what is the meaning of the yellow ellipses.
42) Conclusions: this part looks more a recap of the previous section. I would expect some conclusions of higher abstract level than just a comparison of numbers. For instance, you should provide your advice as how your algorithm could be used in real applications, how it could help collect tomatoes, how it compares to other approaches found in the literature, etc. Remember, this is a journal about agriculture and readers would like to know if your method can really help them!

Note that I did not check the appendices as they might be put in Supplementary material out of the editor control.

Author Response

Please refer to the attachment. Thank you!

Author Response File: Author Response.pdf

Article Menu

Recognition for Stems of Tomato Plants at Night Based on a Hybrid Joint Neural Network

Further Information

Guidelines

MDPI Initiatives

Follow MDPI