YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5

Li, Yaodi; Xue, Jianxin; Zhang, Mingyue; Yin, Junyi; Liu, Yang; Qiao, Xindan; Zheng, Decong; Li, Zezhen

doi:10.3390/agronomy13071901

Open AccessArticle

YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5

¹

College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong 030801, China

²

College of Food Science and Engineering, Shanxi Agricultural University, Jinzhong 030801, China

^*

Author to whom correspondence should be addressed.

Agronomy 2023, 13(7), 1901; https://0-doi-org.brum.beds.ac.uk/10.3390/agronomy13071901

Submission received: 27 June 2023 / Revised: 15 July 2023 / Accepted: 17 July 2023 / Published: 19 July 2023

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The smart farm is currently a hot topic in the agricultural industry. Due to the complex field environment, the intelligent monitoring model applicable to this environment requires high hardware performance, and there are difficulties in realizing real-time detection of ripe strawberries on a small automatic picking robot, etc. This research proposes a real-time multistage strawberry detection algorithm YOLOv5-ASFF based on improved YOLOv5. Through the introduction of the ASFF (adaptive spatial feature fusion) module into YOLOv5, the network can adaptively learn the fused spatial weights of strawberry feature maps at each scale as a way to fully obtain the image feature information of strawberries. To verify the superiority and availability of YOLOv5-ASFF, a strawberry dataset containing a variety of complex scenarios, including leaf shading, overlapping fruit, and dense fruit, was constructed in this experiment. The method achieved 91.86% and 88.03% for mAP and F1, respectively, and 98.77% for AP of mature-stage strawberries, showing strong robustness and generalization ability, better than SSD, YOLOv3, YOLOv4, and YOLOv5s. The YOLOv5-ASFF algorithm can overcome the influence of complex field environments and improve the detection of strawberries under dense distribution and shading conditions, and the method can provide technical support for monitoring yield estimation and harvest planning in intelligent strawberry field management.

Keywords:

strawberry; YOLOv5; intelligent monitoring; automatic picking

1. Introduction

Strawberry (Fragaria × ananassa Duch.) is a perennial herb of the genus strawberry in the family Rosaceae. The wild strawberry originated in Europe, America, and Asia, while the modern large-fruited cultivated strawberry originated in France. Strawberries are rich in carotenoids and vitamin A, which are beneficial for the maintenance of healthy epithelial tissues and the promotion of growth and development [1]. The high fiber content of strawberries may be beneficial in the digestion of food in the gastrointestinal tract and the prevention of acne and colon cancer. Strawberries are in commercial production in most regions of China, and the country ranks first in the world in terms of strawberry cultivation area and production. Since strawberries are densely planted and mature quickly, untimely harvesting can easily lead to strawberry decay and economic loss to farmers. At present, strawberry harvesting is mainly manual, which causes great pressure on strawberry harvesting because of the high hiring cost, high work intensity, and low efficiency of employees [2]. For these reasons, monitoring the growth of strawberries is a difficult task, and manually picking ripe strawberries is a dreary and time-consuming task. This requires the development of an automatic detection method to monitor strawberry growth and achieve accurate identification of ripe fruit for use in small picking robots.

2. Related Work

With the rapid development of computer technology and electrical technology, agricultural automation and intelligence have become popular trends in the production and cultivation of agricultural products [3]. Computer vision technology is currently used as a main tool for agricultural product detection, which has been widely used in maturity detection [4], remote intelligent monitoring [5], yield prediction [6], picking robots [7,8,9,10], variety selection [11,12], etc. However, the complex and variable natural environment still has certain effects on fruit detection, such as leaf shading, fruit overlapping, plant structure interference, and light variation, which are common factors influencing the accuracy of fruit detection. Therefore, the key to solving the above problems is how to improve the performance of fruit detection algorithms. Target detection algorithms fall into two main categories: one-stage and two-stage. One-stage target detection algorithms detect the target by extracting features only once, mainly including the SSD algorithm [13] and YOLO series algorithms [14,15,16,17]. Two-stage target detection algorithms must first form the candidate region, and then achieve the detection of the target using a convolutional neural network (CNN), mainly including the SPPNet [18] and R-CNN series algorithms [19,20]. Traditionally, compared with the two-stage target detection algorithm, the one-stage target detection algorithm has better real-time performance but lower accuracy. However, with the continuous refinement and improvement of the YOLO algorithm, its accurate and efficient detection performance has been widely noticed and applied. At the same time, YOLO series algorithms also have strong generalization and robustness, which can adapt to target detection tasks under complex situations such as different scales, poses, and occlusions, which is of good significance for practical applications in agriculture. Zheng et al. [21] proposed an improved algorithm YOLOX-Dense-CT to solve the problem of low detection accuracy due to environmental factors such as shading by branches and leaves during mechanical harvesting of sacred fruits, which improved the mAP by 4.02% and reduced the parameters by 36.16% compared with the YOLOX-L algorithm. To equip a grape-harvesting robot with efficient gridded grape recognition and picking point localization, Xu et al. [22] proposed a grape detection model YOLOv4-SE based on feature-enhanced recognition, which had good detection performance for grapefruits with various disturbances in the environment and could better meet the picking requirements of high-speed picking robots. Tang et al. [23] developed an oil tea fruit detection algorithm based on YOLOv4-tiny to solve the obstacle of fruit detection due to light variation, leaf shading, etc. The AP of the improved algorithm increased by 4.86%, and the model size was reduced 12%, with room for further reduction. Lv et al. [24] considered that the current greenhouse tomato cultivation density, plant shading, and other factors lead to insufficient target recognition accuracy, and proposed a cascaded deep learning-based collaborative tomato flower and fruit recognition method YOLOX-ViT. Compared with YOLOX and combined enhancement YOLOX, the model mAP improved by 2.38–6.11%, which better alleviated the problem of information loss in the detection network during image input. Huang et al. [25] developed a GCS-YOLOv4-Tiny multistage fruit detection algorithm based on YOLOv4-Tiny, where the mAP and F1 were improved by 17.45% and 13.8%, respectively, compared to the original network, but this experiment did not detect small targets at a distance. Zhang et al. [26] introduced improvement methods to YOLOv5s to improve the model accuracy; the AP value of the improved dragon fruit detection model reached 97.4%, and the complexity of the improved model was also reduced, making it easier to deploy in embedded devices. Jiang et al. [27] proposed a new young apple detection algorithm combining YOLOv4 and two attention modules to improve the detection accuracy of young apples and obtained good accuracy by experimenting with highlighted, blurred, and heavily occluded images in the test set, but the size of this algorithm reached 349 MB, and it is difficult to install and run in embedded devices. Tian et al. [28] used the method of DenseNet to modify the YOLOv3 network to improve the accuracy in detecting apples at different fertility stages. The overall performance of the improved model was improved somewhat, but the F1 was only about 80%, and the detection accuracy was still low.

Although the above methods achieved some improvements for fruit detection, they have some shortcomings in most cases and are not suitable for multi-stage detection and subsequent application of strawberry fruits in natural growing environments. Therefore, this research proposes an improved algorithm YOLOv5-ASFF for the detection and monitoring of multistage strawberries in complex scenes, which improves the target detection accuracy while ensuring the streamlining of the model. The main contributions of this work include the following:

Constructing a strawberry dataset containing four growth stages of strawberries in a complex field environment;
Introducing ASFF (adaptive spatial feature fusion) structure to the YOLOv5 network;
Developing a YOLOv5-ASFF model with higher accuracy, generalization, and robustness for multistage strawberry detection;
Comparing the performance of the YOLOv5-ASFF algorithm with other mainstream one-stage target detection algorithms (SSD, YOLOv3, YOLOv4, and YOLOv5s).

3. Materials and Methods

3.1. Data Collection

All strawberry images were taken using one smartphone (Samsung S20, the f-number was f/2.2, sensor size was 1/1.76 inch, and time of exposure was 8 ms) in Yuncheng City, Shanxi Province, China, from 20 December 2022 to 10 February 2023, at a GSD (ground sample distance) of 0.16–0.49 mm in the morning, midday, and afternoon. The images collected included complex and variable conditions, such as different growth stages, leaf shading, overlapping fruit, dense fruit, and other scenarios, where the smartphone was not shot in a fixed position, as a way to obtain pictures of strawberries from different angles and scales. To reduce the interference of duplicate images and fruitless images on the model training, we used a manual screening method to preprocess the data of the collected raw images. A total of 1217 images were collected; each image was 4000 × 3000 pixels and saved in JPG format. Some strawberry images are shown in Figure 1.

Strawberry has a multistage growth period, and this research mainly divided the growth period of strawberry into four stages: fruitlet, expanding, turning, and mature. The four stages of strawberry fruit are shown in Figure 2.

The strawberries at the fruitlet stage are very small, and the surface is covered with seeds; the strawberries in the expanding stage are larger than fruitlets, and the distribution of seeds on the surface is not as dense as in the fruitlet stage; the color of strawberries in the turning stage starts to change from green to red, but there are still some non-red areas on the surface; the surface of strawberries in the mature stage is almost completely covered in red.

3.2. Image Annotation and Dataset Partition

The original resolution of the images collected in this experiment was 4000 × 3000, which is not conducive to efficient training of the model, as the low resolution would cause hindrance in the process of manual target labeling due to its poor clarity, which is not conducive to achieving the desired results in the model training phase. Therefore, to ensure the convenience of target labeling and the efficiency of model training, we compressed the original image to 1000 × 750 pixels. Considering that YOLOv5 comes with Mosaic data enhancement, no separate data expansion was performed in this experiment.

The labeling software was LabelImg, version 1.8.6. Four labels were included in this experiment: “fruitlet”, “expanding”, “turning”, and “mature”; 1217 strawberry dataset images were labeled with a total of 9225 strawberry targets, and the labeled files were saved in YOLO format. The strawberry images and the corresponding annotation files were divided into a training set, validation set, and test set with the ratio of 7:1:2 and placed in the corresponding folders, which constituted the strawberry dataset used in this research. The specific annotation details of the dataset are shown in Table 1.

3.3. YOLOv5-ASFF for Detecting Strawberry

YOLOv5 was a one-stage target detection algorithm proposed in 2020, and it is widely used in the field of agricultural engineering today. The YOLOV5 algorithm adds some new and improved methods to YOLOv4, resulting in a significant improvement in both speed and accuracy. It contains four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The network structure of these four versions is basically the same, and the difference lies in the size of the Depth_multiple and Width_multiple parameters; a larger value of these parameters leads to a larger model volume, while generally increasing the model accuracy. Considering the real-time nature of strawberry monitoring tasks in the greenhouse environment, YOLOv5s-6.0 was used as the network benchmark model in this research.

3.3.1. YOLOv5-ASFF

The network structure of YOLOv5 mainly consists of four parts: input, backbone, neck, and head. The backbone network is a CNN whose main function is to extract image features, and the structure mainly consists of Conv, C3 (concentrated comprehensive convolution), and SPPF (spatial pyramid pooling—fast). The Conv module consists of Conv2d, BatchNorm2d, and the SiLU activation function, which extracts features through Conv2d and normalizes them through BatchNorm2d to speed up network learning. The SiLU activation function is a smooth nonmonotonic function that can achieve effective mapping of features during training to avoid the gradient disappearance problem. The C3 module contains three convolutions to extract the depth features of the image [29]. SPPF achieves the fusion of local features and global features via multiple MaxPool2d as a way to enrich the expression capability of the feature map. The upsampling module implements the process of deconvolution and functions to extend the feature map. Concat is an operator module for tensor stitching, and its function is to achieve feature fusion. ASFF was introduced at the head of the YOLOv5 as a way to construct the new network algorithm YOLOv5-ASFF, whose structure is shown in Figure 3, and the specific structure of each module of YOLOv5 is shown in Figure 4.

In CNN, deeper networks are more likely to respond to semantic features, while shallower networks are more likely to respond to image features; hence, the FPN (feature pyramid network) [30] and PAN (path aggregation network) [31] structures were used in the neck of YOLOv5 to complete the feature fusion of images. FPN-PAN is a top-down and bottom-up bidirectional fusion backbone network, and YOLOv5 uses FPN-PAN for two-time feature fusion, which combines different scale information and enables the network to extract richer image features; its structure is shown in the blue dashed line in Figure 5. The output scales are 80 × 80, 40 × 40, and 20 × 20, respectively, to achieve the type and location prediction of targets of different sizes, and to perform the fusion of feature information at different scales at the end.

To represent the structure and tensor information of YOLOv5-ASFF more clearly, we list it in Table 2.

3.3.2. ASFF

Strawberries are grown in a dense environment, and the interference of the background has a more obvious effect on strawberry detection; thus, a detection network with better feature fusion capability is required. The information in CNN mainly contains semantic and location information, with shallow networks richer in location information and deep networks richer in semantic information. In the original FPN-PAN structure, when a feature map matches an object, the information of the feature maps of other layers is ignored; hence, there is a problem of feature inconsistency between features of different scales. For this reason, we introduced the ASFF module in the head of the YOLOv5 to improve the multiscale fusion of image features. The ASFF structure is shown in Figure 5. Two steps are required to implement ASFF: feature resizing and adaptive fusion [5,32,33].

Feature resizing: X^l represents the (l ∈ {1, 2, 3} for YOLOv5) resolution feature of Level l. ASFF-detectl is the result of multiplying and adding the semantic information of Level 1, Level 2, and Level 3 with weights

α

,

β

, and

γ

of different layers, as expressed in Equation (1).

ASFF - detect l = X^{1 \to l} \times α^{l} + X^{2 \to l} \times β^{l} + X^{3 \to l} \times γ^{l},

(1)

where

X^{1 \to l}

denotes adjusting the feature map of Level 1 to the size of Level l, and the definitions of

X^{2 \to l}

and

X^{3 \to l}

can be obtained in the same way.

Adaptive fusion: Let

X_{ij}^{n \to l}

denote the feature vector at the position (i, j) on the feature mapping adjusted from level n to level l. The feature fusion for Level l is shown in Equation (2).

y_{ij}^{l} = α_{ij}^{l} \times x_{ij}^{1 \to l} + β_{ij}^{l} \times x_{ij}^{2 \to l} + γ_{ij}^{l} \times x_{ij}^{3 \to l},

(2)

where

y_{ij}^{l}

denotes the (i, j)-th vector of output features mapped between channels as

y^{l}

.

α_{ij}^{l}

,

β_{ij}^{l}

, and

γ_{ij}^{l}

are the spatial importance weights corresponding to the three different levels of feature mapping learned by the network up to level l.

α_{ij}^{l}

,

β_{ij}^{l}

, and

γ_{ij}^{l}

can be simple scalar variables, which are shared across all channels;

α_{ij}^{l} + β_{ij}^{l} + γ_{ij}^{l} = 1

,

α_{ij}^{l} \in [0, 1]

,

β_{ij}^{l} \in [0, 1], a n d γ_{ij}^{l} \in [0, 1]

can be used to define

α_{ij}^{l} = \frac{e^{λ_{α_{ij}}^{l}}}{e^{λ_{α_{ij}}^{l}} + e^{λ_{β_{ij}}^{l}} + e^{λ_{γ_{ij}}^{l}}},

(3)

where

α_{ij}^{l}

,

β_{ij}^{l}

, and

γ_{ij}^{l}

are defined by

λ_{α_{ij}}^{l}

,

λ_{β_{ij}}^{l}

, and

λ_{γ_{ij}}^{l}

, respectively, using the softmax function, as the control parameters. The weights

λ_{α}^{l}

,

λ_{β}^{l}

, and

λ_{γ}^{l}

of the scalar mappings are computed from

X^{1 \to l}

,

X^{2 \to l}

, and

X^{3 \to l}

, respectively, using 1 × 1 convolutional layers and are learned by standard backpropagation. The features of the three levels in the YOLOv5-ASFF are adaptively aggregated at each corresponding scale, and the fused features are input to head for the classification and detection of multistage strawberries.

3.3.3. Loss Function for Multiclassification Tasks

The loss function L_loss in the YOLOv5-ASFF model used for the detection of multistage strawberries is used to update the loss of gradient, which is summed by the coordinate localization loss L_ciou, target confidence loss L_obj, and classification loss L_cls. The formulas are as follows:

L_{loss} = L_{ciou} + L_{obj} + L_{cls},

(4)

L_{ciou} = \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{obj} [1 - IOU + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + α ν],

(5)

L_{obj} = - \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{obj} [\hat{C_{i}} \log (C_{i}) + (1 - \hat{C_{i}}) \log (1 - C_{i})] - λ_{noobj} \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{obj} [\hat{C_{i}} \log (C_{i}) + (1 - \hat{C_{i}}) \log (1 - C_{i})],

(6)

L_{cls} = - \sum_{i = 0}^{S^{2}} I_{i, j}^{obj} \sum_{j = 0} [\hat{P_{i}} (c) \log (P_{i} (c)) + (1 - \hat{P_{i}} (c) \log (1 - P_{i} (c)))],

(7)

ν = \frac{4}{π^{2}} {(\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})}^{2},

(8)

α = \frac{ν}{(1 - IOU) - ν},

(9)

where S is the number of grids, B is the number of prior frames,

λ_{noobj}

is the weight factor, and

ρ (∙)

is the Euclidean distance.

I_{i, j}^{obj}

and

I_{i, j}^{noobj}

are used to determine whether the j-th prior box in the i-th grid contains the required object. If it contains the required object,

I_{i, j}^{obj}

and

I_{i, j}^{noobj}

are 1 and 0 respectively; otherwise, they are 0 and 1.

\hat{C_{i}}

indicates whether the anchor frame is responsible for the prediction of the network (1 if yes, 0 otherwise)m

C_{i}

indicates the confidence level of the predicted and labeled boxes,

\hat{P_{i}}

and

P_{i}

are the category probabilities of the annotation box and prediction box, and C is the diagonal distance between the predicted box and the smallest closed area contained in the real box.

IOU denotes the ratio of the intersection of the predicted wraparound box to the actual wraparound box; b, w, and h are the center coordinates, width, and height of the predicted box, respectively,

b^{gt}

,

w^{gt}

, and

h^{gt}

are the center coordinates, width, and height of the actual box, respectively,

α

is the weighting factor, and

ν

is the length-to-width similarity ratio.

3.4. Experimental Environment

All tests in this experiment were conducted using the same computer equipment. The specific environment configuration details are shown in Table 3.

All images were compressed from 750 × 1000 pixels to 640 × 640 pixels before training. The training epoch was set to 100, the batch size was set to 4, and the hyperparameters were set according to the default values of YOLOv5. The pretrained weights file (based on the COCO dataset) was used for transfer learning during the training process to improve the training speed and accuracy of the model. During the training process, stochastic gradient descent (SGD) was used as the optimizer for the neural network, and the training efficiency of the network was improved by adjustment of the learning rate. To make the model reach the best convergence state after training, YOLOv5 used the cosine annealing algorithm to reduce the learning rate, i.e., the value of the cosine function decreases slowly as the number of training rounds increases, and then rises rapidly and falls slowly to avoid entering the local optimum, making the model converge to a new optimum by continuously adjusting the learning rate until model training stops. The principle of the cosine annealing algorithm is as follows [34]:

l_{new} = l_{\min}^{i} + \frac{1}{2} (l_{\max}^{i} - l_{\min}^{i}) [1 + \cos (\frac{T_{cur}}{T_{i}})],

(10)

where

l_{new}

is the latest learning rate,

l_{\min}

is the minimum learning rate,

l_{\max}

is the maximum learning rate,

T_{cur}

is the number of epochs of current execution, and

T_{i}

is the total number of epochs under current execution. In this experiment, the loss, mAP, precision, and recall curves converged in the first 50 rounds, and there was no underfitting or overfitting in the subsequent training process.

3.5. Evaluation Metrics

Metrics are crucial for model evaluation; in this experiment, we evaluated the model using widely applied model evaluation metrics: average precision (AP), mean average precision (mAP), precision (P), recall (R), and F1. AP is the average precision of a single target category, while mAP is the average precision of all categories. This paper used mAP and F1 to measure the model’s detection accuracy, and frames per second (FPS) to evaluate the model’s detection speed.

We used the intersection over union (IOU) between the predicted frame and the actual labeled frame to determine whether the target was successfully predicted. If IOU ≥ 0.5, the target was successfully predicted; if IOU < 0.5, the target was incorrectly predicted. TP indicates that the target was detected with the same portion of the positive sample (labeled strawberry target), FP indicates that the target was detected with the portion of the negative sample (unlabeled background), FN indicates that the target was not detected with the positive sample, and TN indicates that the target was not detected with the negative sample. P denotes the proportion of samples identified as positive samples that were correctly predicted; R denotes the proportion of all positive samples predicted that could be correctly predicted; F1 is a composite evaluation index of P and R. The formulas are shown below.

P = \frac{TP}{TP + FP},

(11)

R = \frac{TP}{TP + FN},

(12)

F 1 = \frac{2 \times P \times R}{P + R},

(13)

AP = \int_{0}^{1} P (R) dR,

(14)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i},

(15)

where N is the number of categories (N = 4).

4. Results and Discussion

The strawberry dataset of this experiment contained four growth stages of strawberries, including fruitlet, expanding, turning, and mature. It served as a test of the validity and supremacy of the proposed YOLOv5-ASFF algorithm. Three other one-stage detection algorithms were also selected for comparison tests.

4.1. Results of Detection Algorithms

In this experiment, we chose three mainstream one-stage detection algorithms, SSD, YOLOv3, and YOLOv4, as the comparison algorithms (all algorithms were trained with default parameters; the epoch was 100 and batch size was 4), and the detection results are shown in Table 4. Figure 6 provides a more visual illustration of the differences among the different algorithms. YOLOv5-ASFF performed the best in terms of detection accuracy, with mAP and F1 values reaching 91.86% and 88.03%, respectively, improvements of 0.99% and 1.09% compared to the original algorithm YOLOv5s. The performance levels of the other models were as follows: YOLOv4 (83.33%, 79.51%), YOLOv3 (81.00%, 78.21%), and SSD (73.61%, 72.22%). In terms of detection speed, the models were ranked SSD, YOLOv5s, YOLOv3, YOLOv5-ASFF, and YOLOv4 in descending order. Although the FPS of SSD reached 91, the FPS of mAP and F1 only reached 73.61% and 72.22%, which are not suitable for multistage detection of strawberries because of the low accuracy. Although the FPS of YOLOv5-ASFF decreased slightly compared with the original algorithm YOLOv5s, it still reached 56 frames/s, fully satisfying the need for remote growth monitoring of strawberries in smart farming. Thus, YOLOv5-ASFF was the preferred algorithm for this experiment. Some examples of strawberry detection are shown in Figure 7.

The division of a single training and test set easily leads to the risk of overfitting the model on the test set. In order to verify the performance of YOLOv5-ASFF more comprehensively, we randomly divided the strawberry dataset into five groups, and then we applied the k-fold method (k = 5) to train YOLOv5s and YOLOv5-ASFF. Meanwhile, we tested the mAP value of each fold. The k-fold cross-validation results are shown in Table 5.

4.2. Analysis of the Multistage Strawberry Detection Results

For each growth stage of the strawberry, the detection performance of the proposed algorithm was further analyzed. Figure 8 shows the AP and F1 values for each stage of strawberry detection. It can be seen that the AP and F1 of strawberries increased with maturity, which is because, as strawberries develop, their morphology and color become more pronounced, and they have more distinctive shape characteristics; hence, the neural network can discriminate them more accurately. In particular, the accuracy of the discrimination of mature strawberries is much higher than that of the other stages, because the mature strawberries have basically become bright red, and their characteristics are more obvious than those of the other three stages; the high accuracy of the discrimination of mature strawberries facilitates mechanical harvesting.

YOLOv5-ASFF improved the detection of strawberries at all stages compared with YOLOv5s at both mAP and F1. The detection was improved by 0.51% and 1.93% for the fruitlet stage, 1.02% and 1.56% for the expanding stage, 1.79% and 0.03% for the turning stage, and 0.67% and 0.94% for the mature stage strawberry, respectively. The comparison is shown in Figure 6. In this experiment, by introducing the ASFF structure, the YOLOv5 network successfully applied the spatial filtering of conflicting information to suppress the inconsistency between different feature scales and improve the scale invariance feature, thus improving the accuracy of strawberry multistage growth detection.

Figure 9 shows the comparative graphs of YOLOv5-ASFF and YOLOv5 for the detection of strawberries at each growth stage. From Figure 9a, it can be seen that YOLOv5-ASFF had better accuracy in detecting strawberries. As shown in Figure 9b, YOLOv5s mistakenly identified the strawberry fruit stem part as fruitlet, while YOLOv5-ASFF successfully avoided this misdetection. In Figure 9c, YOLOv5s showed a missed detection of fruitlet, which was effectively avoided by the improved algorithm.

4.3. Model Interpretability and Feature Visualization

Currently, deep learning still lacks a corresponding explanation for the detection process of the target, thus becoming a hindrance in the process of model optimization. Some researchers have used feature visualization techniques to improve the interpretability of the model, i.e., the features of different network output layers are converted into visual images, and the features extracted from different convolution layers are shown through the visual images. To demonstrate the effectiveness of the YOLOv5-ASFF algorithm proposed in this research more intuitively, we used Grad-CAM (gradient-weighted class activation mapping) for YOLOv5s and YOLOv5-ASFF to visualize and analyze the output layers of the network [35].

As can be seen in Figure 10, the YOLOv5s model was susceptible to the interference of image background in the field environment, such as branches and leaves, and strawberries at the young fruit stage were more similar in color, thus affecting the discriminative results of the model to some extent. In contrast, the YOLOv5-ASFF algorithm could effectively filter the background information and highlight the target fruits in the detection process, thus proving the superiority of the improved model.

4.4. Model Application Analysis

This research achieved the detection of strawberries at multiple stages in a complex field environment, demonstrating the applicability and greater robustness of the method. The small size of YOLOv5-ASFF, only 25.4 MB, makes it easier to apply and greatly facilitates the monitoring of field information, and it can also be easily attached to small picking robots for the efficient picking of ripe strawberry fruit. Fu et al. [36] developed a YOLOv3-Tiny-based algorithm for the automatic detection of kiwi targets with high accuracy for both daytime and nighttime kiwifruit. However, the present dataset lacks the image information of strawberries in a low-light environment; thus, the detection performance of the method at night needs to be further investigated.

5. Conclusions

To achieve the detection of multistage strawberries in complex field environments, an effective detection algorithm based on YOLOv5 was proposed in this research. The method introduced ASFF into the head of YOLOv5 network to fully extract and learn the target features and achieved good detection performance in the experiment, providing a new method for growth monitoring of agricultural products in smart farms. According to the tests of this experiment, the method achieved 88.03% F1, 91.86% mAP, and 98.77% mAP for detecting mature strawberries, performing better than SSD, YOLOv3, YOLOv4, and YOLOv5s. YOLOv5-ASFF algorithm can improve the detection performance of strawberries in complex environments and has a high discrimination accuracy for mature strawberries, which is beneficial to the fruit-harvesting operation of picking robots. However, there is still room to improve the detection accuracy for the other three growth stages of strawberries, i.e., the fruitlet, expanding, and turning stages, and the model has not been ported to a specific embedded device (e.g., Jetson TX1). In future work, we will expand the strawberry dataset by collecting more images and highlighting the information of strawberry features at different growth stages using image enhancement methods to further improve the detection accuracy of multistage strawberries. It is also planned to test the model’s performance in specific embedded devices and further optimize it.

Author Contributions

Conceptualization, Y.L. (Yaodi Li); methodology, Y.L. (Yaodi Li); data curation, Y.L. (Yaodi Li), J.Y., Y.L. (Yang Liu) and M.Z.; writing—original draft preparation, Y.L. (Yaodi Li); writing—review and editing, Y.L. (Yaodi Li) and X.Q.; supervision, J.X., D.Z. and Z.L.; project administration, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant/Award Number 31801632.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Baby, B.; Antony, P.; Vijayan, R. Antioxidant and Anticancer Properties of Berries. Crit. Rev. Food Sci. Nutr. 2018, 58, 2491–2507. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Hu, J.; Xu, Z.; Yue, J.; Ye, H.; Yang, G. A Novel Greenhouse-Based System for the Detection and Plumpness Assessment of Strawberry Using an Improved Deep Learning Technique. Front. Plant Sci. 2020, 11, 559. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep Learning in Agriculture: A Survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef] [Green Version]
Li, S.; Zhang, S.; Xue, J.; Sun, H. Lightweight Target Detection for the Field Flat Jujube Based on Improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107391. [Google Scholar] [CrossRef]
Qiao, Y.; Guo, Y.; He, D. Cattle Body Detection Based on YOLOv5-ASFF for Precision Livestock Farming. Comput. Electron. Agric. 2023, 204, 107579. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep Learning for Real-Time Fruit Detection and Orchard Fruit Load Estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Lawal, O.M. Development of Tomato Detection Model for Robotic Platform Using Deep Learning. Multimed. Tools Appl. 2021, 80, 26751–26772. [Google Scholar] [CrossRef]
Montoya-Cavero, L.-E.; Díaz De León Torres, R.; Gómez-Espinosa, A.; Escobedo Cabello, J.A. Vision Systems for Harvesting Robots: Produce Detection and Localization. Comput. Electron. Agric. 2022, 192, 106562. [Google Scholar] [CrossRef]
Lawal, O.M. YOLOMuskmelon: Quest for Fruit Detection Speed and Accuracy Using Deep Learning. IEEE Access 2021, 9, 15221–15227. [Google Scholar] [CrossRef]
Guo, X.; Li, J.; Zheng, L.; Zhang, M.; Wang, M. Acquiring soybean phenotypic parameters using Re-YOLOv5 and area search algorithm. Trans. Chin. Soc. Agric. Eng. 2022, 38, 186–194. [Google Scholar]
Fu, X.; Li, A.; Meng, Z.; Yin, X.; Zhang, C.; Zhang, W.; Qi, L. A Dynamic Detection Method for Phenotyping Pods in a Soybean Population Based on an Improved YOLO-v5 Network. Agronomy 2022, 12, 3209. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Purkait, P.; Zhao, C.; Zach, C. SPP-Net: Deep Absolute Pose Regression with Synthetic Views. arXiv 2017, arXiv:1712.03452. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3500–3509. [Google Scholar]
Wang, X.; Shrivastava, A.; Gupta, A. A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3039–3048. [Google Scholar]
Zheng, H.; Wang, G.; Li, X. YOLOX-Dense-CT: A Detection Algorithm for Cherry Tomatoes Based on YOLOX and DenseNet. J. Food Meas. Charact. 2022, 16, 4788–4799. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Wang, J.; Cai, L.; Jin, Y.; Zhao, S.; Xie, B. Realtime Picking Point Decision Algorithm of Trellis Grape for High-Speed Robotic Cut-and-Catch Harvesting. Agronomy 2023, 13, 1618. [Google Scholar] [CrossRef]
Tang, Y.; Zhou, H.; Wang, H.; Zhang, Y. Fruit Detection and Positioning Technology for a Camellia Oleifera C. Abel Orchard Based on Improved YOLOv4-Tiny Model and Binocular Stereo Vision. Expert Syst. Appl. 2023, 211, 118573. [Google Scholar] [CrossRef]
Lu, Z.; Zhang, F.; Wei, X.; Huang, Y.; Li, J.; Zhang, Z. Synergistic recognition of tomato flowers and fruits in greenhouse using combination enhancement of YOLOX-ViT. Trans. Chin. Soc. Agric. Eng. 2023, 39, 124–134. [Google Scholar]
Huang, M.-L.; Wu, Y.-S. GCS-YOLOV4-Tiny: A Lightweight Group Convolution Network for Multi-Stage Fruit Detection. Math. Biosci. Eng. 2022, 20, 241–268. [Google Scholar] [CrossRef]
Zhang, B.; Wang, R.; Zhang, H.; Yin, C.; Xia, Y.; Fu, M.; Fu, W. Dragon Fruit Detection in Natural Orchard Environment by Integrating Lightweight Network and Attention Mechanism. Front. Plant Sci. 2022, 13, 1040923. [Google Scholar] [CrossRef] [PubMed]
Jiang, M.; Song, L.; Wang, Y.; Li, Z.; Song, H. Fusion of the YOLOv4 Network Model and Visual Attention Mechanism to Detect Low-Quality Young Apples in a Complex Environment. Precis. Agric. 2022, 23, 559–577. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple Detection during Different Growth Stages in Orchards Using the Improved YOLO-V3 Model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Park, H.; Yoo, Y.; Seo, G.; Han, D.; Yun, S.; Kwak, N. C3: Concentrated-Comprehensive Convolution and Its Application to Semantic Segmentation. arXiv 2018, arXiv:1812.04920. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef] [Green Version]
Feng, J.; Yi, C. Lightweight Detection Network for Arbitrary-Oriented Vehicles in UAV Imagery via Global Attentive Relation and Multi-Path Fusion. Drones 2022, 6, 108. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Fu, L.; Feng, Y.; Wu, J.; Liu, Z.; Gao, F.; Majeed, Y.; Al-Mallahi, A.; Zhang, Q.; Li, R.; Cui, Y. Fast and Accurate Detection of Kiwifruit in Orchard Using Improved YOLOv3-Tiny Model. Precis. Agric. 2021, 22, 754–776. [Google Scholar] [CrossRef]

Figure 1. Strawberry dataset samples.

Figure 2. Four stages of strawberry fruit growth: (a) fruitlet; (b) expanding; (c) turning; (d) mature.

Figure 3. The structure of YOLOv5-ASFF.

Figure 4. The Conv, C3, and SPFF module structures.

Figure 5. The FPN-PAN and ASFF structure.

Figure 6. Performance comparison of different algorithms.

Figure 7. Detection results of multistage strawberries.

Figure 8. Comparison of YOLOv5-ASFF and original algorithm (YOLOv5s).

Figure 9. Comparison of YOLOv5s and YOLOv5-ASFF detection results. YOLOv5s detection results are boxed in black, while YOLOv5-ASFF detection results are boxed in blue. (a) YOLOv5-ASFF has higher detection accuracy; (b) YOLOv5-ASFF successfully avoids false detection; (c) YOLOv5-ASFF successfully avoids missed detection.

Figure 10. Visualization analysis of Grad-CAM: (a) fruitlet; (b) expanding; (c) turning; (d) mature. Blue represents zero features, and red represents the maximum number of features.

Table 1. Details of the strawberry dataset.

Dataset	Image	Labels
Dataset	Image	Fruitlet	Expanding	Turning	Mature
Training	852	1935	2086	982	1485
Validation	122	318	286	152	206
Test	243	533	552	280	410

Table 2. The YOLOv5-ASFF network structure.

Layer	Input	Parameters	Module	Tensor Information	Layer	Input	Parameters	Module	Tensor Information
0	−1	3520	Conv	[3, 32, 6, 2, 2]	13	−1	361,984	C3	[512, 256, 1, False]
1	−1	18,560	Conv	[32, 64, 3, 2]	14	−1	33,024	Conv	[256, 128, 1, 1]
2	−1	18,816	C3	[64, 64, 1]	15	−1	0	Upsample	[None, 2, “nearest”]
3	−1	73,984	Conv	[64, 128, 3, 2]	16	[−1, 4]	0	Concat	[1]
4	−1	115,712	C3	[128, 128, 2]	17	−1	90,880	C3	[256, 128, 1, False]
5	−1	295,424	Conv	[128, 256, 3, 2]	18	−1	147,712	Conv	[128, 128, 3, 2]
6	−1	625,152	C3	[256, 256, 3]	19	[−1, 14]	0	Concat	[1]
7	−1	1,180,672	Conv	[256, 512, 3, 2]	20	−1	296,448	C3	[256, 256, 1, False]
8	−1	1,182,720	C3	[512, 512, 1]	21	−1	590,336	Conv	[256, 256, 3, 2]
9	−1	656,896	SPFF	[512, 512, 5]	22	[−1, 10]	0	Concat	[1]
10	−1	131,584	Conv	[512, 256, 1, 1]	23	−1	1,182,720	C3	[512, 512, 1, False]
11	−1	0	Upsample	[None, 2, “nearest”]	24	[17, 20, 23]	5,463,722	ASFF_Detect	/
12	[−1, 6]	0	Concat	[1]

The input “−1” refers to the input from the previous layer. The tensor information includes the number of input channels, number of output channels, convolution kernel size, step length, and grouping.

Table 3. Experimental environment configurations.

Hardware	Configure	Environment	Version
System	Windows11	Pycharm	2022.1.3
GPU	R7-5800H	Pytorch	1.12.1
CPU	RTX3070(8G)	Python	3.8.5
RAM	40 G	Cuda	11.3
Hard disk	2.5 T	Cudatoolkit	11.3.1

Table 4. The comparative results of the different algorithms.

Algorithm	Precision (%)	Recall (%)	F1 (%)	[email protected] (%)	L_loss	FPS (Frames/s)
SSD	86.32	62.10	72.22	73.61	0.132	91
YOLOv3	84.26	72.97	78.21	81.00	0.079	57
YOLOv4	86.42	73.63	79.51	83.33	0.086	41
YOLOv5s	86.71	87.17	86.94	90.87	0.065	64
YOLOv5-ASFF	87.19	88.88	88.03	91.86	0.057	56

Table 5. Results of k-fold cross-validation.

The Times of Fold	mAP of YOLOv5s (%)	mAP of YOLOv5-ASFF (%)
1	90.87	91.86
2	90.12	92.34
3	90.91	91.79
4	90.13	91.43
5	90.97	92.31
Average	90.60	91.95
Standard deviation	0.39	0.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Xue, J.; Zhang, M.; Yin, J.; Liu, Y.; Qiao, X.; Zheng, D.; Li, Z. YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5. Agronomy 2023, 13, 1901. https://0-doi-org.brum.beds.ac.uk/10.3390/agronomy13071901

AMA Style

Li Y, Xue J, Zhang M, Yin J, Liu Y, Qiao X, Zheng D, Li Z. YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5. Agronomy. 2023; 13(7):1901. https://0-doi-org.brum.beds.ac.uk/10.3390/agronomy13071901

Chicago/Turabian Style

Li, Yaodi, Jianxin Xue, Mingyue Zhang, Junyi Yin, Yang Liu, Xindan Qiao, Decong Zheng, and Zezhen Li. 2023. "YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5" Agronomy 13, no. 7: 1901. https://0-doi-org.brum.beds.ac.uk/10.3390/agronomy13071901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection

3.2. Image Annotation and Dataset Partition

3.3. YOLOv5-ASFF for Detecting Strawberry

3.3.1. YOLOv5-ASFF

3.3.2. ASFF

3.3.3. Loss Function for Multiclassification Tasks

3.4. Experimental Environment

3.5. Evaluation Metrics

4. Results and Discussion

4.1. Results of Detection Algorithms

4.2. Analysis of the Multistage Strawberry Detection Results

4.3. Model Interpretability and Feature Visualization

4.4. Model Application Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI