A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x

Chen, Yuanjie; Wang, Chunyuan; Zhang, Chi

doi:10.3390/app131810225

Open AccessArticle

A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x

by

Yuanjie Chen

¹,

Chunyuan Wang

^1,*

and

Chi Zhang

²

¹

The College of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Department of Ethnic Music, Shanghai Conservatory of Music, Shanghai 200031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10225; https://0-doi-org.brum.beds.ac.uk/10.3390/app131810225

Submission received: 24 August 2023 / Revised: 1 September 2023 / Accepted: 8 September 2023 / Published: 12 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Pedestrian detection is a crucial task in computer vision, with various applications in surveillance, autonomous driving, and robotics. However, detecting pedestrians in complex scenarios, such as rainy days, remains a challenging problem due to the degradation of image quality and the presence of occlusions. To address this issue, we propose RSTDet-Lite (a robust lightweight network) for pedestrian detection on rainy days, based on an improved version of YOLOv5-x. Specifically, in order to reduce the redundant parameters of the YOLOv5-x backbone network and enhance its feature extraction capability, we propose a novel approach named CBP-GNet, which incorporates a compact bilinear pooling algorithm. This new net serves as a new backbone network, resulting in significant parameter reduction and enhancing the fine-grained feature fusion capability of the network. Additionally, we introduce the Simple-BiFPN structure as a replacement for the original feature pyramid module based on the weighted bidirectional feature pyramid to further improve feature fusion efficiency. To enhance network performance, we integrate the CBAM attention mechanism and introduce the idea of structural reparameterization. To evaluate the performance of our method, we create a new dataset named RainDet3000, which consists of 3000 images captured in various rainy scenarios. The experimental results demonstrate that, compared with YOLOv5, our proposed model reduces the network size by 30 M while achieving a 4.56% increase in mAP. This confirms the effectiveness of RSTDet-Lite in achieving excellent performance in rainy-day pedestrian detection scenarios.

Keywords:

lightweight network; YOLOv5 algorithm; weighted bidirectional feature pyramid; compact bilinear pooling

1. Introduction

Pedestrian detection is one of the hot research directions in the field of computer vision; it refers to the processing of images or videos by computers to automatically filter out pedestrian targets among them. Usually, traditional pedestrian detection algorithms consist of three steps: image preprocessing, manual feature extraction, and classification. The purpose of image preprocessing is to extract more effective features. The manual extraction process is generally detected by a sliding window approach, which uses a rectangular box to scan comprehensively from the top left to the bottom right of the image and obtains one or more specific features (e.g., edge features, image blocks, trait features, wavelet coefficients, etc.) from the input image. Some of the representative features are LBP [1], HOG [2], Edgelet [3], Shapelet [4] and Haar-like [5], Gabor, and other features. Traditional pedestrian detection has advantages such as fast detection speed, but the feature information is single, which leads to the low recognition accuracy of the network.

With the development of deep learning, a variety of excellent detection algorithms have emerged that effectively solve the problem of the low recognition rate of traditional pedestrian detection. The mainstream of target detection algorithms are divided into two categories: two-stage target detection algorithms represented by RCNN [6], Fast RCNN [7], and Faster RCNN [8] and single-stage target detection algorithms represented by

YOLO [9,10] and FCOS [11]. The former obtains candidate regions with targets first and then performs regression prediction on the size and position of the obtained candidate frames; the latter generates several candidate frames and performs regression prediction. Single-stage target detection is fast but less accurate; two-stage target detection is more accurate but slower.

There are challenges in pedestrian detection tasks, such as complex weather conditions and severe occlusions. In addition, existing object detection networks suffer from issues such as large model parameters and slow inference speed. To address these problems, many researchers have proposed solutions specifically targeting these issues. For instance, some have suggested replacing CSPDarknet53 with the lightweight MobileNet [12] to reduce the network parameters. Zhang et al. [13] proposed the CA-MobileNetv2-YOLOv4 network, which replaces the CSPDarknet53 backbone network of YOLOv4 with MobileNetv2 and adds a coordinate attention mechanism to adjust the weights and focus the network on the regions of interest. The algorithm proposed by Sun Hao et al. [14] takes YOLOv4-Tiny as the baseline and incorporates the Ghost [15] module and dilated convolution. This reduces the model’s capacity significantly. However, it limits the model’s ability to learn advanced features and the authors did not consider that dilated convolution might be more effective for larger objects and less friendly for smaller objects. The pedestrian detection algorithm based on YOLOv4 proposed by Baopeng Zhang et al. [16] uses ShuffleNet instead of YOLOv4 and uses depth-separable convolution to reduce the model size. While the network becomes more lightweight, its feature extraction capability decreases. In complex detection scenarios, this may lead to a significant decrease in recall rate. zRoszyk, Kami [17] was able to improve the detection performance of the network in cases of obstacle occlusion by using YOLOv4-Tiny as the baseline, incorporating multispectral methods and reducing latency. The α-CIoU loss function is used in Li, Ming-Lun [18]’s proposed YOLOv5s-G2 network to improve the problem of occlusion and small target unrecognition in the pedestrian detection task. The mentioned GhostC3 module in the paper increases the structural complexity of the network, which could potentially result in less than optimal model inference speed. Sha, Mengzhou [19] proposed to use a lightweight scalable attention module based on dilated convolution to maintain important feature channels and a multiplexed connection residual block to construct a lightweight network for pedestrian detection. Zhao [20] proposed a lightweight detection model based on YOLOv5, which combines the MD-SILBP operator and the five-frame differential method to enhance the contour feature extraction capability and uses Distance-IoU non-maximum suppression to reduce the missed detection rate in detection. Sun, Zaiming [21] proposed the PVformer algorithm for vehicle and pedestrian detection in rainy scenarios, based on the Swin transformer. They introduced a local enhancement perception block and a deraining module to improve the accuracy of detection in rainy scenes.

In this paper, we investigate a robust pedestrian detection network applied to complex scenarios. Using YOLOv5-x as the baseline, the focus is on the network lightweighting and the improvement of the accuracy of pedestrian detection in complex conditions. The contributions of this study can be summarized as follows:

A RainDet3000 dataset is proposed to fill the gap that the pedestrian detection dataset does not target rainy days, providing a more realistic detection scenario for network training.
The bottleneck layer structure of GhostNet has been optimized using a compact bilinear pooling algorithm to enhance the network’s feature learning capability while maintaining its lightweight architecture. The resulting CBP-GNet is then utilized as the backbone network for RSTDet-Lite.
The proposed Simple-BiFPN is an extension of BiFPN, which delivers superior computational efficiency compared with YOLOv5’s PANet feature fusion network by eliminating redundant computational overhead. In addition, an attention mechanism module called CBAM is incorporated between the backbone network and the feature fusion network to optimize the assignment of feature weights, thus enabling the network to learn more effective features.
The REP structure is a novel approach to enhancing the capacity of neural networks through structural reparameterization. During training, the REP structure increases the number of trainable parameters, allowing the network to capture richer and more abstract semantic concepts without increasing inference time. In other words, the REP structure provides a way to improve the network’s performance without compromising its efficiency in real-time applications.

2. Related Work

YOLOv5 Algorithm

The YOLOv5 framework, developed by Ultralytics LLC, presents an enhanced iteration within the YOLO series. Its structural composition involves a one-stage detection framework comprising four fundamental segments: input, backbone network, neck network, and output. Drawing inspiration from prior YOLO versions and other detection algorithms, YOLOv5 integrates the focus layer into the input phase to facilitate data augmentation. The YOLOv5 authors further refined the spatial pyramid pooling (SPP) [22] technique, evolving it into SPPF, where SPPF demonstrates improved speed while maintaining the same output. Concurrently, the DarkNet53 backbone is harnessed to extract primary image features. Notably, the neck network incorporates a feature fusion framework that includes the feature pyramid network (FPN) and the bottom-up path aggregation network. This strategic incorporation enhances short-circuit connections and cross-layer fusion for multiscale features. The holistic YOLOv5 architecture is visually presented in Figure 1. The YOLOv5 network is composed of four constituent units, as illustrated below:

3. Method

The network structure of the robust lightweight pedestrian detection algorithm, RSTDet-Lite, is depicted in Figure 2, utilizing a modified GhostNet network as the backbone network and a Simple-BiFPN network as the feature fusion network. Additionally, an attention mechanism, the CBAM module, is inserted between the backbone network and the feature fusion network to enhance feature selection. Finally, an REP module is introduced in front of the YOLO detection head to increase network complexity and improve performance.

3.1. Design of Backbone Network

The efficient and lightweight GhostNet is chosen as the base network structure for the backbone network in this paper. The design philosophy behind the YOLOv5 backbone network shares similarities with YOLOv4, as both draw inspiration from the CSP network architecture. While the CSP network yields commendable performance, its intricate and redundant network parameters lead to sluggish detection speeds, underutilized hardware device capabilities, and demanding hardware specifications. Therefore, we chose the more efficient GhostNet instead. In many efficient neural network models, a large number of feature maps are generated, but many of these features are highly similar. Therefore, the concept of Ghost Module is proposed, which uses simple convolutional and linear operations to obtain more features at a smaller computational cost. This reduces the number of convolution filters used to generate the feature maps, resulting in better performance.

Generally speaking, for the input data

X \in R^{c \times h \times w}

, where c represents the number of input channels, the h and w denote the height and width of the input data, respectively, and the convolution operation at any layer can be described as Equation (1).

Y = X \circ f + b

(1)

where

°

represents the normal convolution operation, b is the bias term,

Y \in R^{h^{'} \times w^{'} \times n}

is the output feature map, n is the number of channels of the output feature,

f \in R^{c \times k \times k \times n}

is the convolution filter of this layer, and

k \times k

is the size of the convolution kernel. The output feature map usually contains a large number of repeated features and it is observed that some of the features are highly similar to another part, so a simpler convolution can be considered to generate intrinsic features. Then, cheap calculations can be performed on the intrinsic features to obtain more features and the convolution operation can be described as (2).

Y^{'} = X \circ f^{'}

(2)

f^{'} \in R^{c \times k \times k \times m} (m \leq n)

in Equation (2). Compared with (1), (2) removes the bias term in order to reduce the computational effort and the number of channels by (2) becomes m. In order to change the number of channels to n, a linear operation with a smaller number of parameters s is performed for each intrinsic feature obtained in

Y^{'}

. The expression is shown in Equation (3).

y_{i j} = φ_{i, j} ({y_{i}}^{'}), \forall i = 1, \dots, m, j = 1, \dots, s,

(3)

where

y_{i}^{'}

denotes the

i

intrinsic features in

Y^{'}

and

φ_{i, j}

is the

j

(except the last one) linear operation, which aims to generate the

j

feature map. The ultimate linear operation aims to upsample the intrinsic feature map to preserve its features following the linear transformation map, as depicted in Figure 3. The final feature map obtained from Equation (3) is used as the output data of the Ghost module.

The Ghost bottleneck layer is a critical component for enhancing network performance, as illustrated in Figure 4. Structurally, it bears resemblance to the base residual module in ResNet [23], featuring residual connections and Ghost modules that respectively expand and reduce the number of channels to match the number of channels in the residual connections. The above structure is suitable when the stride is set to 1. However, when the stride is set to 2, inserting a depth-wise separable convolution with a stride of 2 between two Ghost modules can effectively reduce the impact of geometric feature variations, thereby improving the results.

Despite performing well in standard detection tasks, GhostNet faces challenges in complex detection scenarios, particularly with regards to pedestrian height overlap and object occlusion. To address these challenges, we incorporated the compact bilinear pooling (CBP) algorithm [24] to enhance network feature fusion and employed the ELU activation function to accelerate network convergence and improve feature expression capabilities. Additionally, we redesigned the bottleneck layer structure of the network.

The CBP algorithm is an extension of the bilinear pooling method [25]. The compact bilinear pooling algorithm addresses this challenge through three main steps: feature mapping, bilinear pooling, and compression. In the feature mapping step, two input vectors A and B, typically represented with dimensions d1 and d2, are mapped. Bilinear pooling, the core concept of CBP, computes the product of each feature in A with each feature in B and aggregates these products. This results in a high-dimensional feature vector of size

d 1 \times d 2

. To manage the potential large number of products and reduce computational complexity, CBP employs a compression technique. This compresses the high-dimensional feature vector into a lower-dimensional representation, usually a fixed dimension, suitable for subsequent tasks. This compression step helps improve computational efficiency.

CBP excels at capturing high-order feature interactions between input vectors, making it highly valuable in various computer vision and natural language processing tasks. It can significantly enhance performance, especially in tasks that require high-dimensional feature interactions, while reducing computational costs. The compact bilinear pooling algorithm approximates the bilinear pooling method by using a low-dimensional polynomial kernel mapping and extends the bilinear pooling method by employing tensor sketch algorithms for model compression. This approach effectively reduces feature dimensions and computational expenses without compromising the effectiveness of feature fusion.

To minimize computational costs, the proposed CBP-G network features two structures: CBP-G with stride 1 simply modifies the activation function of G-bottleneck, whereas CBP-G with stride 2 employs the compact bilinear pooling algorithm, which more effectively integrates residual edges and Ghost modules and uses the more efficient ELU activation function. The architecture of CBP-G with stride 2 is illustrated in Figure 5, with CBP denoting the compact bilinear pooling structure.

The design of CBP-GNet maintains the lightweight characteristics of GhostNet while enhancing the model’s ability to capture and integrate features from various network layers. This contributes to a more comprehensive understanding of pedestrian appearances under adverse weather conditions. This improved feature fusion effectively combines contextual information, spatial relationships, and scale-related details, which is crucial for accurate pedestrian detection in adverse weather conditions.

3.2. Proposed Simple-BiFPN Structure

The BiFPN structure, proposed by the Google team in EfficientDet [26] and illustrated in Figure 6c, is an efficient weighted bidirectional feature pyramid network. When examining the PANet utilized in YOLOv5 (Figure 6a), it becomes apparent that node A and node B have limited impact on feature fusion due to their one-dimensional feature inputs. Consequently, BiFPN removes these nodes to minimize redundant parameters. Furthermore, by introducing an extra skip connection within the same feature dimension, more features can be fused without additional computational overhead. Lastly, to optimize the one-path structure of NAS-FPN [27], BiFPN integrates both top-down and bottom-up paths into a single feature layer network.

The solution proposed by BiFPN is to add additional weights to each input and let the network learn, evaluate the importance of each input, and assign the appropriate weight assignment. For weight fusion, BiFPN uses a fast normalized fusion strategy to ensure that the output feature representation is of high quality. It can be described in Formula (4).

O = \sum_{i} \frac{ω_{i}}{ε + \sum_{j}} \cdot I_{i}

(4)

The training process of Simple-BiFPN involves the following steps: with an input image size of

(416, 416,3)

,

P 3_i n

has dimensions of

(104, 104,24)

,

P 2_i n

has dimensions of the dimensions of

(52, 52,40)

, and

P 3_i n

has dimensions of

(26, 26,112)

. Since the YOLO detector only utilizes three scales of information,

P 4_i n (13, 13,160)

is obtained through upsampling to further improve feature fusion, but is only involved in the feature fusion process and not in the final output. The Simple-BiFPN module assigns a weight

ω_{i}

to each input layer and is used as an example.

\begin{matrix} P 3_t d = C o n v (\frac{ω_{1} \cdot P 3_i n + ω_{2} \cdot Re s i z e (P 4_i n)}{ω_{1} + ω_{2} + ε}) \\ P 3_o u t = C o n v (\frac{ω^{'}_{1} \cdot P 3_i n + ω^{'}_{2} \cdot P 3_t d + ω^{'}_{3} \cdot Re s i z e (P 2_o u t)}{ω^{'}_{1} + ω^{'}_{2} + ω^{'}_{3} + ε}) \end{matrix}

(5)

The operation of Resize in Equation (5) is generally done by upsampling or downsampling to achieve feature scale uniformity.

P 3_t d

refers to the intermediate features of the layer

P 3_{i n};

P 3_o u t

refers to the output features of that layer. To further improve the feature fusion efficiency, depthwise separable convolution is also incorporated in this process. The other layers are constructed similarly to this process, utilizing depthwise separable convolution to enhance the feature fusion efficiency. The network architecture of Simple-BiFPN is illustrated in Figure 7.

An efficient feature fusion network plays a crucial role in addressing the challenges of pedestrian detection in rainy conditions. The advantage of Simple-BiFPN lies in its seamless integration of multiscale features from different network layers, enhancing the model’s ability to capture pedestrian details under adverse weather conditions. This feature fusion network optimally combines contextual information, such as object relationships and spatial dependencies, contributing to improved detection accuracy. Furthermore, it ensures that the network maintains computational efficiency, ensuring real-time or near real-time performance in rainy conditions. This efficiency allows the model to remain effective while efficiently utilizing hardware resources, making it a powerful choice for rainy-day pedestrian detection.

3.3. Incorporating the Spatial Attention Mechanism CBAM

CBAM (convolutional block attention module) is a more comprehensive approach to feature attention, as illustrated in Figure 8. Unlike SENet, which only considers channel attention, CBAM focuses on both channel and spatial attention. The channel attention mechanism assigns appropriate weights to different channels to help them focus more effectively on key information. The module consists of two parts: global maximum pooling and global average pooling. The input features are processed through a shared fully connected layer; the resulting output is fed into two separate branches for channel and spatial attention. The output of each branch is then combined through summation and passed through a sigmoid function to obtain a value between 0 and 1, which is then used to weight the original features and obtain the new feature F1.

The spatial attention mechanism of the CBAM module focuses on the most salient regions in the feature map and assigns different weights to different regions to reduce the proportion of irrelevant regions. When the feature map F1 passes through the spatial attention mechanism, it undergoes global maximum pooling and global average pooling; these two results are concatenated channel by channel. The resulting feature map is then processed by a 1 × 1 convolution to adjust the number of channels. A sigmoid function is used to obtain weights between 0 and 1, which are then multiplied element-wise with the input features. The resulting weighted feature map is then used as the input for the next layer, F2.

As illustrated in Figure 2, the CBAM module is inserted between the backbone network and the feature network, adding three attention mechanism modules in total. By incorporating the attention mechanism between the backbone network and the feature network, the network can leverage the rich feature representations generated by the backbone network while selectively focusing on the most relevant regions of the image. This includes key contextual information such as the outline and posture of pedestrians, as well as the rainy background and environmental factors. This allows the network to generate more accurate predictions by considering only the most pertinent information in the image.

The integration of the CBAM module between the backbone and feature fusion networks proves highly effective. CBAM enhances feature quality, refines spatial context, improves information flow, and notably contributes to the network’s improved performance, especially in challenging conditions such as rainy weather.

3.4. REP Structures Combining Structural Reparameterization Ideas

The design of the REP (reparameterization) structure is inspired by RepVGG [28], a concept introduced by the Tsinghua University team, which proposes to use a complex structure in the training phase and a simple structure in the prediction phase to improve detection performance without increasing the complexity of the prediction network. Based on this concept, we propose the REP structure (Figure 9).

The proposed method consists of three processes: (a) constructing the structure of the representation learning encoder in the training phase, (b) generating intermediate states during the process, and (c) designing the structure of the representation learning encoder for the prediction phase. As shown in Figure 9, the structure of the prediction network is very simple, consisting of only a 3 × 3 convolutional block. The key to this process is to combine the convolution and BN operations into one operation and to unify all convolution operations with a 3 × 3 convolution kernel.

f_{i, j}^{'} = W_{BN} (W_{conv} f_{i, j} + b_{conv}) + b_{BN}

(6)

The first process achieves the fusion of the BN layer with the convolutional layer, as shown in Equation (6). The derivation of Equation (6) reveals that the computational form still follows the convolutional computational format, with the weights becoming

W_{B N} \cdot W_{c o n v}

and the bias term becoming

{W_{B N} \cdot b_{c o n v} + b}_{B N}

. At the end of the training, one of these parameters can be obtained, so the constant transformation can be performed in the prediction phase to achieve the fusion of the BN layer with the convolutional layer. In the second process, all convolutions are replaced with 3 × 3 convolutions. The idea of this operation is also relatively simple. For the 1 × 1convolution, 0-value padding can be performed around it without affecting the calculation results. For the places where no convolution is performed, the goal can be achieved by replacing them with 3 × 3 convolutions with all weights of 1.

Based on the above analysis, we can conclude that the REP module has a high time cost during the training phase, but only requires one 3 × 3 convolution operation during the prediction phase, indicating that the REP module is an efficient module. As illustrated in Figure 2, the REP structure is applied between the YOLO detection head and the feature fusion network to enhance the network’s complexity, enabling it to capture more complex features and improve the network’s detection performance.

Incorporating the REP module enhances network complexity during training, which, in turn, improves the network’s ability to extract and recognize pedestrian features in challenging rainy weather conditions. Importantly, this enhancement in complexity does not significantly impact network inference speed during the prediction phase, thus maintaining a high-speed detection capability.

4. Experiments

4.1. Experimental Environment Configuration and Data Introduction

Due to the limited availability of publicly available datasets for rainy weather conditions, we propose the RainDet3000 dataset, based on the Cityscapes’ [29] annotation protocol, which aims to provide a more accurate representation of real-world rainy scenarios. The dataset consists of 3000 images, including 2100 for training, 600 for testing, and 300 for validation. The images in RainDet3000 are collected from both camera shots and the Internet and include two object classes: vehicles and pedestrians. The dataset is annotated using the LabelImg tool, ensuring high-quality and consistent labeling across all images. A portion of the dataset is shown in Figure 10.

In addition to a subset of images from the RainDet3000 dataset, we also utilize publicly available datasets, including UA-DETRAC [30], Caltech Pedestrian Dataset [31], and KITTI [32], as the training set for our proposed method. By incorporating images from multiple datasets, our model can learn from a wide range of diverse scenarios, improving its robustness and generalization ability for pedestrian detection in various conditions.

The BBD100K [33] dataset is a large-scale benchmark dataset that includes a diverse range of driving images captured under various conditions. The dataset features images captured at different times of the day, including early morning, midday, evening, and night; this provides a comprehensive evaluation of object detection algorithms under different lighting conditions. Additionally, the dataset includes a wide range of challenging weather scenarios, such as rainy, cloudy, and snowy days, which further enhances the complexity and diversity of the dataset. In order to further validate the robustness of our proposed pedestrian detection network, we conduct experiments on a subset of the BDD100K dataset that exclusively contains images of pedestrian targets. This subset, which we refer to as BBD100K-pedestrian, comprises a total of 8823 images that have been selected based on rigorous screening criteria. In this dataset, 6176 images are used as the training set, 1765 images are used as the test set, and 882 images are used as the validation set.

For our experimentation, we conduct the experiments on a notebook computer equipped with an Intel i5 10200H processor, NVIDIA GeForce RTX 2060 graphics, and 8 GB of memory. We employ widely used deep learning frameworks and image processing libraries commonly applied in object detection tasks. The experimental hardware and parameter settings are summarized in Table 1.

In this experimentation, the initial network input size is fixed at 640 × 640. A momentum of 0.9 is utilized, along with a batch size of 4 per epoch. The learning rate is set to 0.001 and a weight decay of 0.05 is implemented to ensure experimental fairness and consistency. All comparative experiments adhere to these identical parameter settings. Figure 11 shows the loss curve of network training.

4.2. Experimental Environment Configuration and Data Introduction

To assess the performance of the network, the study uses mean average precision (mAP) as the detection accuracy index. In addition, the study also employs frames per second (FPS) as the evaluation index for the detection speed and the model reduction ratio R as the evaluation index for the level of model lightweighting. These metrics provide a comprehensive evaluation of the network’s performance, taking into account both accuracy and efficiency. By utilizing these metrics, the study is able to provide a comprehensive and accurate analysis of the network’s capabilities, taking into account both accuracy and efficiency and enabling a more informed assessment of its potential for practical applications.

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

A P = \int_{0}^{1} P (R) d r

(9)

m A P = \sum_{i = 1}^{N} \frac{A P_{i}}{N}

(10)

In Equations (7) and (8) P (precision) denotes precision rate; R (recall) denotes recall rate; TP (true positive) denotes positive class judgment as positive class judgment is correct; FP (false positive) denotes negative class judgment as positive class; FN (false negative) denotes positive class judgment as negative class. The model scaling ratio is described in Equation (11).

R = \frac{I - x}{I}

(11)

where we denote the original model size (in this paper, the original model size is the original network model of YOLOv5) and x is the size of the simplified model; the larger the model reduction ratio indicates the better the reduction effect.

4.3. Analysis of Experimental Results

To verify the suitability of the GhostNet algorithm as a baseline network for pedestrian detection under rainy weather conditions, we conduct ablation experiments. By comparing the performance of the GhostNet network with and without the compact bilinear pooling algorithm, we assess the combined effect of integrating the compact bilinear pooling algorithm with the GhostNet network. The experimental setup is outlined in Table 2.

First, from Table 2, we can conclude that GhostNet as the backbone network of YOLOv5-x has the smallest model size and the best FPS performance. mAP performance is significantly better than other series of backbone networks, which proves that the GhostNet network is more suitable for the detection task in this paper. The CBP-GNet model parameters increase by 2.8 M compared with the original GhostNet network, but the mAP improves by 2.2%. The CBP-GNet network can improve the detection accuracy with only a small computational cost, proving that the proposed CBP-G bottleneck layer has good performance. The experimental results demonstrate that GhostNet performs better than other main networks as a baseline network for pedestrian detection under rainy weather conditions and that the GhostNet network with the compact bilinear pooling algorithm outperforms the original GhostNet network.

To demonstrate that the proposed Simple-BiFPN feature fusion network is better suited as the feature fusion network for the current detection task, we compare its performance to that of PANet and NAS-FPN networks. The experimental results are shown in Table 3 (the backbone network utilized in our approach is CBP-G):

The experimental results show that Simple-BiFPN works better in terms of mAP, FPS, and model reduction ratio performance and that the proposed feature fusion network has a smaller model and better detection performance, which can perform well in complex weather pedestrian detection scenarios.

To validate the effectiveness of integrating the CBAM attention mechanism between the backbone network and feature fusion network, as well as integrating the REP module between the feature fusion network and the YOLO detection head, we design comparative experiments, as shown in Table 4.

Table 4 shows the comparison experiments of adding the CBAM module and the REP module to the CBP-GNet + Simple-BiFPN network structure. The experiments prove the effectiveness of adding the CBAM and REP modules; the network performance is improved by 0.48% with a 1.48 M increase in computational cost. The experimental results demonstrate that the integration of the CBAM attention mechanism and the REP module can significantly improve the performance of the network for rainy weather detection tasks. This finding validates the effectiveness of the proposed improvements.

To evaluate the effectiveness of the RSTDet-Lite algorithm, we conduct a performance comparison on the RainDet3000 dataset and select some mainstream algorithms for comparison, including YOLOv4, YOLOv5x, yolov7, and the algorithms proposed by Sun Hao [14], Zhang Baopeng [16], and PVformer [21] as shown in Table 5.

The results demonstrate the outstanding performance of RSTDet-Lite in multiple aspects. Model performance: RSTDet-Lite excels in terms of mean average precision (mAP), achieving a superior score of 54.47%, surpassing YOLOv7 (50.9%) and PVformer (52.44%). This highlights the superiority of RSTDet-Lite in terms of object detection capability. Inference speed: RSTDet-Lite exhibits remarkable speed, achieving a frame rate of 49.8 frames per second (FPS). This high FPS is crucial for real-time applications where faster processing speed is of paramount importance. Model size: notably, RSTDet-Lite maintains a compact model size, occupying only 12.3 megabytes (Mb), significantly smaller than many other models, reducing storage and memory requirements. Finally, RSTDet-Lite achieves an AP60 value of 51.68%, surpassing other detection algorithms. This result indicates that the model performs better at high recall rates, effectively capturing more positive class targets and reducing false negatives. These results underscore the effectiveness of RSTDet-Lite as an advanced and efficient object detection solution, providing a striking balance between accuracy, speed, model size, and recall rate. It represents a promising choice for practical applications where these factors are critical.

To further demonstrate the effectiveness of RSTDet-Lite in challenging scenarios, we conduct comparative experiments on the BBD100K-pedestrian dataset, which includes pedestrian images captured under various challenging weather conditions. Specifically, we select YOLOX, PVformer, CA-MobileNetv2-YOLOv4, and our proposed RSTDet-Lite algorithm for the comparative experiments. The detailed experimental results are presented in Table 6.

The results indicate that our innovative RSTDet-Lite model performs exceptionally well on the challenging BBD100K-pedestrian dataset, outperforming YOLOv5-x by a significant margin. Our algorithm achieves a recall rate of 0.9546, which is 0.0623 higher than YOLOv5-x, and a mAP of 49.89%, surpassing the other algorithms. This suggests that RSTDet-Lite exhibits exceptional robustness in the presence of complex weather conditions. Notably, RSTDet-Lite achieves high performance while employing minimal computational resources, underscoring its efficiency. Additionally, our network demonstrates strong robustness, further validating its efficacy. These findings collectively demonstrate the compelling effectiveness of our proposed RSTDet-Lite, which represents a promising solution for pedestrian detection in challenging weather conditions.

In the final section of our experiments, we present a visual comparison of the pedestrian detection results to showcase the superior performance of our proposed algorithm. Figure 12b displays the detection outcomes achieved by the original YOLOv5-x model, while Figure 12c show cases the detection results obtained by our algorithm. These images offer a side-by-side illustration of the pedestrian detection performance of both methods in the same scenarios. Through this visual comparison, it becomes evident that our algorithm significantly enhances pedestrian localization and detection accuracy. Figure 12 illustrates the experimental comparison results.

5. Conclusions

In conclusion, this study proposes a novel lightweight pedestrian detection network, RSTDet-Lite, which is specifically designed for complex weather conditions. The proposed algorithm incorporates several key improvements, including the use of a self-annotated RainDet3000 dataset, a novel CBP-G bottleneck layer structure based on GhostNet architecture, a Simple-BiFPN feature fusion network, a CABM attention mechanism module, and a REP parameterization module. Through extensive experiments, we demonstrate that the proposed RSTDet-Lite algorithm achieves excellent performance in both the RainDet3000 dataset and the widely-used BDD100K pedestrian dataset, outperforming several state-of-the-art algorithms in terms of accuracy, model size, and FPS.

However, we acknowledge that there are still limitations and potential areas for improvement in the proposed algorithm. Specifically, our algorithm lacks multimodal fusion and auxiliary detection techniques, which may limit the overall detection performance. In future work, we plan to explore the integration of infrared technology and semantic segmentation to enhance the detection performance in rainy weather conditions. Additionally, we will focus on developing new detection methods that can detect umbrellas as a separate class of objects and propose a joint detection loss function that assigns specific weight to the detection of umbrellas. These efforts will help to further improve the accuracy and robustness of the proposed algorithm.

In summary, the proposed RSTDet-Lite algorithm demonstrates superior performance in lightweight pedestrian detection under rainy weather conditions and offers valuable insights and potential directions for further research in this field.

Author Contributions

Writing—Original Draft Preparation, Y.C.; Writing—Review and Editing, C.W.; Data Curation, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ahonen, T.; Hadid, A.; Pietikainen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [Google Scholar] [CrossRef] [PubMed]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1. [Google Scholar]
Wu, B.; Nevatia, R. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Beijing, China, 17–21 October 2005; Volume 1. [Google Scholar]
Ye, L.; Keogh, E. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009. [Google Scholar]
Lienhart, R.; Maydt, J. An extended set of haar-like features for rapid object detection. In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; Volume 1. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, Y.; Zhou, A.; Zhao, F.; Wu, H. A lightweight vehicle-pedestrian detection algorithm based on attention mechanism in traffic scenarios. Sensors 2022, 22, 8480. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Dong, X.; Wang, J.; Chen, Z. Based on the improved YOLOv4-tiny lightweight pedestrian in school target detection algorithm. Comput. Eng. Appl. 2023, 35, 13895–13906. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, B.; Kang, Q.; Li, J.; Guo, J.; Chen, S. Lightweight YOLOv4 Object Detection Algorithm. Comput. Eng. 2022, 48, 206–214. [Google Scholar] [CrossRef]
Roszyk, K.; Nowicki, M.R.; Skrzypczyński, P. Adopting the YOLOv4 architecture for low-latency multispectral pedestrian detection in autonomous driving. Sensors 2022, 22, 1082. [Google Scholar] [CrossRef] [PubMed]
Li, M.-L.; Sun, G.-B.; Yu, J.-X. A pedestrian detection network model based on improved YOLOv5. Entropy 2023, 25, 381. [Google Scholar] [CrossRef] [PubMed]
Sha, M.; Zeng, K.; Tao, Z.; Wang, Z.; Liu, Q. Lightweight Pedestrian Detection Based on Feature Multiplexed Residual Network. Electronics 2023, 12, 918. [Google Scholar] [CrossRef]
Zhao, Q.; Ma, W.; Zheng, C.; Li, L. Exploration of Vehicle Target Detection Method Based on Lightweight YOLOv5 Fusion Background Modeling. Appl. Sci. 2023, 13, 4088. [Google Scholar] [CrossRef]
Sun, Z.; Liu, C.A.; Qu, H.; Xie, G. PVformer: Pedestrian and vehicle detection algorithm based on Swin transformer in rainy scenes. Sensors 2022, 22, 5667. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Lin, T.-Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Ghiasi, G.; Lin, T.-Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.H.; Lyu, S. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
Dollár, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: A benchmark. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]

Figure 1. YOLOv5 network framework.

Figure 2. RSTDet-Lite network framework.

Figure 3. Ghost module.

Figure 4. Ghost bottleneck layer.

Figure 5. CBP-G bottleneck layer.

Figure 6. (a) PANet, (b) NAS-FPN, (c) BiFPN.

Figure 7. Simple-BiFPN network structure.

Figure 8. CBAM module.

Figure 9. REP module.

Figure 10. Partial RainDet3000 dataset.

Figure 11. Training loss curve.

Figure 12. Comparison of different algorithms on BDD100K-pedestrian dataset result.

Table 1. Experimental hardware and parameter settings.

Item	Parameter
CPU	Intel(R) Core i5-10200H
GPU	NVIDIA GeForce GTX 2060
Operating System	Ubuntu 16.04 LTS
Memory	8 GB
Deep learning framework version	Pytorch 1.8
Development languages	Python 3.9

Table 2. Backbone network comparison experiment.

Backbone	mAP (%)	Model Size (Mb)	Model Scaling Ratio	FPS
-	49.91	86.1	-	30.5
GhostNet	43.9	21.3	75.2	48.6
CBP-GNet	50.1	24.1	72	47.9

Table 3. Feature fusion network comparison experiments.

Feature Fusion Network	mAP (%)	Model Size (Mb)	Model Scaling Ratio	FPS
PANet	50.1	24.1	72	47.9
NAS-FPN	51.2	25.9	69.9	34.6
Simple-BiFPN	52.99	11.4	86.8	48.8

Table 4. Comparison experiments of different modules.

Network (CBP-GNet + Simple-BiFPN)	mAP (%)	Model Size (Mb)	Model Scaling Ratio	FPS
-	52.99	11.4	86.8	48.8
+CBAM	53.09	11.4	86.8	49.8
+CBAM + REP	54.47	12.3	85.7	49.8

Table 5. Performance comparison of different algorithms on the RainDet3000 dataset.

Model	Input Image Size	Recall (%)	mAP (%)	AP₆₀	Model Size (Mb)	FPS
YOLOv4	416 × 416	66.43	45.91	44.31	244.0	18.6
YOLOv5x	640 × 640	73.41	49.91	44.86	42.3	30.5
YOLOv7	640 × 640	74.51	50.9	47.6	37.1	48.3
Improved YOLOv4-Tiny [14]	416 × 416	54.32	30.40	22.1	7.1	32.1
Improved YOLOv4 [16]	416 × 416	79.15	47.93	41.92	187.0	29.0
PVformer [21]	640 × 640	83.35	52.44	45.3	145.4	19.1
RSTDet-Lite	416 × 416	88.31	54.47	51.68	12.3	49.8

Table 6. Comparison of different algorithms on BDD100K-pedestrian dataset.

Model	Input Image Size	Recall (%)	mAP (%)	AP₆₀	Model Size (Mb)	FPS
YOLOv5x	640 × 640	89.23	43.1	36.9	42.3	24.9
YOLOX	416 × 416	91.21	45.77	39.81	19.8	29.8
CA-MobileNetv2-YOLOv4	640 × 640	92.33	43.89	38.93	40.1	31.2
PVformer	640 × 640	94.65	47.23	45.67	145.4	25.9
RSTDet-Lite	416 × 416	95.46	49.89	45.32	12.3	43.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Wang, C.; Zhang, C. A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x. Appl. Sci. 2023, 13, 10225. https://0-doi-org.brum.beds.ac.uk/10.3390/app131810225

AMA Style

Chen Y, Wang C, Zhang C. A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x. Applied Sciences. 2023; 13(18):10225. https://0-doi-org.brum.beds.ac.uk/10.3390/app131810225

Chicago/Turabian Style

Chen, Yuanjie, Chunyuan Wang, and Chi Zhang. 2023. "A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x" Applied Sciences 13, no. 18: 10225. https://0-doi-org.brum.beds.ac.uk/10.3390/app131810225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Lightweight Network for Pedestrian Detection Based on YOLOv5-x

Abstract

1. Introduction

2. Related Work

YOLOv5 Algorithm

3. Method

3.1. Design of Backbone Network

3.2. Proposed Simple-BiFPN Structure

3.3. Incorporating the Spatial Attention Mechanism CBAM

3.4. REP Structures Combining Structural Reparameterization Ideas

4. Experiments

4.1. Experimental Environment Configuration and Data Introduction

4.2. Experimental Environment Configuration and Data Introduction

4.3. Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI