EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform

Wu, Wanneng; Liu, Ao; Hu, Jianwen; Mo, Yan; Xiang, Shao; Duan, Puhong; Liang, Qiaokang

doi:10.3390/drones8060261

Open AccessArticle

EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform

by

Wanneng Wu

^1,*,†

,

Ao Liu

^1,†,

Jianwen Hu

¹

,

Yan Mo

¹,

Shao Xiang

²

,

Puhong Duan

³

and

Qiaokang Liang

³

¹

School of Electrical and Information Engineering, Changsha University of Science and Technology, Changsha 410114, China

²

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

³

National Engineering Research Center for Robot Vision Perception and Control, College of Electrical and Information Engineering, Hunan University, Changsha 410082, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2024, 8(6), 261; https://0-doi-org.brum.beds.ac.uk/10.3390/drones8060261

Submission received: 30 April 2024 / Revised: 7 June 2024 / Accepted: 10 June 2024 / Published: 13 June 2024

(This article belongs to the Special Issue Advances in Perception, Communications, and Control for Drones)

Download

Browse Figures

Versions Notes

Abstract

:

Crafting an edge-based real-time object detector for unmanned aerial vehicle (UAV) aerial images is challenging because of the limited computational resources and the small size of detected objects. Existing lightweight object detectors often prioritize speed over detecting extremely small targets. To better balance this trade-off, this paper proposes an efficient and low-complexity object detector for edge computing platforms deployed on UAVs, termed EUAVDet (Edge-based UAV Object Detector). Specifically, an efficient feature downsampling module and a novel multi-kernel aggregation block are first introduced into the backbone network to retain more feature details and capture richer spatial information. Subsequently, an improved feature pyramid network with a faster ghost module is incorporated into the neck network to fuse multi-scale features with fewer parameters. Experimental evaluations on the VisDrone, SeaDronesSeeV2, and UAVDT datasets demonstrate the effectiveness and plug-and-play capability of our proposed modules. Compared with the state-of-the-art YOLOv8 detector, the proposed EUAVDet achieves better performance in nearly all the metrics, including parameters, FLOPs, mAP, and FPS. The smallest version of EUAVDet (EUAVDet-n) contains only 1.34 M parameters and achieves over 20 fps on the Jetson Nano. Our algorithm strikes a better balance between detection accuracy and inference speed, making it suitable for edge-based UAV applications.

Keywords:

small object detection; UAV aerial images; edge-based platform; lightweight detection

1. Introduction

The recent advancements in neural network technology have significantly propelled the application of unmanned aerial vehicles (UAVs) for object detection in diverse domains [1,2], such as intelligent agriculture, disaster response, urban surveillance, etc. A noteworthy trend in these domains is the increasing prevalence of edge-based deployment on UAVs [3,4,5], as opposed to server-based deployment. Although the latter benefits from enhanced computational capabilities, it struggles with the inherent challenge of communication delay, which is particularly unsuitable for applications requiring real-time performance. The former holds greater potential for autonomous decision-making by drones or realizing a real-time human–computer interaction [6,7]. Consequently, an increasing number of object detection algorithms are now being tailored for edge-based deployment on UAVs. However, the computational resources available on UAVs are considerably constrained in comparison to server-based deployments, thereby posing a challenging issue of maintaining comparable object detection accuracy with reduced computing power. In this paper, we aim to develop a lightweight object detection algorithm crafted for edge-based devices and hope to facilitate more edge-based deployment applications for UAVs in various fields.

Researchers have proposed several promising lightweight object detection approaches for real-time applications. For instance, Zhang et al. [8] and Lu et al. [9] employed a network pruning approach to enhance inference speed by eliminating redundant channels or connections. Despite their success in improving detection speed, these methods tend to significantly compromise detection accuracy. By contrast, Li [10] and Lee [11] have designed lightweight modules with the purpose of reducing model parameters while preserving comparable accuracy. However, they have faced difficulties in improving detection speed under crowded scenes, thus limiting their practical applicability. Inspired by these observations, we further craft an efficient and lightweight object detection method for edge-based UAV applications in this work and study how to strike a better balance between detection accuracy and inference speed.

Detecting objects for UAV-captured images presents an obviously greater challenge compared to those obtained from surveillance cameras. As illustrated in Figure 1, objects captured by traditional surveillance cameras (COCO [12] dataset) are larger in size and distributed sparsely within the image. By contrast, targets captured by UAVs (VisDrone [13] dataset) are commonly smaller in size and more densely distributed, thereby making them more difficult to detect. Considering the unique attribute of UAV images, researchers have traditionally focused on improving the accuracy of detecting small objects by integrating multiple attention mechanisms [14] or an adaptive dense pyramid network [15] into the multi-scale fusion network [16]. However, these additional fusion processes bring increased computational resources. To reduce the computational demands, on the other hand, some researchers have proposed structured pruning techniques [17], such as applying a global context-enhanced adaptive sparse convolution (CEASC) [18], depthwise separable convolution (DS-Conv) [19], or pointwise group convolution (G-Conv) [19] to simplify the traditional convolution process. However, these existing detectors often prioritize speed over detecting extremely small targets, and the challenge of effectively balancing the efficiency and accuracy of UAV object detection remains unresolved.

Based on the aforementioned observations, the primary challenges associated with object detection in UAVs images includes the following: (1) embedded devices carried by UAVs, characterized by limited computational capacity, often struggle with processing high-resolution images; (2) reducing network parameters can alleviate computational requirements, but it may also compromise the accuracy of small object detection by diminishing the receptive field; (3) balancing real-time requirements with achieving high detection accuracy in UAVs imagery remains an unresolved issue. Drawing inspiration from the previously proposed models, such as GhostNet [20], DBB (Diverse Branch Block) [21], and VoVNet [22], we introduce a novel, lightweight detection model for UAV aerial images in this paper. As depicted in Figure 2, various versions of our proposed methods outperform their respective counterparts in other state-of-the-art approaches. The main contributions of this work can be summarized as follows:

(1) We propose an efficient and lightweight object detection method for edge-based UAVs, termed EUAVDet in this paper. The components of EUAVDet are designed to be plug-and-play, allowing for seamless integration into existing object detectors. Extensive experiments on three public UAV datasets demonstrate the superior performance of our method in terms of both detection accuracy and inference speed.

(2) We enhance the efficiency of the backbone network by introducing two novel modules: an efficient feature-downsampling module designed to improve inference speed while retaining more details compared to traditional downsampling methods and a novel multi-kernel aggregation block that is designed to improve the detection efficiency for targets in various scales.

(3) We redesign a lightweight neck network from two perspectives. Firstly, we introduce a Focused Feature Pyramid Network that significantly reduces network parameters by emphasizing specific feature scales. Secondly, we propose a faster feature compression module based on the ghost module aimed at accelerating inference time.

Figure 2. Performance comparison illustration between EUAVDet and other popular object detectors in terms of detection accuracy (represented on the vertical axis), inference speed (represented on the horizontal axis), and model parameters (represented by the circles). All the experimental results are tested on the embedded Jetson Nano, and the VisDrone validation set is utilized. More comparative details can be found in Table 1.

Table 1. Comparative results with the state-of-the-art detectors on the VisDrone validation and test sets.

Method	Parameters (M)	FLOPs (G)	AP(%)							FPS
Method	Parameters (M)	FLOPs (G)	AP $_{S}^{val}$	AP $_{M}^{val}$	AP $_{L}^{val}$	AP $_{50}^{val}$	mAP $^{val}$	AP $_{50}^{test}$	mAP $^{test}$	Nano	Orin
YOLOv3-tiny [23]	8.68	12.9	4.0	11.6	13.9	16.7	6.95	14.8	6.2	14.8	75.0
YOLOv5-s [24]	7.01	15.8	9.5	24.1	39.6	29.8	16.1	25.8	13.6	12.5	65.0
YOLOx-tiny	5.04	15.2	9.0	25.1	28.8	31.9	18.8	27.6	16.1	10.8	55.1
YOLOx-s [25]	8.94	26.8	10.0	26.4	34.8	33.4	19.8	28.9	17.0	9.4	46.7
YOLOv7-tiny [26]	6.03	13.1	11.1	26.9	39.1	35.0	18.5	29.5	15.4	16.3	70.0
YOLOv10-n	2.28	6.7	9.4	27.4	34.4	30.8	17.8	26.0	14.3	22.3	82.6
YOLOv10-s [27]	7.20	21.6	12.5	33.5	46.1	37.0	22.0	30.6	17.2	11.3	54.5
EUAVDet-n $^{v 10}$	1.21	6.2	9.5	28.9	34.9	31.3	18.3	26.1	14.3	23.4	84.7
EUAVDet-s $^{v 10}$	4.44	21.4	13.0	34.1	40.4	37.7	22.3	31.0	17.4	11.6	55.6
YOLOv8-n	3.01	8.1	9.6	28.6	38.2	31.9	18.4	26.2	14.4	19.5	78.0
YOLOv8-s [28]	11.13	28.7	13.0	33.1	41.5	38.0	22.4	31.0	17.3	6.4	45.0
EUAVDet-n $^{v 8}$	1.34	6.9	10.5	29.8	36.0	32.9	19.2	27.1	14.9	21.1	79.6
EUAVDet-tiny $^{v 8}$	2.86	15.0	12.8	34.1	40.6	37.2	22.1	30.5	17.0	13.2	56.0
EUAVDet-s $^{v 8}$	4.96	25.6	14.0	35.3	42.1	39.2	23.5	32.4	18.1	8.6	52.7

2. Related Works

2.1. Lightweight Object Detection

Mainstream deep learning approaches for object detection are commonly categorized into two frameworks: two-stage approaches and one-stage approaches. In two-stage models like Faster RCNN [29] and Mask RCNN [30], candidate object regions are initially generated, followed by object classification and bounding box refinement. Two-stage models have proven to be successful in high-resolution remote sensing object detection, as demonstrated by Dong et al. [31]. These models are known for their accuracy in object detection tasks, especially when precise localization and classification of objects are required. However, they are often slower in processing speed compared to one-stage detectors because of the additional region proposal step. By contrast, the single-stage method is characterized by its speed because it uses a single forward-propagation network to simultaneously regress regions and classify objects. Representative single-stage detection models include SSD [32], YOLO [33], and CenterNet [34].

Traditional convolutional neural networks (CNNs) are characterized by huge parameters and high computational demands, hindering them from directly employed in mobile or edge-based devices. Researchers have conducted many efforts to develop lightweight object detection approaches based on previous CNN models. Among these efforts, Depthwise Separable Convolution (DSConv), comprising Depth-wise Convolution (DWConv) and Pointwise Convolution (PWConv), has emerged as a widely used alternative to standard 2D convolution. For instance, MobileNet [35] leverages DSConv to construct novel inverted residual structures that strike a trade-off between parameter reduction and accuracy preservation, and it is also incorporated into the state-of-the-art detector YOLOv8 [28] as the backbone network; ShuffleNet [36], on the other hand, adopts a strategy that combines group convolution and channel reshuffling to reduce the substantial number of PWConv operations in MobileNet. The impressive performance of PPLCDet [37] confirms the adaptability of DSConv to CPU chips with high-speed cache and memory access. Additionally, DSConv has also demonstrated its effectiveness in remote-sensing ship detection, as evidenced by Yin et al. [19]. MicroNet [38] takes a step further by decoupling DWConv convolution into low-rank matrices, achieving a balance between channel number and input/output connections. This approach is specifically tailored for low-power single-chip microcontroller devices with extremely low FLOPs. However, the channel-by-channel nature of DWConv leads to frequent memory accesses, resulting in high FLOPS requirements that may not be suitable for edge-based GPUs. To address this issue, reparameterization techniques [39] have been proposed to better utilize GPU resources. Besides the manual designs, some researchers employ neural architecture search (NAS) [40,41] techniques to automatically search for optimal parameters in module or model construction. However, this approach faces significant computational barriers and requires substantial computational resources. Thus, how to design a lightweight object detector for edge-based devices remains an open issue.

2.2. Multi-Scale Feature Fusion

The object detection model typically involves a backbone network followed by a neck network. The backbone often extracts general features from the input data, and the neck serves to enrich the features for various tasks. Multi-scale feature fusion techniques are frequently utilized in neck networks to enhance detection accuracy. For instance, the feature pyramid networks (FPNs) [42] leverage high-level semantic information extracted in the deep layer to enhance the accuracy of shallow detection heads, particularly for small targets. Building upon FPNs, Path Augmentation Networks (PANets) [43] introduce an additional bottom-up branch to further refine high-level feature detection. Extending this approach, the NasFPN [44] employs a neural architecture search to dynamically determine the optimal fusion path between different feature layers in a non-sequential manner. This strategy reduces redundant connections and emphasizes the significance of multi-scale feature fusion. Subsequently, BiFPN [45] follows a similar pattern and demonstrates its robust feature fusion capability.

With the remarkable performance of Transformers in computer vision, attention-based approaches, such as FPT [46] and CrossVIT [47], have been developed for feature fusion in neural networks. These approaches leverage the self-attention mechanism, which can make the model capable of capturing long-range dependencies. Additionally, the multi-head attention mechanism plays a crucial role in enabling the model to capture intricate relationships between different features, thereby facilitating more effective feature fusion. However, the computational complexity associated with this global information fusion tends to be excessive. To address this issue, NanoDet [48] retains only the PWConv from PANet for channel fusion but has been found to adversely affect accuracy, although it improves the inference efficiency. To mitigate this balance between detection speed and accuracy, Li et al. [49] propose the lightweight and efficient Group Shuffle Convolution (GSConv) module, which forms the basis of the SlimNeck structure. Recently, RFBNet [50] introduced the Receptive Field Block (RFB) module to enhance the discriminative capabilities of the features learned by the network, especially in the context of lightweight models where achieving a balance between speed and accuracy is crucial. YOLOv6 [51] and YOLOv7 [26] incorporate the reparameterization concept into the neck component to largely increase the model inference speed. As the latest series of YOLO-based object detection models from Ultralytics, YOLOv8 [28] and its variants [52] also leverage multi-scale feature fusion techniques to augment detection precision. Inspired by these observations, this paper further explores how to optimally fuse the multi-scale features for better accuracy while not compromising the inference speed.

2.3. Object Detection for UAV Aerial Images

Different from traditional surveillance images, UAV aerial images are characterized by high resolution, wide-scale coverage, and intricate landscapes. Therefore, detecting objects in UAV aerial images poses significant challenges, especially for tiny or dense targets. Considering the deep layers may lose crucial information for tiny objects in aerial images, researchers such as Zhu [53] and Li [54] opted to utilize shallow features for localization and regression, as they can provide more contextual information and clearer pixel-level details. By contrast, methods such as YOLT [55] and SAHI [56] segment the input image into multiple smaller ones, indirectly enhancing the visibility of smaller targets. However, this supplementary cropping process escalates computational complexity and may yield a decreased proportion of foreground targets within the cropped image block.

On the other hand, researchers have also developed several UAV object detection methods to address the challenge of dense small targets that are non-uniformly distributed in aerial images. Wang et al. [57] utilized a clustering algorithm to adaptively crop dense regions of the image and then finely detect hard-detected regions. This approach focuses the computational resources on the regions containing dense targets, thereby enhancing the representation of small targets and suppressing the background noise. Furthermore, Huang et al. [58] proposed a unified foreground assembly strategy that crops and stitches the aggregated regions into a mosaic-like image. This strategy improves the percentage of the detected foreground greatly, leading to enhanced detection accuracy and efficiency. To obtain richer contextual information, TPH-YOLOv5 [53] and GeleNet [54] integrate the Transformer method into their detectors. These two approaches can achieve notable results in detection accuracy, particularly for small targets. However, this incorporation of global attention modeling, especially for long sequences, consumes more hardware resources. Overall, object detection for UAV aerial images has received increasing attention, and how to improve trade-off detection accuracy and inference speed is extremely significant for its effective deployment in edge-based applications.

3. Proposed Method

Detecting objects in UAV aerial images on edge devices presents challenges because of the restricted computational resources available and the small size of objects observed from the UAV’s perspective. This research attempts to craft an efficient and low-complexity object detector for edge-based UAVs, and we thus designate it as EUAVDet. Being built upon the YOLO series detectors, the EUAVDet model consists of three stages: the backbone, the neck, and the head. The backbone extracts general features from the input image, the neck further processes these features to enhance their representation, and the head finally utilizes these features to perform target location regression and object classification tasks. In this paper, the backbone and neck network are mainly revised to accelerate the inference speed while retaining comparable detection accuracy.

An overview of the structure of the proposed EUAVDet detector can be seen in Figure 3. Given an image captured by UAVs, our proposed method detects the objects through the following three stages. In the backbone stage, one novel feature downsampling module—efficient feature downsampling (EFD)—and three redesigned feature aggregation modules—multi-kernel aggregation blocks (MKABs)—mainly constitute the backbone network, which can retain more details of small objects and reduce computation, as described in Section 3.1 and Section 3.2. In the neck stage, the EUAVDet first utilizes the Spatial Pyramid Pooling Fast (SPPF) module to improve the receptive field followed by the backbone network and then employs a novel multi-scale feature fusion module called the Focused Feature Pyramid Network (FFPN) to enhance the representation of features and reduce the network parameters, as described in Section 3.3. The basic building block used inside the FFPN is the faster ghost module (FGM), as described in Section 3.4. Lastly, in the head stage, the EUAVDet employs the 8-fold and 16-fold downsampled features as the detection heads to predict the objects in the given image while discarding the 32-fold sampled features for faster inference speed. Since we do not use 32-fold sampled feature maps as the output head, the fourth block in the backbone adopts the basic block from ResNet [59] to prevent redundancy. Within Figure 3, the commonly used CBS module combines convolution, batch normalization, and the SiLU activation function for feature extraction and nonlinear activation in neural networks. It plays a crucial role in introducing nonlinearity into the neural network, thereby enhancing the generalization performance of the model.

3.1. Efficient Feature Downsampling Module

Processing UAV aerial images often requires substantial computational resources because of their high resolution. The most commonly employed strategy to alleviate computational demands involves downsampling the original image in the initial stage of the backbone network. In Figure 4, for instance, a very tiny person in the resolution of 2000 × 1500 aerial image tends to be undetected primarily because of the conventional downsampling process. Therefore, how to efficiently downsample the high-resolution aerial image and better retain the details of tiny targets is very crucial. As shown in Figure 5a, researchers [23,25,26,51] often adopt two cascaded convolutional layers with a stride of 2 to reduce feature map size in the feature extraction stage. However, the adoption of this downsampling process may compromise detection accuracy, as the salient features of small objects within aerial images from UAVs are prone to be diminished during this procedure.

Inspired by this observation, we introduce a novel downsampling module depicted in Figure 5b. Different from the conventional downsampling process illustrated in Figure 5a, our approach is carefully crafted. Following the initial downsampling accomplished by a 3 × 3 convolutional layer, the channel count increases from 3 to C. This step retains the original information of the input image, while the computational cost is relatively reasonable. Subsequently, our proposed downsampling module diverges into two branches. In one branch, channel compression is achieved through a PWConv, succeeded by a 2 × 2 convolutional layer. This design serves to alleviate the computational burden incurred by an excessive number of channels. Simultaneously, in the other branch, a 2 × 2 MaxPool layer followed by a PWConv is adopted for an economical reduction in feature map size. This branch can extract salient features while reducing sensitivity to local changes because of the presence of MaxPooling. Finally, the outputs from both branches are merged by concatenation, and a PWConv is applied to recover the number of channels to its original size. Compared to the Conventional Feature Downsampling module, this downsampling strategy is expected to better preserve features for small objects in UAV aerial images and enhance the suitability of our approach for detecting such objects. We thus name it the efficient feature downsampling (EFD) module in this paper.

3.2. Multi-Kernel Aggregation Block

In addition to the aforementioned downsampling module, the CSPNet (Cross Stage Partial Network) [2,60] is frequently employed as a feature aggregation network in object detection tasks. Illustrated in Figure 6a, the network is divided into two parts and reconnected in a cross-layer manner, allowing feature maps from early layers to be directly combined with those from later layers. This mechanism enables efficient feature reuse, thereby enhancing learning capabilities and detection performance. Additionally, YOLOv7 [26] integrates the One-Shot Aggregation (OSA) concept from VoVNet [22] to develop the ELAN module (depicted in Figure 6b), which significantly enriches feature map information through multi-layer gradient combination. In UAV aerial images, however, targets such as vehicles and pedestrians exhibit scale variations, posing challenges for these kinds of methods to accurately detect multi-scale targets, as shown in Figure 7. Convolutional neural networks (CNNs) commonly use smaller kernels to capture finer details and textures, while larger kernels are more adept at capturing broader contextual information. As illustrated in Figure 6c, the Diverse Branch Block (DBB) [21] is thus proposed to enhance the model’s ability to capture intricate patterns and features by integrating multiple branches of varying sizes. This approach enriches the feature diversity, thereby significantly improving the model’s generalization capability.

Although the DBB module can better perceive objects with various scales compared to previous methods, it also obviously increases computing complexity because of its use of channel summation and kernel scaling for reparameterization. To address this issue and accelerate processing speed, we propose a novel feature aggregation module inspired by DBB called a multi-kernel aggregation block (MKAB). Specifically, as illustrated in Figure 6d, the MKAB employs three simple branches, which are concatenated and followed by a 1 × 1 convolutional layer. The first branch uses a 1 × 1 convolution kernel instead of a residual structure to retain original features and prevent overfitting. The second branch utilizes a 3 × 3 convolution kernel to capture localized features. The third branch uses two consecutive 3 × 3 convolution kernels instead of a single 5 × 5 convolution kernel, aiming to reduce computational requirements. Despite having the same receptive fields and computational amount as a 5 × 5 convolution kernel, these two 3 × 3 convolution kernels capture more complex nonlinear features with a deeper network. To further improve computational efficiency, we minimize the frequent use of 1 × 1 convolution kernel for adjusting channel numbers and avoid large kernel convolution operations. Instead, we dynamically adjust the input channel numbers for each MKAB module based on the network depth and flexibly control the number of internal branches. This design approach aims to enhance the backbone’s capability to perceive multi-scale information while maintaining lightweight and efficient computing characteristics.

3.3. Faster Ghost Module

The backbone network is typically designed to be both large and deep, aiming to extract richer features from the input image. On the other hand, the neck network plays a crucial role in fusing semantic information from different layers. However, it is observed that some object detectors [26,51] adopt the same modules in both the backbone and neck parts to ensure the overall model consistency. An overlooked point in this approach is that as the network deepens and different features are combined, the number of parameters increases. Repeatedly stacking complex modules in the neck part does not adequately address the efficiency issue. Therefore, the Ghost Module [20] (Figure 8a) is introduced to reduce redundancy and computational cost. The key idea behind the ghost module is to split the input channels of a convolutional layer into two groups: a primary group that undergoes standard convolution and a secondary group that undergoes a cheaper form of processing, such as Depth-wise Convolution (DWConv) or a simpler operation like linear transformation. This splitting effectively reduces the computational cost by performing the more expensive operation on a smaller subset of the input channels.

In this part, we further improve the computational efficiency by substituting the DWConv with standard convolution since a channel-by-channel nature of DWConv leads to latency on GPU devices, as discussed in Section 2.1. As a solution to this efficiency concern, we have designated this improved feature fusion module as the faster ghost module (FGM), as shown in Figure 8b. To ensure the retention of essential channel information and facilitate inter-channel information exchange, the FGM processes all channels while employing a two-step compression approach to maintain model efficiency. Firstly, it fuses feature maps from different layers across channels using Pointwise Convolution (PWConv) and compresses the input channels (

C_{i}

) to the hidden layer channels (C). Secondly, it continues feature learning through standard convolution. Finally, the results of the first two stages are merged in the channel-wise manner to form 2C channels, which are then compressed back to desired output channels (

C_{o}

) using PWConv. To alleviate the heavy computation caused by the larger feature channels of the deep neck network, the FGM unifies the feature output channels of different scales, as shown in Figure 3. This multi-step compression strategy enables the FGM to effectively manage the flow of information across channels, ensuring efficient feature extraction while minimizing the computational overhead associated with redundant or excessive channel information.

3.4. Focused Feature Pyramid Network

Researchers often enhance the detection accuracy for small objects by incorporating multi-scale architectures that integrate information from both the shallow and deep layers of the network. For instance, the feature pyramid network (FPN) [42] is designed to combine features from various levels of the backbone network and creates a pyramid of feature maps with consistent semantic information across different resolutions, as illustrated in Figure 9b. The Path Aggregation Network (PANet) [43] is extended to gather features from various levels across the feature pyramid, aiming to enhance representation, as depicted in Figure 9c. The utilization of the FPN or PANet has been demonstrated to significantly boost the accuracy of object detection, particularly for small objects.

We conduct a comparative experiment on the VisDrone dataset to verify the positive effect of FPN and PANet architectures. The comparison results can be seen in Figure 10. The horizontal axis represents the different levels of the feature maps, which are denoted as P2, P3, …, P5, with P2 being the lowest resolution layer and P5 being the highest resolution layer. The vertical axis refers to the mean average precision at a 50% overlap (mAP50) value where a different detection head is selected for object detection prediction. Moreover, non-fusion represents the baseline method, which does not fuse features from different layers, and the FPN and PANet refer to the approaches that adopt multi-scale architectures. It is found that both the FPN and PANet architectures are beneficial for accuracy improvements of small object detection, as the detection head of each fused layer can yield obviously higher detection accuracy. However, it is also observed that further fusing the P3 and P4 layers can achieve almost comparable detection accuracy than fusing P3, P4, and P5 simultaneously, but with less computing complexity. These findings suggest that the layer-by-layer fused architectures may exhibit feature redundancy, leading to heightened computational complexity and negative effects on inference speed.

In the end, we further improve the FPN by focusing on efficient feature extraction, aiming at reducing computational overhead without comprising detection accuracy. Thus, we name this improved component Focused Feature Pyramid Networks (FFPNs), which draws inspiration from the concept of focused multi-scale information. As depicted in Figure 9d, the FFPN architecture begins with a standard convolution with a step size of 2 for the P2 layer, aligning it with the P3 features. Subsequently, feature fusion is performed using our proposed faster ghost module. Similarly, nearest-neighbor interpolation is employed for up-sampling the P5, which is then fused with P4. This fusion process integrates multi-level features to prevent abrupt fusion from non-neighboring features, which can lead to suboptimal results because of significant differences in feature information or direct concatenation of all features. In the subsequent stage, the results from the initial stage are integrated again, while the P3-sized features are upsampled and the P4-sized features are downsampled. This process can further emphasize the significance of these two informative layers and thus be more efficient for multi-scale feature fusion.

4. Experiments

4.1. Datasets

The performance evaluation of the EUAVDet was conducted on three publicly available datasets containing numerous aerial videos captured by UAVs: VisDrone [13], SeaDronesSeeV2 [61], and UAVDT [62].

VisDrone [13]: This dataset is a widely used benchmark dataset in complex urban scenes and comprises 6147 training subset images, 548 validation subset images, and 1610 test subset images, with an average resolution of 2000 × 1500. It is characterized by high density and contains numerous small targets with less than 32 pixels. The dataset contains a total of ten categories: pedestrians, people, car, van, bus, truck, automobiles, bicycles, shade tricycles, and tricycles.

SeaDronesSeeV2 [61]: This dataset is specifically tailored for UAV vision tasks in marine environments. As the images are mostly of 4K resolution, they are of large size. It consists of 8930 images in the training set and 1547 images in the validation set. There are five object categories involved in this dataset: swimmer, boat, jetski, life-saving appliances, and buoy.

UAVDT [62]: This dataset contains a variety of aerial images, including 23,829 training images and 16,580 validation images. The images have a resolution of 1080 × 540 pixels and consist of three object categories: car, truck, and bus. Because of the low-altitude perspective, the detected objects in the images have relatively larger sizes compared to VisDrone but are obviously smaller than the objects in the SeaDronesSeeV2 dataset.

4.2. Evaluation Metrics

To evaluate the detection accuracy and inference speed of the object detection model, we utilized the mean average precision (mAP) metric and the frames per second (FPS) metric, respectively.

Detection accuracy metric. The average precision (AP) was determined by computing the area under the precision-recall curve across different target categories, as specified in Formula (1). Subsequently, the mean average precision (mAP) was obtained by averaging the AP values for N classes across various IoU thresholds, ranging from 0.50 to 0.95 with a step of 0.05, as outlined in Formula (2). Specifically, AP50 and AP75 correspond to the results at IoU thresholds of 0.5 and 0.75, respectively. Furthermore, APL, APM, and APS are the AP values calculated at IoU thresholds from 0.5 to 0.95, which are applied to three different scales of objects: large (pixel > 96), medium (32 < pixel < 96), and small (pixel < 32) objects. The evaluation standards of the MS COCO [12] dataset were employed to ensure consistency in the detection results.

AP = \int_{0}^{1} P (r) d r

(1)

{mAP}_{50 : 95} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j} A P_{i} ({IOU}_{thre} = j)

(2)

Inference speed metric. In object detection tasks, the latency or consuming time usually includes pre-processing, model inference, and data post-processing, such as non-maximum suppression. The FPS metric is often used to measure the inference speed of image processing, indicating how many image frames a system can process per second, which can be formulated as

FPS = \frac{1}{Latency} = \frac{1}{T_{pre} + T_{infer} + T_{post}}

(3)

4.3. Implementation Details

To comprehensively evaluate the proposed EUAVDet method, we designed three different models based on their sizes, namely, EUAVDet-small, EUAVDet-tiny, and EUAVDet-nano. The model size is highly related to the number of feature channels (C), as shown in Figure 3. The feature channel also determines the width of the backbone network. In this work, we set the channel C of EUAVDet-small, EUAVDet-tiny, and EUAVDet-nano to 64, 48, and 32, respectively.

Both ablation studies and comparative experiments were conducted in this study. The experiments were carried out within the PyTorch deep learning framework on a workstation with two 3080Ti GPUs. For model training, we employed the standard Stochastic Gradient Descent (SGD) strategy, initializing the learning rate at 0.01 and setting the momentum as 0.937. In the training process, the input image size was downsampled as 640 × 640. Notably, all models underwent training from scratch without leveraging pre-training weights to ensure a fair comparison.

To evaluate the EUAVDet model on embedded devices, we selected the widely used Jetson Nano and the Jetson Orin Nano for testing. The latter has nearly 80 times the aggregated power of the former one. We first converted the training weight files to the unified Open Neural Network Exchange (ONNX) format. Then, we optimized and accelerated the models using the Floating Point 16-bit (FP16) format of TensorRT. To further optimize and accelerate the model, the Jetson Orin Nano utilizes Floating Point 32-bit (FP32). The Jetson Nano and Jetson Orin Nano versions used in the experiments were Jetpack 4.6.1 and Jetpack 5.1.2, with TensorRT versions 8.2.1.8 and 5.1.2, respectively.

4.4. Ablation Study

In this subsection, we performed a series of ablation analyses on the proposed EUAVDet detector to evaluate the performance of each module. We selected the YOLOv8-n [28] as the baseline model and subsequently inserted one or more modules into the backbone or neck network to evaluate their performance. Each experiment was conducted with a batch size of eight and trained for 300 epochs. We report the experimental results, including model size (Params), running speed metrics (FLOPs, Latency), and detection accuracy metrics (AP, mAP), on the VisDrone validation set using an embedded device (Jetson Nano, NVIDIA, Santa Clara, CA, USA).

Efficient Feature Downsampling (EFD). As shown in Table 2, the EFD module obviously improved the detection accuracy in metrics of AP50 and mAP when compared to the baseline models. However, this improvement in accuracy was accompanied by a little increase in GFLOPs and a decrease in FPS metrics, indicating that the incorporation of EFD modules results in a slight rise in computational burden. Despite this trade-off, the EFD module is suitable for small object detection in high-resolution aerial images, as it can preserve more details compared to Conventional Feature Downsampling modules. This is further evidenced by the class response maps illustrated in Figure 11. These maps clearly visualize the areas within the image where the model is inclined to detect objects. Figure 11b,c are the class response maps under two different downsampling modules: the Conventional Feature Downsampling (CFD) and the proposed EFD module. It is clear to see that the EFD module exhibits a stronger response for small and dense objects, indicating its remarkable performance in detection accuracy for aerial images.

Multi-kernel aggregation block (MKAB).It is obvious from Table 2 that integrating the MKAB module can improve both detection accuracy and processing speed. Compared to the YOLOv8-n baseline model, the introduction of MKAB reduced FLOPs from 8.2 G to 8.0 G, decreased latency by 2.8 ms, and improved the mAP metric by 0.3%. When MKAB was combined with EFD as a backbone (sixth row of Table 2), it effectively mitigated the delay caused by EFD, resulting in improvements across all the detection metrics. These improvements are evident not only in the increased accuracy (AP50 and mAP) but also in the reduced model parameters, FLOPs, and lower latency. To further evaluate the effectiveness of the MKAB module, we also conducted comparative experiments among the four different modules mentioned in Figure 6. We replaced the CSP-style module used in the YOLOv8-n backbone network with an ELAN, a DBB, and an MKAB to test their performance, respectively. As shown in Table 3, it can be seen that the proposed MKAB module performed the best both in terms of detection accuracy and processing speed.

Focused Feature Pyramid Networks (FFPNs). The FFPN was introduced for more efficient multi-scale feature aggregation in the neck network. From the fifth row of Table 2, it can be observed that the FFPN led to a significant decrease in network parameters by more than 30%, along with a 0.3% improvement in the mAP metric for the baseline model. The inference latency of the model was reduced by nearly 4 ms on the Jetson Nano. These results suggest that the FFPN achieves a reduction in network size without compromising detection accuracy. This is attributed to the two-stage aggregation structure, which ensures better multi-scale fusion and effectively reduces redundant features, allowing the feature stream to be more accurately focused on the key detection layers.

Faster ghost module (FGM). The FGM was designed to reduce feature redundancy, thereby expected to decrease the computational cost in the neck propagation stage. This effect is obviously demonstrated in the fourth and last row of Table 2 with the FLOPs metric largely decreasing and latency metric decreasing. However, it also can be seen that the FGM contributed nothing to the detection accuracy. Nevertheless, similar to the combination of EFD and MKAB modules in the backbone network, the integration of FFPN and FGM in the neck network can also result in a remarkable performance. As shown in the seventh row of Table 2, all metrics largely improved compared to the baseline model. Consequently, the introduction of the FGM is mainly beneficial for accelerating the inference speed of the network, while the FFPN is beneficial for both the detection accuracy and running speed.

EFD + MKAB + FFPN + FGM (EUAVDet). Finally, we constructed the EUAVDet-n model by combining all the proposed modules. Compared to the performance of the baseline YOLOv8-n (the first row in Table 2), our model successfully improved the detection accuracy (mAP) from 18.4% to 19.2%, significantly reduced the model parameters (from 3.01 M to 1.34 M) and FLOPs (from 8.2 G to 6.9 G), and finally obviously enhanced the inference speed with the inference latency reduced by 4.2 ms.

Additionally, we also report the experimental results of evaluating different sizes of objects on the VisDrone validation and test sets, as illustrated in Table 4. We built our EUAVDet model upon YOLOv8 (specifically, YOLOv8 + EFD + MKAB + FFPN + FGM) and designated two versions of our model as EUAVDet-n and EUAVDet-s based on their sizes, similarly to the YOLOv8-n and YOLOv8-s model. It is clear to see from Table 4 that the EFD module can effectively improve detection accuracy for small and medium objects, but it is a little harmful for objects large in size. This may be attributed to the FFPN removing the high-level detection head, which affects the model’s capacity to detect larger objects. However, when other modules are integrated into the network, both detection accuracy and processing speed are largely improved. Meanwhile, incorporating the proposed modules also accelerated the processing speed, as the metrics of model parameters and FLOPs are obviously reduced.

4.5. Comparison with the State-of-the-Art Methods

Quantitative comparison. To demonstrate the superiority of our model, we constructed our model using YOLOv8 and YOLOv10 [27] as baselines and conducted comparative experiments on the VisDrone dataset using two embedded devices (Jetson Nano and Jetson Orin Nano), comparing it with several state-of-the-art models. To ensure a fair comparison, we re-implemented these detectors in the same environment.

Compared to several state-of-the-art methods in Table 1, the proposed EUAVDet achieved a better trade-off between detection accuracy and inference time. To be specific, our largest model (EUAVDet-s

^{v 8}

) exhibits the best detection accuracy on both the VisDrone validation and test sets, with significantly fewer parameters than the corresponding versions of YOLOv5-s, YOLOX-s, and YOLOv8-s. As opposed to the YOLOv8-s, it improved mAP

^{t e s t}

by 0.8% (from 17.3% to 18.1%) and improved FPS by 25.6% (from 6.4 fps to 8.6 fps). On the other hand, the smallest model (EUAVDet-n

^{v 8}

) has only 1.34 M parameters. Compared to YOLOv8-n, it improved mAP

^{t e s t}

by 0.5% (from 14.4% to 14.9%) and boosted FPS by 7.6% (from 19.5 fps to 21.1 fps) on Jetson Nano. It also achieved comparable AP50 and mAP scores compared to the other baseline methods. Additionally, the proposed EUAVDet-tiny

^{v 8}

nearly doubled its FPS (from 6.4% to 13.2%) while only compromising mAP

^{t e s t}

by 0.3% (from 17.3% to 17.0%) compared to YOLOv8s. Thus, it can serve as a balanced solution between accuracy and speed. Similarly, it is also found that while YOLOv10 performed better in terms of speed, because of its reduced computational complexity and NMS-free architecture, it did not outperform YOLOv8 in terms of accuracy. Additionally, we integrated our proposed modules into the YOLOv10, achieving obvious improvements in both speed and accuracy, which further demonstrates the capacity of plug-and-play for our proposed method.

To further evaluate the generality of EUAVDet, we re-implemented the YOLOv5-n, YOLOv7-tiny, YOLOv8-n, and YOLOv10-n with the proposed modules, including the EFD, MKAB, FFPN, and FGM. In Table 5, we report their performance evaluation on the SeaDronesSeeV2 validation set using embedded devices, where each experiment was trained for 100 epochs. The EUAVDet-n

^{v 5}

, EUAVDet-tiny

^{v 7}

, EUAVDet-n

^{v 8}

, and EUAVDet-n

^{v 10}

represent the EUAVDet model built upon YOLOv5-n, YOLOv7-tiny, YOLOv8n and YOLOv10-n, respectively. It is observed that our EUAVDet model outperforms their corresponding models in most metrics, including parameter counts and mAP scores. The only exception was that the detection accuracy of large-scale objects occasionally exhibited lower performance compared to the baseline models. Table 6 reports the performance evaluation on the UAVDT dataset in the same way, where each experiment was trained for 30 epochs. It also can be seen that most metrics exhibited a significant increase in accuracy and speed, except that the EUAVDet-n

^{v 5}

, EUAVDet-n

^{v 8}

, and EUAVDet-n

^{v 10}

did not perform as well as the baseline on large objects, mainly because UAVDT has many large-scale objects in low altitude shots. The running speed on the edge devices improved obviously in all metrics of parameter amounts, FLOPs, and inference speed when the proposed modules were inserted into the baseline models. Overall, these comparative performance results on the SeaDronesSeeV2 and UAVDT datasets further demonstrate the effectiveness of the proposed modules in this work and verify their plug-and-play capability.

Visualization comparison. Besides the above quantitative comparison, we present some visualization detection results to provide a more intuitive comparison. In the VisDrone dataset, as shown in the elliptical region in Figure 12, our model (EUAVDet-s) could detect more targets of small size and even detect some unlabeled targets. For the SeaDronesSeeV2 dataset, as shown in the elliptical region in Figure 13, our model built upon both YOLOv5 and YOLOv7 could discriminate targets more correctly than the original models. We can also observe that our model could ensure higher detection confidence for tiny targets, as indicated by the rectangular region in Figure 13. For the UAVDT dataset, as obviously illustrated with the green rectangular box in Figure 14, our method could detect more targets relative to the baseline model, both under strong lighting conditions and at high altitudes with fog. Compared to YOLOv7-tiny, the blue rectangular box results show that our method had fewer false detections. However, the model capacity limits its own learning ability, leading to poor detection performance on extremely small objects. In conclusion, our proposed EUAVDet model can serve as a benchmark for edge-side UAV object detection with a strong capability to provide a better-balanced accuracy and speed.

5. Conclusions

In this paper, we develop a novel real-time object detection method for edge-based UAVs called EUAVDet. We initially introduce an efficient feature downsampling module and a novel multi-kernel aggregation block into the backbone network to retain more feature details and capture richer spatial information, especially for the small targets from the UAV perspective. Additionally, a novel feature pyramid network with an improved feature compression module is incorporated into the neck network to fuse multi-scale features with fewer parameters. The components of EUAVDet are designed to be plug-and-play, allowing for seamless integration into existing object detectors. Extensive experiments conducted on the public VisDrone, SeaDronesSeeV2, and UAVDT datasets all demonstrate the superior performance of our method when deployed on embedded devices, such as Jetson Nano or Jetson Orin Nano, in terms of detection accuracy and computational efficiency. This work is expected to promote more edge-based UAV applications, and our future work will focus on how to further improve the detection accuracy for extremely tiny objects and design more lightweight detectors on embedded devices with less computational capacity.

Author Contributions

Conceptualization, W.W. and Q.L.; Methodology, W.W.; Software, A.L.; Validation, A.L. and Y.M.; Investigation, S.X.; Resources, Y.M.; Writing—original draft, W.W. and A.L.; Writing—review & editing, J.H., S.X. and P.D.; Visualization, A.L.; Supervision, J.H.; Project administration, W.W.; Funding acquisition, P.D. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62073129, U21A20490, U22A2059, 62201207, and 62371185; Hunan Provincial Natural Science Foundation of China under Grant 2022JJ10020, 2023JJ40163, and 2024JJ6063; Scientific Research Project of Hunan Education Department of China under Grant 21B0330; and Graduate School of Changsha University of Science and Technology under Grant CLSJCX23065.

Data Availability Statement

The data is unavailable due to privacy currently.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. YOLO-Based UAV Technology: A Review of the Research and Its Applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Zhang, Z.; Zheng, H.; Cao, J.; Feng, X.; Xie, G. FRS-Net: An Efficient Ship Detection Network for Thin-Cloud and Fog-Covered High-Resolution Optical Satellite Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2326–2340. [Google Scholar]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-time object detection based on uav remote sensing: A systematic literature review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Koay, H.V.; Chuah, J.H.; Chow, C.O.; Chang, Y.L.; Yong, K.K. YOLO-RTUAV: Towards real-time vehicle detection through aerial images with low-cost edge devices. Remote Sens. 2021, 13, 4196. [Google Scholar] [CrossRef]
Hernández, D.; Cecilia, J.M.; Cano, J.C.; Calafate, C.T. Flood detection using real-time image segmentation from unmanned aerial vehicles on edge-computing platform. Remote Sens. 2022, 14, 223. [Google Scholar] [CrossRef]
Fan, Y.; Chen, W.; Jiang, T.; Zhou, C.; Zhang, Y.; Wang, X. Aerial Vision-and-Dialog Navigation. arXiv 2022, arXiv:2205.12219. [Google Scholar]
Liu, S.; Zhang, H.; Qi, Y.; Wang, P.; Zhang, Y.; Wu, Q. AerialVLN: Vision-and-Language Navigation for UAVs. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Paris, France, 2–6 October 2023; pp. 15338–15348. [Google Scholar]
Zhang, P.; Zhong, Y.; Li, X. SlimYOLOv3: Narrower, faster and better for real-time UAV applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Lu, Y.; Gong, M.; Hu, Z.; Zhao, W.; Guan, Z.; Zhang, M. Energy-based CNNs Pruning for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3000214. [Google Scholar] [CrossRef]
Li, Z.; Liu, X.; Zhao, Y.; Liu, B.; Huang, Z.; Hong, R. A lightweight multi-scale aggregated model for detecting aerial images captured by UAVs. J. Vis. Commun. Image Represent. 2021, 77, 103058. [Google Scholar]
Lee, J.; Wang, J.; Crandall, D.; Šabanović, S.; Fox, G. Real-time, cloud-based object detection for unmanned aerial vehicles. In Proceedings of the 2017 First IEEE International Conference on Robotic Computing (IRC), IEEE, Taichung, Taiwan, 10–12 April 2017; pp. 36–43. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Guo, X. A novel Multi to Single Module for small object detection. arXiv 2023, arXiv:2303.14977. [Google Scholar]
Zhang, R.; Shao, Z.; Huang, X.; Wang, J.; Wang, Y.; Li, D. Adaptive dense pyramid network for object detection in UAV imagery. Neurocomputing 2022, 489, 377–389. [Google Scholar] [CrossRef]
Zhao, L.; Zhu, M. MS-YOLOv7: YOLOv7 Based on Multi-Scale for Object Detection on UAV Aerial Photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Zhang, Z.; Xia, W.; Xie, G.; Xiang, S. Fast Opium Poppy Detection in Unmanned Aerial Vehicle (UAV) Imagery Based on Deep Neural Network. Drones 2023, 7, 559. [Google Scholar] [CrossRef]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar]
Yin, Y.; Cheng, X.; Shi, F.; Zhao, M.; Li, G.; Chen, S. An Enhanced Lightweight Convolutional Neural Network for Ship Detection in Maritime Surveillance System. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5811–5825. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10886–10895. [Google Scholar]
Lee, Y.; Hwang, J.w.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 752–760. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, 2023; Ultralytics Inc.: Seattle, WA, USA, 2023. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object detection in high resolution remote sensing imagery based on convolutional neural networks with suitable object scale features. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2104–2114. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A lightweight CPU convolutional neural network. arXiv 2021, arXiv:2109.15099. [Google Scholar]
Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Yuan, L.; Liu, Z.; Zhang, L.; Vasconcelos, N. Micronet: Improving image recognition with extremely low flops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2021; pp. 468–477. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Zhao, L.; Gao, J.; Li, X. NAS-kernel: Learning suitable Gaussian kernel for remote sensing object counting. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6010105. [Google Scholar] [CrossRef]
Peng, C.; Li, Y.; Shang, R.; Jiao, L. RSBNet: One-shot neural architecture search for a backbone network in remote sensing image recognition. Neurocomputing 2023, 537, 110–127. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature pyramid transformer. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 323–339. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 357–366. [Google Scholar]
Arani, E.; Gowda, S.; Mukherjee, R.; Magdy, O.; Kathiresan, S.; Zonooz, B. A comprehensive study of real-time object detection networks across multiple domains: A survey. arXiv 2022, arXiv:2208.10895. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Li, G.; Bai, Z.; Liu, Z.; Zhang, X.; Ling, H. Salient object detection in optical remote sensing images driven by transformer. IEEE Trans. Image Process. 2023, 32, 5257–5269. [Google Scholar] [CrossRef]
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), IEEE, Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
Wang, Y.; Yang, Y.; Zhao, X. Object detection using clustering algorithm adaptive searching regions in aerial images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 651–664. [Google Scholar]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1026–1033. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 375–391. [Google Scholar]

Figure 1. Image examples in different camera views: (a) traditional surveillance camera (from COCO dataset). (b) UAV view (from VisDrone dataset).

Figure 3. An overview of the structure of the proposed EUAVDet algorithm. Firstly, the input image successively goes through a feature downsampling module and four feature aggregation blocks in the backbone stage. Then, the SPPF is used to improve the receptive field, and the FFPN is designed to enhance the representation of features in the neck stage. Finally, the 8-fold and 16-fold downsampled features serve as the detection heads to predict the objects in the given image.

Figure 4. Extremely tiny targets in high-resolution aerial images are often difficult to be detected because of the downsampling process.

Figure 5. The architecture illustration of two distinct feature downsampling modules: (a) Conventional Feature Downsampling. (b) The proposed efficient feature downsampling module. The H × W × 3 and H/4 × W/4 × C represent the input size of the original image and the output size of the feature map, respectively.

Figure 6. Comparison of different network structures: (a) CSP block, (b) ELAN, (c) DBB, and (d) MKAB, where

C_{i}

and

C_{o}

represent the input and output channel amounts of each module, respectively.

Figure 6. Comparison of different network structures: (a) CSP block, (b) ELAN, (c) DBB, and (d) MKAB, where

C_{i}

and

C_{o}

represent the input and output channel amounts of each module, respectively.

Figure 7. Targets such as vehicles and pedestrians exhibit huge scale variations in UAV aerial images.

Figure 8. Comparison between (a) the existing ghost module and (b) the proposed faster ghost module, where C represents hidden layer channels.

Figure 9. Structure comparison between different multi-scale feature fusion methods: (a) non-fusion, (b) FPN, (c) PANet, and (d) the proposed FFPN.

Figure 10. Comparative results of the average accuracy and latency on the VisDrone validation dataset with YOLOv5-s under three different multi-scale feature fusion methods: (a) non-fusion, (b) FPN, and (c) PANet. The size of the input image is downsampled to 640 × 640, and the resolution of the P3, P4, and P5 layers are set to 80 × 80, 40 × 40, and 20 × 20, respectively.

Figure 11. Comparison of class response maps for different downsampling modules: (a) Origin images. (b) Conventional Feature Downsampling, and (c) efficient feature downsampling.

Figure 12. Visualization samples of different models in the VisDrone dataset.

Figure 13. Visualization samples of different models in the SeaDronesSeeV2 dataset.

Figure 14. Visualization samples of different models in the UAVDT dataset.

Table 2. Ablation experiments with different components of EUAVDet-n on VisDrone validation set. The best results are shown in bold. Same below.

Components				Params (M)	FLOPs (G)	AP $_{50}^{val}$ (%)	mAP $^{val}$ (%)	Latency $_{nano}$ (ms)
EFD	MKAB	FGM	FFPN	Params (M)	FLOPs (G)	AP $_{50}^{val}$ (%)	mAP $^{val}$ (%)	Latency $_{nano}$ (ms)
-	-	-	-	3.01	8.2	31.9	18.4	51.4
✓				3.01	8.5	32.4	18.7	55.7
	✓			2.67	8.0	32.5	18.7	48.6
		✓		2.30	7.4	31.8	18.4	48.3
			✓	2.03	7.8	32.4	18.7	47.7
✓	✓			2.67	8.1	32.8	19.1	51.1
		✓	✓	1.83	7.2	32.4	18.6	47.4
✓	✓	✓	✓	1.34	6.9	32.9	19.2	47.2

Table 3. Performance comparison of different modules in YOLOv8n backbone network: C2f, ELAN, DBB, and MKAB.

Method	Params (M)	FLOPs (G)	AP $_{50}^{val}$ (%)	mAP $^{val}$ (%)	AP $_{50}^{test}$ (%)	mAP $^{test}$ (%)	Latency $_{nano}$ (ms)
C2f [28]	3.01	8.2	31.9	18.4	26.2	14.4	51.4
ELAN [26]	2.71	8.6	23.7	13.3	18.1	9.8	50.6
DBB [21]	4.45	8.1	31.8	18.4	25.7	14.1	49.8
MKAB	2.67	8.0	32.5	18.7	26.7	14.7	48.6

Table 4. Performance comparison for small, medium, and large objects on both VisDrone validation and test sets.

Method	Params (M)	FLOPs (G)	AP $_{S}^{val}$ (%)	AP $_{M}^{val}$ (%)	AP $_{L}^{val}$ (%)	AP $_{S}^{test}$ (%)	AP $_{M}^{test}$ (%)	AP $_{L}^{test}$ (%)
YOLOv8-n [28]	3.01	8.2	9.6	28.6	38.2	5.7	22.6	35.3
YOLOv8-n + EFD	3.01	8.5	10.0	29	38.2	6.1	23.0	33.4
YOLOv8-n + MKAB	2.67	8.0	10.1	28.8	38.6	5.9	22.9	36.2
EUAVDet-n	1.34	6.9	10.5	29.8	37.7	6.4	23.3	37.3
YOLOv8-s [28]	11.13	28.7	13.0	33.1	41.5	7.8	26.8	40.6
YOLOv8-s + EFD	11.14	29.2	13.8	34.9	41.2	8.3	27.8	38.7
YOLOv8-s + MKAB	9.80	28.2	13.8	34.7	42.0	8.1	27.4	39.8
EUAVDet-s	4.96	25.6	14.0	35.3	42.1	8.4	28.5	39.5

Table 5. Comparative results on the SeaDronesSeeV2 validation dataset.

Method	Parames (M)	FLOPs (G)	AP(%)						FPS
Method	Parames (M)	FLOPs (G)	AP $_{S}^{val}$	AP $_{M}^{val}$	AP $_{L}^{val}$	AP $_{50}^{val}$	AP $_{75}^{val}$	mAP $^{val}$	Nano	Orin
YOLOv5-n [24]	1.77	4.2	26.2	42.0	55.9	70.2	36.0	38.4	24.5	92.0
EUAVDet-n $^{v 5}$	1.03	4.0	34.4	45.0	55.8	74.6	39.7	40.9	24.8	94.5
YOLOv7-tiny [26]	6.03	13.1	33.9	42.8	58.0	72.9	38.9	40.2	15.4	63.2
EUAVDet-tiny $^{v 7}$	1.60	7.2	35.2	45.9	55.9	76.7	39.3	41.3	19.3	73.2
YOLOv8-n [28]	3.01	8.1	20.7	37.6	58.7	59.4	33.6	33.9	19.7	73.5
EUAVDet-n $^{v 8}$	1.34	6.9	22.7	38.6	61.2	61.6	35.7	35.7	19.9	74.1
YOLOv10-n [27]	2.28	6.7	18.7	37.1	57.9	58.9	33.7	33.6	20.8	80.5
EUAVDet-n $^{v 10}$	1.21	6.2	18.3	39.8	57.2	60.8	34.3	35.0	21.7	83.1

Table 6. Comparative results on the UAVDT validation dataset.

Method	Parames (M)	FLOPs (G)	AP(%)						FPS
Method	Parames (M)	FLOPs (G)	AP $_{S}^{val}$	AP $_{M}^{val}$	AP $_{L}^{val}$	AP $_{50}^{val}$	AP $_{75}^{val}$	mAP $^{val}$	Nano	Orin
YOLOv5-n [24]	1.77	4.2	9.0	23.7	32.0	27.5	12.4	14.0	26.5	97.5
EUAVDet-n $^{v 5}$	1.03	4.0	9.4	24.2	28.4	28.8	10.4	14.2	26.8	98.6
YOLOv7-tiny [26]	6.03	13.1	9.3	26.2	30.3	32.9	12.1	15.6	16.3	67.2
EUAVDet-tiny $^{v 7}$	1.60	7.2	10.2	28.0	34.5	33.6	14.9	16.8	20.5	77.5
YOLOv8-n [28]	3.01	8.1	9.8	24.8	30.3	26.4	16.2	15.2	20.8	79.5
EUAVDet-n $^{v 8}$	1.34	6.9	10.9	28.0	27.4	29.4	18.	17.0	21.0	82.2
YOLOv10-n [27]	2.28	6.7	11.2	27.5	27.9	28.1	17.4	16.3	23.1	84.6
EUAVDet-n $^{v 10}$	1.21	6.2	11.1	27.9	23.6	28.5	17.	16.6	24.3	86.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, W.; Liu, A.; Hu, J.; Mo, Y.; Xiang, S.; Duan, P.; Liang, Q. EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform. Drones 2024, 8, 261. https://0-doi-org.brum.beds.ac.uk/10.3390/drones8060261

AMA Style

Wu W, Liu A, Hu J, Mo Y, Xiang S, Duan P, Liang Q. EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform. Drones. 2024; 8(6):261. https://0-doi-org.brum.beds.ac.uk/10.3390/drones8060261

Chicago/Turabian Style

Wu, Wanneng, Ao Liu, Jianwen Hu, Yan Mo, Shao Xiang, Puhong Duan, and Qiaokang Liang. 2024. "EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform" Drones 8, no. 6: 261. https://0-doi-org.brum.beds.ac.uk/10.3390/drones8060261

Article Menu

EUAVDet: An Efficient and Lightweight Object Detector for UAV Aerial Images with an Edge-Based Computing Platform

Abstract

1. Introduction

2. Related Works

2.1. Lightweight Object Detection

2.2. Multi-Scale Feature Fusion

2.3. Object Detection for UAV Aerial Images

3. Proposed Method

3.1. Efficient Feature Downsampling Module

3.2. Multi-Kernel Aggregation Block

3.3. Faster Ghost Module

3.4. Focused Feature Pyramid Network

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Ablation Study

4.5. Comparison with the State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI