A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks

Li, Lianpeng; Zhao, Xu; Zhao, Hui

doi:10.3390/app122311918

Open AccessArticle

A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks

by

Lianpeng Li

^1,*,

Xu Zhao

² and

Hui Zhao

¹

School of Automation, Beijing Information Science & Technology University, Beijing 100192, China

²

Beijing Key Laboratory of High Dynamic Navigation Technology, Beijing Information Science & Technology University, Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 11918; https://0-doi-org.brum.beds.ac.uk/10.3390/app122311918

Submission received: 21 October 2022 / Revised: 14 November 2022 / Accepted: 16 November 2022 / Published: 22 November 2022

(This article belongs to the Section Applied Industrial Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The target tracking of unmanned aerial vehicles (UAVs) has attracted significant attention recently with the increasing application of UAVs, yet few studies have made breakthroughs in dynamic dim target detection. How to efficiently and accurately identify dynamic dim targets in complex contexts poses a challenge. To address this issue, we proposed an improved lightweight Siamese network (ILSN) with an optimized feature-extraction network design and similarity measurement.As for the feature-extraction network, we built a position-wise attention module to obtain the target feature’s position information, which enhanced the network’s ability to extract weak targets while reducing the model parameters, thus ensuring the network is lightweight. For the similarity-measurement module, the tracking accuracy was expected to be improved by deeply mining the localization information of the shallow features and the semantic information of the deep features in the feature networks. To evaluate the performance of the proposed method, we established a simulated experimental environment and a physical experimental platform and then carried out comparison experiments on the attention modules, tracking accuracy, and efficiency performance. The experimental results showed that, compared with the five introduced comparison algorithms, the ILSN had apparent advantages in tracking accuracy: the tracking speed reached 108 frames per second, which met the real-time requirements while improving the success rate.

Keywords:

unmanned aerial vehicle (UAV); dim target tracking; deep learning; feature extraction

1. Introduction

With the development of microelectronics and artificial intelligence (AI) technology, the mobility, flexibility, and intelligence of unmanned aerial vehicles (UAVs) have been greatly improved, which provides the possibility fast tracking dynamic dim targets [1]. UAVs can be effectively utilized in diverse application scenarios and have been widely used in power inspections, visual monitoring, and security warnings [2]. According to the International Federation of Robotics [3], from 2010 to 2022, the investment volume increased from USD 36 billion in 2010 to USD 2339 billion in 2022, an increase of more than 60 times globally.

However, dynamic dim targets, which are common objects in tracking tasks, can suffer from target distortion, spatial occlusion, similar interference, and other problems due to complex background information making it challenging to identify and track targets accurately and efficiently, which is an issue to be solved in the current target tracking field [4].

Figure 1 illustrates the common issues faced by UAVs in dynamic dim target tracking, which mainly include spatial occlusion, similar interference, target distortion, and illumination changes. Specifically:

Spatial occlusion: vision sensors are usually installed under the body or head of UAVs. In the process of target tracking, there is often a spatial-occlusion problem due to a UAV’s field-of-view limitation and a complex background environment. Spatial occlusion severely affects target tracking accuracy; continuous occlusion may cause target loss [5].
Similar interference: there is a huge challenge in the accurate and robust tracking performance of UAVs under a similar target interference in conditions where the tracking target features are not obvious or the features are approximated [6].
Target distortion: due to factors such as rapid target motion, background interference, and scale changes, target distortion often exists in UAV recognition targets, which affects feature recognition and judgment. The authors of [7] emphasized the complexity and uncertainty of the tracking scenarios of UAVs, which usually encounter significant environmental disturbances during flight. This image distortion will reduce the target characteristics.
Illumination changes: there are complex illumination changes in UAV target tracking that mainly include dynamic changes in five texture-feature parameters such as energy, entropy, correlation, contrast, and local uniformity, which then lead to poor-quality image recognition [8].

Different target tracking methods have been analyzed in UAVs that cover background denoising, target enhancement, optimum control, and intelligent identification. Tian et al. [9] stated that they believe that detection measures such as the mean filter, Gaussian filter, median filter, and bilateral filter are inefficient for fast and robust identification. The authors of [10] and [11] studied the target tracking process and showed that Siamese networks had an excellent target tracking performance and have tremendous potential for application.

Although researchers have conducted many studies, research on the dynamic dim target tracking of UAVs is still in its infancy. In addition, the increases in complex operating scenarios and target usage will result in more challenges.

Considering these current issues, we provide a detailed overview of the recent works and a comparative analysis of related works. Motivated by the drawbacks of the current work, we focus on the fast and robust tracking of targets, propose a dynamic dim target tracking approach for UAVs using an improved lightweight Siamese network (ILSN), and carry out a performance test that was conducted on a physical experimental platform. In summary, the main contributions of this paper are as follows:

We propose an improved lightweight Siamese network (ILSN) with an optimized feature-extraction network design and similarity measurement. The ILSN had obvious advantages in tracking accuracy and the tracking speed, demonstrated through simulations and a system test.
We optimize the feature network extraction by designing a position-wise attention module and localization information fusion, which in turn improves the accuracy and real-time performance of ISLN target tracking.
Finally, we establish a simulation environment and a UAV-based experimental platform. By comparing with comparison algorithms such as SiamRPN, DaSiamRPN, and SlamRPN++, we verify the fallibility and feasibility of the ILSN.

2. Related Work

The issue of dynamic dim target tracking has attracted widespread concerns among scholars and much research have been conducted. The related research results focus on the correlation filtering algorithm and deep learning algorithm.

2.1. Correlation Filtering Algorithms

The correlation filter tracking (CFT) algorithm is an online learning method that adapts to changes in the target by updating the model in time, thus guaranteeing tracking accuracy. Since Bolme et al. [12] first introduced the CFT algorithm to the field of target tracking in 2010 by proposing the minimum output sum square error (MOSSE) tracking algorithm, the filtering algorithm has developed rapidly.

To overcome the tracking drift phenomenon in complex scenes due to changes in target scales, Danelljan et al. [13] proposed a spatially regularized discriminative correlation filtering (SRDCF) algorithm that introduces a spatially regularized component and uses regularized weight pairs to train the aforementioned objective function. Moreover, it accurately tracked the target by training two different filters for position and scale, respectively, which better handled the scale change problem.

Li et al. [14] introduced temporal regularization into the single-sample SRDCF algorithm and proposed the spatially temporal regular correlation filtering (STRCF) algorithm. The algorithm solves the system of equations using the alternating directional multiplier method (ADMM), which improves the tracking accuracy of the algorithm while also significantly increasing the tracking speed of the algorithm. Considering the change in target background over time, Galoogahi [15] proposed the background-aware correlation filtering (BACF) algorithm, which dynamically models the foreground and background of the tracking target using HOG features over time to ensure higher accuracy and real-time performance.

To suppress distortions, Huang et al. [16] proposed the correlation filtering algorithm (CFA). By extracting background samples as negative samples for the model training and detection phases and introducing a regularization term that limits the rate of change of the response map, CFA avoids variation in the response map and improves the robustness. The automatic spatiotemporal regularization tracker (AutoTrack) is also an improved target tracking algorithm used in UAV scenarios, and the proposed algorithm is proved to have very high robustness to complex and variable UAV scenarios through extensive comparison experiments.

The CFA could ensure that the UAV tracking model is trained and updated online promptly to extract the effective features of the target, which significantly reduces the computational complexity and improves the target tracking in real-time. At the same time, it enables the UAV to achieve real-time tracking on a single CPU. However, it is difficult to effectively deal with fast scale changes, moving object occlusion, low resolution, small targets and other tracking problems with the correlation-based filtering algorithm. Moreover, the filtering structure element is not adaptive, which makes it difficult to accurately predict the position of the target.

2.2. Deep Learning Algorithms

Recently, with the development of UAV computing power and graphics processing unit (GPU) technology, it has become possible to run deep learning models for target tracking on UAV airborne platforms. At present, various target UAV target tracking methods with deep learning opportunities are classified into Siamese neural network-based tracking methods and classification-based convolutional neural network (CNN) target tracking methods.

The basic idea behind CNN-based approaches is to divide video frames into background and target regions, thus transforming target tracking into a classification problem; commonly used backbone networks include AlexNet [17], VGGNet [18], and ResNet [19].

The multi-domain network (MDNet) algorithm [20] designs a lightweight CNN structure including three convolutional layers and three fully connected layers to perform a binary classification of the target and background on candidate samples. The core of MDNet is the introduction of a multi-domain learning strategy; i.e., for each new image sequence, the last fully connected layer of the MDNet model needs to be rebuilt. Although the tracking accuracy of MDNet is high, the requirement to interpret a large number of repeated candidate samples and the fact that the model must be updated online makes the method computationally complex. It is difficult to achieve real-time processing with today’s UAV hardware.

Nam et al. [21] proposed a tree structure to maintain the multi-CNN model during the tracking process to improve the reliability of the model for the target occlusion or disappearance problem. The score of each candidate region is obtained by weighting the average of the classification numbers of multiple CNNs, which leads to the tracking results.

With the rapid development of deep learning, Siamese networks are gradually being applied in the target tracking field, mainly for target position and scale recognition tracking, balancing the contradiction between speed and accuracy. It has gradually become the optimal model for target tracking.

In 2017, Tao et al. [22] proposed the Siamese instance (SINT) algorithm to convert the target tracking problem into a similarity metric problem for feature blocks. They proposed training the similarity measure module using neural networks. Bertinetto et al. [23] proposed SiamFC, which no longer trains each module independently, but designs an end-to-end trained network. The authors of [24] imported the Region Proposal Network (RPN) for target detection to Siamese network tracking, transforming the similarity computation problem into a classification and regression problem for targets. DaSiamRPN [25] improves training by adding positive samples and difficult negative samples in the training samples. SiamRPN++ [26] enhances the network’s ability to track image edge targets by increasing the positions of the targets in the moving training samples.

Although breakthroughs have been made in the above tracking algorithms, most deep learning tracking algorithms use offline training, which cannot quickly adapt to the situation of complete occlusion or even out of view, so it is crucial to explore efficient online training methods for neural networks. The SiamRPN network target tracking algorithm is limited by the depth of the feature extraction network, resulting in poor feature extraction for tiny targets. Using only depth features leads to significant information loss for tiny targets and finally results in the poor tracking accuracy and robustness of the tracking algorithm when facing visible air-to-ground targets. Table 1 outlines the algorithm comparison.

To improve the tracking effect, incorporating an attention mechanism into the target tracking model has become an important research direction [27]. Yang et al. [28] proposed Target-aware deep tracking (TADT), which applies the channel attention mechanism to the target tracking algorithm, arguing that the high response channel of the same target should be kept constant by reducing the response to highlight the target. During training, Deep Attention Tracking (DAT) [29] improves the tracking performance of the network by introducing an attention paradigm in the loss function.

Although the above Siamese network tracking algorithm research has achieved promising results, the following problems still exist in the tracking tasks facing weak targets:

In the presence of background information interference and similar targets, the tracking accuracy decreases. It is prone to target loss and false tracking.
Most Siamese network target tracking algorithms use a shallow deep feature extraction backbone network, failing to take full advantage of the feature extraction capabilities of deep neural networks.
Using features from the last layer of the feature extraction network for similarity measures ignores the accurate localization information in the shallow network.

We expect to improve the tracking efficiency and accuracy by optimizing the design of the feature network and combining the similarity metric with the attention mechanism.

3. Materials and Methods

In this section, we describe the Fast Dynamic Dim Target Tracking Approach (FDDTA) in detail, presenting the algorithmic framework. For the feature extraction module, we propose a channel attention mechanism that fuses the position-wise information. We then use the channel information and position information to generate weights to complete the rescaling of the original features and improve the feature extraction capability. For the similarity measurement module, a regional candidate network RPN is referenced to use the extracted features for similarity measurement to generate the current frame prediction frames of the target. This effectively eliminates duplicate redundant predictor frames and retains the information of the predictor frame with the highest confidence.

The algorithmic framework shown in Figure 2 is based on the Siamese network and consists of a position information attention module, a similarity measurement module, a feature extraction network module, and a prediction frame generation module.

Firstly, the target data and the search area were introduced into the feature extraction network to obtain the three layers of features. The features were up-dimensioned to a unified dimension; then, the three layers of features were imported in pairs into the similarity measurement module to obtain classification scores and regression scores for the three sets of features at different depths. To obtain the predicted frames for the current frame, we finally weighted and fused the classification and regression scores to calculate the total classification and regression response scores.

3.1. SianmRPN

A SiamRPN network is formed by combining two artificial neural network branches with the same structure and the weight value. The two images are fed into two separate branches and go through a series of network structures to extract abstract features, which are then imported into the similarity measure function to obtain a similarity score for the two images. This will measure the difference between them.

The used SiamRPN is designed with a feature extraction network consisting of a convolutional layer, a nonlinear activation layer and a pooling layer as the branching body, and a large number of samples are used to train this network offline to ensure the feature extraction capability. We imported the manually selected target image

x

in the first frame and the search area image

z

in the subsequent frames into the feature extraction network to obtain the target feature map

φ_{θ} (x)

with resolution

H_{x} \times W_{x} \times m

and the search area feature map

φ_{θ} (z)

with resolution

H_{z} \times W_{z} \times m

, respectively, where

H

and

W

are the aspect of the feature map and

m

is the number of channels. Then, by designing and learning a similarity measure function

f_{θ} (z, x) = φ_{θ} (z) * φ_{θ} (x) + b \cdot I

(1)

The response map

f_{θ} (z, x)

representing the similarity between the target features and the search region features is obtained, and the similarity between the target template image of the first frame and the search region of the subsequent frames is judged according to the response map, and the position with the highest response value score is the new target position, where

θ

is the network parameter,

*

denotes the convolution operation, and

b \cdot I

represents the bias term.

3.2. Feature Extraction Network

In the SiamRPN network target tracking algorithm using AlexNet [19] as the feature extraction network, the depth of the network is restricted and cannot extract enough deep features, so MobileNetV2 was chosen as the feature extraction backbone network. The MobileNetV2 network compared to feature extraction networks such as AlexNet, VGG [20] and ResNet [21] has the advantages of having fewer parameters and a faster operation speed.

The MobileNetV2 network is designed for target detection with a final pooling layer and a 1 × 1 convolutional layer for channel number up-dimensioning for classification. It is not needed in tracking tasks. Moreover, the overall step size of the network is 32, and the resolution of the final extracted feature map is too low for the similarity measure of the SiamRPN network. In order to extract features from multiple layers for fusion for subsequent similarity measures, it is also necessary to adjust the network structure so that the layers that extract features output the same feature resolution.

For this, the following modifications have been made to the MobileNetV2 network.

(1): Remove the final pooling layer and the 1 × 1 convolution layer.
(2): Adjust the step size of the last four layers of the network to 1 and the total step size of the network to 8.
(3): The remaining layers with linear bottlenecks in the inverted residual structure are removed from the first layer, as well as all layers in layers 6 and 7, using the null convolution [23] to extend the perceptual domain.

The structure of the modified feature extraction network is shown in Table 2.

The input feature map is firstly up-dimensioned to six times that of the previous dimension. The purpose of this step is to enable the convolutional layer to convolve using information from the high-dimensional space. Subsequently, depth-separable convolution is used to extract feature information; finally, we obtain the new feature map by dimensionality reduction and use linear bottlenecks instead of nonlinear activation functions to reduce the number of parameters of the whole model and improve the computational speed while retaining most of the nonlinear information.

3.3. Position-Wise Attention Module

Attention modules commonly used for target tracking tasks include SENet [15], CBAM [16], ECA-Net [17], NAM [18], etc., which are mainly divided into channel attention and spatial attention. Channel attention focuses on the relationship across channels of the feature, and spatial attention focuses on the relationship within each pixel of the feature map channel. The position information attention module proposed in this paper adopts SENet channel attention. It adapts the specific implementation of both position information extraction and fusion weights to improve the network’s ability to characterize the target.

The process of the position information attention module is shown in Figure 3.

To obtain global spatial information, channel attention usually uses global pooling to compress the global information into a channel descriptor. It is difficult to describe the location information with a single channel description. Therefore, in the part where the location information is extracted, a one-dimensional global pooling of the input feature maps is performed along the vertical and horizontal directions, respectively, which in turn leads to a vector of channel descriptions in both directions.

\begin{array}{l} z_{c}^{h} (h) = \frac{1}{W} \sum_{i = 1}^{W} x_{c} (h, i) \\ z_{c}^{w} (w) = \frac{1}{H} \sum_{i = 1}^{H} x_{c} (j, w) \end{array}

(2)

Z_{h}

is the channel description vector obtained by averaging pooling along the horizontal X direction with a resolution of

H \times 1 \times C

.

Z_{w}

is the channel description vector obtained by averaging pooling along the vertical Y direction with resolution

1 \times W \times C

.

Then, the fusion weights are generated. To facilitate the calculation of the weights, the vertical Y-direction matrix

Z_{w}

is rotated and spliced with the horizontal X-direction matrix

Z_{h}

, and then enters the convolution layer with a convolution kernel of 1, the normalization layer and the ReLU6 nonlinear activation function layer to obtain the intermediate features

f

:

f = δ (F_{1} ([Z^{h}, Z^{w}]))

(3)

where

[*, *]

is the rotational splicing transform,

F_{1}

is the convolution and normalization, and

δ

is the nonlinear activation function ReLU6.

The intermediate feature

f

is split into two independent features along the middle in the horizontal and vertical directions, and then into two convolutional layers with a convolutional kernel of 1 and a nonlinear function activation layer to obtain the weights in the two directions, respectively.

\begin{array}{l} g_{h} = σ (F_{h} (f_{h})) \\ g_{w} = σ (F_{w} (f_{w})) \end{array}

(4)

Eventually, the weights are expanded into a three-dimensional array with the same dimensionality as the input features along their respective directions and multiplied by bit with the input features to obtain the output feature map. In this way, we completed the rescaling of the features.

\hat{X} (i, j) = X (i, j) \times g_{h} (i) \times g_{w} (j)

(5)

\hat{X} (i, j)

is the output feature map;

X (i, j)

is the input feature map.

3.4. Similarity Measurement Module

Concerning the features extracted by the deep feature extraction network, the shallower-level features have accurate localization information and the deeper-level features have rich semantic information. We use the multilayer feature fusion method to bring out the feature extraction ability of the deep feature extraction network, which in turn improves the tracking ability of the ILSN [9].

To achieve multilayer feature fusion, the features output by layer3, layer5, and layer7 network layers in the optimized feature extraction network are fused. Due to the modification of the network structure, the resolution of the three layers of features is consistent while the number of channels is different. To facilitate the import of the similarity measurement module, the number of feature channels in the three layers was adjusted from 44, 134 and 448 to 256 using a convolutional layer with a convolutional kernel. The features were then normalized to facilitate subsequent feature fusion.

The adjusted template features and search area features were imported into the similarity measure module, respectively, and the response maps

S_{i}

and

B_{i}

were calculated for the corresponding classification branches and regression branches. Due to the modification of the network structure, the resolution and number of channels of

S_{i}

and

B_{i}

calculated by the three layers of features are consistent. The overall response maps

S_{a l l}

and

B_{a l l}

were derived directly using linear weighting.

\begin{array}{l} S_{a l l} = \sum_{i}^{n} α_{i} S_{i} \\ B_{a l l} = \sum_{i}^{n} β_{i} B_{i} \end{array}

(6)

Finally, the response graph scores of the overall classification branch and the regression branch were used to generate the current frame prediction frame.

4. Simulation and Experiment

In this section, we describe out algorithm simulation experiments to verify the efficiency of ILSN by using KCF, MCNN, SiamRPN, DaSiamRPN, and SlamRPN++ as comparison algorithms. Furthermore, to test the feasibility of the ILSN, an experimental platform of the UAV system for dim target tracking was established. Additionally, the composition of the software and hardware system is described in detail. On this basis, the physical test experiment of the algorithm performance was completed.

4.1. Simulation

The main performance parameters of the simulation server are as follows: the operating system is Linux, version 18.04, the programming language is Python3.7, and the Pytorch 1.9.1 framework was used. The CPU processor is i7-10700K, the memory is 32 GB, and the GPU is NVIDIA Geforce RTX3080 with 10 GB of video memory.

ILSVRC2015_VID was selected as the training set, and the UAV123 [25] and VOT2018 [26] datasets were used as the test evaluation set, respectively. We set up a comparison algorithm including ablation experiments and introduced evaluation metrics such as accuracy, tracking precision and tracking speed. A quantitative analysis of the algorithm performance was carried out based on the experimental data.

4.1.1. Dataset

ILSVRC2015_VID has a total of 30 basic categories, such as animals, vehicles, boats, etc. We selected 3000 snippets in ILSVRC2015_VID as the training set. The UAV123 contains 123 high-resolution video sequences captured by UAVs. The VOT2018 dataset contains 60 video sequences, which track targets of different sizes and have difficulties such as large-scale changes, occluded targets, and fast target movement.

The partial data of the training set and test set are shown in Figure 4.

4.1.2. Training

The feature extraction backbone network was pre-trained using the ImageNet dataset to obtain better feature extraction network weights first. The overall network is trained offline using the ILSVRC2015_VID [24] dataset, which contains 3862 video sequences. The training process was performed by the stochastic gradient descent (SGD) method, by randomly selecting two images from a video sequence in the dataset as the target image and the search region into ILSN, respectively. The outputs of the classification and regression branches were obtained to calculate the losses separately, and the losses of the classification and regression branches were fused linearly to the overall loss to backpropagate. The loss weight of the categorical branch is 1.0, and the loss weight of the regression branch is 1.2. The learning rate was warmed up using a warm-up learning rate approach, where the learning rate was warmed up for 5 rounds, decreasing from 10⁻² to 10⁻⁶. Within each round, it started with 0.05 times the overall learning rate and gradually grew to the overall learning rate. The overall training has 20 rounds, the first 10 rounds train only the similarity measure network structure of multilayer fusion, and the last 10 rounds train the overall network. The model was trained 150,000 times overall, with 80 pairs of images each time.

Then, four commonly used evaluation indicators were used to quantify the algorithm performance:

(1): Accuracy

Accuracy refers to the ratio of the overlap area between the predicted frame and the actual target frame to the total area of the predicted frame and the actual target frame in the video sequence, which is calculated as follows:

S = \frac{B_{c}}{B_{A} + B_{P} - B_{c}}

(7)

where

S

is the accuracy rate,

B_{A}

is the actual target frame,

B_{P}

is the prediction frame, and

B_{c}

is the overlap region of

B_{A}

and

B_{P}

.

(2): Tracking Precision

Tracking precision refers to the Euclidean distance between the center of the predicted frame pixel and the center of the actual target frame pixel in the video sequence, which is calculated as follows:

P = \sqrt{{(c_{A} x - c_{p} x)}^{2} + {(c_{A} y - c_{p} y)}^{2}}

(8)

where

P

is the tracking accuracy, and

c_{A} x

,

c_{A} y

,

c_{p} x

,

c_{p} y

denote the horizontal and vertical coordinates of the center of the actual target frame and the prediction frame, respectively.

(3): Tracking speed

Tracking speed refers to the number of video sequence frames per second that the tracking algorithm completes, in frames per second, and is used to measure the algorithm’s computing speed. As the playback speed of the video sequence frames is controlled by the UAV in the real-time experiments, only this metric is counted and analyzed in the non-real-time experiments.

(4): Expected average overlap (EAO)

All video sequences are sorted by length, so that the algorithm can be tested on a sequence of length

N_{s}

. The accuracy

ϕ_{t}

of each frame is obtained, and the accuracy

ϕ N_{s}

of that video sequence is obtained by taking the average value to obtain the EAO value on a video sequence of length

N_{s}

.

\bar{ϕ} N_{s} = \frac{1}{N_{S}} \sum_{i = 1}^{N_{s}} ϕ N_{s}

(9)

4.1.3. Testing

(1): Ablation experiments

To validate the improvement of the position-wise attention module on the feature extraction network’s ability to characterize target features, we conducted an ablation experiment to assess the attention modules, as well as a comparison experiment of the attention modules. The results of the image classification task in the ImageNet dataset were used for evaluation, and four commonly used attention models, SENet, CBAM, ECA-Net, and NAM, were selected for comparison with the position information attention proposed in this part. The experimental results are shown in Table 3.

The comparison criteria are the number of parameters, Top1 accuracy and Top5 accuracy. The number of parameters indicates the number of parameters of the network; the Top1 accuracy rate indicates the percentage of correct results with the highest score among the classification results.

Experimental results show that the ILSN model with the introduction of the position-wise attention module exhibits the best performance among the six networks in terms of Top1 accuracy and Top5 accuracy, reaching 88.79% and 86.44%, with 9.31% and 7.64% improvements compared to the base MobileNetV2 network. The ILSN also improved the Top1 accuracy and Top5 accuracy by 6.23% and 5.63% compared to the ECA-Net, which was the second-best performer, indicating the practical improvement of the feature extraction network’s ability to characterize target features by position-wise attention.

(2): Quantitative evaluation

To ensure the accuracy of the dataset, the UAV123 dataset contains 12 common challenges during UAV applications, such as rapid target movement, target occlusion, low screen resolution, target size variation and screen illumination adjustment, covering most of the scenarios that will be encountered during UAV flight. This dataset thus allows for a comprehensive and unbiased assessment of the performance of tracking algorithms in the UAV environment. To quantitatively compare the algorithm performances, five mainstream target tracking algorithms, such as KCF, MCNN, SiamRPN, DaSiamRPN, and SlamRPN++, were selected for comparison with the proposed ILSN.

The performance variations of the five algorithms on the UAV123 dataset are shown in Figure 5. The tracking success rate of the algorithm in this paper is 87.8%, and the tracking precision is 86.3% by taking a localization error threshold of 20 pixels, which is higher than that of other algorithms.

For further verification of the tracking capability of the algorithm in complex scenes, tracking experiments in four complex scenes provided by the UAV123 dataset are described in this paper. The complex environment includes background interference, partial occlusion, out-of-field, and similar object interference. The experimental results are shown in Figure 6. Except for the three scenes of complete occlusion, background interference and out-of-field, the ILSN is tied for first place with the DaSiamRPN algorithm. The other three scenes are all ahead of the other algorithms by some margin. These indicate that the algorithm still has good tracking accuracy in complex scenes.

Accuracy, robustness, and EAO were used as evaluation metrics. Accuracy measures the size of the overlap area between the predicted frame and the real frame. Robustness is related to the number of times the algorithm fails to track the target. EAO represents the evaluation of the predicted frame during the whole tracking process.

Experiments were carried out to test the tracking speed of the algorithm. The average frame rate (AFR) is introduced as the fourth tracking metric to indicate the number of images processed by the algorithm per unit of time. The experimental results of the eight algorithms on the VOT2018 dataset are shown in Table 4.

The algorithm in this paper is leading by quite a margin compared to other algorithms in terms of accuracy, robustness, and EAO indexes. Although the frame rate is slightly lower compared with the base algorithm SiamRPN, the speed of 112 fps still meets the real-time requirements.

4.2. Experiment

To test the feasibility of the ILSN, the UAV experimental system was built based on DJI Mini 3 pro, which has an intelligent obstacle avoidance and high-resolution map transmission function. Based on this, we introduced IMU and gimbal, etc., which obtained more stable data images. As illustrated in Figure 7, these include the sensing system, flight control system, and power system.

The perception system consists of a camera, radar, IMU and gimbal, etc.; the flight control system includes a server, flight body, monitor, etc.; the power system includes a drive unit, flight battery, and DJI Ronin 4D. Table 5 illustrates the hardware and software devices.

The stable flight of the UAV system is guaranteed by the gyroscope and the gimbal.

We chose eight different temporal, spatial, and altitude data acquisition scenarios, covering various scenes such as cities, mountains, streets, and lakes, in which we first set up the server data acquisition environment and set up the monitor and UAV communication channels. The UAV system dynamically captures low-altitude images based on the ultra-high-definition binocular vision camera, which is controlled by the monitor. The monitor allows us to dynamically adjust the speed and height of the drone to increase the variety of samples collected. The collected data were transmitted to the server and monitored through the image transmission module. The server stores and processes the data according to the obtained images, completes dynamic target recognition, and transmits the data to the flight control system to complete the target tracking.

Then, we obtained 1000 sets of image data for each of the 8 scenarios through post-processing. An example of the collected experimental data is shown in Figure 8, which shows the target recognition and tracking images of 8 scenes.

The average accuracy and tracking precision of the proposed ILSN algorithm for each group of video sequences are shown in Figure 9. It can be seen that the proposed algorithm maintains an extremely high tracking accuracy and precision under eight video sequences. In scenario 4, the algorithm has the highest accuracy and precision. The average accuracy is over 85% and the average precision is 1.5 pixels.

From Table 6, it can be seen that the ILSN successfully completes the tracking of 8 sets of video sequences, with an average tracking accuracy of 86.5% and an average tracking accuracy of 1.8 pixels. In video sequence a, the tracking accuracy and precision produced a slight decrease due to the tracking difficulty of large changes in lighting in front and behind the target vehicle. In video sequence f, there is no severe degradation in tracking accuracy or tracking precision, even in the special case where the target is obscured by trees.

The actual situation of ILSN’s tracking in sequence f of Figure 9 is shown in Figure 10. For recognition accuracy, the proposed method, which is relatively stable within the frame rate of 20 fps-110 fps, has a relatively large jump in accuracy when the frame rate is too high or low, which indicates that the best performance frame rate for ILSN does not exceed 110. However, precision has remained relatively stable. ILSN has an average tracking accuracy of 86% and an average tracking center error of 1.9 pixels in the entire tracking process.

By establishing a physical experimental platform, the accuracy and precision of the proposed ILSN target tracking algorithm for UAV tracking of dim targets are verified. It is proved that the algorithm is feasible. It can effectively track dim targets with tiny target sizes, complex background interference, target occlusion, and difficult target camera movement. Meanwhile, ISLN has a tracking speed of 108 frames per second, which is a slight decrease relative to the simulated system. However, it can meet the basic real-time requirements.

5. Conclusions

To address the problem of efficiently and accurately identifying dynamic dim targets in complex environments, we propose the ILSN with an optimized feature extraction network design and similarity measurement. We establish a position-wise attention module to obtain target feature position information, which enhances the network’s ability to extract weak targets while reducing model parameters, thus achieving a lightweight network and completing feature network optimization. The similarity measurement module with multi-layer feature fusion was used to improve the localization accuracy and tracking accuracy of the model. To evaluate the performance of the proposed method, we set up a simulation environment and a physical experimental platform to conduct comparative experiments on the attention module, tracking accuracy and precision. The simulation results show that ILSN has at least a 6.23% advantage over the KCF, MCNN, SiamRPN, DaSiamRPN, and SlamRPN++ algorithms in terms of detection accuracy for target tracking in complex environments, while maintaining a processing speed of 112 fps, which verifies the effectiveness of the proposed algorithm. The physical experiment verifies the feasibility of the proposed algorithm in eight scenarios, where the algorithm’s performance is degraded due to interference, computational power and other constraints, but still meets the real-time requirements.

However, the size of the targets tracked by ILSN is still significant, and there is insufficient consideration and analysis of tiny targets. Next, we will focus on the further optimization of the performance of the algorithm and transplanting the algorithm to the edge side to achieve autonomous and controllable target tracking for UAVs.

Author Contributions

Conceptualization, L.L. and X.Z.; methodology, L.L.; software, H.Z.; validation, L.L. and X.Z.; formal analysis, L.L.; investigation, L.L.; resources, L.L.; data curation, L.L.; writing—original draft preparation, L.L. and X.Z.; writing—review and editing, L.L. and X.Z.; visualization, H.Z.; supervision, H.Z.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Beijing Natural Science Foundation Grant No. 4214071 and Beijing Information Science and Technology University Foundation Grant No. 2022XJJ10.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not application.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Deng, X.; Li, J.; Guan, P. Energy-efficient UAV-aided target tracking systems based on edge computing. IEEE Internet Things J. 2021, 9, 2207–2214. [Google Scholar] [CrossRef]
Hentati, A.I.; Fourati, L.C. Mobile target tracking mechanisms using unmanned aerial vehicle: Investigations and future directions. IEEE Syst. J. 2020, 14, 2969–2979. [Google Scholar] [CrossRef]
Muslimov, T.; Munasypov, R. Fuzzy model reference adaptive control of consensus-based helical UAV formations. In Proceedings of the 2022 8th International Conference on Automation, Robotics and Applications (ICARA), Prague, Czech Republic, 18–20 February 2022; pp. 196–201. [Google Scholar]
Xia, Z. Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking. IEEE Trans. Veh. Technol. 2022, 71, 931–945. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Y.; Liu, P.; Zhao, S. Robust localization of occluded targets in aerial manipulation via range-only mapping. IEEE Robot. Autom. Lett. 2022, 7, 2921–2928. [Google Scholar] [CrossRef]
Baldi, S.; Sun, D.; Zhou, G. Adaptation to unknown leader velocity in vector-field UAV formation. IEEE Trans. Aerosp. Electron. Syst. 2021, 58, 473–484. [Google Scholar] [CrossRef]
Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared small UAV target detection based on residual image prediction via global and local dilated residual networks. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. ADTrack: Target-aware dual filter learning for real-time anti-dark UAV tracking. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar]
Tian, X.; Liu, J.; Mallick, M.; Huang, K. Simultaneous detection and tracking of moving-target shadows in ViSAR imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1182–1199. [Google Scholar] [CrossRef]
Moon, J.; Papaioannou, S.; Laoudias, C.; Kolios, P.; Kim, S. Deep reinforcement learning multi-UAV trajectory control for target tracking. IEEE Internet Things J. 2021, 8, 15441–15455. [Google Scholar] [CrossRef]
Luo, Y.; Song, J.; Zhao, K.; Liu, Y. UAV-cooperative penetration dynamic-tracking interceptor method based on DDPG. Appl. Sci. 2022, 12, 1618. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
Danelljan, M.; Hager, G.; Khan, F.S. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Li, F.; Tian, C.; Zuo, W.M. Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Galoogahi, H.K.; Fagg, A.; Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Huang, Z.; Fu, C.H.; Li, Y. BiCF: Learning aberrance repressed correlation filters for real-time UAV tracking. In Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019. [Google Scholar]
Khare, S.K.; Bajaj, V. Time–frequency representation and convolutional neural network-based emotion recognition. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2901–2909. [Google Scholar] [CrossRef] [PubMed]
Tharawatcharasart, K.; Pora, W. Effect of spatial dropout on mosquito classification using VGGNet. In Proceedings of the 2022 19th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Prachuap Khiri Khan, Thailand, 24–27 May 2022. [Google Scholar]
Ting, W.C.; Rui, Z.D.; Cha, Z.; Diana, M. Towards efficient model compression via learned global ranking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Hu, R.; Wang, T.; Zhou, Y.; Snoussi, H.; Cherouat, A. FT-MDnet: A deep-frozen transfer learning framework for person search. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4721–4732. [Google Scholar] [CrossRef]
Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Beijing, China, 13–15 October 2018. [Google Scholar]
Tao, R.; Gavves, E.; Smeulders, A. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cen, M.; Jung, C. Fully convolutional siamese fusion networks for object tracking. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
Li, B.; Yan, J.; Wu, W. High performance visual tracking with siamese region proposal network. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Berlin, Germany, 8 September 2019. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of siamese visual tracking with very deep net-works. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16 June 2019. [Google Scholar]
Hu, J.; Yan, P.; Su, Y.; Wu, D.; Zhou, H. A method for classification of surface defect on metal workpieces based on twin attention mechanism generative adversarial network. IEEE Sens. J. 2021, 21, 13430–13441. [Google Scholar] [CrossRef]
Luo, Z.; Li, J.; Zhu, Y. A deep feature fusion network based on multiple attention mechanisms for joint iris-periocular biometric recognition. IEEE Signal Process. Lett. 2021, 28, 1060–1064. [Google Scholar] [CrossRef]
Rodriguez, P.; Velazquez, D.; Cucurull, G.; Gonfaus, J.; Gonzalez, J. Pay attention to the activations: A modular attention mechanism for fine-grained image recognition. IEEE Trans. Multimed. 2020, 22, 502–514. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The issues for unmanned aerial vehicles (UAVs).

Figure 2. Fast dynamic dim target tracking approach.

Figure 3. Position-wise attention mechanism.

Figure 4. Sample images of the experiment dataset. (a) car; (b) Grape leaf blight; (c) Powdery mildew; (d) car; (e) road; (f) girl; (g) Bacterial leaf blight; (h) Brown spot; (i) Leaf smut.

Figure 5. The success rate and tracking precision of each algorithm on the UAV dataset. (a) Success rate; (b) Precision.

Figure 6. Accuracy comparison of the UAV123 dataset under 4 different scenarios. (a) Background clutters; (b) Partial occlusion; (c) Background clutters; (d) Similar object.

Figure 7. UAV system.

Figure 8. Experimental data. (a) car in night; (b) house in mountain; (c) car in road (high FOV); (d) car in road (low FOV); (e) target with occlusion; (f) target (no occlusion); (g) house in lake (close distance); (h) house in lake (remote distance).

Figure 9. Algorithm performance in 8 scenarios.

Figure 10. The tracking accuracy and precision of f.

Table 1. Algorithm comparison.

Ref	Model	Datasets	Accuracy	Pros and Cons
[17]	KCF	Mnist	-	High complexity
[21]	MCNN	NSL-KDD	86.90%	High speed
[24]	SiamRPN	NSL-KDD	96.35%	Low power
[25]	DaSiamRPN	SCADA data	95.84%	High time cost
[26]	SiamRPN++	IoT datasets	96.20%	High complexity

Table 2. Feature extraction backbone network structure.

Network Layer	Input	Channel	Step	Attention
layer0	127 × 127 × 3	44	2
layer1	63 × 63 × 44	22	1
layer2	63 × 63 × 22	33	2
layer3	31 × 31 × 33	44	2	Y
layer4	15 × 15 × 44	89	1
layer5	15 × 15 × 89	134	1	Y
layer6	15 × 15 × 134	224	1
layer7	15 × 15 × 224	448	1	Y
Output	15 × 15 × 448

Table 3. Results of comparative experiment of attention.

Model	Size	Top1 Accuracy	Top5 Accuracy
MobileNetV2	3.51 M	79.48	78.80
SENet	3.53 M	80.23	79.35
CBAM	3.54 M	80.26	79.34
NAM	3.51 M	80.66	79.82
ECA-Net	3.51 M	82.56	80.81
Ours	3.68 M	88.79	86.44

Table 4. Results of the comparative experiment under 4 different scenarios.

Model	Accuracy	Robustness	EAO	AFR
ILSN	0.811	0.244	0.417	112
KCF	0.647	0.773	0.134	150
MCNN	0.783	0.585	0.289	90
SiamRPN	0.748	0.460	0.244	130
DaSiamRPN	0.751	0.344	0.326	120
SiamRPN++	0.781	0.236	0.414	35

Table 5. The UAV system parameters.

Category	Name	Parameters	Application
Aircraft system	UAV body	Volume: 171 mm × 245 mm × 62 mm Speed: 5 m/s Weight: 249 g PTZ: 360°	The aircraft
	Monitor	Frequency: 5.725~5.850 GHz Positioning: GPS+BeiDou WIFI prctocol: WIFI 6	Trajectory planning Flight altitude control
	Sever	Model: Lenovo ST558 CPU: Xeo GPU: RTX 3080 × 2 RAM: 256 G ROM: 1 T	Store and process data
Perception system	Camera	Image sensor: 1/1.3 inch, up to 48 million pixels Angle of view: 82.1° Equivalent focal length: 24 mm Aperture: f/1.7	Collecting all the displacement data
	Radar	Model: RD2484 R Measuring range: 1.5–50 m FOV: horizontal 360°, vertical ± 45°	Auxiliary environmental detection
	IMU	Model: Honeywell HMS-MM-10 Positional accuracy: ±10 cm Range: ±500°/sec, 16 g Bandwidth: 200 Hz	Balancing the UAV attitude.
	Image transmission	Image transmission quality: 1080 p/30 fps Bandwidth: 1.4 Mhz–40 Mhz Real-time rate: 18 Mbps Transmission distance: 12 km Antenna: four	Telecommunication
Power system	Motor	Stator size: 100 × 33 mm KV value: 48 RPM/V Motor power: 4000 W/rotor	Driving the UAV to fly
	Battery	Capacity: 2453 mAh Voltage: 7.38 V Temperature: 5–40 °C
	Propellers	Diameter: 54 inch Rotor number: 8	UAV obstacle avoidance
Operating system	Linux	Linux_18.04	The system in Sever
Software	Python3	Pycharnm Community Edition 2021	Developing various algorithms
	Matlab	Matlab 2018 b	Data processing
	DJI Fly	Android 6.0	Display image transmission data

Table 6. Algorithm performance in 8 scenarios.

Number	Target Type	Tracking Completed	Average Accuracy	Average Precision	Tracking Speed
a	Car	Y	0.78	1.04	110.2
b	Building	Y	0.87	3.08	107.8
c	Car	Y	0.86	1.20	107.5
d	Car	Y	0.91	4.14	110.2
e	Car	Y	0.86	1.73	108.8
f	Building	Y	0.89	1.85	108.5
g	Building	Y	0.86	1.59	107.4
h	Building	Y	0.85	2.45	109.0

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, L.; Zhao, X.; Zhao, H. A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks. Appl. Sci. 2022, 12, 11918. https://0-doi-org.brum.beds.ac.uk/10.3390/app122311918

AMA Style

Li L, Zhao X, Zhao H. A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks. Applied Sciences. 2022; 12(23):11918. https://0-doi-org.brum.beds.ac.uk/10.3390/app122311918

Chicago/Turabian Style

Li, Lianpeng, Xu Zhao, and Hui Zhao. 2022. "A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks" Applied Sciences 12, no. 23: 11918. https://0-doi-org.brum.beds.ac.uk/10.3390/app122311918

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Fast Dynamic Dim Target Tracking Approach for UAVs Using Improved Lightweight Siamese Networks

Abstract

1. Introduction

2. Related Work

2.1. Correlation Filtering Algorithms

2.2. Deep Learning Algorithms

3. Materials and Methods

3.1. SianmRPN

3.2. Feature Extraction Network

3.3. Position-Wise Attention Module

3.4. Similarity Measurement Module

4. Simulation and Experiment

4.1. Simulation

4.1.1. Dataset

4.1.2. Training

4.1.3. Testing

4.2. Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI