Next Article in Journal
Potential of Sentinel-2A Data to Model Surface and Canopy Fuel Characteristics in Relation to Crown Fire Hazard
Previous Article in Journal
Multi-Spectral Water Index (MuWI): A Native 10-m Multi-Spectral Water Index for Accurate Water Mapping on Sentinel-2
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Correlation Tracking for UAV Videos via Feature Fusion and Saliency Proposals

1
School of Computer Science and Engineering, Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, Northwestern Polytechnical University, Xi’an 710129, China
2
Department of Computer Science, Faculty of Business and Physical Sciences, Aberystwyth University, Aberystwyth, SY23 3DB, UK
*
Author to whom correspondence should be addressed.
Remote Sens. 2018, 10(10), 1644; https://0-doi-org.brum.beds.ac.uk/10.3390/rs10101644
Submission received: 10 September 2018 / Revised: 6 October 2018 / Accepted: 12 October 2018 / Published: 16 October 2018
(This article belongs to the Section Remote Sensing Image Processing)

Abstract

:
Following the growing availability of low-cost, commercially available unmanned aerial vehicles (UAVs), more and more research efforts have been focusing on object tracking using videos recorded from UAVs. However, tracking from UAV videos poses many challenges due to platform motion, including background clutter, occlusion, and illumination variation. This paper tackles these challenges by proposing a correlation filter-based tracker with feature fusion and saliency proposals. First, we integrate multiple feature types such as dimensionality-reduced color name (CN) and histograms of oriented gradient (HOG) features to improve the performance of correlation filters for UAV videos. Yet, a fused feature acting as a multivector descriptor cannot be directly used in prior correlation filters. Therefore, a fused feature correlation filter is proposed that can directly convolve with a multivector descriptor, in order to obtain a single-channel response that indicates the location of an object. Furthermore, we introduce saliency proposals as re-detector to reduce background interference caused by occlusion or any distracter. Finally, an adaptive template-update strategy according to saliency information is utilized to alleviate possible model drifts. Systematic comparative evaluations performed on two popular UAV datasets show the effectiveness of the proposed approach.

Graphical Abstract

1. Introduction

Recent years have witnessed significant developments in computer vision. An enormous amount of research effort has gone into vision-based tasks, such as object tracking [1,2,3,4,5,6] and saliency detection [7,8,9,10]. As a core field of computer vision, visual tracking [4,5,6,11] plays an active role in a wide range of applications, including driverless vehicles, robotics, traffic analysis, medical imaging, motion analysis, and many others.
It is critical to employ an efficient feature representation in order to improve the performance in object tracking. Gradient and color features are the most popular single types of feature. In particular, color features, such as color names (CN), help capture rich color characteristics, and histogram of oriented gradient (HOG) [12] features are adept in capturing abundant gradient information. Based on these feature descriptions, a variety of techniques on target tracking have been proposed. For instance, FragTrack [13] is devised to build object appearance models by exploiting multiple parts of the target. Babenko et al. [14] presented a multiple instance learning (MIL) algorithm to develop a discriminative model by bagging all ambiguous negative and positive samples. Grabner et al. [15] utilized a novel on-line Adaboost feature selection method (OAB), benefitting considerably by on-line training. In a past paper [2], a structural local sparse representation is applied to tracking task, where both partial and spatial information are exploited. Zhang et al. [16] discovered the relationship between an object and its spatiotemporal context based on the use of a Bayesian framework. Extended Lucas Kanade (ELK) method [17] considers two log-likelihood terms that are related to information regarding object pixels or background affiliation, in addition to the standard LK template matching term. Most of the aforementioned techniques are dependent of the intensity or texture information while characterizing a given image. However, it is difficult for them to meet the requirement of processing a large number of frames per second without resorting to parallel computation on a standard PC in dealing with real-time tasks [17]. From this viewpoint, correlation filters [18,19,20,21,22] show their strengths both in speed and in accuracy, where tracking problem is converted from time domain to frequency domain with fast Fourier transform (FFT). In so doing, convolution can be substituted with multiplication in an effort to achieve fast learning and target detection.
Although high tracking speed may be obtained, long-time tracking can often result in model drift. To ensure the stability of model updating in object tracking, Kalal et al. [1] decomposed the ultimate task of tracking into subtasks of tracking, learning and detection (TLD), where tracking and detection reinforce each other. However, if the location of an object is predicted only with respect to the previous frame, the appearance model may suffer from noisy samples. In particular, when the object is becoming blocked by something else, the tracker will fail immediately. Having taken notice of this, Hare et al. [2] adjusted the appearance model in a more reliable way, learning a joint structured output (Struck) to predict the object location. Apart from using a correlation filter, Zhu et al. [21] introduced an additional filter for detection, which greatly alleviated the problems of location error and model drifting caused by serious occlusion. Benefiting from temporal context and online redetector, a method described previously [22] performs robustly to appearance variation.
Note that following the increasing availability of low-cost, commercially available unmanned aerial vehicles (UAVs), more and more research efforts have been focusing on object detection and tracking by UAV videos. For example, Logoglu et al. [23] designed a feature-based moving object detection method for aerial videos. Fu et al. [24] proposed a technique named ORVT, for onboard robust visual tracking of targets in aerial images using a reliable global-local object model. However, all methods mentioned above cannot cope well with challenges appearing in such videos, which typically involve illumination variation, background clutter, and occlusion. To address these issues we propose a robust tracking approach for UAV videos, which offers three main contributions: (1) Composed of the HOG and dimension-reduced CN features, fused features are introduced to correlation filter in order to improve the robustness of appearance model in describing the target. (2) To deal with background clutter and meanwhile, and to reduce the risk of model drifts caused by occlusion, saliency proposals are introduced as posterior information to relocate the object. (3) A new adaptive template update method is proposed to further alleviate the problem of model drift that is caused by occlusion or distraction. The effectiveness of this approach is demonstrated through systematic comparisons against other techniques.
The rest of this paper is organized as follows. Section 2 discusses relevant previous work on correlation filter and saliency detection. Under the general framework of correlation filter, Section 3 describes our approach. Section 4 presents an evaluation of the proposed approach and a comparative study with state-of-the-art techniques. Section 5 discusses the tracking speed of different methods and assesses the effects of each contribution made by the proposed work. Finally, Section 6 concludes this study and points out interesting further research.

2. Related Work

2.1. Correlation Filters

Because of their impressive high-speed, correlation filters have attracted a great deal of interests in object tracking. For instance, David S. Bolme et al. [25] proposed the minimum output sum of squared errors (MOSSE) filter, which works by finding the maximum cross correlation response between the model and candidate patch. Henriques et al. [26] exploited the circulate structure and Fourier transformation in a kernel space (CSK), offering excellent performance on a range of computer vision problems. A vector correlation filter (VCF) was proposed by Boddeti et al. [27] to minimize localization errors while improving the tracking speed. Danelljan et al. [28] exploited the color attributes of an object and introduced CN features into CSK to perform object tracking. Combining techniques of kernel trick and cycle shift [26], Kernelized Correlation Filter (KCF) [29] entails more adaptive performance for diverse scenarios using multichannel HOG features. The DSST tracker [19] learns adaptive multiscale correlation filters by the use of HOG features to handle the scale change of target objects. To learn a model that is inherently robust to both color changes and deformations, Staple [30] combines two image patch representations that are sensitive to competing factors. Danelljan et al. [31] utilized a spatial regularization component in the learning process to penalize correlation filter coefficients as a function of their spatial location. Recently the authors of a past paper [20] proposed a background-aware correlation filter (BACF) that can model how background as well as foreground of an object may vary over time. To drastically reduce the number of parameter in the model, Danelljan et al. [32] proposed a factorized convolution operator. The utilization of a compact generative model of the training sample distribution significantly reduces the memory and time complexity, while providing better diversity of samples.
Whilst many methods exist as outlined above, they do not address the critical issue of online model update. As a result, such correlation trackers are susceptible to model drifting and hence, are less effective for handling important problems such as long-term occlusion and object out-of-view.

2.2. Saliency Detection

Saliency is considered to represent an object or a pixel that is more conspicuous than its neighbors. Saliency detection aims to capture the regions that stand out in an image. In terms of algorithm strategy, saliency detection approaches can be categorized into two subgroups, one is the group of bottom-up data-driven methods [9,33,34] and the other is that of top-down task-driven methods [10].
Top-down methods are task-driven which learn a supervised classifier for salient object detection. In DRFI [9], hand crafted features were extracted to classify each region. Xi at al. [10] proposed a SVM based methods with a color information as the input. On the other hand, for most bottom-up methods, low-level features are employed to calculate the saliency value. By analyzing the log-spectrum of an input image, Hou X. et al. [8] introduced a mechanism to extract the spectral residual of an image in spectral domain. They proposed a fast method for constructing the corresponding saliency map in the spatial domain which is independent of features, categories, or other forms of prior knowledge of the domain objects. To keep the structure of the objects, region-based methods were also proposed. These methods segmented images into coherent regions to obtain proper spatial structure. Goferman et al. [33] used a patch-based approach to get global properties. Cheng et al. [34] combined a soft abstraction to decompose an image into large perceptually homogeneous elements in order to achieve efficient saliency detection. Additionally, boundary cue is used to improve the saliency detection performance, with boundary prior knowledge treating image boundary regions as labeled background.

3. Proposed Approach

We aim to develop an online tracking algorithm that is adaptive to significant appearance change without being prone to drifting, in which the extracted fused features are encoded in terms of multivectors. Further, saliency information is attained to provide reliable proposals for correlation filters to redetect objects in case of tracking failure. In particular, the adaptive template updating rules are put forward in order to achieve robust performance. The flowchart of the proposed tracking approach is illustrated in Figure 1, where the speed of such a tracker is ensured using a correlation filter.

3.1. Correlation Tracking through Fused Features

Features play an important role in computer vision. For example, much of the impressive progress in object detection can be attributed to the improvement in the representation power of features [35]. Gradient and color features are the most widely exploited in object detection and tracking. Indeed, previous work [36] has verified that there exists a strong complementarity between gradient and color features.
However, how to jointly utilize different features for aerial tracking is still an open question. Compared with generic visual object tracking, certain tracking challenges are amplified in aerial scenarios, including abrupt camera motion, low resolution, significant changes in scale and aspect ratio, fast moving objects, as well as partial or full occlusion. It is difficult to obtain comprehensive information of interesting objects using a single feature type like HOG or CN [37] under such circumstances. Hence, we employ fused features to achieve robust performance in aerial tracking. Inspired by CN from a linguistic viewpoint [37], which involves eleven preliminary color terms: black, blue, brown, grey, green, orange, pink, purple, red, white, and yellow, we concatenate CN features extracted from the original image and substantially reduce the number of color dimensions in an effort to enable a significant speed boost, with the support a work reported previously [28]. In addition, any given input color image is transformed into one with grey values and then, HOG features are extracted from the resulting grey image. All these features are concatenated directly to form a multivector as a fused feature descriptor.
In this paper, we utilize the multivector representation of fused features which better fits with the correlation tracking framework. More specifically, we denote x d as the fused feature multivector of cardinality d R D , respectively. We consider y d as the desired correlation output corresponding to a given sample x d . A correlation filter w of the same dimensionality as x d is then learned by solving the following minimization problem:
w = arg   min w x d y d 2 + λ w 2 2
where λ is a regularization parameter. Note that the minimization problem in Equation (1) is akin to training the multivector correlation filters in a past paper [27], and can be resolved within each individual feature channel using FFT. Let the capital letters be the corresponding Fourier transformed signals. The learned filter in the frequency domain on the d t h   ( d { 1 , , D } ) channel can be written as
W d = Y ¯ X d i = 1 D X ¯ i X i + λ
where Y , X , W denote the discrete Fourier transforms (DFT) of y , x w , respectively; Y ¯ represents the complex conjugation of Y , and Y ¯ X d is a point-wise product. Given an image patch in the next frame (of the video sequence concerned), the fused feature multivector is denoted by z R D . The correlation response map is computed by
r = F 1 ( d = 1 D W d Z ¯ d )
where the operator F 1 denotes the inverse FFT. The target location can then be estimated by searching for the position of the maximum value of the correlation response map r , such that
( x , y ) = arg   max a , b ( r ( a , b ) )

3.2. Object Redetection Based on Saliency Proposals

For traditional correlation filter-based trackers [26,27,28,29], the use of FFT helps greatly reduce the computational cost, demonstrating the ability of real time tracking on UAV videos. Nevertheless, two main challenges remain: (a) distraction and (b) model drift, caused by occlusion or background clutter. In DSST [19], an independent scale prediction filter is presented, but it fails perform well when serious occlusion exists, as shown in Figure 2. A common approach to handling model drift is to integrate a short-term tracker and online long-term detector, as with what is taken in the TLD algorithm [1]. However, learning an online long-term detector relies heavily on lots of well-labeled training samples which can be difficult to collect. Additionally, an exhaustive search through the entire image with sliding windows is time-consuming, especially for the case of employing complex but discriminative features.
To provide relatively less proposals and suppress the background interference, in this paper we not only utilize an adaptive update strategy to learn the appearance model, but also exploit a few pieces of reliable information from the biologically inspired saliency map. We postulate that the redetector could alleviate the model drift problem caused by occlusion or distraction.

3.2.1. Saliency Proposal Detection

Due to its simplicity and efficiency, we propose to utilize the spectral residual based saliency detection algorithm [8] to obtain saliency proposals. Then we iteratively redetect the object based on the resulting saliency proposals. Given an original image I , Fourier transform is used to extract the phase features P ( f ) and amplitude features A ( f ) of the image (in the frequency domain), as shown in Equations (5) and (6):
A ( f ) = R ( F ( I ( x ) ) )
P ( f ) = S ( F ( I ( x ) ) )
From this averaged spectrum is approximated by convoluting the input image h n ( f ) L ( f ) , where L ( f ) = log ( A ( f ) ) and h n ( f ) denotes a local average filter to approximate the shape of A ( f ) . Thus, the spectral residual R ( f ) can be obtained by Equation (7):
R ( f ) = L ( f ) h n ( f ) L ( f )
In the subsequent experimental studies, the size of h n ( f ) , n is empirically set to 3.
The spectral residual R ( f ) helps capture the key information contained within an image. In particular, it serves as a compressed representation of the underlying scene reflected by the image. Using inverse Fourier transform (IFT), we can construct the saliency map in the spatial domain. The saliency map contains primarily the nontrivial parts of the scene. The content of the residual spectrum can also be interpreted as the unexpected portion of the image. Thus, the value at each point in a saliency map is squared to indicate the estimation error. For better visual effects, we smooth the saliency map with a Gaussian filter g(x). In sum, given an image I(x), we have
S ( x ) = g ( x ) F 1 [ exp ( R ( f ) + P ( f ) ) ] 2 g ( x ) = 1 2 π σ 2 exp ( ( i k 1 ) 2 + ( j k 1 ) 2 2 σ 2 )
where k = 4 , σ = 2.5 , ( i , j ) is the coordinate of pixel x and F 1 denotes IFT.
Having built a saliency map S(x), saliency proposals can be obtained using threshold segmentation and region connection. Specifically, the saliency map is first segmented according to the adaptive thresholding [38], and therefore generates a number of interconnected domains. Without losing generality, suppose that the connected domain corresponding to the real object does not appear at the border of the image, we can exclude the connected domains whose centers are within a certain number of pixels of the boundary in the segmented image to derive the final saliency proposals (in implementation herein, this number is set to 15).

3.2.2. Redetection Based on Saliency Proposals

The traditional correlation tracker cannot perform well when serious occlusion exists. To address this issue, we propose our tracker with a redetection approach based on saliency proposals. If the correlation response r is less than the threshold T 1 for more than L consecutive frame, it is high likely that the target is occluded seriously. So we redetect the object using saliency information at this time, otherwise the object will be located only by correlation filter.
Specifically, we consider the location of a certain object in the previous frame as a center point, around which the image patch is cropped from the original image. The image patch is of size B × B , B = 0.08 × n u m _ l o s t × w × h + w + 1 , where n u m _ l o s t is the number of frames where serious occlusion happened continuously, w and h denote the initialized horizontal width and vertical height of the interested object in the first frame, respectively, and   means rounding down. Such an image patch is designed to guarantee that the longer the object is lost, the bigger the image patch is cropped from the original image.
From this we can obtain the saliency proposals in the image patch and sample paddings with a size of 3 w × 3 h around the center of every saliency proposal. Then, correlation filtering is applied between the center of each saliency proposal and the template in the first frame, with the point of the largest response r m taken as the center of the new object if r m exceeds a certain value T 2 . Otherwise, in order to ensure that the object remains within an image patch, the patch is expanded when repeating the redetection step in the next frame.
Figure 3 shows that an object is redetected based on saliency proposals. As can be seen from this figure, when the object is lost, we can gradually relocate its approximate position by saliency detection in the area where the object may appear. Following this, the object can then be relocated accurately using correlation filtering.

3.3. Adaptive Model Updating

To obtain a robust and efficient approximation, we update the numerator A d and the denominator B d of the correlation filter W d in Equation (2) separately, using a moving average:
A t d = ( 1 η ) A t 1 d + η Y X ¯ t d
B t d = ( 1 η ) B t 1 d + η i = 1 D X t i X ¯ t i
W t d = A t d B t d + λ
where t is the frame index and the learning rate η is set to 0.025 empirically.
If the object position is relocated according to saliency information, we update the template according to Equation (12):
A = ( 1 10 × η ) × A l + 10 × η × i n i t _ A l B = ( 1 10 × η ) × B + 10 × η × i n i t _ B
Then, the previous templates and the first frame template are combined to update the target template, thereby minimizing potential model drift.

4. Experimental Results

We provide representative experimental results in this section. The proposed tracker is implemented in Matlab2014 on a PC with a 3.4 GHz processor and 16 GB RAM without involving any sophisticated program optimization. In order to present an objective evaluation regarding the performance of the proposed approach, we conduct experiments on two datasets, namely, the VIVID dataset [39] and the UAV123 dataset [40], for both qualitative and quantitative evaluations. In these experiments, the parameters are fixed for all of the sequences, in which T 1 and T 2 are set to 0.2 and 0.25, respectively. In addition, L is set as 7 and the candidate region size for the correlation filter is set to three times as big as that of the object under tracking.
We compare the proposed tracker with a range of excellent state-of-art trackers, including TLD [1], DSST [19], BACF [20], ORVT [24], Staple [30], SRDCF [31], ECO_HC [32], KCFDP [41], BIT [42], and fDSST [43]. Among these trackers, TLD introduces the detection method into the tracking problem, which performs well when occlusion exists, while DSST, KCFDP, SRDCF, Staple, BACF, ECO_HC, and fDSST involve the use of correlation filters to improve the speed of tracking. In particular, ORVT is an onboard robust aerial tracking algorithm working by the use of a reliable global-local object model. Additionally, BIT is a biologically inspired tracker that extracts low-level biologically inspired features while imitating an advanced learning mechanism to combine generative and discriminative models for target location. Note that we employ publicly available codes of compared trackers for fair comparison.
We follow the standard evaluation metrics for object tracking algorithms in two aspects: the precision rate and success rate. The precision rate shows the percentage of successfully tracked frames on which the center location error (CLE) of a tracker is within a given threshold (e.g., 20 pixels), with CLE defined as the average Euclidean distance between the center locations of the targets and the manually labeled ground truths. A tracking result in a frame is considered successful if | r d r t | | r d r t | > θ for a threshold θ ( 0 , 1 ] , where r d and r t denote the areas of the bounding boxes of the tracked region and the ground truth, respectively, and represent the intersection and union of two regions, respectively, and | | denotes the number of pixels in the region. Thus, the success rate is defined as the percentage of frames where the overlap rates are greater than a threshold θ . Normally, the threshold θ is set to 0.5.
We present the results under one-pass evaluation (OPE) using the average precision and success rate over all sequences. OPE is the most common evaluation method which runs trackers on each sequence for once. It initializes the trackers with the state of the ground truth object in the first frame and reports the average precision or success rates across all the results obtained.

4.1. Experiments on VIVID Dataset

There are eleven video sequences in the VIVID dataset. Apart from motion blur and fast motion, these video sequences also suffer from further difficulties such as occlusion, scale variation, background clutter, low resolution, etc. In the VIVID dataset, the ground truth is given every ten frames. To evaluate the trackers more accurately, we mark the entire eleven sets of videos’ ground truths, referring to the official data, for quantitative evaluation.
The experimental results on these nine videos are summarized in Table 1 and Table 2, which show the overall rates of the success plots and those of the precision plots, respectively. As can be seen from these tables, our tracker performs reliably and can achieve optimal outcomes overall. In particular, regarding the first three video sequences where occlusion occurs seriously, our method exhibits an excellent performance benefitting from saliency based redetection and adaptive template updating, while the other trackers lost the targets under these circumstances. However, the remaining video sequences are frequently affected by scale change, rotation and similar objects which led to a decline in the performance of our algorithm also.
Figure 4 shows the qualitative evaluation on the VIVID dataset. Figure 4a illustrates the performance of our approach and compared algorithms on the sequence pktest01. Only our method keeps the virtue of robust tracking after more than 100 frames of occlusion. It is evident that through redetecting object by saliency information, the proposed tracker is more robust than the other trackers. In the sequence pktest03, in addition to motion blur and fast motion, the other main challenges for tracking are illustration variation, serious occlusion, and background clutter. From the last picture of Figure 4b, it is obvious that the full occlusion with the car is handled well by our tracker, while the other methods have a shift for the target. This implies that saliency detection makes an important contribution to achieve such an outstanding performance. In addition, almost every frame is subject to a varying degree of background clutter. Note that the scale of the target is too small to recognize, it is almost integrated with the background with certain texture and other details lost. It can be seen from the results that only our algorithm can successfully deal with the problem of background clutter as other methods fail to track the target completely. There is no doubt that fused features help improve the robustness of the proposed appearance model. In addition, the adaptive model update strategy also helps reduce model drift. Both of the above measures lead to the excellent performance of our method.
As shown in Figure 4c–e, where there is no significant occlusion, our methods can always follow the target as with other trackers. It works even when similar cars appear in the sequence egtest02. However, when scale variation and rotation occur, the calculated scales of bounding boxes are not sufficiently accurate causing a decrease in the accuracy of our tracker. For the sequences egtest01 and redteam, the background is similar with the edge of the target. If the response of the correlation filter is less than the threshold for a long time, our tracker will automatically try to relocate the target by exploiting the vision saliency. Of course, this strategy may gradually introduce certain noise from the background around the target to the template, leading to slight model drift.

4.2. Experiments on UAV123 Dataset

In order to evaluate the performance of our proposed approach, we conduct experiments on twenty challenging video sequences selected from the UAV123 dataset for both quantitative and qualitative analysis. The UAV123 dataset provides a facility for the evaluation of different trackers on a number of fully annotated HD videos captured from a professional grade UAV. It complements those benchmarks establishing the aerial component of tracking while providing a more comprehensive sampling of tracking nuisances that are ubiquitous in low-altitude UAV videos. Apart from aspect ratio change (ARC) and fast motion (FM), these video sequences are also affected by several adverse conditions such as background clutter (BC), camera motion (CM), full occlusion (FOC), illumination variation (IV), low resolution (LR), out of view (OV), partial occlusion (POC), similar object (SOB), scale variation (SV), and viewpoint change (VC). Thus, the experiments carried out herein include all typical challenges involved in real-world aerial tracking problems.
Ranging from 535 to 1783 frames, the twenty selected sequences used here involve all the challenging factors in the UAV123 dataset with different resolutions. Various scenes exist in these sequences, such as roads, buildings, field, beaches, and so on. The targets include aerial vehicles, person, trucks, boats, cars, etc. Detailed information of these sequences is listed in Table 3.
Table 4 and Table 5 exhibit the overall rates of the success plots and those of the precision plots on the twenty sequences, respectively. It can be seen that our tracker achieves the best performance on average, demonstrating its robustness in dealing with object tracking tasks involving different challenging factors and various background types.
We also perform an attribute-based comparison with other methods on this subset of the UAV123 dataset. Figure 5 and Figure 6 show the success plots and precision plots of twelve respective attributes on the precision and success rates, respectively. As can be seen from these results, our tracker always performs reliably and can achieve the optimal, or at least a close to optimal solution in most cases. Specifically, for the amplified challenging factors in aerial tracking, including CM, BC, SV, ARC, FM, IV, FOC, and VC, our tracker is able to achieve promising results, benefitting from the robustness of fused features as well as from the employment of the appearance template and model updating strategy. For videos with fast moving objects, camera motion, and background clutter, the fused features have stronger abilities to capture the information from the objects and, therefore, lead to better results as compared to the classic single-feature trackers. In addition, when the aspect ratio of an object changes significantly, our adaptive appearance template updating strategy can adjust the template to the appearance of the object. Moreover, thanks to the high confidence model updating method background noise is suppressed as much as possible when serious occlusion exists in aerial videos. Nevertheless, our tracker may not perform equally well when dealing with images of low resolution and targets that are out of view. It is likely due to the fact that such challenging factors usually create very serious problems for saliency detection, resulting in model drift.
Figure 7 illustrates qualitative evaluations on the application of different trackers to example sequences selected from the UAV123 dataset. In the sequence person16, the background has the similar color with person, making it difficult for the trackers to successfully function to a different extent. Owing to the use of saliency information, our tracker is able to relocate the object after it has been occluded by the tree and outperforms the state-of-the-art tracking methods. As shown in Figure 7b, the sequence uav1_3 contains almost all the possible challenges in aerial tracking, especially low resolution and serious background clutter. Benefiting from the target redetection strategy, our tracker can track the target successfully all the time, while the others locate the target correctly only once in a while. Of course, the robustness of fused features also helps ensure the good performance of our tracker. However, for certain sequences with serious scale variation and similar objects, for example the sequence car10, our tracker slightly underperforms in comparison to several state-of-art algorithms (e.g., ECO_HC, BACF, and Staple). Under such circumstances, our tracker may incur small model drift but it does not lose the target.

5. Discussion

In this section, we discuss the tracking speed of different methods and assess the effect of each technical contribution incorporated within the proposed approach. All the experimental results are again taken on the twenty selected sequences of the UAV123 dataset, as indicated previously.

5.1. Speed Analyse

For practical applications of aerial tracking, the computational efficiency of a given tracker also needs to be considered. Table 6 lists the running speed of each compared tracker and the average speeds over all of the sequences are shown in the last row. As we can see, the fDSST tracker achieves the fastest running speed which is almost 133 fps and the biologically inspired BIT tracker performs well in terms of running efficiency, too. Mainly due to the low cost of computing the color histogram, the Staple tracker also has a good performance on tracking speed. However, SRDCF and BACF trackers show low running efficiencies on all of the twenty test sequences, which are approximately 10.79 fps and 9.65 fps, respectively, which still may not meet the standard of real-time running. It is worthwhile to note that our tracker can meet the real-time requirements, while gaining highly satisfactory results on both success rate and precision rate. This owes much to the robustness of fused features and the efficacy of saliency detection. To further strengthen the performance of our proposed tracker, we are trying to find an optimization method to speed it up.

5.2. Effect of Fused Features

Computationally, feature construction is an essential part of our tracker as it provides sufficient information for the correlation filter. We perform an experimental study to show the advantage of feature fusion. In particular, we test our tracker with fused features against a version of the tracker using only HOG or CN features. The results are reported in Figure 8. It is obvious that fused features lead to better performance in terms of both the precision rate and the success rate.

5.3. Effect of Saliency Proposals

To demonstrate the effectiveness of our saliency proposals in the detection stage, we evaluate the quantitative performance of our tracker with and without saliency proposals respectively. Note that almost all the sequences used for the experiments suffer from partial or full occlusion. The results are shown in Figure 9. Compared with the version without saliency proposals, the one utilizing saliency obtained exceedingly better performance. In addition, these results demonstrate that the tracking-by-detection mechanism is very helpful once integrated with correlation-based tracking for occlusion-dominated scenes.

5.4. Comparison of Saliency-Based Detection and Sliding Widow Based Detection

To further testify the contribution of the saliency-based mechanisms, we use the traditional sliding window-based detection in substitution of the saliency-based within the generic framework of our tracking algorithm. Specifically, the detector is applied to the entire frame with sliding windows when max ( r ) < T 1 . In our implementation, the detector is trained by a random fern classifier [1], where each fern performs a number of pixel comparisons on an image patch with two feature vectors that point to the leaf-node with a certain posterior probability. The posteriors from all ferns are averaged as the target response and the detection is based on the use of the scanning window strategy. We use a k-nearest neighbor (KNN) classifier to select the most confident tracked results as positive training samples, e.g., a new patch is predicted as the target if k nearest feature vectors in the training set all have positive labels (k = 5 in this work).
Figure 10 presents the success plot and precision plots of these two trackers on the testing sequences. Obviously, our tracker performs significantly better on both evaluations. Due to fast motions of the UAV, great changes can occur to the scale and appearance of the target in the videos, which may reduce the similarity between the target and the corresponding tracking templates. Hence, it is hard for methods using a sliding window to obtain satisfactory results, which work by discriminating the target according to similarity measures between windows. What should not be ignored is that object tracking is closely related to attentional tasks in the biological world. Inspired by this observation, we exploit the abundant saliency information in the videos. Then, the adaptive template updating strategy ensures that new templates obtained by saliency detection can be introduced in time. This helps minimize the occurrence of possible model drift when the appearance of the target changes drastically.
Furthermore, we compare the speeds of these two methods. The results are illustrated in Figure 11. It can be seen that with the introduction of saliency information, the proposed approach achieves a higher running speed on the majority of testing sequences as compared to the version with sliding window-based detection. This can be expected because the proposed approach is intended to imitate biological vision systems that are able to pop-out the salient locations in the visual field [44] even under the most adverse conditions (e.g., highly cluttered scenes, low-light, etc.). These salient locations become the focus of attention for the post-attentive stages of visual processing, which can effectively provide proposals for target relocation. However, for the detector without the use of saliency detection, every tracking outcome is computed via running a sliding window, inevitably at the expense of costing more computing resource.

6. Conclusions

In this paper, we have proposed a robust tracking method for UAV videos via fused feature based correlation filter and saliency detection. The correlation filter that combines the HOG and dimension-reduced CN features leads to significant contribution in tracking performance while dealing with challenging factors such as occlusion, noise and illumination. To handle serious occlusion, this work has introduced saliency information into the tracker as redetection, thereby reducing background interference. Moreover, an adaptive model update strategy is adopted to alleviate possible model drifts, which is both robust and computationally efficient. Experimental investigations have demonstrated, both quantitatively and qualitatively, that our approach achieves favorable results on the average performance for two popular aerial tracking datasets in comparison with the state-of-the-art methods. Given its reliability and robustness, the proposed tracker can be successfully employed in a wide variety of UAV video applications (beyond those related to surveillance), such as wild-life monitoring, activity control, navigation/localization, and obstacle/object avoiding, especially when real-time processing is mandatory, as in the case of rescue or defense purposes.
As a generic approach for aerial videos, we plan to further develop more robust fused features and to reinforce the fast nature of the redetect methods in future, while operating in real-time. Also, in this work, it has been assumed that each-channel feature is independent of the rest and hence, no interaction between such features has been considered. As such, a channel-wise filter was successfully adopted. However, it would be interesting to explore the interconnections among the information contents conveyed by different channels and to introduce a general linear filter to deal with such cases.

Author Contributions

All the authors made significant contributions to this work. X.X. and Y.L. devised the approach and analyzed the data; Q.S. provided advice for the preparation and revision of the work; X.X. performed the experiments; and H.D. helped with the experiments.

Funding

This work was supported by the National Natural Science Foundation of China (61871460, 61876152), the National Key Research and Development Program of China (2016YFB0502502), and the Foundation Project for Advanced Research Field of China (614023804016HK03002).

Acknowledgments

The authors would like to thank the editors and the anonymous referees for their constructive comments which have been very helpful in revising this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kalal, Z.; Matas, J.; Mikolajczyk, K. Tracking-Learning-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Hare, S.; Saffari, A.; Torr, P.H.S. Struck: Structured Output Tracking with Kernels. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 263–270. [Google Scholar]
  3. Lu, H.; Jia, X.; Yang, M.H. Visual tracking via adaptive structural local sparse appearance model. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 1822–1829. [Google Scholar]
  4. Blake, A.; Isard, M. Active Contours: The Application of Techniques from Graphics, Vision, Control Theory and Statistics to Visual Tracking of Shapes in Motion; Springer Science Business Media: Berlin, Germany, 2012. [Google Scholar]
  5. Battiato, S.; Farinella, G.M.; Furnari, A.; Puglisi, G.; Snijders, A.; Spiekstra, J. An integrated system for vehicle tracking and classification. Expert Syst. Appl. 2015, 42, 7263–7275. [Google Scholar] [CrossRef]
  6. Andriluka, M.; Roth, S.; Schiele, B. People-tracking-by-detection and people-detection-by-tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  7. Zingoni, A.; Diani, M.; Corsini, G. A Flexible Algorithm for Detecting Challenging Moving Objects in Real-Time within IR Video Sequences. Remote Sens. 2017, 9, 1128. [Google Scholar] [CrossRef]
  8. Hou, X.; Zhang, L. Saliency Detection: A Spectral Residual Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
  9. Jiang, H.; Wang, J.; Yuan, Z.; Wu, Y. Salient Object Detection: A Discriminative Regional Feature Integration Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2083–2090. [Google Scholar]
  10. Li, X.; Li, Y.; Shen, C.; Dick, A.; Hengel, A.V.D. Contextual hypergraph modeling for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2014; pp. 3328–3335. [Google Scholar]
  11. Wan, M.; Gu, G.; Qian, W.; Ren, K.; Chen, Q.; Zhang, H.; Maldague, X. Total Variation Regularization Term-Based Low-Rank and Sparse Matrix Representation Model for Infrared Moving Target Tracking. Remote Sens. 2018, 10, 510. [Google Scholar] [CrossRef]
  12. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
  13. Adam, A.; Rivlin, E.; Shimshoni, I. Robust Fragments-Based Tracking Using the Integral Histogram. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006; pp. 798–805. [Google Scholar]
  14. Babenko, B.; Yang, M.-H.; Belongie, S. On-line boosting and vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–26 June 2009; pp. 983–990. [Google Scholar]
  15. Grabner, H.; Bischof, H. On-line boosting and vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006; pp. 260–267. [Google Scholar]
  16. Zhang, K.; Zhang, L.; Liu, Q. Fast visual tracking via dense spatio-temporal context learning. In Proceedings of the 2014 European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 127–141. [Google Scholar]
  17. Oron, S.; Bar-Hillel, A.; Avidan, S. Extended Lucas-Kanade Tracking. In Proceedings of the 2014 European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 142–156. [Google Scholar]
  18. Yang, R.; Wei, Z. Real-Time Visual Tracking through Fusion Features. Sensors 2016, 16, 949. [Google Scholar] [Green Version]
  19. Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK, 1–5 September 2014; pp. 65.1–65.11. [Google Scholar]
  20. Galoogahi, H.K.; Fagg, A.; Lucey, S. Learning Background-Aware Correlation Filters for Visual Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1144–1152. [Google Scholar]
  21. Zhu, G.; Wang, J.; Wu, Y.; Lu, H. Collaborative Correlation Tracking. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 184.1–184.12. [Google Scholar]
  22. Ma, C.; Yang, X.; Zhang, C.; Yang, M.H. Long-term correlation tracking. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5388–5396. [Google Scholar]
  23. Logoglu, K.B.; Lezki, H.; Yucel, M.K. Feature-Based Efficient Moving Object Detection for Low-Altitude Aerial Platforms. In Proceedings of the IEEE International Conference on Computer Vision Workshop, Venice, Italy, 22–29 October 2017; pp. 2119–2128. [Google Scholar]
  24. Fu, C.; Duan, R.; Kircali, D. Onboard Robust Visual Tracking for UAVs Using a Reliable Global-Local Object Model. Sensors 2016, 16, 1406. [Google Scholar] [CrossRef] [PubMed]
  25. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 2244–2250. [Google Scholar]
  26. Henriques, F.; Caseiro, R.; Martins, P.; Batista, J. Exploiting the Circulant Structure of Tracking-by-Detection with Kernels. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 702–715. [Google Scholar]
  27. Boddeti, V.N.; Kanade, T.; Kumar, B.V. Correlation filters for object alignment. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2291–2298. [Google Scholar]
  28. Danelljan, M.; Khan, F.S.; Felsberg, M.; van de Weijer, J. Adaptive Color Attributes for Real-Time Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1090–1097. [Google Scholar]
  29. Henriques, F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H.S. Staple: Complementary Learners for Real-Time Tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
  31. Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Learning Spatially Regularized Correlation Filters for Visual Tracking. In Proceedings of the 2016 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2016; pp. 4310–4318. [Google Scholar]
  32. Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6931–6939. [Google Scholar]
  33. Goferman, S.; Zelnik-Manor, L.; Tal, A. Context-aware saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1915–1926. [Google Scholar] [CrossRef] [PubMed]
  34. Cheng, M.M.; Warrell, J.; Lin, W.Y.; Zheng, S.; Vineet, V.; Crook, N. Efficient Salient Region Detection with Soft Image Abstraction. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1529–1536. [Google Scholar]
  35. Benenson, R.; Omran, M.; Hosang, J.; Schiele, B. Ten Years of Pedestrian Detection, What Have We Learned? In Proceedings of the European Conference on Computer Vision Workshops, Zurich, Switzerland, 6–7 September 2014; pp. 613–627. [Google Scholar]
  36. Khan, R.; Weijer, J.V.D.; Khan, F.S.; Muselet, D.; Ducottet, C.; Barat, C. Discriminative Color Descriptors. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2866–2873. [Google Scholar]
  37. Berlin, B.; Kay, P. Basic Color Terms: Their Universality and Evolution; University of California Press: Berkeley, CA, USA, 1991. [Google Scholar]
  38. Roth, D.B.G. Adaptive Thresholding using the Integral Image. J. Graph. Tools 2007, 12, 13–21. [Google Scholar]
  39. VIVID Tracking Evaluation Web Site. Available online: http://vision.cse.psu.edu/data/vividEval/datasets/datasets.html (accessed on 22 April 2018).
  40. Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar]
  41. Huang, D.; Luo, L.; Wen, M.; Chen, Z. Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals. In Proceedings of the 2015 British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 185.1–185.12. [Google Scholar]
  42. Cai, B.; Xu, X.; Xing, X.; Jia, K.; Miao, J.; Tao, D. BIT: Biologically Inspired Tracker. IEEE Trans. Image Process. 2016, 25, 1327–1339. [Google Scholar] [CrossRef] [PubMed]
  43. Danelljan, M.; Häger, G.; Khan, F.S.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1561–1575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  44. Mahadevan, V.; Nuno, V. Saliency-based discriminant tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 1007–1013. [Google Scholar]
Figure 1. Flowchart of proposed tracking algorithm.
Figure 1. Flowchart of proposed tracking algorithm.
Remotesensing 10 01644 g001
Figure 2. Tracking results of DSST: (a) tracking well without occlusion; (b) tracking failed within occlusion; and (c) model drift after occlusion.
Figure 2. Tracking results of DSST: (a) tracking well without occlusion; (b) tracking failed within occlusion; and (c) model drift after occlusion.
Remotesensing 10 01644 g002
Figure 3. Object redetection based on saliency proposals. (a) Original image; (b) zoomed-in image patch cropped from (a); (c) zoomed-in saliency map; (d) zoomed-in saliency proposals; and (e) redetected object (marked with a red rectangle).
Figure 3. Object redetection based on saliency proposals. (a) Original image; (b) zoomed-in image patch cropped from (a); (c) zoomed-in saliency map; (d) zoomed-in saliency proposals; and (e) redetected object (marked with a red rectangle).
Remotesensing 10 01644 g003
Figure 4. Qualitative evaluation of tracking results on VIVID dataset.
Figure 4. Qualitative evaluation of tracking results on VIVID dataset.
Remotesensing 10 01644 g004
Figure 5. Precision plots of proposed tracker compared with state-of-the-art approaches on different attributes of UAV123 dataset.
Figure 5. Precision plots of proposed tracker compared with state-of-the-art approaches on different attributes of UAV123 dataset.
Remotesensing 10 01644 g005
Figure 6. Success plots of tracker compared with state-of-the-art approaches on different attributes of proposed UAV123 dataset.
Figure 6. Success plots of tracker compared with state-of-the-art approaches on different attributes of proposed UAV123 dataset.
Remotesensing 10 01644 g006
Figure 7. Qualitative evaluation of tracking results on UAV123 dataset.
Figure 7. Qualitative evaluation of tracking results on UAV123 dataset.
Remotesensing 10 01644 g007aRemotesensing 10 01644 g007b
Figure 8. Tracking results using fused, color names (CN) or histograms of oriented gradient (HOG) features on 20 sequences from UAV123 dataset.
Figure 8. Tracking results using fused, color names (CN) or histograms of oriented gradient (HOG) features on 20 sequences from UAV123 dataset.
Remotesensing 10 01644 g008
Figure 9. Tracking results with or without saliency proposals on 20 sequences from UAV123 dataset.
Figure 9. Tracking results with or without saliency proposals on 20 sequences from UAV123 dataset.
Remotesensing 10 01644 g009
Figure 10. Tracking results with saliency based detection and sliding window based detection on 20 sequences from UAV123 dataset.
Figure 10. Tracking results with saliency based detection and sliding window based detection on 20 sequences from UAV123 dataset.
Remotesensing 10 01644 g010
Figure 11. Running speeds of tracking methods with saliency-based detection and sliding window based detection on 20 sequences from UAV123.
Figure 11. Running speeds of tracking methods with saliency-based detection and sliding window based detection on 20 sequences from UAV123.
Remotesensing 10 01644 g011
Table 1. Overall rates of precision plots on different sequences of VIVID dataset.
Table 1. Overall rates of precision plots on different sequences of VIVID dataset.
OursDSSTSRDCFKCFDPBACFStapleBITfDSSTTLDECO_HCORVT
pktest010.8700.4770.6000.6060.4590.3530.3560.3590.8500.4110.390
pktest020.8140.3290.5660.4480.7450.4750.3290.3150.7120.5560.544
pktest030.8670.576 0.6050.6150.7860.6610.5630.5700.6170.5540.511
egtest010.8450.2740.2750.2740.9100.9090.8610.8360.4330.8870.849
egtest020.7860.8980.9310.8730.9270.9280.9070.9410.9110.6910.632
egtest030.8250.8640.8550.8630.8450.8590.8700.8670.9190.8490.864
egtest040.8030.3910.9400.9180.9270.9280.9070.9410.2160.9530.384
egtest050.7120.7310.7340.7300.7310.7340.7290.7360.7860.7290.726
Redteam0.8690.9530.9680.9360.9640.9620.9110.9430.9310.9440.946
Overall0.8210.6100.7190.6960.8100.7570.7150.7230.7080.7300.650
Table 2. Overall rates of success plots on different sequences of VIVID dataset.
Table 2. Overall rates of success plots on different sequences of VIVID dataset.
OursDSSTSRDCFKCFDPBACFStapleBITfDSSTTLDECO_HCORVT
pktest010.5330.1880.1730.1650.1750.1700.1470.1720.4930.1730.142
pktest020.5100.1010.2850.1140.2860.0980.0900.0970.4200.2840.141
pktest030.4620.2670.2390.2310.2460.2770.2160.2740.1940.2780.199
egtest010.5590.0860.0830.0800.5530.5840.4660.4460.1780.5830.466
egtest020.4970.6550.6800.6040.6400.6910.6060.6970.6550.7500.875
egtest030.5380.6430.5310.6180.5510.6430.6530.6460.7090.6010.646
egtest040.2830.2440.5750.5110.6400.6910.6060.6970.1170.5080.232
egtest050.3590.3970.4040.3950.3970.3920.3940.4020.3980.1260.391
redteam0.5420.5800.8590.7380.7810.5970.5610.6720.7160.7210.626
Overall0.4760.3510.4250.3840.4740.4600.4150.4560.4310.4470.413
Table 3. Description of sequences selected from UVA123 for experimental investigations.
Table 3. Description of sequences selected from UVA123 for experimental investigations.
SequenceSize# FramesChallenge
truck31280 × 720535LR,POC,BC
bike21280 × 720553ARC,BC,CM,FOC,IV,OV,POC
boat61280 × 720805SV
wakeboard31280 × 720823SV,ARC,LR,VC,CM
building31280 × 720829SV.SOB
group2_21280 × 720865SV, FOC,POC,VC,CM,SOB
person14_11280 × 720847SV,ARC,LR,FOC,POC,BC,CM
boat11280 × 720901SV
group2_31280 × 720913SV,ARC,LR, FOC,POC, BC,IV,CM,SOB
uav1_3720 × 480997SV,ARC,LR,FM, FOC,POC,OV, BC,IV,VC,CM
person101280 × 7201021SV,FOC,POC,OV,VC,CM,SOB
car171280 × 7201057SV,ARC,LR,VC,CM
person161280 × 7201147SV,ARC,FOC,POC,BC,IV,CM
car141280 × 7201327SV,ARC,LR, FOC,POC,OV,VC,CM
group1_11280 × 7201333SV,POC,SOB
person181280 × 7201393SV,ARC, POC,OV,VC,CM
car101280 × 7201405SV,POC,SOB
person2_21280 × 7201434POC,OV,CM
car31280 × 7201717SV,LR,POC,OV,CM,SOB
person201280 × 7201783SV, ARC,POC,OV, VC, CM,SOB
Table 4. Overall rates of precision plots on different sequences of UAV123 dataset.
Table 4. Overall rates of precision plots on different sequences of UAV123 dataset.
OursDSSTSRDCFKCFDPBACFStapleBITfDSSTTLDECO_HCORVT
truck30.9670.9720.9690.9310.9670.9690.9070.9610.9150.9700.949
bike20.2750.3540.3600.1630.2740.2630.1440.2700.1440.3580.336
boat60.9510.8250.8370.8380.9500.9230.5710.8180.7680.9050.823
building30.9560.9370.9380.9190.9540.9480.8980.9230.8080.9610.899
group2_20.9360.9020.9140.6350.9310.5930.8950.5630.3570.9230.887
person14_10.2040.2090.2080.2070.2080.2070.2070.2070.8230.2080.209
boat10.8160.6870.7220.6030.8150.7840.7790.7770.3290.7800.526
wakeboard30.9280.2570.8600.2590.9280.9350.2530.2650.3950.9320.875
group2_30.9000.6820.8430.7570.8980.8750.75907590.2740.8680.775
uav1_30.3890.0900.1550.0900.1980.0910.0900.0900.1950.3410.090
person100.3390.3350.3410.3360.3290.3420.3380.3120.5410.3400.339
car170.3290.2170.1040.1450.3290.2300.1440.2720.3150.5070.247
person160.9110.2150.2150.2190.2160.2150.2140.2140.2020.2150.205
car140.6780.6300.6460.7010.6410.6660.6760.6370.5080.6440.623
group1_10.8470.7780.9020.8700.8870.6300.8330.3960.1830.9090.908
person180.5550.4270.5040.3570.5500.5130.4310.5200.1260.5630.146
car100.9420.9120.9430.9470.9550.9460.9390.9160.6330.9500.952
person2_20.9330.9250.9200.9150.9230.8970.9220.8760.6550.9370.925
car30.9630.9580.9560.9440.9530.9520.6660.9300.0900.9650.926
person200.4680.3720.4380.5080.4680.5040.2380.3440.1110.4000.225
Overall0.6980.5860.6390.5670.6700.6240.5450.5530.4190.6820.606
Table 5. Overall rates of success plots on different sequences of UAV123 dataset.
Table 5. Overall rates of success plots on different sequences of UAV123 dataset.
OursDSSTSRDCFKCFDPBACFStapleBITfDSSTTLDECO_HCORVT
truck30.7630.7570.5750.5690.7630.7530.5520.7590.6090.8090.729
bike20.1370.1620.1420.0430.1110.1290.0170.1340.0160.1510.143
boat60.8100.3380.6610.6570.8000.3430.1820.6030.4380.7060.380
building30.7740.5440.6690.6150.7740.6460.5170.6380.4240.7450.538
group2_20.6890.6100.6840.4980.6850.4390.6000.4390.2760.6830.603
person14_10.1180.1370.1310.1300.1360.1340.1320.1360.5290.1350.137
boat10.7880.3760.7650.4600.7860.6890.3760.7610.4940.7670.768
wakeboard30.3660.1860.5320.1750.3660.5600.1820.1820.2710.6220.492
group2_30.5300.3880.4670.3760.5280.5040.4000.3090.1040.4870.402
uav1_30.1450.0010.0470.0010.0710.0010.0010.0010.0690.1470.001
person100.1530.1470.1580.1480.1410.1600.1430.1330.2950.1590.154
car170.3800.0900.0720.0540.3800.0980.0550.1040.2680.1300.215
person160.6260.0980.0990.1040.1000.0960.0970.0980.0850.0940.087
car140.4330.3680.3840.4920.4330.4390.3750.4530.3140.4430.418
group1_10.7250.6210.6180.6940.7140.4740.6630.3140.1560.7700.744
person180.6590.5050.6150.6010.6590.6660.5070.6190.1850.6600.352
car100.7110.8030.7930.8100.8240.8180.7810.7500.3890.8320.823
person2_20.7640.7610.6750.7640.7660.7290.7630.7190.4320.7760.779
car30.7100.6900.6380.6280.6530.6990.3960.6700.0620.7380.657
person200.7200.3330.6450.6850.7180.6930.3370.5820.2120.6530.331
Overall0.5310.3960.4700.4250.5220.4540.3450.4200,2820.5260.426
Table 6. Running speed (frame per second) of each tracker on sequences from the UAV123 dataset.
Table 6. Running speed (frame per second) of each tracker on sequences from the UAV123 dataset.
OursSRDCFECO_HCKCFDPBACFStapleBITfDSSTTLDORVTDSST
truck345.9919.2879.73131.0813.0499.41139.15219.962.9026.04145.87
bike242.5926.2782.51102.4214.67102.45179.93295.031.4436.43238.96
boat652.7616.8378.1353.8112.26101.18142.74163.7510.0332.59157.43
building332.6012.8177.0449.2011.3193.08108.00165.085.8112.78102.79
group2_216.467.7876.0253.299.2186.6681.79124.4412.6428.9160.64
person14_126.289.5677.3346.778.3687.8197.64157.1129.8721.4793.65
boat110.135.4151.9318.957.6951.285.7112.0213.9012.053.12
wakeboard337.7812.9076.4755.5311.7398.71106.63136.4121.6825.01104.89
group2_37.4913.2878.3858.809.6896.05103.95170.8726.0621.66105.19
uav1_330.1814.5877.3442.8510.4890.58114.98176.3032.7423.35162.47
person108.674.9674.4039.205.5173.8449.0968.2827.1524.1325.72
car1747.5220.9586.30196.0413.03103.27165.52273.9710.9333.98220.91
person1617.185.6372.0643.896.8782.8678.10116.5217.9514.4951.68
car1429.916.2988.8942.549.2692.7567.67112.9413.5017.7944.73
group1_111.455.3975.1132.948.2570.7750.8234.8214.5625.5625.42
person1816.14.9156.4222.756.3330.7812.1818.1025.2526.804.50
car1026.637.6876.7633.408.5586.6784.75133.2516.8119.1973.95
person2_211.755.4074.0329.188.2971.6953.5488.4416.1625.8229.51
car337.2712.1680.8654.1412.9993.77124.41186.7417.8628.73111.87
person204.263.8550.5231.095.5319.826.3920.3616.0817.849.09
Average25.6510.7974.5156.899.6581.6789.64133.7116.6623.7388.61

Share and Cite

MDPI and ACS Style

Xue, X.; Li, Y.; Dong, H.; Shen, Q. Robust Correlation Tracking for UAV Videos via Feature Fusion and Saliency Proposals. Remote Sens. 2018, 10, 1644. https://0-doi-org.brum.beds.ac.uk/10.3390/rs10101644

AMA Style

Xue X, Li Y, Dong H, Shen Q. Robust Correlation Tracking for UAV Videos via Feature Fusion and Saliency Proposals. Remote Sensing. 2018; 10(10):1644. https://0-doi-org.brum.beds.ac.uk/10.3390/rs10101644

Chicago/Turabian Style

Xue, Xizhe, Ying Li, Hao Dong, and Qiang Shen. 2018. "Robust Correlation Tracking for UAV Videos via Feature Fusion and Saliency Proposals" Remote Sensing 10, no. 10: 1644. https://0-doi-org.brum.beds.ac.uk/10.3390/rs10101644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop