1. Introduction
Accurate animal counts are the cornerstone of robust conservation and management plans [
1]. For species prone to be in conflict with humans or when populations densities can greatly vary in time and space, they need to be carried out frequently [
2]. Many different techniques exist to assess animal populations, from indirect methods, like pellet counts, to direct visual counting [
3,
4,
5]. Most often, animal censuses are species-specific and require substantial investments in time, money, and effort by wildlife management teams [
6]. Whilst some species gather periodically in specific locations, making population assessment easier [
5,
7], others roam alone or in small groups across vast territories [
8,
9]. Perhaps the most commonly used technique in open or semi-open environments is direct visual counting. Be it from the ground or from a moving aircraft, it is relatively easy to set up and carry out. However, it is prone to errors due to animal movement, group sizes, poor lines of sights, or variations in the observer’s capacities [
5,
10].
Unmanned aerial vehicles (UAV), commonly known as drones, have recently become more accessible to researchers [
11]. They allow easy access to remote areas, are safer and less technically challenging than their manned counterparts, are less stressful for animals and offer the possibility to completely automate flights [
2,
12,
13]. Moreover, the onboard positioning systems allow the possibility to reproduce earlier flights, making them well-suited for regular assessments of the same areas [
2]. When used correctly, they have proven to be able to produce more accurate counts than direct methods [
5]. Because they are most often equipped with a digital imagery sensor, they capture the whole scene, and thus allow the counting of big groups of individuals or multiple species. They also offer the possibility to use thermal infrared imagery. This has proven to be effective when detecting animals that wouldn’t be visible on RGB images, for instance for animals hidden in tree foliage [
14] or at night [
15]. All these characteristics make them particularly promising alternatives to standard methods to reduce costs and efforts, and increase the accuracy of wildlife surveys [
14,
16,
17].
So far, the main bottleneck hindering their wide deployment is the difficulty to process the vast amounts of data they generate [
2,
13,
17]. Luckily, automatic object detection has known an important revolution in the last few years, thanks to the use of convolutional neural networks (CNN) [
18], making the processing of large amounts of images faster and more accurate than even humans on specific tasks [
19,
20]. Compared to previous image classification techniques, they are completely data driven, extracting and refining automatically the relevant information to make their decision [
21]. Moreover, their performance is known to increase with the amount of data provided [
22], making them particularly interesting for tasks that repeatedly collect vast amounts of data, like self-driving cars or in our case, animal census. They have successfully been used to detect various species [
8,
23,
24,
25,
26] and their use in ecology has been on the rise in the past few years [
21]. However, obtaining good performances is often the result of tedious trial and error and educated guesses, looking for the right values of the numerous hyperparameters that drive the learning process in a long iterative effort [
27].
The sparsity of animals in the wild [
8,
26] make their detection subject to the false positive paradox, where a detection method with good accuracy might end up giving more false positives (FP) than true positives if the natural frequency of the positive class is extremely low [
24,
28]. These FP would then have to be processed manually, hindering the performance of automatic detection [
28]. Furthermore, this sparsity will naturally lead to collecting many more images of background than of animals. The difference in numbers of samples between classes is called class imbalance and is known to have a negative effect on training deep learning classifiers. [
29,
30]. This is a common issue in fields such as disease diagnosis or fraud detection, where the events of interest are rare [
31,
32]. Several methods, such as oversampling, undersampling, class weighting, or thresholding have been studied to tackle this issue [
28,
33]. While class weighting and oversampling seem to perform better than the others [
28,
29,
33], they have been evaluated by training on whole unbalanced datasets, which can be very time consuming with large datasets.
Most often when using CNN to detect animals, any object that is not the class of interest is labelled as background. As explained by Kellenberger et al. [
28], the wider the covered area, the more landscape variety the background class will contain. However, some background objects might not be as common within the background class as others. This intraclass imbalance can in turn be the source of false detections because the network hasn’t seen enough of those rare samples [
29]. Most of the previously discussed techniques to address class imbalance cannot be used to address this issue because the subclasses are not explicitly labelled within the background class. Furthermore, when using the whole available training data as a training set, the overly represented background objects might end up wasting computing time training on many easy samples and diluting the impact of the hard ones.
Hard-negative mining (HNM), also known as bootstrapping, is the search for negative samples that the network fails to correctly classify [
34]. Kellenberger et al. [
28] use it to fine tune a network after training it on the whole training set. However, it was originally designed as an iterative process to build the training set by selecting the most relevant samples from the training data. Because only the relevant samples are selected to form the training set, the number of samples it contains is kept to a minimum while maintaining good levels of performance and short training times. The main downside of this method is that it requires several rounds of training.
In this paper, we present a general method that simultaneously tackles two major hurdles of training neural networks in image classification for wildlife surveys: the high number of FP and the big size of datasets. More specifically, we showcase the effectiveness of HNM to reduce the number of FP while training quickly and efficiently, using a series of recent, simple, and available methods without needing extensive fine-tuning.
4. Discussion
Our goal in this paper was to reduce training time and the number of FN when training on highly unbalanced data, with minimal fine-tuning of hyperparameters. Our method managed to achieve better performance than the same model on the whole dataset, in a fraction of the training time and with very low FP rates (around 1 FP per 6 hectares on the summer test set and 1 per 60 hectares on the winter test set).
4.1. Training the Models
Early stopping allowed us to simultaneously avoid overfitting, save the version of the network with the best generalization performance, and limit training time. However, shortening the training also limits the capacity of an optimizer such as RAdam to reach its best performance when using a suboptimal learning rate. The LRRT offers an interesting synergy with early stopping in this regard, as it ensures that the learning rate is picked within the range containing the optimal value. Therefore, we can expect good performance from the beginning of training and an end result not too far from what it would have been with the optimal value. While training parameters are generally given in the literature, little information is provided regarding the process to pick them [
8,
24,
43,
44]. The LRRT used in this article offers an interesting tool to standardize the search for good learning rate values at little cost. Moreover, it can also be used to find good values for other optimizer-related parameters, such as weight decay or momentum [
27].
Slightly worse hyperparameter values than the ones chosen would likely have increased the number of hard samples and the time for the network to converge, but the performance between HNM rounds would likely have improved due to the addition of new, relevant data. In our eyes, the method presented here offers a good trade-off between shortening the training time and good generalization performance. An alternative approach to improve the performance on the final round of training (round 1 in this case) would be to spend more time fine-tuning the hyperparameters instead of using the same exact training method as in the previous rounds. However, in a case like ours where the performance on the validation set is already very good, we expect diminishing returns on the time invested in the fine-tuning.
Apart from the impact high levels of imbalance can have on generalization performance, training on large datasets requires a lot of computing time. Moreover, the hyperparameter tuning required to use any method that tackles imbalance on a full dataset, also takes significantly longer on a large dataset. While HNM doesn’t completely remove the imbalance, it greatly reduces its magnitude. The other techniques to mitigate the impact of imbalance mentioned earlier could still be used in conjunction with HNM, but the fine-tuning of their parameters would require far less time due to the smaller size of the training and validation sets.
We noticed high levels of variability between runs using different random initializations for a given set of hyperparameters, despite what can be read in [
43], and therefore encourage practitioners to try several runs before settling on a final model (see
Appendix A for more details).
4.2. Hard-Negative Mining
Whilst the FP rate was better on the winter acquisition than on the summer one, the HNM impact on the classification performance between the two rounds of training was much stronger on the summer acquisition (improvement of a factor 100 on the summer test set against an improvement of a factor 14 on the winter test set). We believe the reason for this to be the higher variety of objects present in the negative class of the summer dataset compared to its winter counterpart. In the winter, snow covered the majority of the objects present in the images, thereby reducing the intraclass imbalance of the negative class while increasing the contrast between the deer and the background. With more objects to confuse the network in the summer acquisition, the first inference on the training and validation pools returned significantly more FP than for the winter dataset. Most of these FP were similar, with a vast majority of them being of rocks, tree trunks, or shadows and happened to be almost absent of the initial training set. In this regard, the HNM ended up being a way of oversampling confusing objects within the negative class of the training set.
While applying the network to new areas that may offer different background diversity than previously encountered and thus may decrease its performance [
28], the HNM process can retrieve only the informative examples needed to fine-tune the network. This would allow the training process to scale well in time as more data is acquired, without needing to retrain the network from scratch.
Unsurprisingly, the training of a model through HNM was significantly faster than a simple training on the full dataset and achieved better results. This highlights the negative impact a high number of easy samples can have on performance when nothing is done to mitigate the imbalance.
We believe this approach could be very beneficial to studies that use CNN to perform image classification on imbalanced datasets applied to different species, either on camera-trap images [
43,
44] or on UAV imagery [
24]. The latter is a particularly good example as it has annotation and classification methodologies very similar to ours but with a much higher proportion of FP (1 FP for 530 negative images and an MCC score of 0.3526). The vast majority of their negative class is made of ocean, with very little intraclass diversity and therefore few objects that could confuse the network. Most of the negative images are likely to be easy samples that negatively impact the network’s performance and could be removed from the training set through HNM. However, the difference in overall performance between our study and theirs doesn’t only come from our HNM method. As explained in their article, other factors such as network depth (4 layers against 18 in ours), the use of a non-pretrained network, or the fact that they favor the recall against the precision may also have a significant impact on their network’s performance. We could expect similar results to ours for images of similar GSD, of animals of comparable sizes to the red deer and in a similar environment, such as white-tailed deer (Odocoileus
virginianus), caribou (
Rangifer tarandus), or black bear (
Ursus americanus).
When facing high levels of imbalance (75% of their images are negative), Norouzzadeh et al. [
43] opted for a two-stage pipeline, first separating empty and full (containing animals) images then classifying the species present in the full images. To that end, they first randomly selected negative images to match the number of positive images, as we do to start our round 0. However, they then carried out their training without using the rest of the negative images, amounting to half of their available data. A single round of mining on the negative data might have brought new informative samples, improving the ability of the network to distinguish between empty and full images, without causing too much imbalance. Perhaps this alone might have helped improve the performance of their one stage pipeline and reduced the need for a two-stage approach.
4.3. Perspectives on Future Work
The nature of image classification as it is performed here can lead to mistakes when the network is more confident in the background area around the deer than the deer itself. When looking at the class activation maps of most of the FN (
Figure 3), both deer and background are properly distinguished by the network, but the prediction is not the one we expect. Preliminary testing on these images have shown that cropping a significant portion of the background area around the deer led the network to classify it as deer. We interpret this as the consequence of forcing the network to give only one, non-spatially-specific label to the images containing both classes, based on the one that gives the highest score. A promising avenue to improve on this is to use the same network to perform coarse semantic segmentation by transforming it into a fully convolutional network (FCN). This method outputs a raster per class, highlighting the areas in the image where the class is detected (
Figure 4). Similar ideas in Kellenberger et al. [
28] and Bowler et al. [
26] achieved good levels of performance. We believe that this technique could be used as a detection method but additional work to fully automate the detection from the coarse segmentation map is needed to assess its effectiveness on full-size images.