Background Instance-Based Copy-Paste Data Augmentation for Object Detection

Zhang, Liuying; Xing, Zhiqiang; Wang, Xikun

doi:10.3390/electronics12183781

Open AccessArticle

Background Instance-Based Copy-Paste Data Augmentation for Object Detection

by

Liuying Zhang

,

Zhiqiang Xing

^* and

Xikun Wang

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3781; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12183781

Submission received: 8 August 2023 / Revised: 4 September 2023 / Accepted: 5 September 2023 / Published: 7 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In supervised deep learning object detection, the quantity of object information and annotation quality in a dataset affect model performance. To augment object detection datasets while maintaining contextual information between objects and backgrounds, we proposed a Background Instance-Based Copy-Paste (BIB-Copy-Paste) data augmentation model. We devised a method to generate background pseudo-labels for all object classes by calculating the similarity between object background features and image region features in Euclidean space. The background classifier, trained with these pseudo-labels, can guide copy-pasting to ensure contextual relevance. Several supervised object detectors were evaluated on the PASCAL VOC 2012 dataset, achieving a 1.1% average improvement in mean average precision. Ablation experiments with the BlitzNet object detector on the PASCAL VOC 2012 dataset showed an improvement of mAP by 1.19% using the proposed method, compared to a 0.18% improvement with random copy-paste. Images from the MS COCO dataset containing objects of the same classes as in PASCAL VOC 2012 were also selected for object pasting experiments. The contextual relevance of pasted objects demonstrated our model’s effectiveness and transferability between datasets with same class of objects.

Keywords:

data augmentation; object detection; BIB-Copy-Paste

1. Introduction

Object detection [1,2,3] is an important task in computer vision, which is now widely used in intelligent traffic, industrial detection, home security, and other fields. Most object-detection models based on convolutional neural networks require labeled data for supervised learning [4,5,6]. For supervised object-detection models, the quantity and quality of data in the dataset impact the performance of the model, necessitating as much high-quality data as possible. However, manually labeling data can be both time-consuming and expensive. Therefore, data augmentation is considered as a solution to this problem. Data augmentation [7] expands the original dataset by generating new data from existing data. Early forms of data augmentation relied on basic image operations such as geometric and color transformations. In later developments, random resizing and scale jittering became more prevalent methods and saw widespread use [8]. However, these methods of data augmentation lack specificity when applied to object detection tasks.

Similar objects generally appear in similar environments, and there is a certain degree of correlation between the object and the background. When the visual information of the object is incomplete (such as poor shooting angle or lighting conditions, being obscured or truncated, blurred or noisy, etc.), contextual information becomes an important basis for object detection [9]. Before the widespread application of deep learning, some object detectors modeled the relationship between the object’s class, position, and context through manual engineering [10,11,12]. With the development of deep learning, modeling context through convolutional neural networks has become a trend. By inputting the object background into the context CNN for modeling, the object is positioned based on the prediction of the context [13]. However, for datasets with a small number of objects, the contextual samples obtained only through the object background are insufficient. Ref. [14] uses two pre-trained models function as encoders, while a recurrent neural network serves as the decoder. By employing the attention mechanism, the encoders are able to supply the decoder with semantic contextual information to generate captions for a given image. We found that in most images of the object-detection datasets, the area occupied by objects is often smaller than that occupied by the background. These background areas could be better utilized.

In this paper, we considered augmenting object-detection datasets by copy-past objects while ensuring that the object classes match the contextual information. The contributions of this paper are as follows:

A BIB-Copy-Paste model was proposed for data augmentation in object detection. The model employed a plug-and-play similarity assessment module to generate background pseudo-labels based on the background of the object instances and trained a classifier to guide the copy-paste process.
Experimental results on the PASCAL VOC 2012 benchmark demonstrated that the proposed BIB-Copy-Paste model effectively expanded the dataset and generally improved mean average precision compared to baselines without data augmentation.

Our proposed model was effective in increasing the number of objects in the object detection dataset and generated images that were mostly consistent with the objective reality. Instead of time-consuming and labour-intensive manual annotation, our model automatically generated new images with multi-scale objects using only a small number of annotated images.

2. Related Works

2.1. Data Augmentation

Data augmentation typically generates new data based on existing data, artificially increasing the number and diversity of training samples. Data augmentation can improve model performance and enhance model robustness. Image data augmentation can generally be divided into two categories: image-based data augmentation and deep-learning-based data augmentation [15,16].

Image-based data augmentation originally came from basic image operations, such as geometric transformations and color transformations. Mixup [17] generates a new image by linearly interpolating two images. In contrast to Mixup, AlignMixup [18] interpolates features in feature space, and the newly generated image preserves the geometry of one image and the appearance or texture of the other. These methods are experimentally effective for image classification. Some methods stitch a number of images into a single image, RICAP [19] stitches four random images into a new image. This method does perform data augmentation efficiently, but corrupts contextual information. Some work pastes an image [20] or part of an image onto another image [21], or partly masks the image [22]. These data augmentation methods are essentially random pasting or masking may result in the loss of some vital information. Golnaz Ghiasi et al. [23] proposed Copy-Paste, in which a subset of image objects is pasted onto another image after random scale jittering and horizontal flipping. Copy-Paste achieves, then, state-of-the-art results in instance segmentation.

With the development and application of deep learning in various fields, data augmentation is increasingly being achieved through deep-learning methods. Data augmentation based on deep learning can be roughly divided into three categories: adversarial training, GAN-based data augmentation, and meta-learning data augmentation. Ian J. Goodfellow et al. [24,25] were the first to propose adversarial training, where generated adversarial samples were added to the training set to improve the robustness of the model. Some other methods were later proposed based on [24,25], but adversarial training is quite time-consuming and labour-intensive and involves a choice between robustness and accuracy [26,27]. Generative Adversarial Networks (GANs) [28] are powerful generative models. There has been some progress in recent years exploring the use of GANs to generate synthetic data for data augmentation in the context of limited or unbalanced datasets [29,30,31]. Ref. [32] introduced an approach that leverages adversarial learning techniques to balance datasets, with the goal of improving classifier performance. The study’s results demonstrate that the proposed model exhibited robust classification performance when applied to highly imbalanced datasets. Ekin D. Cubuk et al. [33] proposed AutoAugment, which designs a search space and allows the search algorithm to automatically find the best augmentation strategy. Sungbin Lim et al. [34] proposed the Fast AutoAugment method, which employs a density-matching-based search algorithm, and is several orders of magnitude faster than the AutoAugment method in terms of search time.

2.2. Data Augmentation for Object Detection

Data augmentation can effectively improve the training performance of object detectors. Some methods use erasure and masking to perform data augmentation for the object detection. Terrance DeVries et al. [22] proposed the Cutout method, which involves randomly selecting a fixed-size square region on the image to be masked. Zhong et al. [35] proposed the random-erasing method, where random rectangular regions in an image are replaced with random values or aver-age pixels from the training set. Singh et al. [36] proposed Hide-and-Seek, which divides the image into several small areas and randomly occludes them according to a certain proportion. Chen et al. [37] proposed GridMask, using a grid-arranged mask to occlude the image. However, these methods have the potential to completely occlude the object, which can negatively affect object detection. Some methods use image blending for data augmentation, such as the aforementioned Mixup [17] and CutMix [21]. Bochkovskiy et al. [38] proposed Mosaic, an extension of CutMix that stitches four images into one image. But these methods tend to corrupt the object’s contextual information. In addition, some methods exist for data augmentation in object detection. Consequently, there are also works that focus on object context. Nikita Dvornik et al. [13] proposed a context model based on convolutional neural networks, using segmentation annotations to increase the number of object instances in the training data, and experiments have shown the effectiveness of this model in pasting objects into the correct scenes. Artificially selecting data enhancement methods is often not the best solution. Barret Zoph et al. [39] proposed the use of deep learning to search for suitable combinations of image transformations for data augmentation.

Based on the above work, we believe that data enhancement for object detection should focus on the following two points: 1. Take care to avoid severely occluding the original objects. 2. Take care to ensure that the object and contextual information matches. We proposed BIB-Copy-Paste data augmentation for object detection based on these two points.

3. Method

Real-life scenarios involving humans are varied and complex, while those involving most objects of the same kind are usually similar or regular. For example, vehicles are usually found on roads, and screens are usually placed on tables. Figure 1 illustrates how objects of the same class appear in similar scenes. Based on this empirical knowledge, for objects with relatively fixed scenes, similar regions obtained by searching through object background instance features in images have theoretically learnable commonalities, even in the absence of background annotations.

Figure 2 shows the BIB-Copy-Paste process, divided into two parts: training and pasting. In the training branch, an instance-based background similarity assessment module generates a background instance similarity matrix

M_{b}

, based on the object instance’s background in the image, to evaluate the similarity between different parts of the image and the object background. Pseudo-labels are generated for highly similar background regions and fed into the classifier for training. This produces a backbone network that extracts features from object backgrounds and a classifier that distinguishes object background features. During pasting, features are extracted and classified using the trained backbone network and classifier to obtain the highest scoring background region. An object is randomly selected from the set, pasted into this area, and modifies the dataset’s annotation file. This operation is performed on training set images to obtain an enhanced BIB-Copy-Paste dataset.

3.1. Instance-Based Background Similarity Assessment Module

An instance-based background similarity assessment module is proposed to assess the similarity between different image regions and the object background instance. The process of this module is shown in Figure 3. In the image, the red box

O_{1}

represents the object’s bounding box, while the yellow box represents its background box. The background box is

k

times (

k = 1.5

in this paper) larger than the object bounding box. The image is fed into the backbone to obtain image feature tensor

F_{w}

, object feature tensor

F_{o}

, and corresponding background box feature tensor

F_{b}

.

F_{o}

and

F_{b}

are fed into the background instance vector module to obtain object background feature vector

V_{b}

.

V_{b}

and

F_{w}

are fed into the background similarity module to produce a background instance similarity matrix

M_{b}

, where regions with similarity greater than threshold

K_{1}

are positive background regions. The positive feature module generates a series of anchor boxes with different sizes and ratios on the image. Anchor boxes with IoU greater than threshold

K_{2}

with positive background regions are considered positive background anchor boxes

A_{p}

.

A_{p}

is projected onto image feature

F_{w}

to produce positive features

F_{p}

, with their pseudo-labels set to object

O_{1}

’s label.

F_{p}

and its label are used for classifier training. The trained backbone and classifier can extract and distinguish object background features.

3.1.1. Background Instance Vector Module

To find image regions similar to object

O_{1}

’s background, features must be extracted from different image regions and the background where object

O_{1}

is located. The background instance vector module extracts the background feature between object and background boxes, using object and background box features as input to output the background feature vector. This module’s process is described by Formula (1).

F_{b}

represents the background box feature with dimension (2048,

h_{b}

,

w_{b}

),

F_{o}

represents the object feature with dimension (2048,

h_{o}

,

w_{o}

),

w_{o}

and

h_{o}

represent the width and height of the object feature,

w_{b}

and

h_{b}

represent the width and height of the background box feature,

G A P

stands for global average pooling operation, and

V_{b}

represents the object background instance vector of length 2048.

V_{b} = \frac{(G A P (F_{b})) w_{b} h_{b} - (G A P (F_{o})) w_{o} h_{o}}{w_{b} h_{b} - w_{o} h_{o}},

(1)

The vector

V_{b}

represents the features of the region between an object’s bounding box and its background box, corresponding to the background where the object instance is located. Since the operation in Formula (1) does not make a change in the number of channels,

V_{b}

’s length is equal to the channel depth of the image feature

F_{w}

, which is 2048 in this paper. Therefore,

V_{b}

can be compared with features from different image regions to find regions similar to where the object instance is located.

3.1.2. Background Similarity Module

The Background similarity module evaluates the similarity between different image regions and an object’s background instance. The image feature

F_{w}

is traversed in the spatial dimension to produce feature vector

V_{w_{n}} (n : 1, \dots, N = h \times w)

, where

h

represents the height of the image feature and

w

represents its width.

V_{w_{n}}

is compared with background instance vector

V_{b}

to calculate their similarity

S_{w_{n} b}

. A threshold

K_{s}

is set. If similarity

S_{w_{n} b}

is greater than

K_{s}

, the image region corresponding to

V_{w_{n}}

is considered a positive background for this object. If

S_{w_{n} b}

is less than

K_{s}

, it is considered a negative background. The following explains how to compare

V_{w_{n}}

with

V_{b}

and calculate their similarity.

Geometrically, the dot product of two vectors equals the product of their magnitudes and the cosine of the angle between them. Physically, the value of a vector dot product reflects the similarity of two vectors in their occupied space. Based on this theory, the dot product of vectors is used to compare the similarity of

V_{w_{n}}

and

V_{b}

in Euclidean space. When two vectors have constant magnitudes, their dot product represents the size of the angle between them. To strictly compare

V_{w_{n}}

and

V_{b}

’s similarity, it is evaluated from both magnitude and angle perspectives to obtain a similarity score

S_{w_{n} b} (0 \leq S_{w_{n} b} \leq 1)

. First, the magnitudes of vectors

V_{w_{n}}

and

V_{b}

are compared for similarity. The magnitudes

N_{w_{n}}

of vector

V_{w_{n}}

and

N_{b}

of vector

V_{b}

are calculated (assuming

N_{w_{n}} > N_{b}

). To limit the similarity score to [0, 1], the smaller magnitude is divided by the larger to obtain vectors’ magnitude similarity

S_{m_{n}}

. Formulas (2) and (3) show this calculation process.

|V_{w_{n}}| = N_{w_{n}}, |V_{b}| = N_{b} (assuming N_{w_{n}} > N_{b}),

(2)

S_{m_{n}} = \frac{N_{b}}{N_{w_{n}}} (N_{w_{n}} > N_{b}, 0 \leq S_{m_{n}} \leq 1),

(3)

Then, the angle between

V_{w_{n}}

and

V_{b}

is evaluated.

V_{w_{n}}

and

V_{b}

are normalized to obtain unit vectors

e_{w_{n}}

and

e_{b}

, with the similarity

S_{a_{n}}

of the angle given by their dot product. Let

α

represent the angle between

e_{w_{n}}

and

e_{b}

, with its cosine value equal to

S_{a_{n}} (0 \leq S_{a_{n}} \leq 1)

. Formulas (4) and (5) show this calculation process.

e_{w_{n}} = \frac{V_{w_{n}}}{|V_{w_{n}}|}, e_{b} = \frac{V_{b}}{|V_{b}|},

(4)

S_{a_{n}} = e_{b} \cdot e_{w_{n}} = \cos α (0 \leq S_{a_{n}} \leq 1),

(5)

The similarity

S_{w_{n} b}

between

V_{w_{n}}

and

V_{b}

is calculated as the product of the magnitude similarity

S_{m_{n}}

and the angle similarity

S_{a_{n}}

, as shown in Formula (6).

S_{w_{n} b} = S_{m_{n}} S_{a_{n}} (0 \leq S_{w_{n} b} \leq 1),

(6)

According to the above calculation process,

V_{w_{n}} (n : 1, \dots, N = h \times w)

is traversed to obtain the background instance similarity matrix

M_{b}

. The values in this matrix represent the similarity between different parts of the image and the background where object instance

O_{1}

is located.

3.1.3. Positive Feature Module

The process of the positive feature module is shown in Figure 4. After obtaining the background instance similarity matrix

M_{b}

through the Background similarity module, a background feature similarity threshold

K_{1}

was set. Regions with similarity greater than

K_{1}

were positive background regions for object

O_{1}

(as shown by the yellow regions in the matrix). On the other hand, regions with similarity less than or equal to

K_{1}

were negative background regions (as shown by the blue regions in the matrix). The height and width of the background instance similarity matrix

M_{b}

were consistent with those of the image feature

F_{w}

. Therefore, each positive region corresponded to a consistent size in the spatial dimension of the original image. Using positive background features with fixed spatial scale does not effectively discriminate the backgrounds of multi-scale objects. To address this issue, a series of anchor boxes with different scales and ratios were generated using anchor box generation method to adapt to objects of different scales. Anchor boxes with intersection over union (IoU) greater than threshold

K_{2}

with positive background regions were considered positive anchor boxes. Positive features were yielded by projecting positive anchor boxes onto image feature

F_{w}

. The labels of these positive features were assigned object

O_{1}

’s label and fed into the classifier for training.

3.2. Backbone, Classifier and Paste Module

In this paper, ResNet50 [40] was used as the backbone with the classifier composed of a three-layer fully connected layer and a Softmax layer connected in sequence. The number of outputs of the last fully connected layer corresponded to the number of object types in the dataset with its output fed into the Softmax layer. After training, the backbone and classifier can extract and discriminate object background features recognizing image backgrounds in the dataset to determine object pasting positions. To enrich multi-scale objects in the dataset, a random scale method was adopted during object pasting. All objects in the training set were first cropped to generate an object set. When pasting an object, one was randomly selected from the set and its scale was used as a reference for pasting. Based on this reference scale, a series of pasting boxes with varying aspect ratios and areas were generated (in this paper pasting boxes with aspect ratios of 1:1, 1:2, 1:1.5, and 1:3 were generated with areas 1.5, 1.2, 1, and 0.8 times that of the reference scale). By scanning image regions using generated pasting boxes of different scales, corresponding features were fed into the classifier to obtain classification scores for each background. All boxes with classification scores greater than 0.85 were selected for pasting corresponding objects. If no boxes met this requirement, then one with highest classification score was selected for pasting an object.

The following is the strategy for pasting objects. For object box

B_{p}

and object class

C_{p}

the box’s score must exceed 0.5, otherwise no object will be pasted in it. An object image of class

C_{p}

is randomly selected from the set and resized to fit within pasting box

B_{p}

while preserving its aspect ratio. The scaled object image is then centered and pasted in

B_{p}

. To prevent interference with object detection from the boundary between a pasted object and its background, a Gaussian blur is applied to soften their transition.

4. Experiment

4.1. Dataset and Preprocessing

Experiments were conducted on the PASCAL VOC 2012 (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, accessed on 4 September 2023) [41] and MS COCO (https://cocodataset.org/, accessed on 4 September 2023) [42] datasets in this paper. The PASCAL VOC 2012 dataset contained a total of 11,540 images and 27,450 objects in 20 classes broadly categorized into vehicle, household animal, and person. For object detection, the PASCAL VOC 2012 dataset was labeled with object location shot angle whether it was truncated or occluded and ease of detection. The PASCAL VOC 2012 dataset is widely used for object classification, detection, segmentation, and action classification. The backbone was pre-trained on the PASCAL VOC 2012 dataset with its pre-training process detailed in Section 4.2.1. The BIB-Copy-Paste model was trained on the PASCAL VOC 2012 training set containing 5717 images and a new training set was created by pasting objects. An object detector was trained using this new training set and its detection results provided quantitative evidence of BIB-Copy-Paste method’s effectiveness. The MS COCO dataset contained over 330,000 images with about 1.5 million object instances in 80 classes. The MS COCO dataset contained all of PASCAL VOC’s object classes but with more images and objects. Most images in the MS COCO dataset were derived from life scenarios and contained an average of 7.7 objects per image. Object pasting experiments were performed on a subset of the MS COCO dataset including objects of same class as PASCAL VOC’s to validate model transferability.

4.2. Experimental Results and Analysis

4.2.1. Pre-Training

Due to the absence of pre-trained models for background recognition on the ResNet50 network specific to the PASCAL VOC 2012 dataset, initial attempts to train the proposed model without pre-training yielded suboptimal results. Figure 5 shows BIB-Copy-Paste’s object pasting results without pre-training with the red box denoting a pasted object. In this case, it was found that the paste module only pasted objects from certain classes and the number of pasted objects was relatively small. These classes of objects have common characteristics with relatively simple and consistent backgrounds such as airplanes and boats. According to experimental results, although objects were pasted in background areas, most did not accurately reflect reality. For example, boats were pasted onto sky backgrounds, while airplanes were pasted onto snow backgrounds. There may be two reasons for this. One reason is that due to lack of pre-training for the backbone, the model did not effectively learn background features associated with certain object classes. On the other hand, for some objects with complex and diverse backgrounds, such as people, the lack of common features in the object background made it difficult to determine accurate classification boundaries in the feature space.

To address the above issues, the method proposed by Spyros Gidaris et al. [43] was improved upon and pre-training was implemented to learn object background features. First, objects within an image were partially occluded with an occlusion ratio of 0.9 times their size. Then, the image was rotated around its center by 0°, 90°, 180°, and 270°, respectively, and rotated images were fed into the backbone for feature extraction. Extracted features were fed into a three-layer fully connected network for predicting rotation angle. According to Spyros Gidaris et al.’s [43] theory, predicting an image’s rotation angle can facilitate the backbone’s ability to comprehend it to some degree. Therefore, attempts were made to occlude objects over a large area to facilitate backbone learning of backgrounds. Section 4.2.4 of this paper shows the experimental results and ablation experiments.

4.2.2. Implementation Details and Evaluation Metric

The model’s backbone was first pre-trained on the dataset using the SGD optimizer with an initial learning rate of 0.001, weight decay of 0.0005, and momentum of 0.9. The learning rate was decreased to 70% of its initial value every 3 epochs and training lasted for 30 epochs. Then, training weights from the backbone were transferred to the training branch’s backbone, its weights were frozen and remaining components of the training branch were trained for an additional 20 epochs. Then, backbone weights were unfrozen and the training branch was trained for another 30 epochs. The background feature similarity threshold was set at 0.8 with an IoU threshold of 0.85 set between anchor boxes and positive regions. Trained weights from the backbone and classifier were transferred to the paste branch to facilitate pasting, generating a dataset enhanced by BIB-Copy-Paste.

The proposed model was trained on the PASCAL VOC 2012 dataset and a new training set was generated by pasting objects onto its images. As described in Section 3.2, a random scaling method was employed when pasting objects to have a variety of sizes, which may result in no suitable background being available for pasting at that scale within an image. Therefore, objects were pasted onto images multiple times (3 times in this paper) to increase their number and enhance detection of multi-scale objects. An object detector was trained on both the new training set and PASCAL VOC 2012′s training set and tested on its test set. A bounding box was deemed correct if its Intersection over Union (IoU) with a ground truth box exceeded 0.5. Average Precision (AP) was used to evaluate detection quality for a single object class while Mean Average Precision (mAP) was used for the entire dataset. Section 4.2.3 refers to experimental results.

4.2.3. Experimental Results on PASCAL VOC 2012 and MS COCO

To verify the effectiveness of BIB-Copy-Paste for object detection, experiments were conducted on the PASCAL VOC 2012 dataset. First, a training set augmented by BIB-Copy-Paste (hereafter referred to as the BIB training set) was generated following the aforementioned process and an object detector was trained on both the VOC and BIB training sets. Table 1 presents the experimental results on the PASCAL VOC 2012 test set. As shown in Table 1, compared to being trained on the VOC training set, an object detector trained on a BIB training set significantly improved mAP. On the PASCAL VOC 2012 test set, mAP increased by about 1% for YOLO and Faster RCNN object detectors and by about 1.1% for BlitzNet and SSD512. All four object detectors showed significant improvements in AP for classes such as ‘aeroplane’, ‘bird’, ‘boat’, ‘bus’, ‘car’, ‘cow’, and ‘sheep’. Overall, BIB-Copy-Paste effectively enhanced the dataset and improved mAP for object detectors on the test set.

In Figure 6, the red box highlights objects pasted using BIB-Copy-Paste from the BIB training set. The pasting results show that objects were generally pasted on suitable backgrounds with variable scales. BIB-Copy-Paste showed good results for classes such as ‘aeroplane’, ‘sheep’, and ‘ship’. The experimental results demonstrate that BIB-Copy-Paste can effectively learn about object instance backgrounds and paste objects on conforming backgrounds.

To verify the transferability of BIB-Copy-Paste, an experiment was conducted to paste objects onto images from the MS COCO dataset using a BIB-Copy-Paste model trained on the PASCAL VOC 2012 dataset. Figure 7 shows the experimental results from the MS COCO dataset. As shown in the experimental results, objects were generally pasted onto suitable backgrounds, indicating that BIB-Copy-Paste had a good understanding of the background for some objects. A BIB-Copy-Paste model trained on one dataset can augment other datasets containing objects of the same classes.

4.2.4. Ablation Study

First, the effectiveness of pre-training was evaluated. The pasting results of BIB-Copy-Paste models with and without pre-training on the PASCAL VOC 2012 training set were compared. All other processes and conditions were consistent between the two experiments except for pre-training. Table 2 shows the results. As shown in Table 2, without pre-training, the model only tended to paste certain classes such as ‘aeroplane’, ‘bird’, ‘boat’, ‘bus’, and ‘car’, while many classes of objects were not pasted, indicating that the backbone did not sufficiently learn the background features corresponding to most object classes. After pre-training, there was a significant increase in the total number of objects pasted by the model and a more even distribution of pasted object types compared to before pre-training, indicating that pre-training improved the backbone’s understanding of object backgrounds. In addition, both before and after pre-training, the model showed a tendency to paste certain objects such as ‘aeroplane’, ‘bird’, ‘boat’, ‘bus’, ‘car’, ‘cow’, and ‘sheep’. After analyzing the images in the PASCAL VOC 2012 dataset, it was found that these objects had some similar characteristics, such as being mostly outdoors, rarely overlapping with images of people, and mostly appearing independently with little occlusion. It was hypothesized that these characteristics result in simpler and more consistent backgrounds for these objects than for other objects, making it more likely for the model to learn and fit their background features. Therefore, during the pasting process when classifying background regions in images, these objects were more likely to receive higher classification scores leading to a greater inclination for the model to paste certain classes of objects.

An experimental verification of the effectiveness of BIB-Copy-Paste for object detection was conducted. First, a training set with randomly pasted objects was generated. A picture was randomly selected from the PASCAL VOC 2012 training set and an object was randomly selected from the object set to be pasted into the picture. The boundary of the pasted object was blurred using Gaussian blur and the annotation file was modified accordingly. The number of randomly pasted objects was consistent with the BIB training set, generating a random training set. The BlitzNet object detector was trained on both the BIB training set and the random training set and tested on the PASCAL VOC 2012 test set. Table 3 shows the experimental results. As shown in Table 3, when trained on a randomly pasted enhanced training set, some classes of objects such as ‘aeroplane’, ‘bird’, ‘boat’, etc., improved in AP while others such as ‘cat’, ‘table’, ‘motorbike’, etc., decreased significantly. Although random pasting increased the total number of objects in the training set, compared to being trained on the VOC training set, mAP only increased by 0.18% for an object detector trained on a random training set and decreased by 1.01% compared to being trained on a BIB training set.

An intuitive analysis of the reasons why random pasting may lead to a decline in detection performance is provided. The object pasting effects of random pasting and BIB-Copy-Paste are compared in Figure 8. As shown in Figure 8, randomly pasted objects may cause serious or even complete occlusion of the original objects in the dataset, seriously impacting the accuracy of annotation data. Severely or completely occluded foreground objects will cause the object detector to learn from incorrect data, directly affecting its detection performance for occluded object classes. BIB-Copy-Paste pastes objects based on background features where they are located, preventing serious occlusion with original foreground objects in the dataset and being more conducive to object detection tasks. However, there are still some shortcomings. The model’s performance on more complex backgrounds is not very satisfactory, tending to paste objects onto simpler backgrounds which may lead to an imbalance in the number of different classes of objects. This will be a future research focus. Overall, BIB-Copy-Paste data augmentation can effectively improve detection accuracy and has good effects for object-detection tasks.

5. Conclusions

In this paper, a background instance-based copy-paste (BIB-Copy-Paste) data augmentation model for object detection was proposed. According to the experimental results, the images generated by the BIB-Copy-Paste data augmentation model were basically in line with the objective situation and significantly improved the accuracy of the object detectors. The strategy was effective and contextual information is necessary and needs to be given attention in object detection. The BIB-Copy-Paste model can generate new high-quality images containing multiple objects based on images labelled with only a small number of objects, thus eliminating the need for time-consuming and labour-intensive labelling. Our model is very helpful and necessary for less labelled object-detection datasets. However, there is still improvement to be made in the balance of the model for different classes of object augmentation, and this will be the direction of future work. With the continuous development of deep learning, it has become a trend for machines to assist humans in data annotation, and in the future, machines may completely replace humans in data annotation.

Author Contributions

Conceptualization, L.Z., Z.X. and X.W.; methodology, L.Z.; software, L.Z.; validation, L.Z.; formal analysis, L.Z.; investigation, L.Z.; resources, Z.X.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, Z.X., X.W. and L.Z.; visualization, L.Z.; supervision, Z.X.; project administration, Z.X.; funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zaidi, S.S.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2019, 128, 261–318. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) [Preprint], Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with Transformers. In Proceedings of the European Conference on Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
Long, X.; Deng, K.; Wang, G.; Zhang, Y.; Dang, Q.; Gao, Y.; Shen, H.; Ren, J.; Han, S.; Ding, E.; et al. PP-Yolo: An effective and efficient implementation of object detector. arXiv 2020, arXiv:2007.12099. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
He, K.; Girshick, R.; Dollar, P. Rethinking ImageNet pre-training. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV) [Preprint], Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Divvala, S.K.; Hoiem, D.; Hays, J.H.; Efros, A.A.; Hebert, M. An empirical study of context in object detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition [Preprint], Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Heitz, G.; Koller, D. Learning spatial context: Using stuff to find things. In Lecture Notes in Computer Science; Springer Science + Business: Berlin, Germany, 2008; pp. 30–43. [Google Scholar] [CrossRef]
Forsyth, D. Object detection with discriminatively trained part-based models. Computer 2014, 47, 6–7. [Google Scholar] [CrossRef]
Park, D.; Ramanan, D.; Fowlkes, C. Multiresolution models for object detection. In Proceedings of the 11th European Conference on Computer Vision—ECCV 2010, Heraklion, Greece, 5–11 September 2010; pp. 241–254. [Google Scholar] [CrossRef]
Dvornik, N.; Mairal, J.; Schmid, C. Modeling visual context is key to augmenting object detection datasets. In Proceedings of the 15th European Conference on Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 375–391. [Google Scholar] [CrossRef]
Ayoub, S.; Gulzar, Y.; Reegu, F.A.; Turaev, S. Generating image captions using Bahdanau attention mechanism and transfer learning. Symmetry 2022, 14, 2681. [Google Scholar] [CrossRef]
Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image Data Augmentation for Deep Learning: A Survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
Kaur, P.; Khehra, B.S.; Mavi, E.B.S. Data Augmentation for Object Detection: A Review. In Proceedings of the 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Lansing, MI, USA, 9–11 August 2021; IEEE Xplore: Piscataway, NJ, USA, 2021. Available online: ieeexplore.ieee.org/abstract/document/9531849 (accessed on 25 August 2022).
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]
Venkataramanan, S.; Kijak, E.; Amsaleg, L.; Avrithis, Y. Alignmixup: Improving representations by interpolating aligned features. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Takahashi, R.; Matsubara, T.; Uehara, K. Data augmentation using random image cropping and patching for Deep Cnns. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2917–2931. [Google Scholar] [CrossRef]
Qin, J.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X.; Wang, X. Resizemix: Mixing Data with Preserved Object Information and True Labels. arXiv 2020, arXiv:2012.11101. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2015, arXiv:1412.6572. [Google Scholar]
Miyato, T.; Dai, A.M.; Goodfellow, I. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv 2021, arXiv:1605.07725. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2019, arXiv:1706.06083. [Google Scholar]
Shafahi, A.; Najibi, M.; Ghiasi, M.A.; Xu, Z.; Dickerson, J.; Studer, C.; Davis, L.S.; Taylor, G.; Goldstein, T. Adversarial Training for Free! arXiv 2019, arXiv:1904.12843. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2016, arXiv:1511.06434. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Ayoub, S.; Gulzar, Y.; Rustamov, J.; Jabbari, A.; Reegu, F.A.; Turaev, S. Adversarial approaches to tackle imbalanced data in machine learning. Sustainability 2023, 15, 7097. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning augmentation strategies from data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S. Fast AutoAugment. arXiv 2019, arXiv:1905.00397. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13001–13008. [Google Scholar] [CrossRef]
Singh, K.K.; Yu, H.; Sarmasi, A.; Pradeep, G.; Lee, Y.J. Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond. arXiv 2018, arXiv:1811.02545. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Jia, J. Gridmask Data Augmentation. arXiv 2020, arXiv:2001.04086. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning Data Augmentation Strategies for Object Detection. arXiv 2019, arXiv:1906.11172. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A retrospective. Int. J. Comput. Vis. 2014, 111, 98–136. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Zitnick, C.L.; Dollár, P. Microsoft Coco: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]

Figure 1. Images with similar backgrounds for the same class of object in PASCAL VOC 2012 dataset.

Figure 2. The process of BIB-Copy-Paste.

Figure 3. The process of instance-based background similarity assessment module. In the image, the red box represents the object’s bounding box, while the yellow box represents its background box.

Figure 4. The process of the positive feature module. Yellow in the matrix represents a positive background region, while blue represents a negative background region.

Figure 5. BIB-Copy-Paste without pre-training. The red boxes represents the pasted objects.

Figure 6. Images of BIB training set. The red boxes represents the pasted objects.

Figure 7. Experimental results on MS COCO dataset. The red boxes represents the pasted objects.

Figure 8. Comparison of BIB-Copy-Paste and random paste.

Table 1. Experimental results of object detection on PASCAL VOC 2012 test set.

Network	YOLO		Faster-RCNN		BlitzNet		SSD512
Backbone	YOLOnet		ResNet-101		ResNet-50		VGG-16
Train set	VOC	BIB	VOC	BIB	VOC	BIB	VOC	BIB
mAP	57.9	58.9	73.8	74.8	79.0	80.1	78.5	79.6
aero	77.0	79.1	86.5	89.3	89.9	92.7	90.0	92.9
bike	67.2	67.4	81.6	81.7	85.2	85.4	85.3	85.4
bird	57.7	59.9	77.2	79.7	80.4	83.5	77.7	80.4
boat	38.3	42.1	58.0	60.9	67.2	70.6	64.3	67.1
bottle	22.7	22.7	51.0	51.0	53.6	53.7	58.5	58.5
bus	68.3	70.2	78.6	80.7	85.9	85.7	85.1	87.9
car	55.9	58.7	76.6	79.3	83.6	86.2	84.3	87.1
cat	81.4	81.7	93.2	93.5	93.8	94.4	92.6	93.3
chair	36.2	36.2	48.6	48.7	62.5	62.9	61.3	61.5
cow	60.8	62.9	80.4	82.8	84.0	86.7	83.4	85.8
table	48.5	48.5	59.0	58.9	65.8	65.8	65.1	65.1
dog	77.2	77.6	92.1	92.5	91.6	92.2	89.9	91.5
horse	72.3	72.5	85.3	85.4	86.6	87.4	88.5	88.6
mbike	71.3	71.6	84.8	85.3	87.6	88.3	88.2	88.7
person	63.5	63.4	80.7	80.7	84.6	84.6	85.5	85.4
plant	28.9	28.6	48.1	48.0	56.8	56.6	54.4	54.3
sheep	52.2	55.3	77.3	80.6	84.7	87.6	82.4	85.1
sofa	54.8	54.8	66.5	66.4	73.9	73.9	70.7	70.6
train	73.9	73.9	84.7	84.7	88.0	88.0	87.1	87.2
tv	50.8	51.2	65.6	66.4	75.7	75.9	75.6	76.0

Table 2. BIB-Copy-Paste results on PASCAL VOC 2012 training set of pre-trained and non-pre-trained. Since some images contain multiple objects, * represents images with duplicate statistics removed and 5717 is the total number of images in the training set.

Class	Number of Training Set Images Containing the Object	Number of New Pasted Objects
Class	Number of Training Set Images Containing the Object	Non-Pre-Trained	Pre-Trained
aeroplane	328	102	233
bicycle	281	-	17
bird	399	71	209
boat	264	143	210
bottle	399	-	-
bus	219	40	137
car	621	75	191
cat	540	64	22
chair	656	-	29
cow	155	77	184
diningtable	318	-	-
dog	636	-	67
horse	238	-	39
motorbike	274	-	34
person	2142	-	-
pottedplant	289	-	-
sheep	171	92	216
sofa	359	-	-
train	275	-	-
tvmonitor	299	-	28
Total	5717 *	664	1616

Table 3. Experimental results of object detector BlitzNet on PASCAL VOC 2012 test set.

Network	BlitzNet
Backbone	ResNet-50
Train set	VOC	random	BIB
mAP	78.92	79.10	80.11
aero	89.9	91.7	92.7
bike	85.2	85.2	85.4
bird	80.4	83.3	83.5
boat	67.2	69.0	70.6
bottle	53.6	51.9	53.7
bus	85.9	83.9	85.7
car	83.6	85.1	86.2
cat	93.8	92.9	94.4
chair	62.5	62.2	62.9
cow	84.0	85.3	86.7
table	65.8	63.9	65.8
dog	91.6	91.2	92.2
horse	86.6	86.8	87.4
mbike	87.6	86.9	88.3
person	84.6	84.5	84.6
plant	56.8	56.4	56.6
sheep	84.7	85.6	87.6
sofa	73.9	73.2	73.9
train	88.0	87.7	88.0
tv	75.7	75.4	75.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Xing, Z.; Wang, X. Background Instance-Based Copy-Paste Data Augmentation for Object Detection. Electronics 2023, 12, 3781. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12183781

AMA Style

Zhang L, Xing Z, Wang X. Background Instance-Based Copy-Paste Data Augmentation for Object Detection. Electronics. 2023; 12(18):3781. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12183781

Chicago/Turabian Style

Zhang, Liuying, Zhiqiang Xing, and Xikun Wang. 2023. "Background Instance-Based Copy-Paste Data Augmentation for Object Detection" Electronics 12, no. 18: 3781. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12183781

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Background Instance-Based Copy-Paste Data Augmentation for Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Data Augmentation

2.2. Data Augmentation for Object Detection

3. Method

3.1. Instance-Based Background Similarity Assessment Module

3.1.1. Background Instance Vector Module

3.1.2. Background Similarity Module

3.1.3. Positive Feature Module

3.2. Backbone, Classifier and Paste Module

4. Experiment

4.1. Dataset and Preprocessing

4.2. Experimental Results and Analysis

4.2.1. Pre-Training

4.2.2. Implementation Details and Evaluation Metric

4.2.3. Experimental Results on PASCAL VOC 2012 and MS COCO

4.2.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI