1. Introduction
Buildings are the carrier of human productive activities, and their information reflects many characteristics of the urban environment [
1]. Quickly requiring accurate and reliable building information provides effective references for the construction of smart cities, which can improve the livability of cities and urban development [
2], and also provides important references for urban planning [
3], disaster management [
4], and map updating [
5]. Recent emerging high-resolution (HR) remote sensing images (RSIs) provide richer building details, which alleviates the difficulty of acquiring small buildings from low-resolution images [
6,
7], providing the data foundation of aerial photographs for accurately acquiring building information. Automatic building detection from HR RSIs provides a practical approach for building information acquisition [
8] and has received increasing attention in recent years. However, accurate detection of buildings from HR RSIs still poses significant challenges due to certain factors, such as complex backgrounds, diverse appearances, and occlusions and shadows. Thus, developing new effective building detection methods has become challenging and valuable research.
Building detection methods can be classified into two categories, traditional and deep learning-based methods. The traditional methods are based on artificial features, which are mainly derived from physical characteristics, such as building contours [
9] and spectral features [
10]. These methods are typically combined with classical machine learning methods to detect buildings [
11]. Moreover, these methods are sensitive to noise and illumination variations. In addition, the features need to be hand-crafted for new regions and new data sources, which typically requires extensive engineering skills and geology expertise. Deep learning-based methods are data-driven, learning features from labeled data [
12,
13]. The object detection models without manual feature engineering, such as Faster R-CNN [
14] and SSD [
15], have been extended to building detection and started to serve as the dominant approach owing to the advancement of deep learning techniques for object detection. Liu et al. [
16] propose a hierarchical building detection framework to extract building features at different scales and spatial resolutions. A locally constrained framework is proposed to improve the detection of small and densely distributed buildings [
17]. A feature split–merge–enhancement network based on SSD architecture is proposed to better detect ground objects with scale differences [
18]. These methods are based on fully supervised models that require a large amount of annotated data to train their models, which is costly and time-consuming.
Recently, semi-supervised learning (SSL) has been illustrated to be effective in reducing the burden of sample annotation, which utilizes a limited amount of labeled data and a large number of unlabeled data. And some semi-supervised object detection (SS-OD) methods have been derived for detecting objects from RSIs [
19,
20,
21]. However, these methods are based on an anchor mechanism for detection, which is very sensitive to the size and aspect ratio of detected objects. In particular, the wide variety of buildings and their diverse shapes pose a great challenge to accurately detecting buildings in HR RSIs.
In this work, an SSL building detection (SS-BD) framework is proposed. Specifically, a color and Gaussian augmentation (CGA) module is developed to train a fully convolutional one-stage (FCOS) object detector with labeled images to address the scarcity of annotated samples for detector training. Then, a consistency learning (CL) module is derived from the teacher–student network to impose consistent prediction between different perturbed images, improving the detection ability of the unlabeled images. Finally, the student model is trained with a joint loss and the teacher model is refined by averaging the weights of the student model over training steps. The proposed framework removes the predefined anchor boxes and provides a more feasible and efficient detection pipeline compared with the anchor-based detection framework. To the best of our knowledge, this method is the first to introduce SSL for building detection of HR RSIs. The contributions of this study are summarized as follows:
This study proposes a semi-supervised framework for building detection from HR RSIs, which leverages the information from the unlabeled RSIs to improve the semi-supervised building detection performance. This study provides a methodological reference for the various object detection tasks on RSIs.
A CGA module is developed to increase the diversity of building features, which enhances the detection ability of an anchor-free detector on the labeled RSIs.
A CL module is designed to impose consistent prediction between different perturbed unlabeled RSIs, which improves the detection accuracy and generalization of the detectors.
The experimental results on three datasets demonstrate that the proposed framework is superior to several state-of-the-art building detection methods and SS-OD methods, achieving a higher of 0.736, 0.704, and 0.370 on the WHU, CrowdAI, and TCC building datasets, respectively.
The rest of this paper is organized as follows:
Section 2 reviews some related works;
Section 3 details the proposed framework;
Section 4 presents the experimental results;
Section 5 describes the ablation study and discusses the factors that may affect the performance of the proposed framework; and finally,
Section 6 concludes this study.
2. Related Literature
The work related to this study can be classified into three categories: building detection from RSIs and SS-OD, and SS-OD from RSIs. These categories will be reviewed in the following three subsections.
2.1. Building Detection from RSIs
Traditional building detection approaches are mainly from handcrafted feature-based methods. These methods realize building detection from RSIs by utilizing the physical building characteristics such as the geometric shape and context information of the building. Huang et al. [
22] improve the morphological building index detector by considering the spectral, geometrical, and contextual information of buildings. Furthermore, a geometric saliency-based method is proposed for accurate building detection with the new geometric building index [
23]. The automatic building detection model uses texture information to better distinguish between trees and buildings [
24]. A building detector is created by using invariant color features and shadow information of buildings [
25]. The handcrafted feature-based methods for automatic building detection relieve the pressure of manual visual interpretation to a large extent. However, the design of handcrafted features is time-consuming, laborious, and depends on empirical parameter settings, making it hard to improve the generalization and efficiency of the method.
In the fields of aerial photogrammetry and remote sensing, building detection algorithms based on deep convolutional neural networks show an excellent detection performance due to the advantage of extracting abstract features of images [
6]. Existing algorithms can be divided into semantic segmentation-based and object detection-based methods. Segmentation-based methods [
26,
27,
28,
29] are mainly built on fully convolution networks (FCNs) [
30] to realize pixel-level building classification. Numerous semantic segmentation models are generalized to the building detection tasks and improve detection performance [
5,
6,
31,
32,
33]. This work aims to introduce object detection algorithms in automatic building detection. These algorithms are mainly categorized into anchor-based and anchor-free methods. The former methods require the predefined anchor boxes to generate a series of region proposals. These methods are originally from Faster R-CNN [
14], YOLO v1 [
34], SSD [
15], RetinaNet [
35], etc. Faster R-CNN uses a region proposal network (RPN) to generate proposal boxes and a proposal prediction network to efficiently classify them. Anchor-free methods remove the anchor boxes and attempt to predict object boxes by detecting the key points or center-ness of objects. The typical detectors are CornerNet [
36] and FCOS [
37]. CornerNet [
36] attempts to predict a bounding box as a pair of corners in a one-stage process. FCOS [
37] is the first fully convolutional detector that predicts the category location and center of bounding boxes in a pixel-level fashion. Several studies have advanced object detection research in the field of building extraction. An automatic building detection method is proposed to identify roof shape types from RSIs [
38]. Hamaguchi et al. [
39] propose a building detection method that handles various sizes. A CNN-based framework with a suitable ROI scale is designed for object detection in HR RSIs [
40]. DAPNet [
41] is proposed to detect objects in sparse and dense scenes of optical RSIs by improving the architecture of the Faster R-CNN model. An FER-CNN model integrating new boundary detection is proposed to improve the accuracy of building detection [
42].
Semantic segmentation-based and object detection-based methods require massive pixel-level and bounding box annotated samples for model training, respectively. However, the above methods are very dependent on a large number of labeled samples, resulting in the unavailability of large-scale regions. The motivation of this study arises from the need to detect buildings with unlabeled data and enhance the model performance in SSL.
2.2. SS-OD
SSL for object detection has received much attention in recent years. SS-OD is aimed at learning detection models based on labeled and unlabeled images, which can leverage a large number of unlabeled images to improve the performance of object detection. SS-OD methods can be classified into two categories: consistency-based and pseudo labeling-based methods.
The consistency-based methods apply the technique of consistency regularization, which enforces the detection model to produce consistent predictions for different views or perturbations of the same input image [
43,
44,
45,
46]. Accordingly, the models can be regularized and their robustness to noise and variations is enhanced. For example, Mean-Teacher [
47] is a consistency-based method by averaging model weights instead of label predictions. Jeong et al. [
46] apply the consistency constraint on an unlabeled image and the corresponding flipped version. Their method proposes a new consistency loss for both the classification of a bounding box and the regression of its location. Tang et al. [
48] propose a consistency-based proposal learning module that learns noise-robust proposal features and predictions by consistency losses. ISD [
49] is proposed for addressing the problems caused by interpolation regularization. The method defines different types of interpolation-based loss functions to improve the performance of SSL.
The pseudo labeling-based methods attempt to generate highly confident pseudo labels on unlabeled images to better train a detection model. Pseudo label generation and utilization are crucial to the success of SS-OD [
50]. STAC [
51] generates stable pseudo labels and updates the model by enforcing consistency by weak augmentations and strong augmentations, respectively. Wang et al. [
52] propose a self-training method for object detection, called SSM. This method makes region proposals reliable via cross-image validation and fuses the model with active learning. The Unbiased-Teacher model addresses the pseudo labeling bias issue and produces more accurate pseudo labels [
53]. An effective SS-OD model is proposed by using instant teaching and a co-rectify scheme to improve the number of pseudo labels [
54]. A soft teacher mechanism and a box-jittering approach are incorporated into an end-to-end SS-OD method [
55]. Chen et al. [
56] propose a DenSe Learning method to improve the stability and quality of pseudo labels, thus improving the detection performance on SS-OD.
Those methods mentioned above aim to enhance the model performance on natural images. However, detecting objects from RSIs poses a significant challenge due to diverse environmental factors, such as shadows, vegetation cover, complex roofs, dense building areas, and oblique angles.
2.3. SS-OD from RSIs
Emerging SS-OD techniques have been derived for object detection from RSIs [
19,
20,
21,
57] and effectively reduce the burden of sample annotation. These techniques utilize a limited amount of labeled RSIs and a large number of unlabeled RSIs for model training. The performance and generalization ability of the object detectors are improved by reducing the distribution gap between labeled and unlabeled RSIs. For example, Liao et al. [
19] propose an improved Faster R-CNN for semi-supervised SAR target detection. Chen et al. [
20] develop a Rotation-Invariant and Relation-Aware cross-domain adaptation object detection (CDAOD) network in SSL to address the rotation diversity of HR RSIs. Wang et al. [
21] present an SSL-based object detection framework for SAR ship detection. The framework generates pseudo labels by using a label propagation strategy and trains the Faster R-CNN network in SSL. Du et al. [
57] propose a novel semi-supervised SAR ship detection network via scene characteristic learning to enhance its feature representation ability for ship targets and clutter.
Those methods for RSIs are anchor based, which employ the anchor mechanism to generate dense anchor boxes. In practice, anchor-based methods are sensitive to the size and aspect ratio of the detection object. However, the size and shape of the buildings are diverse, making these methods inappropriate for detecting buildings from HR RSIs.
6. Conclusions
In this work, a semi-supervised framework is proposed to alleviate the problems of large labeled datasets required for building detection from HR RSIs, which provides an important reference for the resource allocation and sustainable development of smart cities. The experimental results show that the proposed framework has a higher AP of 0.380, 0.365, and 0.133 and an of 0.736, 0.704, and 0.370 on the WHU, CrowdAI, and TCC datasets, respectively. The proposed framework increases the diversity of building features with the color and Gaussian data augmentation strategies and improves the detection ability on the unlabeled images by the introduction of consistency learning. Compared with the competitive approaches, the proposed framework achieves the best detection accuracy over multiple datasets, showing a good generalization ability, and has good detection performance in some challenging scenarios, such as road objects with colors similar to those of buildings, building areas characterized by complex shapes and obscured by trees, and images with dense small buildings.
Furthermore, additional features, such as building center-ness, can be introduced into the framework to improve building detection from HR RSIs. Moreover, the proposed framework is not superior to supervised methods when labeled RSIs are relatively sufficient, which should be further investigated to exhibit better performance.