1. Introduction
High-Resolution Remote Sensing Images (HRRSI) can represent the geometric structure and texture information of ground objects more accurately and clearly [
1,
2,
3], and also provide us with more accurate earth observation [
4,
5,
6]. Remote sensing scene classification is one of the most direct ways to understand remote sensing scenes. It marks HRRSI as predefined semantic categories [
7,
8,
9].
According to the characteristics of different levels, there are three main approaches [
2]: (1) Low-level features; (2) Middle-level features; (3) High-level features. Low-level features mainly adopt shape, texture, spectral features, Global Color Histogram (GCH) [
10,
11], etc. These models are characterized by invariance to image rotation and translation [
12], but their classification result is low for complex scenes [
8]. Subsequently, a second approach has emerged. This approach focuses on encoding a dictionary of low-level features and can describe details more powerfully [
13]. The above two scene classification models are based on manual features and cannot be applied to complex ground object scenes [
14]. Compared with the first two manual features, high-level features automatically learn feature information through Convolutional Neural Networks (CNN) [
15] and Fine-tuned CNN [
16], which can achieve relatively high classification accuracy.
However, to achieve high classification accuracy, deep neural networks require a large number of training sample sets. In fact, the training sample set requires the prior knowledge of experts, and the cost of sample labeling is high. Currently, prior knowledge and experience are mainly used to solve this problem [
17], such as transfer learning [
18], meta-learning [
19], and metric learning [
20]. The above methods depend on the feature representation ability of the model, which is not intuitive and more complex [
14]. In the implementation process of the actual algorithm, the data augmentation method is usually used to increase the training samples, such as pixel change, image rotation, and geometric transformation. The training samples obtained by this method are limited in diversity and quantity. Therefore, the sample generation approach is one of the effective approaches to address the above problems, which can effectively increase the number of samples and obtain rich and diverse data samples.
Generative Adversarial Networks (GAN) are one of the typical models for sample generation [
14]. This model automatically learns the distribution of data samples through the competition between generator and discriminator and generates a dataset similar to the original sample. For example, Song et al. [
21] proposed a spatiotemporal fusion method of remote sensing images based on a generative adversarial network, which is used to process one Landsat-MODIS prior image pair case (OPGAN). Chen et al. [
9] proposed a novel model based on remote sensing road-scene neighborhood probability enhancement and improved conditional generative adversarial network, which consists of two parts: road scenes classification and fine-road segmentation section. Lin et al. [
22] proposed a multi-layer feature matching Generative Adversarial Network to improve the accuracy of scene classification. Yu et al. [
23] proposed a GAN model integrating an attention mechanism to improve accuracy. However, these models are unsupervised models, and the generated samples do not contain labeled information, so they cannot be used to represent specific categories of scenes. Later, Ma et al. [
24] proposed the sifting GAN model to enhance the authenticity of the generated samples. However, this model mainly extracts global high-level semantic features, ignoring the correlation between spatial information and features [
25,
26].
In this study, we proposed a novel supervised adversarial Lie Group feature learning network. The network adopts a supervised model, which can ensure that the generated samples contain category information, and extract more discriminative features such as external physical structure features and internal socio-economic semantic features through Lie Group machine learning.
The main contributions of the study are as follows:
To address the problem of limited data samples, especially when the traditional GAN cannot generate data samples containing scene category information, we propose a novel supervised adversarial Lie Group feature learning network. The model adopts the supervision mode, which inputs the category information and data samples into the generator and discriminator at the same time and optimizes the supervisory anti-loss function to generate samples of category information. In addition, based on the previous feature learning, we added the internal socio-economic semantic features of the scene to further improve the representation ability of the scene model;
To make the generated data samples contain richer semantic feature information, we design the object scale sample generation strategy. This strategy can obtain data samples of different scales and semantic feature information of different scales, and ensure that a single fake sample has more detailed and richer semantic feature information. In addition, the model can effectively suppress the problem of model collapse during training;
To verify the feasibility and effectiveness of our model, we have carried out a large number of experiments. The experimental results show that, compared with other models (including classic methods and state-of-the-art methods), our method can effectively generate data samples, and these data samples contain scene category information. The generated data samples have richer and more detailed semantic feature information, and the samples have diversity.
2. Method
To solve the problem of low performance of deep network model in the case of limited data samples, we propose a novel supervised adversarial Lie Group feature learning network. This model can effectively generate data samples with category information. There are two main differences between our method and traditional GAN: (1) Our model takes category information and data samples as the input of the model, optimizes the constraint of category information in the loss function, and generates data samples containing category information; (2) The introduction of object scale sample generation strategy can generate data samples of different scales, and ensure that the generated data samples contain richer feature information.
Our proposed model is shown in
Figure 1. Firstly, the data samples and corresponding category information are fused as the input of the model. Then, Lie Group feature learning is performed on the data, which mainly learns the external physical structure features, internal socio-economic semantic features, and high-level semantic features of the scene. After that, the learned features and corresponding category information are fused into the network, and the data samples containing fake category information are gradually produced from small scale to large scale. Finally, after iterative training, the generated data samples with fake category information were fused with the original category data samples.
2.1. Lie Group Feature Learning
In the previous study [
2,
15], we adopted Lie Group machine learning to select, extract and learn features of sample data, and then constructed the Lie Group region covariance feature matrix to characterize features and the relationship between features. However, we found that the model has difficulty distinguishing remote sensing scenes with the same or similar geometric structure and spatial layout. To address this problem, in the process of feature learning, we supplement the internal socio-economic semantic features of the scene. In other words, compared with previous studies, the main difference is that our proposed model supplements the internal socio-economic semantic features of the scene.
In the actual algorithm model, the internal socio-economic semantic features of an object in the scene are mainly extracted from the Amap and are crawled according to the main categories. To avoid ambiguity effectively, we fuse intermediate categories and subcategories into point semantic objects, such as museums, libraries, archives, universities, secondary schools, primary school, and kindergartens were applied in this study, instead of science. A total of 45 categories are used in this study, and the category information can be added or deleted according to the needs of actual scenarios.
2.2. Supervised Condition Generation
Traditional GAN is mainly based on unsupervised generative models whose inputs are out-of-order and unlabeled, learning the characteristics of these data samples. After the model is trained, traditional GAN generates fake data samples that do not contain scene category information. However, such samples are of little significance to remote sensing research.
To address the above problems, the Conditional Generative Adversarial Network (CGAN) model proposed by Mirza and Osindero [
27] is used for reference, which can generate data samples with category information. As shown in
Figure 1, our model is transformed from unsupervised to supervised. In other words, in the model design, the scene category information and corresponding data samples are taken as the input of the model. In the condition generation module, random noise and condition information form a joint hidden layer.
In the discriminator module, the original datasets and the generated datasets are merged in the channel dimension, and the corresponding category information is input into the model. Compared with the traditional discriminator, our proposed discriminator still needs to make additional judgments: (1) whether the generated data set matches the original data set; and (2) whether the generated data set is valid and authentic.
2.3. Object-Scale Sample Generation Strategy
In this subsection, we proposed the object-scale sample generation strategy. The generator and discriminator are extracted according to the scale size, and gradually generate data samples from small scale to large scale.
2.3.1. Generator
As shown in
Figure 2, the random noise
and the category information
are fused into tensors and used together as the input of the network. In our previous study [
2], we found that the ordinary convolution receptive field is relatively small and the number of parameters is large, as shown in
Table 1. Therefore, we adopt parallel dilated convolution to expand the receptive field and learn semantically sensitive transformations.
Generally, the rectified linear units (ReLU) activation function is used in the traditional module. However, as shown in
Figure 3, the ReLU activation function is directly reduced to zero in the negative semi-axis region, which may lead to the disappearance of potential gradients in the training process of the model, making the generator unable to generate effective image samples and affecting the accuracy of the model. Therefore, we adopted the scaled exponential linear units (SeLU) activation function. The biggest difference between this activation function and the ReLU activation function is that there are negative value existing in the negative semi-axis area.
The mathematical expression is as follows:
The main advantages of adopting the SeLU activation function instead of the traditional ReLU activation function are as follows: (1) The saturation zone can effectively suppress the large variances of the lower level; (2) The parameters in the function make the slope of some areas greater than 1, which can effectively adjust the relatively small low-level variance; (3) The output value can effectively control the mean value. After repeated experiments and analysis of the results, the above two hyperparameters and are set to and . In the optimized generator module, the mode collapses and gradient disappearance can be effectively suppressed, which makes the generated dataset more authentic and diversified.
In the previous studies, we found that model deepening may lead to model degradation [
2]. Therefore, we adopted a residual network with eight residual blocks. As shown in
Figure 2, the layout of each residual block is the same, containing two parallel dilated convolutional layers, two Batch Normalization (BN) layers, and two SeLU activation functions. To make the model training more effective and the feature information extracted by the model more richer, the residual block adopts a shortcut (skip connection) to add the input from the upper layer to the lower layer.
After a certain extent of training in the above structures with different resolutions, we continue to adopt parallel dilated convolution, BN, SeLU, and convolutional layer operations to enable the model to effectively learn detailed feature information. In the last layer of the generator module, the Lie Group Tanh activation function is adopted [
25,
26].
In addition, in the training process of the generator, the number of parameters of each convolution layer is fixed at the previous resolution, and the number of parameters of the parallel dilated convolution is much reduced compared with that of the ordinary convolution. Therefore, the calculation performance of the whole training process of the generator is relatively better, and the calculation amount of parameters is relatively small. Through the above operation, fake data samples of different scales can be generated, and the samples can also contain richer and more detailed information.
2.3.2. Discriminator
As shown in
Figure 4, in the discriminator module, data samples and category information are taken as the input of the discriminator module together, and the corresponding parallel dilated convolution and SeLU activation functions are adopted. To enhance the classification ability of the discriminator, eight parallel dilated convolution is used to extract the features of the data samples. After two dense layers, the Lie Group Sigmoid activation function outputs the result. The traditional Sigmoid activation function is mainly suitable for vector data samples in Euclidean space. However, the method we proposed is to operate on the Lie Group manifold space, which does not belong to Euclidean space. Moreover, we used matrix data sample representation, and the matrix calculation does not satisfy the commutative law. Therefore, we adopted the previously proposed Lie Group Sigmoid and Lie Group Tanh activation functions. For details, please refer to the literature [
25,
26].
The generators and discriminators mentioned above are synchronized and mirror each other, so the above strategy can effectively suppress the collapse of the GAN model. In the training process, all modules in the generator and discriminator can be trained, and new modules can be smoothly added to the model, effectively avoiding the impact on the lower resolution layer. The low-resolution data samples generated by the generator are also relatively stable, which effectively reduces the number of samples processed by the discriminator. Therefore, the problem of model collapse during training can be effectively suppressed.
2.4. Probability Enhancement Strategy in Ground Object Neighborhood
Neighborhood information refers to the feature information of adjacent ground objects, which is mainly divided into four and eight neighborhoods and is used to represent the neighborhood feature information around the pixels. The spatial correlation of scene objects is usually regarded as the consistency of the connectivity of regional scenes. Inspired by this feature information, we extend the pixel neighborhood theory to remote sensing scenes. Therefore, we design and develop a probability enhancement strategy based on neighborhood correlation to improve the accuracy of sample scenes.
In the traditional model without data samples of geospatial coordinate information, it is generally impossible to construct ground object neighborhood information. However, the regional images neighborhood spatial coordinate information, so the ground object domain information can be constructed, as shown in
Figure 5.
As shown in
Figure 5, the four neighborhood calculates four adjacent regions, namely, up, bottom, left, and right, while the eight neighborhood has a larger calculation range, namely eight adjacent regions. Ground object scene is a small part in the middle, which is likely to be missed, resulting in missing or wrong scene category information. In this study, the eight neighborhood approach is adopted to enhance the correlation between ground objects. The mathematical expression is as follows:
where
represents the probability of the
region.
2.5. Supervised Adversarial Loss
Our approach can generate data samples containing category information, and the corresponding loss function is:
where
represents the final output of the discriminator,
represents the false data at different scales, and
represents the category information of the data sample.
Real data samples, generated fake data samples, and corresponding category information are used as the data of the model. Therefore, the discriminator in this study is different from discriminators in other models. The loss function corresponding to the discriminator in our approach:
where
and
respectively represent the data samples generated at different scales.
2.6. Scene Classification
Through the above operations, a large number of data sets with category information can be obtained. The obtained data set and the original data set can be combined to form a new data set, and more data sets can be obtained. Finally, we use the previously proposed Lie Group machine learning scene classification model and the typical scene classification model and adopt the obtained dataset to train and obtain the results.
5. Conclusions
In this study, we have proposed an adversarial Lie Group feature learning network model. The biggest difference between this model and other models is that it can generate data samples containing category information. After fusing the generated fake data samples with the original data samples, a richer data sample set can be obtained. In the model, the object scale generation strategy is used, which can effectively generate data samples of different scales. In addition, in terms of feature learning, we optimized on the basis of previous research, supplemented the internal socio-economic semantic features of scenes, and further enhanced the representation ability of scenes.
Since there are still some gaps between the generated data samples and the original data samples, in the future, we will continue to deeply study Lie Group and optimize the approach to generate more realistic data samples.