1. Introduction
By analyzing hyperspectral images (HSIs), we can explore both abundant spatial information and rich spectral information [
1,
2,
3]. Compared with RGB images, it can be applied in fields such as mineral detection, disaster prevention and precision agriculture by precisely classifying each pixel [
4,
5,
6]. In environmental protection, HSI can detect gas [
7], oil spills [
8], water quality [
9,
10] and vegetation coverage [
11,
12].
There are hundreds of spectral bands in each pixel taken from a hyperspectral image and thus forms a three-dimensional data cube. Every spectral band in the cube can be seen as a 2D image. By analyzing the vast amount of information in this 3D cube, each pixel can be predicted with a unique label, and various classes are discriminated as accurately as possible. Through the rapid development of classification accuracy, HSIs have become the foundation of military, agriculture and astronomy.
In the early times, researchers mainly focused on traditional machine learning methods such as logistic regression [
13], neural networks [
14], principal component analysis (PCA) [
15] and support vector machines (SVM) [
16]. However, these methods cannot fully utilize the non-linear information in the high-dimensional hyperspectral data.
In the deep learning era, the convolutional neural network (CNN) has achieved satisfactory results with the invention of different models. The CNN can effectively capture features from raw pixels by exploiting the shape, layout and texture of ground objects which combines both the spatial information and spectral information. In [
17], a 2D-CNN and a 1D-CNN are combined together to explore more useful features for classification from spatial information and spectral information. Since a 3D-CNN have more advantages in processing the 3D information, Li et al. [
18] and Chen et al. [
19] developed a classification framework consisting of 3D convolution blocks to process the cubes around each pixel. In [
20], Xu used the dual-channel model to combine a 3D-CNN and a 2D-CNN to learn useful spatial information and spectral information of HSIs. Then, this extracted information is merged and put into a classification block formed of fully connected layers to improve the accuracy. Recently, the SOTA in hyperspectral image classification has been able to reach 99% classification accuracy in the condition of sufficient labeled data.
However, these good results are obtained only under the condition of sufficiently labeled data. While a human can classify new classes by learning a few labeled samples, the performance of these methods decreases sharply when labeled samples are scarce. It is time-consuming and costly to label the data manually. If we only train the network until enough pixels have been labeled, it will be impossible to perform classification in real time. Learning how to obtain good results under the condition that there are only a few labeled classes has recently attracted more and more attention. The so-called few-shot classification means that each class is given K-labeled samples as training data to make predictions on the whole dataset. Usually, the value of K will be set to a small number here, which is 5, 10, 15, 20 and 25 in our experimental settings.
In order to solve a few-shot problem, the unlabeled data and outer dataset are considered to solve the problem [
21]. Semi-supervised methods and active learning methods have been proposed based on the assumption that there is no severe shift between the two data distributions which are the target domain data and source domain data. VSCNN [
22] uses active learning to select valuable samples from uncertain dataset to form training sample set and improve the small sample classification. However, affected by various environmental conditions such as light or atmosphere, even the pixels from the two different domains, which are the target domain and source domain, have the same labeled class, and the target and source domain usually have significant spectral shift. Domain-adaptation methods are proposed in order to solve this cross-domain problem.
DCFSL [
23] is proposed by combining few shot learning and a domain adaptation strategy in the conditional adversarial manner together to address the issue that there may be different data distributions between the target domain and the source domain. MDL4OW [
24] improves the classification accuracy by identifying unknown classes. MDL4OW uses the statistical mode EVT to estimate the unknown score and a new evaluation metric to evaluate the accuracy. These methods try to both solve the few-shot learning problem by applying a framework of utilizing other datasets. However, the fitness of outer dataset is still a burden of the few-shot problem.
In combination with metric learning, domain adaptation can solve the few-shot learning problem without involving an external dataset. Metric learning can learn a relationship between sample pairs by mapping the samples into a metric space. In this space, the distance between the samples of same classes will be as close as possible and the distance between the samples without the same classes is as large as possible. S-DMM [
25] proposes a model based on metric learning and then learns the similarity between sample pairs using a Siamese network and an auto-encoder. S-DMM solves the cross-scene HSI classification by applying the deep learning method. However, the metric learning method has the defect of being very time-consuming.
Solving the few-shot problem while being time-efficient and without using other data is not a trivial task. Nevertheless, the above methods cannot meet the requirements. In summary, the few-shot problem in classifying HSI faces the following challenges:
How to reach a high accuracy in few-shot problem. Considering the cost of manually labeling every pixel in a hyperspectral image, reaching a satisfying accuracy result under the condition of giving a few training samples can bring great economic benefits. However, this is difficult because the network relies on learning the distribution of labeled samples to make predictions. If the amount of training data is not large enough, it will be very difficult for the network to achieve high accuracy.
How to solve the few-shot problem without involving an outer dataset. Because of the existence of severe shifts in the sample distribution between the source dataset and the target dataset, finding an appropriate outer dataset as the source dataset is a hard task. As we want to solve the few-shot problem by applying different datasets successfully, it may be a better choice to achieve high accuracy without including an outer dataset. In this way, searching for useful source datasets for every target dataset will not be required.
How to solve the few-shot problem with a fast speed. The proposed methods usually have huge time requirements to solve the above problem because of the defects in the methods themselves, such as metric learning. In some cases, classifying an HSI as quick as possible is very important.
Considering the above problems, we propose a new method employing all the benefits of convolution blocks and Transformer Encoders to solve few-shot learning in this paper. Convolution blocks have the benefits of shared weight, spatial subsampling and local receptive fields, and Transformer Encoders have the advantages of dynamic attention, better generalization and global context fusion. Combined with a generative adversarial network, this method can ensure the similarity between generated and original samples. We do not use any other dataset or unlabeled data in this paper to solve the few-shot learning problem. The main contributions of our paper are as follows:
- (1)
For the first time, a convolution block, Transformer Encoder and Generative Adversarial Network are combined together to realize the few-shot classification of HSIs. Through this model, we can learn the data distribution by only using a few samples and can reach a high accuracy on different datasets. With this efficient model, we also achieve the aim of not using outer datasets.
- (2)
We solve the few-shot problem with better time efficiency. Considering the time consumption of training Transformers, we speed up the training time by combining the Transformer Encoder with convolution blocks.
- (3)
The method proposed in the paper achieved good classification results on the Indian Pines, PaviaU and KSC datasets compared with other few-shot learning methods.
4. Experiments
In this section, three leading HSI data sets are selected to conduct HSI classification experiments. The experiments are implemented on the pytorch open source software framework using the NVIDIA 3080Ti graphics card.
The Indian Pines dataset was gathered in 1992 and has 224 bands. The AVIRIS stands for airborne visible infrared imaging spectromet, and its data were gathered in northwestern Indiana. The band’s visible and infrared spectra range from 400 to 2500 nm, and 200 spectral bands are used in this paper because of the atmospheric absorption compared with the original 224 bands. The size of the dataset is 145 × 145, and some of these pixels are labeled to 16 classes.
Table 4 shows the data division of the Indian Pines dataset for this experiment.
The Pavia University dataset was gathered by the ROSIS sensor in 2002 and has 115 bands. The ROSIS stands for Reflective Optics System Imaging Spectrometer, and its data were gathered over the University of Pavia campus during a flight campaign. The band’s visible and infrared spectra range from 430 to 860 nm, and the ground resolution of this dataset is 1.3 m. Affected by noise and water absorption, some bands were abandoned, and 103 spectral bands are used in this paper. The size is 610 × 340 in this dataset, and some of these pixels are labeled to 9 classes.
Table 5 shows the training and testing data division of the Pavia University dataset.
The Kennedy Space Center dataset was gathered by NASA AVIRIS in 1996 and has 224 bands. The band’s visible and infrared spectra range from 400 to 2500 nm, and the ground resolution of this dataset is 18 m. Because of the existence of water absorption, some affected and low SNR bands were abandoned, and 176 spectral bands were used in this paper. The size of the dataset is 512 × 614, and some of these pixels are labeled to 13 classes.
Table 6 shows the training and testing data division of the KSC dataset.
To demonstrate how our method performs, we compared our method with eight different methods, including SVM [
40], 2D-CNN [
41], 3D-CNN [
18], HSI-BERT [
34], CA-GAN [
38], DCFSL [
23], VSCNN [
22] and S-DMM [
25]. DCFSL, VSCNN and S-DMM are the few-shot learning methods in hyperspectral image classification and obtain good results. DCFSL can utilize other datasets by combining few-shot learning and a domain adaptation strategy in the conditional adversarial manner. VSCNN uses active learning to select valuable samples from uncertain dataset to form training sample set and improve the few-shot learning ability. S-DMM can learn more features by learning the similarity between sample pairs using a Siamese network and an auto-encoder based on metric-learning.
For the fairness of the experiments, all the methods use their optimal parameters. The experiment is divided into five groups for IP and PU by the number of training samples, and the training samples of every class in each group have the numbers of 5, 10, 15, 20 and 25, respectively. In addition, the experiment is divided into three groups for Kennedy Space Center by the number of training samples, and the training samples of every class in each group have the numbers of 15, 20 and 25, respectively. Taking five per class for example, five samples are randomly selected from every class as the samples for training and the left samples are used as the testing set. We adopt the overall accuracy (OA) as the evaluation metric to measure the classification performance. All experiments are averaged on 10 times independent training results.
The above experiments are shown in
Table 7,
Table 8 and
Table 9. From the tables, we can find that as more samples are labeled, the accuracy reaches a higher score. Our proposed method outperforms in all conducted experiments, which demonstrates the ability of our method regardless of the change in the number of labeled samples. When other methods can obtain a good result on single dataset but cannot fit to the others, it means they do not have good adaptation ability, which is essential in few-shot learning problems. Because we cannot predict what dataset we will encounter, we need to have good results on different datasets.
Given 15 labeled samples as training samples per class, the corresponding classification maps of all the selected methods in IP dataset are shown in
Figure 7. In addition, the corresponding detailed maps of PU and KSC are shown in
Figure 8 and
Figure 9, respectively. It can obviously be seen that our classification map matches best with the image labeled with ground truth in all the images, which means that other methods assigned more incorrect labels to the pixels compared to our method. Moreover,
Table 10,
Table 11 and
Table 12 show the detailed accuracy of every class classification with 15 labeled samples as training samples on different datasets.
Our method achieves better results on most land classes. In particular, on the Indian Pines dataset, our method obtains the highest classification results on 13 classes out of 16 classes. For the classes “Corn-notill”, “Soybean-mintill” and “Woods”, where the proportion between the number of testing sets and the number of training sets is huge, our method obtains classification results of 78.77%, 81.39% and 97.28%, respectively. Our method is greatly improved compared with other methods in category 3 and 11.
On the Pavia University dataset, our method obtains the highest classification results on four classes out of nine classes. For the class “Meadows” and “Bare Soil”, where the proportion between the number of testing set and the number of training sets is huge, our method obtains classification results of 97.57% and 100.0%, respectively. Our method is greatly improved compared with other methods in category 2.
On the KSC dataset, our method obtains the highest classification results on 6 out of 13 classes. For the classes “Scrub” and “Water”, where the proportion between the number of testing sets and the number of training sets is huge, our method obtains classification results of 99.87% and 100.0%, respectively. Our method is greatly improved compared with other methods in category 2.
It can be seen that our method can make full use of a small number of training samples to extract effective features. From the perspective of AA, our method reaches the highest on the Indian Pines and KSC dataset. From the perspective of kappa, our method reaches the highest perfromance on three dataset. It can be seen that the classification results of each category are relatively balanced in the case of unbalanced proportion of training samples. Ablation experiments are shown in
Table 13,
Table 14 and
Table 15. By introducing the generative adversarial network, the OA can be improved by around
. As can be seen, the classification results after adding the generator improve greatly.