Next Article in Journal
A 20-Year Analysis of the Dynamics and Driving Factors of Grassland Desertification in Xilingol, China
Next Article in Special Issue
A Spatiotemporal Enhanced SMAP Freeze/Thaw Product (1980–2020) over China and Its Preliminary Analyses
Previous Article in Journal
Pasture Biomass Estimation Using Ultra-High-Resolution RGB UAVs Images and Deep Learning
Previous Article in Special Issue
Drought Monitoring from Fengyun Satellite Series: A Comparative Analysis with Meteorological-Drought Composite Index (MCI)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSSFF: Advancing Hyperspectral Classification through Higher-Accuracy Multistage Spectral–Spatial Feature Fusion

1
School of Remote Sensing and Geomatics Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China
2
Qingdao Innovation and Development Center (Base), Harbin Engineering University, Qingdao 266000, China
3
Faculty of Engineering and Applied Science, Memorial University, St. John’s, NL A1B 3X5, Canada
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(24), 5717; https://0-doi-org.brum.beds.ac.uk/10.3390/rs15245717
Submission received: 16 October 2023 / Revised: 8 December 2023 / Accepted: 12 December 2023 / Published: 13 December 2023

Abstract

:
This paper presents the MSSFF (multistage spectral–spatial feature fusion) framework, which introduces a novel approach for semantic segmentation from hyperspectral imagery (HSI). The framework aims to simplify the modeling of spectral relationships in HSI sequences and unify the architecture for semantic segmentation of HSIs. It incorporates a spectral–spatial feature fusion module and a multi-attention mechanism to efficiently extract hyperspectral features. The MSSFF framework reevaluates the potential impact of spectral and spatial features on segmentation models and leverages the spectral–spatial fusion module (SSFM) in the encoder component to effectively extract and enhance these features. Additionally, an efficient Transformer (ET) is introduced in the skip connection part of deep features to capture long-term dependent features and extract global spectral–spatial information from the entire feature map. This highlights the significant potential of Transformers in modeling spectral–spatial feature maps within the context of hyperspectral remote sensing. Moreover, a spatial attention mechanism is adopted in the shallow skip connection part to extract local features. The framework demonstrates promising capabilities in hyperspectral remote sensing applications. The conducted experiments provide valuable insights for optimizing the model depth and the order of feature fusion, thereby contributing to the advancement of hyperspectral semantic segmentation research.

1. Introduction

Hyperspectral imagery (HSI) contains a wealth of spectral information and comprises multiple, and in some cases, hundreds of bands. This spectral information can be leveraged to classify important ground objects based on the characteristics exhibited across different bands. Feature extraction plays a pivotal role in HSI classification and has garnered growing interest among researchers. Hyperspectral remote sensing has made significant contributions in various domains. Such as military applications [1], medical research [2], water quality monitoring [3], and agricultural research [4].
However, the presence of numerous frequency bands in the hyperspectral data results in strong correlations between adjacent bands [5]. This correlation leads to a significant amount of redundant information for classification tasks [6]. Consequently, early approaches in hyperspectral classification primarily focused on data reduction techniques and feature engineering [7,8].
In recent years, with the advancements in deep learning, this technology has been increasingly adopted in various domains [9,10,11], including hyperspectral remote sensing, and has achieved remarkable success [6]. Deep learning models have the capability to extract meaningful knowledge from vast amounts of redundant data [12]. The multi-layer structure of these models enables the acquisition of higher-level semantic information from the samples [13].
Various deep learning models have been developed for hyperspectral data analysis, with convolutional neural network (CNN)-based models standing out due to their remarkable performance. Yu et al. [14] introduced a CNN architecture that takes a single pixel as input, enabling the network to directly learn the relationships between different spectral bands. Chen et al. [15] propose a 3D-CNN model with sparse constraints that can directly extract spectral–spatial features from HSI. Ghaderizadeh et al. [16] presented a hybrid 3D-2D CNN architecture. This hybrid CNN approach offers advantages over standalone 3D-CNN by reducing the model’s complexity and mitigating the impacts of noise and limited training samples.
In addition to CNNs, several other network architectures have demonstrated strong performance in HSI classification. Recurrent neural networks (RNNs) are capable of capturing both long-term and short-term spectral dependencies and have found widespread application in HSI classification [17]. Fully convolutional networks (FCNs), a popular model in image segmentation, have been extensively employed in hyperspectral remote sensing tasks [18]. Transformers, which have shown significant advancements in recent years, have also been successfully applied to HSI classification [19,20,21,22,23]. Furthermore, graph convolutional networks (GCNs) have gained attention in HSI classification and have achieved notable performance [24,25].
However, the majority of these models for HSI analysis are primarily patch-based, necessitating laborious preprocessing steps and resulting in substantial storage requirements. Consequently, several studies [20,22,26,27] have attempted to address these challenges by directly performing semantic segmentation on HSI. In these approaches, HSIs are treated as multi-channel images, akin to conventional RGB images, and external ground object labels are employed for annotation. This process can be seen as manually marking and selecting regions of interest within the ROItools [28]. During the loss calculation, only the known ground object types are considered for gradient computation using masks. Experimental verification has demonstrated the simplicity and effectiveness of this approach. Nevertheless, the spectral–spatial characteristics of hyperspectral images are often not fully taken into account by most existing methods. Yu et al. [26] integrated Transformer features directly within the decoder part, overlooking the intrinsic global relationship between distinct patches [25]. In a similar vein, Chen et al. [20] employed a combination of convolution and Transformer in the encoder part to extract hyperspectral image (HSI) features. However, their approach models the spectral sequence in the upper layer of the model, while the spatial characteristics are modeled in the lower layer, thereby neglecting the consideration of consistent spectral–spatial characteristics.
Spatial–spectral fusion methods have been extensively employed in hyperspectral classification tasks for over a decade. Early research focused on analyzing the size, orientation, and contrast characteristics of spatial structures in images, followed by the utilization of support vector machines (SVMs) for classification purposes [29]. Subsequent studies explored supervised classification of hyperspectral images through segmentation and spectral features extracted from partition clustering [30]. Li et al. [31] investigated the use of 3D convolutional neural networks (3DCNN) for direct spatial–spectral fusion in classification tasks. More recently, a two-stage method inspired by image denoising and segmentation was proposed in [32] to merge spatial and spectral information. Moreover, Qiao et al. [33] introduced a novel approach that captures information by concurrently considering the interactions between channels, spectral bands, spatial depth and width. However, it should be noted that these methods primarily operate at the patch level and may not be directly applicable to semantic segmentation tasks.
Some recent works [34,35] have focused on enhancing convolutional modules to better capture spatial and channel details, yielding impressive performance across various tasks. However, when applied to HSIs, extracting both spatial and spectral features comprehensively becomes crucial. Conventional 2D convolutions are insufficient for effective hyperspectral feature extraction, while 3D convolutions exhibit high complexity and parameter redundancy. Thus, to address these limitations holistically, there is a need to employ modules that can extract both spectral and spatial features in hyperspectral tasks, thereby replacing traditional 2D and 3D convolutions. Several studies in the field of HSI [36,37], use new modules with attention mechanisms and multi-scale features to replace traditional convolutions, and have achieved good results in HSI patch-based classification tasks. However, these modules need to be used in conjunction with various different modules, and at the same time, the online module has a high number of parameters and complexity, making it difficult to apply to semantic segmentation tasks.
To simplify the modeling of spectral–spatial relationships in hyperspectral imaging sequences and establish a unified hyperspectral image semantic segmentation architecture. This paper proposes a novel image-based global spectral–spatial feature learning framework called MSSFF. In contrast to conventional classification methods, MSSFF utilizes the MMFF module to hierarchically model features in spectral–spatial sequences, resulting in outstanding classification performance even with a limited number of labeled samples (refer to Figure 1). Firstly, in the encoder component, effective extraction of hyperspectral features is achieved by incorporating a spectral feature fusion module and a spatial feature fusion module. Secondly, an efficient Transformer is introduced between the encoder and decoder to capture global dependencies among deep feature nodes. Lastly, a spatial attention mechanism is employed in the upper layer of the model to model region-level features. The contributions of our proposed MSSFF framework can be summarized as follows.
The contributions of this paper can be summarized as follows:
(1)
The paper introduces the MSSFF framework, a new method for hyperspectral semantic segmentation. It reevaluates the importance of spectral and spatial features and incorporates them effectively into the encoder. The framework also includes a Transformer in the skip connection section to capture global spectral–spatial information from the feature map. This demonstrates the potential of Transformers in modeling spectral–spatial feature maps for hyperspectral remote sensing.
(2)
We conducted a series of ablation experiments and module selection experiments to investigate the optimal depth of the hyperspectral semantic segmentation model. The results of these experiments confirmed that increasing the depth of the model beyond a certain point does not necessarily yield improved performance. Additionally, we explored the order of feature fusion and found that performing spectral feature fusion before spatial feature fusion yields better results. These findings suggest that considering spectral information before spatial information enhances the performance of the hyperspectral semantic segmentation model.
(3)
We performed comparative experiments involving the patch-based method and the semantic segmentation method to assess the feasibility of our proposed approach in the field of hyperspectral semantic segmentation. The results of these experiments confirmed the effectiveness and viability of our method for hyperspectral semantic segmentation.

2. Method

As shown in Figure 1, we find that shallow models can effectively classify HSIs, so we propose an end-to-end shallow semantic segmentation model. HSIs are rich in spatial and spectral information, and spectral correlation and spatial correlation should be fully utilized for modeling. Therefore, in this work, we first propose a Backbone that simultaneously extracts spatial and spectral features, we use SSFM to replace the traditional convolution module, and at the end of the model, we use a pyramid pooling strategy to capture multiple scale contexts. In the decoder part, we followed the standard Unet architecture. However, we introduce the efficient Transformer in the skip connection part to model the deep feature map globally, and for the shallow (topmost) feature map, we use the spatial attention module for shallow feature extraction. Through the above modules, the accuracy of HSI classification is significantly improved. The following sections describe the core components of the framework.
The framework adopts an encoder–decoder architecture, and the encoder part is similar to ResNet18 [38], but we use SSFM to replace the standard Conv module in ResNet. In general, we need to pad the boundaries of the input HSI. We choose to fill the length and width of the HSI to a multiple of 16, assuming the input is an Indian image I R 145 × 145 × 200 , we fill it with I R 160 × 160 × 200 . The HSI is directly input for forward calculation. In the encoder part, we replace the input parameter of Backbone’s first convolutional layer with the number of HSI spectral channels. A pyramid pooling module (PPM) is introduced at the end of the encoder. The multi-scale features extracted by the multi-scale aggregation module are very effective for the modeling of the framework. Residual connections between PPM and underlying feature maps can better facilitate gradient backpropagation. In the decoder part, one upsampling layer and two convolutional layers are set as a group, and there are three groups of upsampling modules in total. Before the upper and lower layer features are fused, the features of the encoder are enhanced by the ET or SA module, and then concat with the upsampled output of the lower layer features. Perform the same operation as above for each layer feature map of the encoder, and finally sample the feature map to the input size. To compute the loss, a small number of samples from the region are used to construct the mask. For the output of each batch, we only calculate the gradient of the known samples after the mask, and do not calculate the unknown samples.

2.1. Spectral–Spatial Fusion Module (SSFM)

To enhance the feature extraction capabilities of traditional 2D convolutions in both spectral and spatial domains, we introduce the concept of SSFM. Our approach involves the extraction and fusion of features from both the spectral and spatial dimensions. Specifically, we propose SSFM that applies the spectral feature fusion module first, followed by the connection of spatial feature fusion modules. The order of these modules will be discussed in the experimental results section.

2.1.1. Spectral Fusion Module

In order to fully leverage the potential of spectral features, we propose the integration of a spectral feature fusion module, as depicted in Figure 2. This module employs a split-extract-fusion strategy, which aims to address the challenges associated with extracting effective feature maps along the spectral dimension. In computer vision [38,39,40], particularly in the context of HSIs, the use of repeated convolutions for feature extraction can pose difficulties in capturing informative spectral-specific features, which has been identified as a critical issue [20,21,22]. Therefore, our proposed spectral feature fusion module provides a solution to overcome this flaw and improve the ability to extract meaningful spectral features in HSI analysis.
Given an input feature map X R H × W × C , firstly, we divide the features into two parts: X 1 R H × W × C / 2 and X 2 R H × W × C / 2 , based on the spectral dimension C. Simultaneously, both feature sets undergo a 1×1 convolution operation, which compresses their dimensions by half, resulting in X 1 R H × W × C / 4 and X 2 R H × W × C / 4 . Next, the features from the upper layer undergo extraction using both 1×1 and 3×3 convolution modules. Concatenation is then performed to obtain X 1 R H × W × C . Similarly, the features from the lower layer pass through a 1×1 convolution module while preserving their original features. Concatenation is performed again to obtain X 2 R H × W × C .
To obtain the combined feature representation, X 1 and X 2 are concatenated, resulting in the total feature representation X R H × W × 2 C . Subsequently, an average pooling (Avg-Pooling) operation is applied to X , and the resulting weights are divided into two parts, corresponding to X 1 and X 2 . These weights are used to perform feature weighting on the respective feature sets. Finally, the two weighted features are superimposed at the end of the module.
The following formula can be used to summarize:
X 1 , X 2 = Split X ,
X 1 = W C 1 X 1 , X 2 = W C 2 X 2 ,
where the operation denoted by s p l i t signifies the splitting of the input along the spectral dimension. Specifically, W C 1 R C / 2 × 1 × 1 × C / 4 and W C 2 R C / 2 × 1 × 1 × C / 4 are learnable weight matrices. These matrices are employed to facilitate the spectral-wise splitting and manipulation of the input features.
X 1 = Concat W C 11 X 1 , W C 12 X 1 ,
X 2 = Concat W C 13 X 2 , X 2 ,
where we define the learnable weight matrices associated with different components as follows: W C 11 R C / 2 × 1 × 1 × C / 4 represents the weight matrix for C 11 , W C 12 R C / 2 × 1 × 1 × C / 4 denotes the weight matrix for C 12 , and W C 13 R C / 2 × 1 × 1 × C / 4 corresponds to the weight matrix for C 13 . These weight matrices are learnable parameters that are utilized within the given formulation for various processing steps and transformations. The function Concat refers to dimension concatenation.
X = Concat X 1 , X 2 ,
After performing feature extraction, instead of directly concatenating or adding the two types of features, we adopt the approach proposed in [41,42] to selectively merge the output features from the feature extraction stage, denoted as X 1 and X 2 . Subsequently, we apply global Avg-Pooling to aggregate the global spatial information and obtain X a v g , which includes spectral statistics. Next, we normalize the global spatial information and multiply it element-wise with the feature map X , resulting in the generation of the feature importance vector Y . To further refine the feature representation, we split the feature vector Y into two equal parts, yielding Y 1 and Y 2 . Finally, we superimpose Y 1 and Y 2 to obtain the spectral refinement feature Y ^ .

2.1.2. Spatial Fusion Module

To ensure the encoder effectively captures spatial features, we propose the integration of a spatial feature fusion module, as illustrated in Figure 3. This module employs separation and fusion operations to enhance its functionality. The primary objective of the separation operation is to distinguish informative feature maps from those containing comparatively less relevant spatial content. By subsequently fusing feature maps that possess rich information with those exhibiting lesser information, we can extract more comprehensive feature information than what can be achieved through convolution operations alone.
Specifically, we propose a method that utilizes group normalization (GN) for a given feature X R H × W × C . GN partitions the input spectral dimension into 16 groups, enabling independent calculations of the mean μ and variance σ for each group. The mean is computed by averaging the values within a group, while the variance is determined by calculating the squared differences between each value and the mean, followed by averaging the squared differences. Subsequently, the activations within each group are normalized by subtracting the group mean and dividing by the square root of the group variance. This normalization process ensures consistent and efficient feature scaling within each group. GN introduces learnable parameters, which include scaling and shifting factors for each group. These parameters enable the network to learn optimal scaling and shifting of the normalized activations. The scaling factor γ adjusts the normalized value, allowing for fine-grained control of the feature representation, while the shift factor β introduces a bias to the normalized value, aiding in capturing higher-order feature interactions.
GN X = γ X μ σ 2 + ε + β ,
Simultaneously, the scaling factor γ within the GN layer serves as an indicator to quantify the variance of spatial pixels within each spectral dimension. The value of γ reflects the extent of spatial pixel variation, with richer spatial information resulting in a larger γ value. To obtain the weights for different feature maps, the following formula is employed: the features are multiplied with the weights within the GN layer. Subsequently, a sigmoid function is utilized to map the feature values to the interval [0, 1]. This process enables effective modulation and normalization of the feature representations.
W = γ i n = 1 C γ n , i , n = 1 , 2 , , C ,
X m i d = Sigmoid GN X W ,
Subsequently, a mask is constructed for the feature X m i d based on a threshold of 0.5. Values greater than or equal to 0.5 are assigned to x 1 , while values less than 0.5 are assigned to x 2 . These divisions result in two weighted features: X 1 , representing the information-rich feature, and X 2 , representing the less informative feature. To enhance the spatial feature fusion capability of the module and reduce spatial redundancy, the feature with rich information is added to the feature with less information. This is followed by a cross-reconstruction operation that facilitates comprehensive integration of the two weighted features, allowing for effective information exchange and generating more informative features. The resulting cross-reconstructed features are then concatenated to obtain spatial detail features, capturing fine-grained spatial information.

2.2. Efficient Transformer (ET)

The standard Transformer model exhibits limitations in terms of high computational complexity and a lack of explicit spatial structure modeling. To address these shortcomings, researchers have proposed various enhanced Transformer models aimed at improving their performance in computer vision tasks. For instance, attention mechanism improvements [43], locality-based attention [44], and hybrid models [45] have been developed. Consequently, it is valuable to explore the integration of Transformer with convolutional models.
Recent research endeavors [46,47] have focused on replacing positional embedding in the Transformer model with convolution operations. By incorporating convolution operations into the Transformer, it becomes possible to effectively combine local and global features. Building upon the aforementioned concept, we present the ET that utilizes convolutional operations to effectively reduce the dimensionality of the feature space while capturing positional information. The architecture of ET is depicted in Figure 4. Furthermore, we introduce convolutional layers at both the input and output of the module to enhance the extraction of spatial features.
Space-reduced Efficient Multi-head Self-Attention (SEMSA) operates in a similar manner to Multi-head Self-Attention (MSA), as it takes Q (query), K (key), and V (value) as input and produces features of the original size as output. However, a key distinction lies in that SEMSA reduces the spatial scale of K and V before the attention operation. This reduction significantly diminishes the computational and memory overhead.
Specifically, in our study, we employ SEMSA as a replacement for the traditional MSA in the encoder module. Each instance of the ET comprises an attention layer and a feed-forward layer (FFN). Considering the high-resolution feature maps involved in hyperspectral semantic segmentation, we utilize convolution (SR) to reduce the spatial dimension of these feature maps while simultaneously learning spatial information. SEMSA operates in a similar manner to MSA, as it takes Q , K and V as input and produces features of the original size as output. However, a key distinction lies in that SEMSA reduces the spatial scale of K and V before the attention operation. This reduction significantly diminishes the computational and memory overhead. The SEMSA of stage i can be expressed as follows.
SEMSA Q , K , V = Concat head 0 , , head N W o ,
Then, for the i-th h e a d , it can be expressed by the following formula:
head i = Attention Q W i Q , SR K W i K , SR V W i V ,
where W i Q , W i K , and W i V R C × C represent linear projection matrices, and the size C of each head is equal to C / N . Here, N represents the number of attention heads. The function S R ( · ) denotes the utilization of convolution to reduce the dimensionality of the input feature space based on the reduction rate r * .
SR x = Norm Reshape x , r * W S ,
where x R H W × C , where H W represents the spatial dimensions of the input and C denotes the number of spectrals. The reduction rate is denoted as R. The operation Reshape ( x , r * ) refers to transforming x into a new shape of H W R 2 × R 2 C . Here, W S R R 2 C × C corresponds to a linear projection matrix.
The attention calculation is defined as follows:
Attention ( Q , K , V ) = Softmax Q K T d V ,
where Q , K , and V represent the query, key, and value matrices, respectively. The variable d represents the dimension of the sequence.

2.3. Pyramid Pooling Module (PPM)

The PPM is shown in Figure 5. For the hyperspectral semantic segmentation task, it is crucial to consider spatial features at different scales. Utilizing pooling modules with varying sizes allows for the extraction of spatial feature information at different scales, thereby enhancing the model’s robustness. To further address the loss of context information between different subregions, approaches such as [48,49] have introduced a hierarchical global prior structure. By incorporating language information from various scales and subregions, a global scene prior can be constructed based on the final layer feature map of the deep neural network, leading to significant improvements in region segmentation accuracy.
To implement this, the input feature map X R H × H × C is transformed into four feature maps with different spatial sizes. Subsequently, 1x1 convolutions are applied to reduce the dimensionality of the four feature maps. Next, the four different feature maps are resized to match the size of the input feature map using linear interpolation. Finally, the input feature map is concatenated with the four interpolated feature maps.
The above process can be expressed by the formula
PPM ( X ) = Concat Pool 1 X , Pool 2 X , , Pool n X ,
Y ^ = ConvModule PPM X ,
where X denotes the input feature map. Pool i ( X ) represents the outcome of the ith pooling operation applied to the input feature map. The variable n signifies the number of pooling operations employed within the PPM. The function Concat refers to the concatenation of all the pooling results along the spectral dimension. Lastly, ConvModule represents a module encompassing convolution, batch normalization, and ReLU activation.

2.4. Spatial Attention (SA)

The spatial attention in our work is modified from that in [39]. To apply SA, we first reduce the dimensionality of the channel features. Then, we perform average pooling and maximum pooling operations on the features to obtain corresponding results using the “avg” and “max” operations, respectively. These pooled features are concatenated together to form a single feature map.
Next, we utilize a two-dimensional convolutional layer with a kernel size of (7, 7) to process the concatenated feature map. This convolutional operation can be represented by the following formula:
Y ^ = X · Sigmoid W S A X a v g , X m a x ,
where W S A R 1 × 7 × 7 × 2 represents a learnable weight matrix. X a v g and X m a x represent avgpooling and maxpooling operations respectively, Sigmoid · represents sigmoid activation function, and Y ^ represents module output features.

3. Experiments

3.1. Experimental Platform Parameter Settings

All experiments were conducted on a Windows 11 system equipped with an Intel (R) Core (TM) i5 10400 CPU @ 2.90 GHz processor and Nvidia GeForce RTX 3060 graphics card. To minimize experimental variability, the model adopts a controlled sampling approach by selecting a limited number of samples from the dataset for training. The experiment is conducted over 150 epochs, and all reported results are averaged over 5 independent experiments to ensure statistical significance. The model employs the AdamW optimizer with default parameters and initializes the learning rate to 5 × 10 4 . The loss function uses the standard cross-entropy, and the training process is the same as that in the literature [20,26]. We employ the hierarchical mask sampling method for calculating the loss function in our model. Specifically, we utilize masks to isolate relevant regions and compute the cross-entropy loss between the masked vectors and the corresponding ground truth objects. However, the presence of imbalanced class distributions and significant inter-class variations pose challenges. To address this, we adopt a strategy of random pixel sampling for known ground object categories. In this approach, we randomly select five pixels from each ground object category during multiple sampling iterations. This ensures comprehensive coverage of all known feature categories.
To verify the validity of the proposed method in this paper, a comparison is made between the segmentation effect of our proposed method (MSSFF) and several alternative methods, encompassing both patch-based approaches and semantic segmentation methods. The experiments are conducted on three publicly available datasets, namely Indian Pines (IA), Pavia Universitylia (PU), and Salinas (SA). In order to evaluate the performance of various models for HSI classification, the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K) are utilized as evaluation metrics.

3.2. Datasets

3.2.1. Indian Pines (IA)

The Indian Pines dataset was captured at a farm test site in northwest Indiana and collected using AVIRIS, an onboard sensor. In this paper, the data of 200 bands are classified after water absorption and low signal-to-noise ratio bands are eliminated. During the experiment, 10% of each type of ground object was selected for training, and the remaining samples were used for testing. When the number of selected samples of each type of ground object was less than five, we set it to 5. The specific training samples and test samples are shown in Table 1.

3.2.2. Pavia University (PU)

The dataset of Pavia University was shot in the University of Pavia, northern Italy, and was collected by airborne sensor ROSIS. In this paper, the data of 103 bands were classified by eliminating the bands affected by noise. During the experiment, 1% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in Table 2.

3.2.3. Salinas (SA)

The Salinas dataset was taken in the Salinas Valley, California, and the USA, and like the India dataset, it was collected using the airborne sensor AVIRIS. But unlike Indian Pines, it has a spatial resolution of 3.7 m. During the experiment, 1% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in Table 3.

3.2.4. Houston (HU)

The Houston dataset was acquired using the ITRES CASI-1500 sensor in the vicinity of the University of Houston, Texas, USA, including nearby rural areas. This dataset serves as a benchmark and is commonly utilized to evaluate the performance of land cover classification models. The hyperspectral dataset consists of 349 × 1905 pixels with 144 wavelength bands spanning from 364 to 1046 nm at 10 nm intervals. During the experiment, 5% of each type of ground object was selected for training, and the remaining samples were used for testing. The specific training samples and test samples are shown in Table 4.

3.3. Comparative Experiment

Table 5, Table 6, Table 7 and Table 8 present a comparative analysis of our proposed model alongside several patch-based frameworks, such as M3DCNN [50], HyBridSN [51], A2S2K [52], ViT [53], and SSFTT [54]. Additionally, the experimental results of Unet [55], PSPnet [48], Swin [44], and SegFormer [47], which are based on semantic segmentation frameworks, are also included for comparison. It is worth noting that semantic segmentation-based methods demonstrate superior performance in capturing global spatial information and exhibit significant advantages, particularly in scenarios with imbalanced training samples.
The experimental findings demonstrate the significant advantages of MSSFF when compared to both patch-based models and various semantic segmentation models. Specifically, M3DCNN, as a conventional 3DCNN model, suffers from parameter redundancy and inadequate extraction of spectral and spatial features, resulting in the poorest performance. ViT overlooks the unique characteristics of hyperspectral data by solely modeling the spectral sequence without considering the spectral similarity of ground objects, leading to subpar results. In contrast, HyBridSN leverages the strengths of both 3DCNN and 2DCNN, yielding certain improvements and highlighting the importance of feature redundancy in hyperspectral analysis. A2S2K adopts a residual-based 3DCNN approach where residual blocks are introduced into the hyperspectral domain. This design choice enables the model to effectively capture and exploit residual information, enhancing its ability to learn complex spatial and spectral features from hyperspectral data. Consequently, better results are achieved, although the computational complexity and parameter count of 3DCNN remain high. SSFTT employs a combination of 3DCNN and 2DCNN for feature extraction and incorporates Transformer to globally model the feature map. Notably, SSFTT outperforms other patch-based methods, underscoring the effectiveness of Transformers in modeling underlying feature maps.
However, the encoder component of Unet fails to fully consider the spatial and spectral characteristics of HSIs, resulting in poor correlation, particularly observed in the AA index, indicating significant misclassification issues with the Unet model. Similarly, PspNet shares the same encoder as Unet but introduces the PPM in the decoder to effectively capture semantic information at multiple scales, leading to improved performance. Swin Transformer incorporates Transformer in the encoder to globally model spectral and spatial features. Additionally, Swin Transformer includes UperNet in the decoder, enabling the capture of semantic information at various scales. Consequently, Swin Transformer demonstrates favorable results; however, Transformers still exhibit feature redundancy compared to convolutional methods.
In contrast, SegFormer leverages an efficient Transformer as the encoder while designing a simple and lightweight MLP decoder to reduce feature redundancy, resulting in outstanding performance across multiple tasks. Nevertheless, using a pure Transformer as the encoder for hyperspectral tasks may introduce invalid modeling, leading to poor model stability. To address this concern, MSSFF introduces SSFM, which considers both spectral and spatial features, as a replacement for the standard 2DCNN. The modification enhances stability and reduces model complexity. Additionally, MSSFF incorporates an efficient Transformer in the deep feature map, aligning with the findings of previous literature [54]. By considering feature extraction ability and model complexity, MSSFF achieves the best performance across the three datasets.
The classification results of different methods are presented in Figure 6, Figure 7, Figure 8 and Figure 9. It can be observed from the figures that there is a significant number of misclassifications between M3DCNN and ViT, particularly when dealing with ground objects that exhibit similar spectral characteristics. However, HyBridSN, A2S2K, and SSFTT show some improvements, although there are still instances of misclassifications. Unet and PspNet, which take into account spatial characteristics, notably reduce the misclassification phenomenon in the central areas of ground objects. However, misclassification still occurs in the edge connection areas of different ground objects. Swin and SegFormer employ a hierarchical Transformer as the encoder, providing a global receptive field. Nevertheless, there are still misclassifications for ground objects with similar spectral and spatial characteristics. MSSFF shows significant improvements in mitigating misclassifications for ground objects with similar spectral and spatial characteristics, with only very few misclassifications occurring in the edge areas of different ground objects. Overall, MSSFF exhibits excellent classification performance for diverse ground objects, fully considering their spectral and spatial characteristics.

3.4. Model Analysis

To verify the effectiveness of each component in the proposed MSSFF framework, this section focuses on conducting ablation experiments. Additionally, we also explore the selection of the number of layers in the encoder and the sequencing of the spectral feature fusion module and the spatial feature fusion module in SSFM.

3.4.1. Ablation Experiments

We conducted a series of ablation experiments to assess the individual contributions of the modules in the MSSFF method. The results of the ablation experiments are shown in Table 9. The MSSFF method comprises four modules: SSFM, PPM, ET, and SA. During the ablation experiments, we systematically removed these modules and evaluated the resulting changes in the classification metrics, namely OA, AA, and K.
When all modules were removed, the classification metric scores were relatively low, indicating the significant role of these modules in improving the classification performance. Specifically, when only the PPM was used, there was a significant improvement in the classification index, demonstrating its favorable impact on enhancing classification performance. Building upon the PPM, the addition of the ET module further improved the classification index, highlighting its positive influence on classification performance. The inclusion of the SA module resulted in slight improvements in the classification metrics. Although the observed improvements were small, they still indicated the contribution of the SA module to the enhancement of classification performance. Finally, when all modules (SSFM, PPM, ET, and SA) were utilized, the classification metrics (OA, AA, and K) achieved their highest levels. This observation underscores the effectiveness of combining these modules in improving the hyperspectral classification performance of the MSSFF method.
Figure 10 illustrates the visualization of feature maps obtained from the MSSFF framework using SSFM and ET modules. A careful selection of representative feature maps was made for visual comparison, revealing that the visualization results obtained with the SSFM module exhibit enhanced refinement, capturing finer details such as object edges, contours, and texture structures. On the other hand, the visualization results obtained with the ET module demonstrate a wider receptive field and a greater emphasis on the overall context compared to those without ET. This visual analysis provides compelling evidence for the effectiveness and superiority of the designed SSFM and ET modules in the MSSFF framework.

3.4.2. Comparative Analysis of Attention Modules in MSSFF

We consider the impact of various types of attention modules on MSSFF. Specifically, we study and compare multiple existing attention mechanisms, including self-attention, channel attention, and spatial attention. Each attention module provides unique capabilities to capture different types of dependencies and enhances feature representation. Through comprehensive experiments, we identify the most effective attention module based on the characteristics of the dataset and the task goals. This systematic approach improves the performance of our deep learning models and enhances model interpretability. As shown in Table 10, the ET module achieved the best results on all three datasets.

3.4.3. Fusion Module Order Selection

The results of the sequential selection experiments conducted on the spectral feature fusion module and spatial feature fusion module in SSFM are presented in Table 11. The feature fusion module employed in SSFM shares similarities with CBAM [39], as both require careful consideration of the order in which spectral and spatial dimensions are modeled. To comprehensively evaluate the impact of feature fusion, we divided the experiments into two parts: Space-Spectral and Spectral-Space.
Interestingly, our findings indicate that fusing the spectral dimension features of hyperspectral data prior to the fusion of spatial dimensions yields better results. We speculate that this is due to the fusion of spatial dimensions potentially causing a disruption to the spectral features, leading to a decline in the effectiveness of spectral feature fusion.

3.4.4. Explore the Layers of Encoder

Regarding the impact of different layers in the encoder on the model, the corresponding results are presented in Table 12. Recent literature [20,54,57,58] has demonstrated the effectiveness of shallower models in hyperspectral object classification tasks. Therefore, we conducted an exploration by varying the number of layers in the encoder to assess their influence on model performance.
Table 12 clearly indicates that the number of layers in the encoder does not necessarily follow a “deeper is better” trend. Specifically, the model’s performance does not consistently improve as the number of layers increases. On the contrary, there is a downward trend in model performance with an increasing number of layers. This phenomenon can be attributed to the introduction of excessive redundant information by overly deep encoders when processing hyperspectral data, which subsequently hampers model performance.
Based on these observations, we can conclude that for hyperspectral object classification tasks, a shallower encoder may be more suitable, and an excessively deep encoder does not necessarily lead to performance improvements. Thus, when designing the model, the number of layers in the encoder should be considered in a comprehensive manner, and an appropriate number of layers should be selected to achieve the optimal performance.

3.4.5. Mean Squared Error (MSE) Discussion on Different Methods

Although the confusion matrix accounts for the significant differences between different categories, we have observed that the patch-based methods (HyBridSN, A2S2K, and SSFTT) exhibit similar Kappa coefficients, OA, and AA. However, merely comparing the significance differences is insufficient to fully explain the relative merits of these methods. Therefore, we conducted further testing using the MSE metric on different datasets. The experimental results are shown in Table 13.
Through the analysis of the MSE metric, we have found that the SSFTT method demonstrated a distinct advantage over A2S2K and HyBridSN across all datasets. Particularly, on the lower-resolution IA and SA datasets, A2S2K showed relatively better performance compared to HyBridSN. However, on the higher-resolution PU dataset, A2S2K exhibited relatively poorer performance.

4. Conclusions

In this paper, we propose an architecture called MSSFF that effectively combines spectral and spatial features for accurate hyperspectral semantic segmentation. MSSFF incorporates spectral and spatial feature aggregation modules within the encoder, allowing for the fusion of features and the generation of hierarchical representations. Additionally, in the deep layers of the encoder, we introduce a PPM for aggregating multi-scale semantic information. In the skip connection part, we employ an efficient Transformer to perform global modeling on deep feature maps, while utilizing a spatial attention mechanism for local feature extraction on shallow feature maps. Consequently, MSSFF exhibits strong capabilities in feature extraction as well as local–global modeling.
The performance of MSSFF was evaluated on three benchmark datasets, and it consistently outperformed other methods in terms of key evaluation metrics, including OA, AA, and Kappa. These results highlight the remarkable potential of MSSFF for hyperspectral semantic segmentation tasks, confirming its superiority over existing approaches.
Furthermore, we conducted an investigation into the impact of the number of layers in the encoder on the model’s performance. Our analysis revealed that deeper models tend to yield better results, with the optimal performance achieved when the number of layers is set to four. In future research, we plan to explore the feasibility of shallow models for hyperspectral semantic segmentation and investigate the deployment of lightweight hyperspectral semantic segmentation models on resource-constrained devices.

Author Contributions

Conceptualization, methodology, software, Y.C., Q.Y. and W.H.; validation, Y.C. and Q.Y.; writing—original draft preparation, Y.C.; writing—review and editing, Q.Y. and W.H.; visualization, Y.C. and Q.Y.; supervision, Q.Y.; project administration, Q.Y.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (42001362) to Q.Y.

Data Availability Statement

The datasets presented in this paper is available through https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, accessed on 1 June 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral Imaging for Military and Security Applications: Combining Myriad Processing and Sensing Techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
  2. Fei, B. Hyperspectral imaging in medical applications. In Data Handling in Science and Technology; Elsevier: Amsterdam, The Netherlands, 2019; Volume 32, pp. 523–565. [Google Scholar]
  3. Liu, H.; Yu, T.; Hu, B.; Hou, X.; Zhang, Z.; Liu, X.; Liu, J.; Wang, X.; Zhong, J.; Tan, Z.; et al. Uav-borne hyperspectral imaging remote sensing system based on acousto-optic tunable filter for water quality monitoring. Remote Sens. 2021, 13, 4069. [Google Scholar] [CrossRef]
  4. Feng, L.; Zhang, Z.; Ma, Y.; Sun, Y.; Du, Q.; Williams, P.; Drewry, J.; Luck, B. Multitask Learning of Alfalfa Nutritive Value From UAV-Based Hyperspectral Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5506305. [Google Scholar] [CrossRef]
  5. Li, Q.; Wang, Q.; Li, X. Exploring the relationship between 2D/3D convolution for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8693–8703. [Google Scholar] [CrossRef]
  6. Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
  7. Zhao, W.; Du, S. Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
  8. Jiang, J.; Ma, J.; Chen, C.; Wang, Z.; Cai, Z.; Wang, L. SuperPCA: A superpixelwise PCA approach for unsupervised feature extraction of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4581–4593. [Google Scholar] [CrossRef]
  9. Yan, Q.; Huang, W. Sea ice sensing from GNSS-R data using convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1510–1514. [Google Scholar] [CrossRef]
  10. Chen, Y.; Yan, Q.; Huang, W. MFTSC: A Semantically Constrained Method for Urban Building Height Estimation Using Multiple Source Images. Remote Sens. 2023, 15, 5552. [Google Scholar] [CrossRef]
  11. Yan, Q.; Chen, Y.; Jin, S.; Liu, S.; Jia, Y.; Zhen, Y.; Chen, T.; Huang, W. Inland Water Mapping Based on GA-LinkNet from CyGNSS Data. IEEE Geosci. Remote Sens. Lett. 2022, 20, 1500305. [Google Scholar] [CrossRef]
  12. Bharadiya, J.P. Leveraging Machine Learning for Enhanced Business Intelligence. Int. J. Comput. Sci. Technol. 2023, 7, 1–19. [Google Scholar]
  13. Dhamo, H.; Navab, N.; Tombari, F. Object-driven multi-layer scene decomposition from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Repbulic of Korea, 27 October–2 November 2019; pp. 5369–5378. [Google Scholar]
  14. Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–98. [Google Scholar] [CrossRef]
  15. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
  16. Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Zhao, N.; Tariq, A. Hyperspectral Image Classification Using a Hybrid 3D-2D Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7570–7588. [Google Scholar] [CrossRef]
  17. Hao, S.; Wang, W.; Salzmann, M. Geometry-Aware Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 2448–2460. [Google Scholar] [CrossRef]
  18. Li, J.; Zhao, X.; Li, Y.; Du, Q.; Xi, B.; Hu, J. Classification of Hyperspectral Imagery Using a New Fully Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2018, 15, 292–296. [Google Scholar] [CrossRef]
  19. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
  20. Chen, Y.; Liu, P.; Zhao, J.; Huang, K.; Yan, Q. Shallow-Guided Transformer for Semantic Segmentation of Hyperspectral Remote Sensing Imagery. Remote Sens. 2023, 15, 3366. [Google Scholar] [CrossRef]
  21. Chen, Y.; Wang, B.; Yan, Q.; Huang, B.; Jia, T.; Xue, B. Hyperspectral Remote-Sensing Classification Combining Transformer and Multiscale Residual Mechanisms. Laser Optoelectron. Prog. 2023, 60, 1228002. [Google Scholar] [CrossRef]
  22. Chen, Y.; Yan, Q. Vision Transformer is Required for Hyperspectral Semantic Segmentation. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 36–40. [Google Scholar]
  23. Qiao, X.; Roy, S.K.; Huang, W. Multiscale Neighborhood Attention Transformer With Optimized Spatial Pattern for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5523815. [Google Scholar] [CrossRef]
  24. Yu, C.; Zhou, S.; Song, M.; Gong, B.; Zhao, E.; Chang, C.I. Unsupervised Hyperspectral Band Selection via Hybrid Graph Convolutional Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5530515. [Google Scholar] [CrossRef]
  25. Shi, C.; Liao, Q.; Li, X.; Zhao, L.; Li, W. Graph Guided Transformer: An Image-Based Global Learning Framework for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5512505. [Google Scholar] [CrossRef]
  26. Yu, H.; Xu, Z.; Zheng, K.; Hong, D.; Yang, H.; Song, M. MSTNet: A multilevel spectral–spatial transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5532513. [Google Scholar] [CrossRef]
  27. Zhu, Q.; Deng, W.; Zheng, Z.; Zhong, Y.; Guan, Q.; Lin, W.; Zhang, L.; Li, D. A spectral-spatial-dependent global learning framework for insufficient and imbalanced hyperspectral image classification. IEEE Trans. Cybern. 2021, 52, 11709–11723. [Google Scholar] [CrossRef] [PubMed]
  28. Jia, K.; Liang, S.; Zhang, N.; Wei, X.; Gu, X.; Zhao, X.; Yao, Y.; Xie, X. Land cover classification of finer resolution remote sensing data integrating temporal features from time series coarser resolution data. ISPRS J. Photogramm. Remote Sens. 2014, 93, 49–55. [Google Scholar] [CrossRef]
  29. Fauvel, M.; Tarabalka, Y.; Benediktsson, J.A.; Chanussot, J.; Tilton, J.C. Advances in spectral-spatial classification of hyperspectral images. Proc. IEEE 2012, 101, 652–675. [Google Scholar] [CrossRef]
  30. Mehta, A.; Ashapure, A.; Dikshit, O. Segmentation-based classification of hyperspectral imagery using projected and correlation clustering techniques. Geocarto Int. 2016, 31, 1045–1057. [Google Scholar] [CrossRef]
  31. Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef]
  32. Chan, R.H.; Kan, K.K.; Nikolova, M.; Plemmons, R.J. A two-stage method for spectral–spatial classification of hyperspectral images. J. Math. Imaging Vis. 2020, 62, 790–807. [Google Scholar] [CrossRef]
  33. Qiao, X.; Roy, S.K.; Huang, W. Rotation is All You Need: Cross Dimensional Residual Interaction for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5387–5404. [Google Scholar] [CrossRef]
  34. Liu, J.J.; Hou, Q.; Cheng, M.M.; Wang, C.; Feng, J. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10096–10105. [Google Scholar]
  35. Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
  36. Ren, Q.; Tu, B.; Li, Q.; He, W.; Peng, Y. Multiscale adaptive convolution for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5115–5130. [Google Scholar] [CrossRef]
  37. Cai, W.; Ning, X.; Zhou, G.; Bai, X.; Jiang, Y.; Li, W.; Qian, P. A novel hyperspectral image classification model using bole convolution with three-direction attention mechanism: Small sample and unbalanced learning. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5500917. [Google Scholar] [CrossRef]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  39. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  41. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Repbulic of Korea, 27 October–2 November 2019; pp. 3146–3154. [Google Scholar]
  42. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Repbulic of Korea, 27 October–2 November 2019; pp. 510–519. [Google Scholar]
  43. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
  44. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  45. Jiang, Y.; Chang, S.; Wang, Z. Transgan: Two transformers can make one strong gan. arXiv 2021, arXiv:2102.07074. [Google Scholar]
  46. Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
  47. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  48. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 2881–2890. [Google Scholar]
  49. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  50. He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3904–3908. [Google Scholar]
  51. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
  52. Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-based adaptive spectral–spatial kernel ResNet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7831–7843. [Google Scholar] [CrossRef]
  53. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  54. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
  55. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  56. Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
  57. Yan, H.; Zhang, E.; Wang, J.; Leng, C.; Basu, A.; Peng, J. Hybrid Conv-ViT Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5506105. [Google Scholar] [CrossRef]
  58. Song, D.; Yang, C.; Wang, B.; Zhang, J.; Gao, H.; Tang, Y. SSRNet: A Lightweight Successive Spatial Rectified Network with Non-Central Positional Sampling Strategy for Hyperspectral Images Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5519115. [Google Scholar] [CrossRef]
Figure 1. The overall framework of MSSFF. In the encoder, the first convolutional layer is modified to align with the spectral channel of the HSI. To enhance multi-scale feature extraction, a PPM is added at the encoder’s end. Skip connections aid in gradient backpropagation, while the ET module captures global information, and the SA module focuses on local features in the upper layer. The decoder comprises three groups of upsampling and convolutional layers. During model training, only known samples are used to compute the loss gradient, excluding unknown samples.
Figure 1. The overall framework of MSSFF. In the encoder, the first convolutional layer is modified to align with the spectral channel of the HSI. To enhance multi-scale feature extraction, a PPM is added at the encoder’s end. Skip connections aid in gradient backpropagation, while the ET module captures global information, and the SA module focuses on local features in the upper layer. The decoder comprises three groups of upsampling and convolutional layers. During model training, only known samples are used to compute the loss gradient, excluding unknown samples.
Remotesensing 15 05717 g001
Figure 2. Spectral fusion module. This module employs a split-extraction-fusion strategy to enhance spectral features.
Figure 2. Spectral fusion module. This module employs a split-extraction-fusion strategy to enhance spectral features.
Remotesensing 15 05717 g002
Figure 3. Spatial feature fusion module. This module employs separation and fusion operations to enhance spatial features.
Figure 3. Spatial feature fusion module. This module employs separation and fusion operations to enhance spatial features.
Remotesensing 15 05717 g003
Figure 4. Efficient Transformer (ET), which utilizes convolution operations to efficiently reduce the dimensionality of the feature space while effectively capturing global information.
Figure 4. Efficient Transformer (ET), which utilizes convolution operations to efficiently reduce the dimensionality of the feature space while effectively capturing global information.
Remotesensing 15 05717 g004
Figure 5. Pyramid pooling module (PPM), which helps enhance the model’s understanding of complex visual scenes by aggregating features from different spatial scales.
Figure 5. Pyramid pooling module (PPM), which helps enhance the model’s understanding of complex visual scenes by aggregating features from different spatial scales.
Remotesensing 15 05717 g005
Figure 6. IA dataset ground feature classification result map.
Figure 6. IA dataset ground feature classification result map.
Remotesensing 15 05717 g006
Figure 7. PU dataset ground feature classification result map.
Figure 7. PU dataset ground feature classification result map.
Remotesensing 15 05717 g007
Figure 8. SA dataset ground feature classification result map.
Figure 8. SA dataset ground feature classification result map.
Remotesensing 15 05717 g008
Figure 9. Houston2013 dataset ground feature classification result map.
Figure 9. Houston2013 dataset ground feature classification result map.
Remotesensing 15 05717 g009
Figure 10. Visualization of selected encoder output features using three different methods. The different labels in the figure above refer to (a) RGB image, (b) and (c) base model, (d) and (e) using SSFM, and (f) and (g) using SSFM and ET.
Figure 10. Visualization of selected encoder output features using three different methods. The different labels in the figure above refer to (a) RGB image, (b) and (c) base model, (d) and (e) using SSFM, and (f) and (g) using SSFM and ET.
Remotesensing 15 05717 g010
Table 1. The number of training and testing pixels per category in the IA dataset.
Table 1. The number of training and testing pixels per category in the IA dataset.
No.Color.Class.Train.Test.Total.
1Remotesensing 15 05717 i001Alfalfa54146
2Remotesensing 15 05717 i002Corn-notill14312851428
3Remotesensing 15 05717 i003Corn-mintill83747830
4Remotesensing 15 05717 i004Corn24213237
5Remotesensing 15 05717 i005Grass-pasture49434483
6Remotesensing 15 05717 i006Grass-trees73657730
7Remotesensing 15 05717 i007Grass-pasture-mowed52328
8Remotesensing 15 05717 i008Hay-windrowed48430478
9Remotesensing 15 05717 i009Oats51520
10Remotesensing 15 05717 i010Soybean-notill98874972
11Remotesensing 15 05717 i011Soybean-mintill24622092455
12Remotesensing 15 05717 i012Soybean-clean60533593
13Remotesensing 15 05717 i013Wheat21184205
14Remotesensing 15 05717 i014Woods12711381265
15Remotesensing 15 05717 i015Buildings-Grass-Trees39347386
16Remotesensing 15 05717 i016Stone-Steel-Towers108393
Total1036921310,249
Table 2. The number of training and testing pixels per category in the PU dataset.
Table 2. The number of training and testing pixels per category in the PU dataset.
No.Color.Class.Train.Test.Total.
1Remotesensing 15 05717 i017Asphalt6765646631
2Remotesensing 15 05717 i018Meadows18718,46218,649
3Remotesensing 15 05717 i019Gravel2120782099
4Remotesensing 15 05717 i020Trees3130333064
5Remotesensing 15 05717 i021Metal sheets1413311345
6Remotesensing 15 05717 i022Bare Soil5149785029
7Remotesensing 15 05717 i023Bitumen1413161330
8Remotesensing 15 05717 i024Bricks3736453682
9Remotesensing 15 05717 i025Shadows10937947
Total43242,34442,776
Table 3. The number of training and testing pixels per category in the SA dataset.
Table 3. The number of training and testing pixels per category in the SA dataset.
No.Color.Class.Train.Test.Total.
1Remotesensing 15 05717 i026Brocoli-green-weeds-12119882009
2Remotesensing 15 05717 i027Brocoli-green-weeds-23836883726
3Remotesensing 15 05717 i028Fallow2019561976
4Remotesensing 15 05717 i029Fallow-rough-plow1413801394
5Remotesensing 15 05717 i030Fallow-smooth2726512678
6Remotesensing 15 05717 i031Stubble4039193959
7Remotesensing 15 05717 i032Celery3635433579
8Remotesensing 15 05717 i033Grapes-untrained11311,15811,271
9Remotesensing 15 05717 i034Soil-vinyard-develop6361406203
10Remotesensing 15 05717 i035Corn-senesced-green-weeds3332453278
11Remotesensing 15 05717 i036Lettuce-romaine-4wk1110571068
12Remotesensing 15 05717 i037Lettuce-romaine-5wk2019071927
13Remotesensing 15 05717 i038Lettuce-romaine-6wk10906916
14Remotesensing 15 05717 i039Lettuce-romaine-7wk1110591070
15Remotesensing 15 05717 i040Vinyard-untrained7371957268
16Remotesensing 15 05717 i041Vinyard-vertical-trellis1917881807
Total54953,58054,129
Table 4. The number of training and testing pixels per category in the Houston dataset.
Table 4. The number of training and testing pixels per category in the Houston dataset.
No.Color.Class.Train.Test.Total.
1Remotesensing 15 05717 i042Healthy Grass6311881251
2Remotesensing 15 05717 i043Stressed Grass6311911254
3Remotesensing 15 05717 i044Synthetic Grass35662697
4Remotesensing 15 05717 i045Tree6311811244
5Remotesensing 15 05717 i046Soil6311791242
6Remotesensing 15 05717 i047Water17308325
7Remotesensing 15 05717 i048Residential6412041268
8Remotesensing 15 05717 i049Commercial6311811244
9Remotesensing 15 05717 i050Road6311891252
10Remotesensing 15 05717 i051Highway6211651227
11Remotesensing 15 05717 i052Railway6211731235
12Remotesensing 15 05717 i053Parking Lot16211711233
13Remotesensing 15 05717 i054Parking Lot224445469
14Remotesensing 15 05717 i055Tennis Court22406428
15Remotesensing 15 05717 i056Running Track33627660
Total75914,27015,029
Table 5. Classification accuracy (%) of the IA image with different methods.
Table 5. Classification accuracy (%) of the IA image with different methods.
No.M3DCNNHyBridSNA2S2KViTSSFTTUnetPspNetSwinSegFormerMSSFF
173.91397.826100.00095.65295.65282.60995.65297.82693.47897.826
289.63697.61997.68994.39898.88098.52997.89998.52999.02099.510
391.56697.95297.71194.57898.31397.10897.47094.69998.31399.036
459.91692.40596.62487.76497.04694.09395.35994.09397.890100.000
594.61798.13799.17298.13799.37997.308100.00098.75899.17298.758
698.76799.863100.00099.041100.00099.17897.80898.63099.31598.630
739.286100.000100.00053.571100.00092.857100.000100.000100.000100.000
8100.000100.000100.000100.000100.00099.79199.582100.00099.79199.791
915.000100.000100.00070.000100.000100.000100.000100.000100.000100.000
1090.12399.28098.25198.45799.48696.91499.69198.86899.69199.486
1192.87299.43099.67495.76499.10499.55297.88295.64299.43099.511
1283.30594.77297.63988.70295.11097.13394.43599.32597.13398.988
1399.51299.02498.04998.049100.000100.000100.000100.000100.000100.000
1495.33699.60599.68499.289100.000100.000100.00099.684100.000100.000
1583.42089.11997.40995.07898.18799.74199.741100.00099.741100.000
1688.172100.000100.000100.000100.00086.02297.84995.69993.54896.774
OA91.22898.23498.81996.00998.97698.42998.31297.79599.15199.424
AA80.96597.81498.86991.78098.82296.30298.33698.23598.53399.269
K89.99397.98698.65495.45198.83298.20898.07797.48999.03299.344
Table 6. Classification accuracy (%) of the PU image with different methods.
Table 6. Classification accuracy (%) of the PU image with different methods.
No.M3DCNNHyBridSNA2S2KViTSSFTTUnetPspNetSwinSegFormerMSSFF
192.61097.58799.02096.23096.95498.34196.23099.50297.76899.955
299.16399.995100.00099.84499.84499.91499.93099.57199.920100.000
373.27392.13994.37888.89988.85280.51598.19097.90499.28598.285
483.32290.60196.37794.19196.93292.00485.90190.07896.37798.792
598.290100.000100.000100.000100.00099.92692.41699.33199.405100.000
683.058100.00098.86796.08399.165100.000100.000100.000100.000100.000
748.49698.94791.95596.61799.92591.05396.84298.72299.24899.850
873.57494.32487.34478.59994.75891.79899.62097.88299.321100.000
923.33792.81996.41097.88893.03189.22988.80772.12282.04999.472
OA88.36597.88497.76095.93297.98796.95297.66998.06298.82699.806
AA75.01496.26896.03994.26196.60793.64295.32695.01297.04199.595
K84.31697.19097.02794.59297.32995.95096.90897.42998.44599.743
Table 7. Classification accuracy (%) of the SA image with different methods.
Table 7. Classification accuracy (%) of the SA image with different methods.
No.M3DCNNHyBridSNA2S2KViTSSFTTUnetPspNetSwinSegFormerMSSFF
1100.00098.457100.000100.000100.00088.552100.00099.00499.851100.000
2100.000100.000100.000100.00099.86699.544100.00098.658100.000100.000
399.899100.000100.00099.949100.00093.421100.000100.000100.000100.000
492.53998.85299.78598.278100.00096.700100.00099.713100.000100.000
598.20898.84298.09697.46198.58195.10899.62799.62799.32899.813
699.94999.848100.000100.000100.00098.25799.06599.495100.000100.000
799.944100.00099.972100.00099.60997.262100.00099.860100.000100.000
892.17597.71197.80995.58299.13999.81499.54899.93899.86799.991
9100.000100.000100.000100.000100.00097.550100.000100.000100.000100.000
1097.95698.38399.42096.52299.23790.818100.000100.00099.878100.000
1197.47298.50299.43898.40899.34587.921100.00097.097100.000100.000
1298.080100.000100.00099.42999.94890.86795.74598.44398.75598.651
1344.32397.817100.00080.67796.83466.376100.000100.000100.000100.000
1497.00999.34699.15997.38398.22471.96399.90799.53399.907100.000
1584.74193.96093.01089.46195.57099.009100.00099.87698.638100.000
1698.78398.28499.22599.17099.72373.326100.000100.000100.000100.000
OA94.74698.32398.41596.82498.96295.08299.66699.64799.69799.941
AA93.81798.75099.12097.02099.13090.40599.61899.45399.76499.903
K94.14698.13198.23496.46298.84394.50799.62899.60799.66399.934
Table 8. Classification accuracy (%) of the Houston2013 image with different methods.
Table 8. Classification accuracy (%) of the Houston2013 image with different methods.
No.M3DCNNHyBridSNA2S2KViTSSFTTUnetPspNetSwinSegFormerMSSFF
197.47599.832100.00097.72799.57998.32197.20298.00297.44299.041
298.57398.40598.90898.48998.65796.89093.62091.14892.10599.841
398.64099.54799.698100.00099.396100.00099.85799.57099.57099.857
498.39398.98599.831100.00099.32390.59587.86290.35493.81099.598
5100.000100.000100.000100.000100.00098.712100.00099.43698.551100.000
6100.00099.029100.00085.11394.82294.76999.69299.077100.000100.000
797.26198.50698.09191.78498.92196.60995.42695.97895.34798.423
888.91788.91785.02585.78788.15678.13586.65684.24485.61191.399
986.37591.50593.27287.72189.82385.54383.38783.38782.34893.131
1098.11399.485100.00099.65797.68495.35599.511100.000100.00099.837
1198.80698.03998.72198.03999.48894.98099.109100.000100.000100.000
1292.31499.40299.48898.03698.80492.94498.13597.24297.24298.378
1368.61097.75897.98280.71798.65595.30998.08198.29498.29498.294
1499.50999.754100.00098.280100.000100.000100.000100.000100.000100.000
15100.000100.000100.00099.841100.000100.000100.000100.000100.000100.000
OA95.31497.64797.71095.46297.36793.78595.01694.90395.14398.250
AA94.86697.94498.06894.74697.55494.54495.90395.78296.02198.520
K94.93397.45697.52595.09297.15393.28194.61394.49194.75098.108
Table 9. Different module ablation experiments. The symbols "" and "" are used to indicate the act of selecting and not selecting a module, respectively.
Table 9. Different module ablation experiments. The symbols "" and "" are used to indicate the act of selecting and not selecting a module, respectively.
SSFMPPMETSAIAPUSA
OAAAKOAAAKOAAAK
98.55698.36798.35398.74598.19398.33899.50799.46199.451
99.05498.02798.92199.14098.64298.86099.66699.64599.628
99.18098.46899.06599.43998.93499.25699.79799.72099.774
99.09398.48698.96599.27598.83599.04099.78299.80399.757
99.21998.54699.11099.44498.98699.26399.86999.82999.854
99.08398.73998.95499.43599.07699.18399.75899.59999.768
99.13298.80199.01099.58499.30299.44999.87199.78199.856
99.31798.94999.22199.64099.32499.52399.88799.79999.875
99.26898.99599.16699.61999.47799.49599.88299.74899.868
99.42499.26999.34499.80699.59599.74399.94199.90399.934
Table 10. Attention module replacement experiment.
Table 10. Attention module replacement experiment.
MethodIAPUSA
OAAAKOAAAKOAAAK
CBAM [39]99.22998.90999.12199.68299.31999.57999.89199.83599.879
Triplet [56]99.22999.04399.12199.70199.57199.60499.93399.89399.926
WMSA [44]99.21999.05699.11099.71099.39999.61699.85299.79399.835
MSA [53]99.34699.24899.25599.65999.40199.54899.90699.86899.895
ET99.42499.26999.34499.80699.59599.74399.94199.90399.934
Table 11. Sequential selection experiments for feature fusion in SSFM.
Table 11. Sequential selection experiments for feature fusion in SSFM.
No.IAPUSA
Space-SpectralSpectral-SpaceSpace-SpectralSpectral-SpaceSpace-SpectralSpectral-Space
193.47897.82698.97599.955100.000100.000
299.58099.51099.887100.000100.000100.000
397.34999.03699.61998.285100.000100.000
4100.000100.00097.32498.792100.000100.000
598.34498.75899.851100.00099.70199.813
698.49398.630100.000100.000100.000100.000
7100.000100.000100.00099.85099.860100.000
899.79199.791100.000100.000100.00099.991
9100.000100.00098.94499.472100.000100.000
1099.07499.486100.000100.000
1199.38999.511100.000100.000
1298.31498.98896.67998.651
13100.000100.000100.000100.000
14100.000100.000100.000100.000
15100.000100.00099.725100.000
1696.77496.77499.225100.000
OA99.14199.42499.55399.80699.79599.941
AA98.78799.26999.40099.59599.69999.903
K99.02199.34499.40999.74399.77299.934
Table 12. Encoder layer exploration experiment.
Table 12. Encoder layer exploration experiment.
No.IAPUSA
345345345
193.47897.82695.65298.85499.95599.005100.000100.00099.900
299.30099.51099.300100.000100.00099.995100.000100.000100.000
398.07299.03697.22995.14198.28598.142100.000100.000100.000
4100.000100.00099.57898.17298.79298.597100.000100.000100.000
599.17298.75899.172100.000100.000100.00099.55299.81399.776
698.63098.63098.356100.000100.000100.000100.000100.000100.000
7100.000100.000100.00099.92599.850100.00099.441100.00099.693
899.79199.79199.791100.000100.00099.26799.99199.991100.000
9100.000100.000100.00099.47299.47299.683100.000100.000100.000
1098.86899.48699.48699.969100.000100.000
1199.51199.51199.470100.000100.000100.000
1298.14598.98898.48298.65198.65199.637
13100.000100.000100.000100.000100.000100.000
1499.921100.000100.00099.159100.00099.907
15100.000100.000100.000100.000100.00099.972
1696.77496.77496.77499.502100.00099.336
OA99.20099.42499.19099.43999.80699.58299.85699.94199.924
AA98.85499.26998.95699.06399.59599.41099.76799.90399.889
K99.08899.34499.07799.25799.74399.44599.84099.93499.916
Table 13. MSE indicator values by different methods.
Table 13. MSE indicator values by different methods.
DatasetM3DCNNHyBridSNA2S2KViTSSFTTUnetPspNetSwinSegFormerMSSFF
IA3.46040.92100.58271.61870.41940.44360.51560.77070.29930.2247
PU3.70610.51850.64011.11740.51670.62830.40620.40650.30930.1153
SA1.90240.75760.69471.18470.43211.56100.05210.06040.10950.0372
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Yan, Q.; Huang, W. MSSFF: Advancing Hyperspectral Classification through Higher-Accuracy Multistage Spectral–Spatial Feature Fusion. Remote Sens. 2023, 15, 5717. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15245717

AMA Style

Chen Y, Yan Q, Huang W. MSSFF: Advancing Hyperspectral Classification through Higher-Accuracy Multistage Spectral–Spatial Feature Fusion. Remote Sensing. 2023; 15(24):5717. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15245717

Chicago/Turabian Style

Chen, Yuhan, Qingyun Yan, and Weimin Huang. 2023. "MSSFF: Advancing Hyperspectral Classification through Higher-Accuracy Multistage Spectral–Spatial Feature Fusion" Remote Sensing 15, no. 24: 5717. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15245717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop