Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers

Huang, Renxiang; Chen, Tao

doi:10.3390/rs15133340

Open AccessArticle

Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers

by

Renxiang Huang

¹

and

Tao Chen

^1,2,3,4,*

¹

School of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China

²

Badong National Observation and Research Station of Geohazards, China University of Geosciences, Wuhan 430074, China

³

Key Laboratory of National Geographic Census and Monitoring, Ministry of Natural Resources, Wuhan 430079, China

⁴

State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(13), 3340; https://0-doi-org.brum.beds.ac.uk/10.3390/rs15133340

Submission received: 3 May 2023 / Revised: 20 June 2023 / Accepted: 28 June 2023 / Published: 30 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Efficient and accurate landslide recognition is crucial for disaster prevention and post-disaster rescue efforts. However, compared to machine learning, deep learning approaches currently face challenges such as long model runtimes and inefficiency. To tackle these challenges, we proposed a novel knowledge distillation network based on Swin-Transformer (Distilled Swin-Transformer, DST) for landslide recognition. We created a new landslide sample database and combined nine landslide influencing factors (LIFs) with remote sensing images (RSIs) to evaluate the performance of DST. Our approach was tested in Zigui County, Hubei Province, China, and our quantitative evaluation showed that the combined RSIs with LIFs improved the performance of the landslide recognition model. Specifically, our model achieved an Overall Accuracy (OA), Precision, Recall, F1-Score (F1), and Kappa that were 0.8381%, 0.6988%, 0.9334%, 0.8301%, and 0.0125 higher, respectively, than when using only RSIs. Compared with the results of other neural networks, namely ResNet50, Swin-Transformer, and DeiT, our proposed deep learning model achieves the best OA (98.1717%), Precision (98.1672%), Recall (98.1667%), F1 (98.1615%), and Kappa (0.9766). DST has the lowest number of FLOPs, which is crucial for improving computational efficiency, especially in landslide recognition applications after geological disasters. Our model requires only 2.83 GFLOPs, which is the lowest among the four models and is 1.8242 GFLOPs, 1.741 GFLOPs, and 2.0284 GFLOPs less than ResNet, Swin, and DeiT, respectively. The proposed method has good applicability in rapid recognition scenarios after geological disasters.

Keywords:

landslide recognition; deep learning; knowledge distillation; efficiency

1. Introduction

Landslides describe the downward or outward movement of substances that make up the slope due to the erosion caused by gravity, resulting in different forms of landslides, including rock mass, soil mass, artificial soil mass, or their combination. A landslide causes severe damage to natural environments, properties, and personal safety all over the world. Creating a landslide inventory map is crucial to recording the location, distribution, number, and extent of landslides in a study area [1,2]. Landslides frequently occur in China, causing significant damage to life, property, and the economy. From 2018 to 2021, the number of landslides and the losses they caused consistently ranked first among all geological disasters in China. Landslides can be triggered by various factors such as rainfall, earthquakes, permafrost degradation, reservoir filling, and urbanization, resulting in the movement of rocks and soil downhill [1,3,4,5]. Knowing their occurrence, distribution, and trends is vital to creating disaster prevention strategies and improving disaster reduction efforts. Recognizing landslide disasters and risks is crucial in quickly determining their location, quantity, and distribution after their occurrence [6].

The conventional methods for recognizing potential landslide areas and updating landslide inventories involve field surveys, which are known to be time-consuming, costly, and inefficient. Recently, with the rapid development of remote sensing technology and machine learning algorithms, automatic landslide recognition from multi-source remote sensing data has become possible and promising in geoscience research. In recent years, there has also been a growing interest in the recognition of landslides from optical images. Digital elevation model (DEM) data, which provides topographic information, plays a crucial role in recognizing landslides [7,8,9]. Since the recognition of landslides from remote sensing images could be defined as a pixel-level image classification problem, a variety of statistical and deep learning methods have been widely utilized. Currently, convolutional neural networks (CNNs) have become the mainstream method for deep learning and have been applied to this task [10]. For instance, Cai et al. [11] proposed a deep learning model with dense connections based on patch blocks. Supervised machine learning methods and statistical methods depend heavily on the availability of high-quality labeled data on landslides for training and evaluation datasets. Therefore, constructing labeled landslide recognition datasets is crucial for accurate recognition and analysis of landslide regions.

In recent years, the field of landslide recognition has witnessed advancements in deep learning techniques, particularly in using Convolutional Neural Networks (CNNs) and Transformer models with attention mechanisms. CNNs have emerged as the dominant approach in landslide recognition due to their exceptional ability to learn representations through convolutions. This enables the network to automatically recognize semantic features associated with landslide bodies without the need for manual calculation of complex landslide features, including image classification [12,13], object detection [14,15], and semantic segmentation [16,17].

However, CNNs have inherent limitations. Backpropagation often leads to slow parameter updates, convergence to local optima, information loss in pooling layers, and unclear interpretation of extracted features. To overcome these challenges, researchers have introduced Transformer models with attention mechanisms, which offer notable advantages. The Transformer model, initially proposed by the Google team in 2017 [18], replaces the convolutional neural network component of CNN with a self-attention module. This model employs multiple attention heads that specialize in distinct tasks and capture diverse input data features, thereby enhancing landslide recognition performance in remote sensing images (RSIs).

The Transformer model excels at learning intricate relationships between various data features, including geological characteristics, climate data, and satellite imagery, enabling it to effectively recognize landslide areas. Additionally, its ability to simultaneously analyze data from multiple sources and formats makes it more versatile compared to traditional landslide recognition models. By leveraging attention mechanisms, the Transformer model provides insights into the attention distribution within the model, further enhancing interpretability and understanding of the landslide recognition process.

Hinton et al., first introduced the concept of knowledge distillation in their article “Distilling the knowledge in a neural network” [19], the core idea of which is that once a complex network model is trained, a smaller model can be extracted from the complex model using another training method. The core idea is that once a complex network model is trained, a smaller model can be extracted from the complex model by another training method, so the knowledge distillation framework usually contains a large model (called the teacher model) and a small model (called the student model).

However, very few studies have been proposed to apply the transformer model to landslide disasters [20]. Lv et al., proposed a shape-enhanced visual transformer (ShapeFormer) model for better extraction and retention of multiscale shape information of landslide bodies [21]. Deep learning models are becoming increasingly popular in geoscience applications, but they still face challenges when dealing with large-scale natural disasters such as landslides. The main difficulty is the significant increase in data processing time when using a model with a large number of parameters, which can reduce the efficiency of the model and place high demands on computer hardware. To address the above issues, this study makes two main contributions to improving the efficiency of deep learning models for landslide recognition: (1) We designed a novel deep learning network using the Swin-Transformer as the backbone structure, with the aim of reducing the number of model parameters while maintaining accuracy in landslide recognition. Our goal was to improve the processing efficiency and reduce the model’s running time. The experimental results showed that our model performed well in recognizing landslides. (2) We constructed a multi-source landslide recognition dataset in the study area, which includes the RSIs, landslide influencing factors (LIFs), and RSIs + LIFs, respectively. The combination of spectral and environmental factors is introduced to improve the performance of deep learning in landslide recognition.

2. Data Preparation

2.1. Study Area

The study area is located in Zigui County, which is in the Yangtze River region of the southwest of Hubei Province, China. It is classified as being within the subtropical monsoon climate zone, experiencing mild temperatures, abundant rainfall, and sufficient sunshine. The topography and geomorphological structure in the study area are complex and predominantly mountainous, with a diverse range of strata and rocks contributing to an unstable geological structure. This instability leads to various geological disasters, with landslides being a particularly common occurrence. In addition, the study area features a well-developed river system, including a network of streams and rivers, with abundant water resources. The Yangtze River flows through Zigui County for 64 km, seeping into the slope and increasing pore water pressure. This softens the rocks and soil, making the slope more prone to instability and causing landslides. In summary, due to its unique geographical environment, the study area is inevitably threatened by various geological disasters (especially landslides), as shown in Figure 1. The study area covers an area of approximately 116 km², with coordinates ranging from 110°33′E to 110°42′E longitude and 30°57′N to 31°02′N latitude, and an elevation ranging from 80 m to 1220 m.

2.2. Landslide Inventory Data

A landslide inventory map for the study area (Figure 1) was provided by the Headquarters of Geological Hazard Prevention and Treatment in the TGR (Three Gorges Reservoir). The inventory data revealed a total of 74 landslide objects, including notable examples such as the Kazewan landslide and the Shuping landslide. The landslides span an area of 6.82 km² and are distributed along a geographic range of approximately 2–4 km on both sides of the main stem and major tributaries of the Yangtze River. The majority of the landslides are distributed along the banks, with the Kazewan landslide being the largest with an area of 1.1 km² and the Shengli Street downstream collapse being the smallest with an area of 5157.29 m². Landslides and collapses in the study area have caused many disasters for local people in history.

2.3. Data Preprocessing

The study used landslide labels, satellite remote sensing imagery, and environmental condition data. The landslide label was created using historical records, satellite image interpretation, and field survey data from the Headquarters for the Prevention and Control of Geohazards. Landsat 8 OLI data, obtained from the Geographic Spatial Data Cloud of the Computer Network Information Center of the Chinese Academy of Sciences, provided the primary high-resolution satellite imagery data. The data used in the experiments were divided into four categories, and their sources are listed in Table 1.

Figure 2 shows a plot of nine LIF diagrams in the study area. The constructed dataset includes RSIs and LIFs. The LIFs were obtained from multiple sources, including RSIs, DEMs, and other ancillary data. These factors underwent a preprocessing stage that included outlier removal, resampling, and normalization. The resulting factors were then overlaid on the RSIs, resulting in a data cube with dimensions of m × n × c, where m and n represent the width and height of the image, respectively, and c denotes the number of channels. The study used a pixel-based landslide recognition method, with 60% of randomly selected samples used for training and 20% each for testing and validation. The total number of dataset sample pixels is 509,662, with the training sample pixels being 305,797 and the testing and validation sample pixels each being 101,932. Due to the limited landslide samples, data augmentation was performed by methods such as horizontal flipping, adding Gaussian noise, and rotation. To improve the proportion of positive samples (i.e., landslides) in this experiment, landslide points are randomly selected as seed pixels [22] in the training area, and the m × n × c data cubes are scanned using a sliding window of size 64 × 64 and step size 32 to generate enough sample cubes. The total number of sample pixels after augmentation is 1,223,188 for training, and the testing and validation sample pixels are both 407,728.

2.4. Model Selected

This article evaluates the performance of the proposed model by comparing it with ResNet50 (ResNet), Swin-Transformer tiny (Swin), and Data-efficient image Transformers small (DeiT) as benchmark models. We use ResNet50, Swin-Transformer tiny, and DeiT-small because on the ImageNet1k dataset (all input images are 224 × 224), the number of parameters for these three networks is 25.56 M, 28.29 M, and 22.05 M. The number of parameters in these three models is close to each other, which is convenient for comparison.

2.5. Experimental Configuration

For this paper, the experiment was conducted on a workstation computer with an Intel Xeon(R) Silver 4210R [email protected] × 40 processor, 128 GB of memory, and a GeForce RTX 3090/PCle/SSE2 graphics card. The source code was written in Python 3.7, utilizing the deep learning frameworks Pytorch and TensorFlow. According to the research purpose, a comparative experiment is designed: recognition results on three different input data sets in Distilled Swin-Transformer (DST); recognition results of three deep learning networks (ResNet, Swin, and DeiT) and DST; an efficiency analysis of different landslide recognition models.

3. Methods

3.1. Residual Network (ResNet)

CNNs have grown deeper over time to extract deeper features, but increasing the number of layers can lead to the vanishing gradient problem, which can reduce accuracy. Initially, the LeNet network had only 5 layers, followed by AlexNet with 8 layers, and later the VggNet network included 19 layers, while GoogleNet had 22 layers. Traditional methods use data initialization and regularization to solve this problem, but ResNet networks use residual units to directly transfer inputs across layers, which improves feature expression and performance. The key to the ResNet network is the residual unit in its structure. The residual network unit contains cross-layer connections that can directly transfer the input across layers, perform the same mapping, and then add the results after the convolution operation. If the input image is x, the output is H(x), and the output after convolution is a nonlinear function of F(x), then the final output is H(x) = F(x) + x. Such an output can still be nonlinearly transformed, and the residual refers to the “difference”, which is F(x), and the network is transformed into the residual function F(x) = H(x) − x. The residual function is thus easier to optimize than F (x) = H (x).

Figure 3 shows the classical transformer structure, which might help readers understand the difference between transformer structures and CNNs.

3.2. Swin-Transformer (Swin)

Swin-Transformer is a new approach to the traditional Transformer architecture that uses “local attention windows” to reduce computation and memory consumption while maintaining accuracy. It divides input features into smaller patches to handle larger input images and uses a hierarchical structure to capture different features at different levels. The Swin also uses “Shifted Window” to interact with other windows. The Swin has achieved good performance in computer vision tasks such as image classification, target detection, and semantic segmentation.

3.3. Data-Efficient Image Transformers (DeiT)

Data-efficient image Transformers (DeiT) is a deep learning model that uses a variant of the transformer architecture called “vision transformer” (ViT) for image classification. ViT divides an image into patches, treats each patch as a token, and processes them using transformer layers to capture spatial relationships. DeiT is designed to be data-efficient, achieving high performance on image classification tasks with relatively small amounts of training data using techniques such as distillation, data augmentation, and regularization. Its success demonstrates the potential of transformer-based architectures for computer vision applications.

3.4. Distilled Swin-Transformer (DST)

We propose a new deep learning model based on the Swin-Transformer architecture, which aims to address the issues of large model parameter sizes and low efficiency. Our approach incorporates knowledge distillation, which trains a lightweight student model using supervisory information from a larger and better-performing teacher model. This helps reduce computational costs while maintaining accuracy.

Unlike other model compression techniques such as pruning and quantization, knowledge distillation transfers knowledge from the teacher model to the student model during training. This allows the student model to learn from the teacher model’s supervisory information and improve its performance. Knowledge distillation also compresses network parameters, which reduces the overall parameter size of the model.

In our proposed model, First, the input image is divided into smaller patch blocks using the Patch Partition layer. These blocks are then linearly embedded to capture relevant information within each patch. Distillation Tokens, serving as additional learnable parameters, are introduced between the Patch Embedding and Position Embedding layers to enable knowledge transfer from a larger teacher model. The patch representations, along with the added Distillation Tokens, are processed through Patch Embedding and Position Embedding layers, encoding both distillation and positional information. Multiple Transformer layers capture long-range dependencies and extract high-level features. Finally, the representations are sent to the Classifier Head for classification.

Distillation helps transfer knowledge from a larger, high-performing teacher model to a student model. This allows the student model to learn from the teacher’s supervisory information, resulting in compressed network parameters and a decreased overall parameter size of the model. Despite the compression, the model still maintains relative accuracy. Figure 4 illustrates the resulting model architecture of DST.

3.5. Flowchart of Landslide Recognition

The whole experimental process consists of seven parts. In the first and second parts, geological maps, RSIs, Google Earth images, and DEM data were acquired. These data and historical landslide records were used to prepare landslide labels for the study and to generate the required nine LIFs. In the third and fourth parts, the nine generated LIFs are screened. The sample database for the Zigui landslide area has been generated. In the fifth part, the training set, validation set, and test set are divided in the ratio of 6:2:2, and the landslide extraction is performed using the DST and other models. In the sixth and seventh parts, the performance of the landslide recognition model is evaluated and compared, and the effects of some parameters on the experimental results are discussed. The complete experimental flow used in this paper is shown in Figure 5.

3.6. Model Evaluation Metrics

In this paper, five statistical metrics, namely Overall Accuracy (OA), Precision, Recall, F1-score (F1), and Kappa coefficient (Kappa), are used to evaluate the model’s performance. Among them, OA indicates the number of pixels correctly classified as landslides accounted for the total number of pixels; Precision indicates how many of the samples predicted as landslides are real landslide samples; Recall indicates how many of the landslide samples in the sample set are accurately predicted; and F1 is the summed average of the accuracy and recall. The values of these four evaluation metrics range from 0 to 1, where the closer the value is to 1, the better the performance of the corresponding model. The Kappa coefficient is a method used to evaluate statistical consistency, and we used it to evaluate the accuracy of the multiclass classification model. Kappa can quantitatively evaluate the agreement between the classification results and the true labels. When its value is greater than 0.8, the agreement can be considered good.

The calculation formulas are shown as follows:

OA = (TP + TN)/(TP + FP + TN + FN)

(1)

Precision = TP/(TP + FP)

(2)

Recall = TP/(TP + FN)

(3)

F1 = 2Precision × Recall/(Precision + Recall)

(4)

Kappa = (Po − Pe)/(1 − Pe)

(5)

Pe = (TP + FN) (TP + FP) + (FP + TN) (TN + FN)/n²

(6)

True Positive (TP) means that the true value is a landslide and the predicted value is also a landslide, which means that the landslide sample is correctly predicted; False Positive (FP) means that the sample with the true value of non-landslide is recognized as a landslide, which means over-identification; False Negative (FN) means that the sample with the true value of landslide is predicted as non-landslide, which means the omission of identification; and True Negative (TN) means that the non-landslide sample is correctly predicted.

In addition to these metrics, four other factors are selected to measure the model run efficiency: the number of model parameters (Params), the number of floating-point operations (FLOPs), the average iteration time, and the model run time. Under the condition that the OA is similar, the higher the model run efficiency, the shorter the model running time.

3.7. Model Hyperparameter Settings

Table 2 displays the hyperparameter settings used for the landslide recognition model. The model was trained for 150 epochs using a sigmoid activation function, an AdamW optimizer, and a learning rate of 0.0000003. To ensure analytical comparison, the same hyperparameter settings were applied to all models.

3.8. Landslide Influencing Factor Analysis

3.8.1. Landslide Influencing Factor Analysis

After historical research and expert analysis, it appears that there may be statistical correlations and collinearity relationships among the initially selected LIFs. These relationships can potentially lead to an inaccurate analysis of the true relationship between LIFs and landslides in the landslide recognition model. Additionally, there are multiple factors that can affect landslides, and the abundance of information in the evaluation factors may affect the accuracy of the landslide recognition results. To address these concerns, this study has adopted a quantitative approach to evaluate and select LIFs from three different perspectives: correlation analysis, collinearity testing, and importance evaluation.

3.8.2. Correlation Analysis

In this study, the Pearson correlation coefficient (PCC) was used to characterize the correlation between each of the selected LIFs [23]. In statistics, PCC is used to measure the linear correlation between two variables, X and Y, with values ranging from −1 to 1. This linear correlation can be intuitively expressed as whether Y increases or decreases as X increases. When the PCC value is positive, it indicates a positive correlation, and when it is negative, it indicates a negative correlation. When the two variables are distributed on a straight line, the PCC value is equal to 1 or −1; if there is no linear relationship between the two variables, the PCC value is 0; and a value between 0 and 1 indicates a stronger correlation.

The visual heat map of the correlation coefficients of the nine influencing factors in the study area is shown in Figure 6, where the darker the color, the larger the value, and the stronger the correlation, with positive numbers being positive correlations and negative numbers being negative correlations. It can be seen that a few factors have a weak negative correlation with each other, while most of them have a positive correlation with each other, but the correlation is not strong. In general, the correlation between these factors is not high, and the correlation coefficients are less than the critical value of 0.7, so no factors were removed in this study.

3.8.3. Collinearity Testing

To further assess the correlation between LIFs, a multicollinearity analysis was performed on the factors [24,25]. In this study, the Variance Inflation Factor (VIF) and Tolerance (TOL) of each LIF were calculated using SPSS Statistics software. VIF represents the variance inflation factor, while TOL represents the tolerance [24]. The calculation formula for VIF is:

VIF = \frac{1}{1 - R_{i}^{2}}

(7)

An independent data set X with n variables, X = {X, X₂, …, X_n}, where

R_{i}^{2}

represents the deterministic coefficient of the ith independent variable when regressed on all other predictors in the model. TOL is numerically the reciprocal of VIF. Typically, factors should be removed if the VIF is greater than 10 or 5. TOL is the inverse of VIF, where a value between 0 and 1 indicates the strength of the collinearity between the independent variables, and the closer TOL is to 1, the weaker the collinearity. As shown in Table 3, there was no multicollinearity between any of the LIFs.

4. Results Analysis

4.1. Performance Analysis of the Dataset Type

In order to verify that the additional LIFs + RSIs-based dataset can improve the performance of landslide recognition, three different dataset types are fed into our proposed model (DST) for landslide recognition. The results are shown in Figure 7.

The results of landslide recognition based on the RSIs dataset are shown in Figure 7a. From the recognition results, it can be observed that there are errors and omissions in the landslide recognition results. In general, most of the recognized landslides correspond to the actual landslide boundaries, but there are still phenomena beyond the boundaries. The recognition of small landslides was inadequate, or only a few pixels of small landslides were recognized. Two factors may have contributed to this occurrence. First, the study area had a significant imbalance in the number of samples for different categories, with a ratio of non-landslide to landslide pixels of 16:1, which could explain the discrepancy. Second, the dataset may have also contributed to this problem, as some small landslides may have been obscured by vegetation cover and therefore not fully recognized by the optical imagery. The results of the LIFs-based landslide recognition are shown in Figure 7b. It can be seen that the overall landslide map is not satisfactory, with numerous errors and omissions. Some landslides close to each other are not accurately distinguished. The reason for this is that the LIFs do not carry the spectral, textural, geometric, and spatial features of the landslides, which have a great influence on the landslide recognition results. Figure 7c shows the results of using the RSIs + LIFs for landslide recognition. The overall recognition results show some improvement compared to using only RSIs, with an OA improvement of 0.8381%. The recognition results show that landslide RSIs + LIFs-based landslide recognition plays an important role in landslide recognition, and the new landslide sample library is effective.

To quantitatively evaluate the model performance and landslide recognition results for three different training datasets, several evaluation factors mentioned in Part 3.6. of this paper were calculated using confusion matrices, and the evaluation metrics are shown in Table 4. The RSIs + LIFs-based model has the highest OA, recall, F1, and Kappa. The higher recall reflects the more recognized landslides, and the higher Precision reflects the more correctly recognized landslides. The experiments show that combining RSIs with LIFs makes the landslide recognition model easier to distinguish landslides from bare rock, soil, and other ground features, which could significantly improve the recognition accuracy and precision of the models in the Zigui study area. However, the results also show that using only LIF data is not sufficient, indicating that optical imagery plays a dominant role in landslide recognition. The overall evaluation shows that the RSIs + LIFs-based dataset has the best recognition effect. Qualitative and quantitative evaluation of the models with three different training datasets shows that the LIFs provide additional landslide feature information that helps improve the accuracy of landslide recognition, and these influencing factors are equivalent to increasing the number of landslide features being extracted. Taking the RSIs + LIFs-based datasets as an example, the landslide is controlled by spectral, elevation, slope, and aspect factors.

Our analysis showed that the best landslide recognition results were obtained using the RSIs + LIFs dataset. Therefore, we will only use the RSIs + LIFs dataset as inputs for all subsequent experiments.

4.2. Performance Analysis of Different Model Types

To illustrate the performance of the Transformer network in landslide recognition, traditional CNN: ResNet [13] is compared. All experimental hyperparameters, training data, and other variables are consistent. ResNet is a CNN network, and CNN is a locally connected network. The attention mechanism is introduced into the Transformer network, which is able to associate information at different locations in the input sequence. The results are shown in Figure 8.

Figure 8a shows the landslide recognition results of the ResNet model, indicating that large landslides have been recognized. However, there are omissions, misdetection, and noise phenomena in the landslide map, and small landslides are not recognized adequately. In addition, the recognized landslide area shows significant inconsistencies with the actual extent. Figure 8d shows the results of landslide recognition using the DST model, which exhibit the closest recognition results to the actual landslide range. The boundaries of the model almost perfectly match the actual landslide area, demonstrating its high accuracy in landslide recognition.

Table 5 presents the quantitative evaluation of the two models. DST achieved outstanding performance, with the highest OA, Precision, Recall, F1, and Kappa of 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. As can be seen from Table 5, the overall accuracy of the DST is 6% higher than that of the ResNet in the study area of Zigui. The attention mechanism is introduced into the transformer network. Self-attention can produce more interpretable models. The individual attention heads can learn to perform different tasks and thus learn more landslide features. The comparison between DST and ResNet shows that the multiple attention heads of the transformer network structure can help the landslide recognition network learn more landslide features to a certain extent and finally help the model get the best landslide recognition results.

To illustrate the performance of DST in landslide recognition, two widely used Transformer networks were compared: Swin [26] and DeiT [27]. All experimental hyperparameters, training datasets, and other variables were consistent. Swin uses a new hierarchical approach called “local attention windows” to reduce computation and memory consumption while maintaining accuracy. DeiT uses knowledge distillation to reduce the number of model parameters, which in turn increases execution speed. The results are shown in Figure 8.

The results of the Swin model are shown in Figure 8b. Compared with the DeiT recognition result, the landslide recognition ability is significantly improved, the landslide boundary is closer to the actual landslide extent, the landslide map is smooth, and the noise phenomenon is significantly reduced. Figure 8c shows the recognition results of the DeiT model. It can be seen that, overall, most of the landslides are recognized. The recognized landslide areas largely match the actual extent, and the landslide map has noise phenomena. Figure 8d shows the results of the DST model. The recognition results are closest to the actual landslide extent, the landslide map is free of noise, and the boundaries are almost perfectly matched.

The quantitative evaluation of the three transformer models is shown in Table 5. DST achieved the highest OA, Precision, Recall, F1, and Kappa, 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. DST is a Swin-Transformer-based knowledge distillation model that we proposed. As can be seen from Table 5, the overall accuracy of DST is 0.5% and 7% higher than that of Swin and DeiT, respectively. The network inherits the local attention window and hierarchical structure proposed by the Swin-Transformer network, which enable the model to handle multi-scale inputs and capture different features at different levels. The comparison between DST and DeiT shows that the improvement of the network structure can help the landslide recognition network recognize more landslide features to a certain extent and then recognize more landslides to get the best results. The comparison between DST and Swin shows that the OA, Precision, Recall, F1, and Kappa of the two networks are similar because they have the same feature recognition structure.

4.3. Performance Analysis of Different Model Efficiency

We have provided a quantitative evaluation in Table 6, where we have evaluated four models based on various efficiency metrics, including the number of model parameters, FLOPs, average iteration time, and model running time. Our proposed DST model exhibits the largest number of parameters among the four, with 27.6 M, surpassing ResNet, Swin, and DeiT by 4 M, 0.076 M, and 4.7 M, respectively. The increased number of parameters in our model implies a more complex structure that can capture intricate relationships between input features, ultimately leading to improved accuracy and performance in landslide recognition.

DST has the lowest number of FLOPs, which is crucial for improving computational efficiency, especially in landslide recognition applications after geological disasters. Despite the large number of parameters, our model requires only 2.83 GFLOPs, which is the lowest among the four models and is 1.8242 GFLOPs, 1.741 GFLOPs, and 2.0284 GFLOPs less than ResNet, Swin, and DeiT, respectively.

In terms of the average iteration time, DST has an iteration time of 0.0981 s, which is similar to ResNet50’s iteration time of 0.0959 s. This means that DST can be trained faster than other models with a similar number of parameters. In addition, DST uses distillation optimization techniques that help reduce the computational load during training, which contributes to its relatively fast average iteration time.

Finally, DST has a relatively fast model running time, which makes it well-suited for real-time landslide recognition applications. In summary, our proposed landslide recognition model has several advantages over the other three models, including a high number of model parameters, the lowest number of FLOPs, the second-fastest average iteration time, and the third-fastest model running time. These advantages make it suitable for applications where timely landslide recognition is critical, providing higher accuracy, faster speed, and more efficient learning.

5. Discussion

5.1. Importance Evaluation of Landslide Influencing Factors

To verify the contribution of each factor to the model’s performance, we evaluated the importance of nine LIFs in the study area prior to modeling. The Gini index in the random forest model was used to rank the importance of landslide-influencing factors [28]. The calculation formula for the Gini index is:

G i n (p) = \sum_{k = 1}^{k} P_{k} (1 - P_{k})

(8)

In the formula, k represents the k categories, and P_k represents the sample weight of category k. An array of size 1 × 9 was generated for the study area, with each element having a positive value that summed to 1. The contribution of each feature to the model was greater with a higher value of the corresponding element. Figure 9 displays the distribution of the importance ranking for each factor in the study area. The Fault factor had a high importance ranking due to its role as a catastrophic factor, while the Lithology and Curve factors had low importance rankings, probably because they do not change significantly in the study area. Since the study area consists mainly of reservoir landslides, hydrological factors such as DTR and MNDWI were found to be particularly important. Additionally, since most of the landslides were historical and had regrown new vegetation, NDVI factors were of relatively lower importance in the modeling [29]. Finally, all landslides cause surface deformation, so topography-like factors such as elevation, slope, and slope direction always play a crucial role in landslide recognition. The importance evaluation results of all LIFs are greater than 0, indicating that these factors contribute to the landslide recognition model in our study area and that all nine LIFs should be retained. The choice of factors was influenced by the environment of the study area, the type of landslide, and the method used. Therefore, when modeling, decide which factors to choose according to your own conditions.

5.2. Limitations and Future Work

In this study, a knowledge distillation model based on Swin-Transformer is proposed to recognize landslides caused by precipitation. The experiments have been carried out in the Zigui area of Hubei Province, China, and the results have shown that the method has good performance in practical applications. However, there are still a number of important issues that need to be discussed.

First of all, the generalization ability of the model is important for practical engineering applications. The characteristics of landslides vary according to different geological backgrounds. However, the landslide recognition model in this paper is only established in the same rainfall landslide area. The recognition ability of other types, such as earthquake-induced/coseismic landslides, needs to be further investigated.

Then, there is still room for further improvement, although the efficiency of the proposed model has improved to some extent. Pre-training is a method in deep learning that improves model accuracy without significantly increasing model runtime. After adding pre-training, improving the efficiency of the proposed method at this institute is also a direction worthy of further research at a later stage.

6. Conclusions

Geological disasters are common in China, with landslides being the most common. Unfortunately, the lack of information on the location, magnitude, and distribution of landslides often hampers post-disaster emergency response and rescue efforts. To address the problems of low data processing efficiency and long model processing times in current deep learning research, this paper proposes a knowledge distillation network based on Swin-Transformer. This approach aims to improve model processing efficiency and speed by reducing model parameters while maintaining the accuracy of landslide recognition. The study used nine LIFs and RSIs to construct a landslide inventory dataset, which significantly improved the recognition performance and accuracy of the model, resulting in improved discrimination of landslides from other ground features.

The proposed DST model, which uses the Swin-Transformer as its backbone, outperformed other landslide recognition networks in terms of running speed while maintaining recognition accuracy. The test results based on the RSIs + LIFs dataset showed that the proposed DST model achieved the highest OA, Precision, Recall, F1, and Kappa, reaching 98.1717%, 98.1672%, 98.1667%, 98.1615%, and 0.9766, respectively. These results demonstrate the importance of landslide recognition methods and the promising potential of deep learning and Multi-Feature Remote Sensing Data in landslide recognition.

Author Contributions

Conceptualization, R.H.; methodology, R.H. and T.C.; software, R.H. validation, R.H.; formal analysis, R.H.; investigation, R.H. and T.C; resources, T.C.; writing-original draft preparation, R.H.; writing-review and editing, T.C.; supervision, T.C.; project administration, T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62071439; in part by the Opening Fund of the State Key Laboratory of Geohazard Prevention and Geoenvironment Protection (Chengdu University of Technology), SKLGP2022K016; in part by the Open Fund of the State Key Laboratory of Remote Sensing Science (Grant No.OFSLRSS202207); in part by the Open Fund of the Badong National Observation and Research Station of Geohazards (No. BNORSG-202302); in part by the Opening Fund of the Key Laboratory of National Geographic Census and Monitoring, Ministry of Natural Resources, 2023NGCM11.

Data Availability Statement

This study cannot publicly share the dataset due to third-party restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cruden, D.M. A simple definition of a landslide. Bull. Int. Assoc. Eng. Geol. 1991, 43, 27–29. [Google Scholar] [CrossRef]
He, K.; Li, X.; Yan, X.; Guo, D. The landslides in the Three Gorges Reservoir Region, China and the effects of water storage and rain on their stability. Environ. Geol. 2007, 55, 55–63. [Google Scholar]
Lissak, C.; Bartsch, A.; De Michele, M.; Gomez, C.; Maquaire, O.; Raucoules, D.; Roulland, T. Remote Sensing for Assessing Landslides and Associated Hazards. Surv. Geophys. 2020, 41, 1391–1435. [Google Scholar] [CrossRef]
Wu, X.; Chen, X.; Zhan, F.B.; Hong, S. Global research trends in landslides during 1991–2014: A bibliometric analysis. Landslides 2015, 12, 1215–1226. [Google Scholar] [CrossRef]
Cruden, D.M.; Varnes, D.J. Landslide Types and Processes. In Landslides: Investigation and Mitigation, Special Report; Transportation Research Board, U.S. National Academy of Sciences: Washington, DC, USA, 1996; Volume 247, pp. 36–75. [Google Scholar]
Zhao, C.; Zhong, L. Remote Sensing of Landslides—A Review. Remote Sens. 2018, 10, 279. [Google Scholar] [CrossRef] [Green Version]
Dou, J.; Yunus, A.P.; Bui, D.T.; Merghadi, A.; Sahana, M.; Zhu, Z.; Chen, C.W.; Han, Z.; Pham, B.T. Improved landslide assessment using support vector machine with bagging, boosting, and stacking ensemble machine learning framework in a mountainous watershed, 2019, Japan. Landslides 2020, 17, 641–658. [Google Scholar] [CrossRef]
Barlow, J.; Martin, Y.; Franklin, S.E. Detecting translational landslide scars using segmentation of Landsat ETM+ and DEM data in the northern Cascade Mountains, British Columbia. Can. J. Remote Sens. 2003, 29, 510–517. [Google Scholar] [CrossRef]
Rau, J.-Y.; Jhan, J.-P.; Rau, R.-J. Semiautomatic object-oriented landslide recognition scheme from multisensor optical imagery and dem. IEEE Trans. Geosci. Remote Sens. 2013, 52, 1336–1349. [Google Scholar] [CrossRef]
Yu, H.; Ma, Y.; Wang, L.; Zhai, Y.; Wang, X. A landslide intelligent detection method based on CNN and RSG_R. In Proceedings of the 2017 IEEE International Conference on Mechatronics and Automation (ICMA), Takamatsu, Japan, 6–9 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 40–44. [Google Scholar]
Cai, H.; Chen, T.; Niu, R.; Plaza, A. Landslide recognition Using Densely Connected Convolutional Networks and Environmental Conditions. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5235–5247. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Chen, T.; Zhu, L.; Niu, R.-Q.; Trinder, C.J.; Peng, L.; Lei, T. Mapping landslide susceptibility at the Three Gorges Reservoir, China, using gradient boosting decision tree, random forest and information value models. J. Mt. Sci. 2020, 17, 670–685. [Google Scholar] [CrossRef]
Lv, P.; Ma, L.; Li, Q.; Du, F. ShapeFormer: A Shape-Enhanced Vision Transformer Model for Optical Remote Sensing Image Landslide Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2681–2689. [Google Scholar] [CrossRef]
Ghorbanzadeh, O.; Blaschke, T. Optimizing sample patches selection of CNN to improve the MIoU on landslide detection. In Proceedings of the International Conference on Geographical Information Systems Theory, Applications and Management, Heraklion, Greece, 3–5 May 2019; pp. 33–40. [Google Scholar]
Gulick, S.; Barton, P.J.; Christeson, G.; Morgan, J.V.; McDonald, M.A.; Mendoza-Cervantes, K.; Pearson, Z.F.; Surendra, A.; Urrutia-Fucugauchi, J.; Vermeesch, P.M.; et al. Importance of pre-impact crustal structure for the asymmetry of the Chicxulub impact crater. Nat. Geosci. 2008, 1, 131–135. [Google Scholar] [CrossRef]
Berk, K.N. Tolerance and Condition in Regression Computations. J. Am. Stat. Assoc. 1977, 72, 863–866. [Google Scholar]
Miao, F.; Zhao, F.; Wu, Y.; Li, L.; Török, Á. Landslide susceptibility mapping in Three Gorges Reservoir area based on GIS and boosting decision tree model. Stoch. Environ. Res. Risk Assess. 2023, 37, 2283–2303. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
Boulesteix, A.L.; Bender, A.; Lorenzo Bermejo, J.; Strobl, C. Random Forest Gini importance favors SNPs with large minor allele frequency: Impact, sources and recommendations. Brief. Bioinform. 2012, 13, 292–304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yan, L.; Gong, Q.; Wang, F.; Chen, L.; Li, D.; Yin, K. Integrated Methodology for Potential Landslide Identification in Highly Vegetation-Covered Areas. Remote Sens. 2023, 15, 1518. [Google Scholar] [CrossRef]

Figure 1. Location of the study area.

Figure 2. Nine LIF diagrams in the study area.

Figure 3. The structure of the classical transformer.

Figure 4. Network structure diagram of the distilled Swin-Transformer.

Figure 5. A flowchart of landslide recognition.

Figure 6. Visualization of the correlation analysis of nine LIFs in the study area.

Figure 7. Three kinds of dataset-based recognition results: (a) RSIs-based; (b) LIFs-based; and (c) RSIs + LIFs-based.

Figure 8. Four model recognition results on the RSIs + LIFs dataset: (a) ResNet; (b) Swin; (c) DeiT; and (d) DST.

Figure 9. Visualization of the importance evaluation of nine LIFs in the Zigui study area.

Table 1. Sources of different applied factors.

Data Type	Factors	Resolution/Scale	Source
Topography	Elevation	30 m	GDEM
	Slope	30 m	GDEM
	Aspect	30 m	GDEM
	Curve	30 m	GDEM
Land cover	NDVI	30 m	Landsat 8
Hydrology	MNDWI	30 m	Landsat 8
	DTR	--	GIS database
Geology	Fault	--	Geological map
	Lithology	1:50,000	Geological map

Table 2. Model hyperparameter settings in the Zigui study area.

Models	Epochs	Activation Function	Optimizer	Learning Rate
ResNet	150	sigmod	AdamW	0.0000003
Swin	150	sigmod	AdamW	0.0000003
DeiT	150	sigmod	AdamW	0.0000003
DST	150	sigmod	AdamW	0.0000003

Table 3. Collinearity testing of nine LIFs in the Zigui study area.

	Zigui Study Area
	TOL	VIF
Elevation	0.564	1.774
Aspect	0.953	1.049
Slope	0.873	1.145
Curve	0.973	1.027
NDVI	0.779	1.284
MNDWI	0.891	1.122
DTR	0.594	1.682
Fault	0.853	1.172
Lithology	0.884	1.131

Table 4. Performance values of different training datasets.

Datasets	OA/%	Precision/%	Recall/%	F1/%	Kappa
RSIs-based	97.3336	97.4684	97.3333	97.3314	0.9642
LIFs-based	66.3335	79.4000	66.3333	62.1250	0.6629
RSIs + LIFs-based	98.1717	98.1672	98.1667	98.1615	0.9766

Table 5. Performance values of different model types.

Models	OA/%	Precision/%	Recall/%	F1/%	Kappa
ResNet	92.6667	93.2218	92.6666	92.6430	0.9428
Swin	97.6667	97.6751	97.6666	97.6665	0.9643
DeiT	92.1667	93.2277	92.1666	92.1183	0.9234
DST	98.1717	98.1672	98.1667	98.1615	0.9766

Table 6. Performance value of different model efficiency analyses.

Models	Params (M)	FLOPs (GFLOPs)	Average Iteration Time (s/Iter)	Total Run Time
ResNet	23.6000	4.6542	0.0959	31 min 53 s
Swin	27.5240	4.5710	0.1539	48 min 59 s
DeiT	22.9000	4.8584	0.1108	31 min 47 s
DST	27.6000	2.8300	0.0981	32 min 23 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, R.; Chen, T. Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers. Remote Sens. 2023, 15, 3340. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15133340

AMA Style

Huang R, Chen T. Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers. Remote Sensing. 2023; 15(13):3340. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15133340

Chicago/Turabian Style

Huang, Renxiang, and Tao Chen. 2023. "Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers" Remote Sensing 15, no. 13: 3340. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15133340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Landslide Recognition from Multi-Feature Remote Sensing Data Based on Improved Transformers

Abstract

1. Introduction

2. Data Preparation

2.1. Study Area

2.2. Landslide Inventory Data

2.3. Data Preprocessing

2.4. Model Selected

2.5. Experimental Configuration

3. Methods

3.1. Residual Network (ResNet)

3.2. Swin-Transformer (Swin)

3.3. Data-Efficient Image Transformers (DeiT)

3.4. Distilled Swin-Transformer (DST)

3.5. Flowchart of Landslide Recognition

3.6. Model Evaluation Metrics

3.7. Model Hyperparameter Settings

3.8. Landslide Influencing Factor Analysis

3.8.1. Landslide Influencing Factor Analysis

3.8.2. Correlation Analysis

3.8.3. Collinearity Testing

4. Results Analysis

4.1. Performance Analysis of the Dataset Type

4.2. Performance Analysis of Different Model Types

4.3. Performance Analysis of Different Model Efficiency

5. Discussion

5.1. Importance Evaluation of Landslide Influencing Factors

5.2. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI