Next Article in Journal
Research on Prediction of Surface Deformation in Mining Areas Based on TPE-Optimized Integrated Models and Multi-Temporal InSAR
Next Article in Special Issue
Multi-Scale Remote Sensing Assessment of Ecological Environment Quality and Its Driving Factors in Watersheds: A Case Study of Huashan Creek Watershed in China
Previous Article in Journal
Satellite-Derived Lagrangian Transport Pathways in the Labrador Sea
Previous Article in Special Issue
Mapping of the Spatial Scope and Water Quality of Surface Water Based on the Google Earth Engine Cloud Platform and Landsat Time Series
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Study of Landslide Susceptibility Mapping Using Bagging PU Learning in Class-Prior Probability Shift Datasets

1
School of Automation, China University of Geosciences, Wuhan 430074, China
2
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, China University of Geosciences, Wuhan 430074, China
3
School of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China
4
Hubei Key Laboratory of Resources and Eco-Environment Geology, Hubei Geological Bureau, Wuhan 430000, China
5
Hubei Geological Environment Station, Wuhan 430000, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(23), 5547; https://0-doi-org.brum.beds.ac.uk/10.3390/rs15235547
Submission received: 20 October 2023 / Revised: 20 November 2023 / Accepted: 27 November 2023 / Published: 28 November 2023
(This article belongs to the Special Issue Remote Sensing for Geology and Mapping)

Abstract

:
Landslide susceptibility mapping is typically based on binary prediction probabilities. However, non-landslide samples in modeling datasets are often unlabeled data, and the phenomenon of class-priori shift, that is, the proportion of landslide samples frequently deviates from real-world scenarios and is spatially heterogeneous. By comparing the classification performance and predicted probability distributions across multiple unbalanced datasets with known and unknown sample proportions, this study assesses the landslide susceptibility model’s generalization ability in the context of class-prior shifts. The study investigates the potential of Bagging PU Learning, a semi-supervised learning approach, in improving the generalization performance of landslide susceptibility models and proposes the Bagging PU-GDBT algorithm. Our findings highlight the effectiveness of Bagging PU Learning in enhancing the recall of landslides and the generalization capabilities of models on unbalanced datasets. This method reduces prediction uncertainties, especially in high and very high susceptibility zones. Furthermore, results emphasize the superiority of models trained on balanced datasets with 1:1 sample ratio for landslide susceptibility mapping over those trained on unbalanced datasets.

1. Introduction

Landslides, characterized by their powerful geological disruptive forces, high frequency, and significant hazards, have led to substantial economic losses and casualties worldwide. Therefore, the monitoring and prevention of landslide disasters is particularly crucial. The improvement of monitoring devices such as GPS, Time Domain Reflectometry (TDR) [1], flexible inclinometers [2], and the development of sensor technologies like Fiber Bragg Grating (FBG) [3] and Red Green Blue+Depth (RGB-D) [4] have increased the difficulty of landslide monitoring due to the abundance of monitoring data and complex data structures. Moreover, constraints such as budgetary limitations, economic considerations, and existing engineering technological conditions make it impractical to conduct comprehensive surveys and real-time monitoring for all regions and landslide points. At this point in time, conducting susceptibility mapping for landslides within a specific regional scope, narrowing down the monitoring and prevention areas, proves to be a proactive, effective, and economical strategy.
Landslide susceptibility mapping modeling integrate the spatial characteristics of both regional remote sensing and non-remote sensing data, selecting a specific number of landslide and non-landslide calculation units. It employs either probability statistics or machine learning methods for modeling. The model is then used to predict the probability values of landslides occurring across the entire region, which are subsequently categorized into numerical intervals. Through this quantitative-to-qualitative analytical approach, regions at risk of landslides are delineated from high to low susceptibility. It involves classifying the probability of each computational unit in the study area as a landslide category within specific susceptibility intervals [5]. One of the challenges in this approach is the assumption that all samples are explicitly labeled as positive or negative. Computational units in the study area are typically considered negative samples unless explicitly labeled as landslides [6,7,8,9]. However, units not labeled as landslides might still represent potential landslide areas or undocumented landslide occurrences. In addition to oversampling the positive sample, many susceptibility studies have adjusted their sampling strategies to minimize errors arising from unreliable negative samples [10]. Previous studies employed techniques such as collecting non-landslide samples outside the buffer zone of landslide edges [11,12], selecting negative samples based on specific characteristics [13,14], constructing collaborative model of supervised and unsupervised learning to select reliable negative samples [15]. Alternatively, pre-classification methods were used to select negative samples from locations with very low pre-classification probability [16,17]. Despite these efforts, these methods often introduce subjectivity or additional complexity. Addressing these challenges, researchers have turned to the positive-unlabeled (PU) learning method for landslide susceptibility classification [18,19].
PU classification is designed to make two-class predictions, akin to the traditional binary classification problem, but with only labeled positive samples in the training set and unlabeled samples [20,21]. This method’s adaptability to various domains highlights its potential to mitigate data labeling challenges and enhance the reliability of predictive models, including those related to landslide susceptibility mapping (LSM).
When addressing positive and negative samples in PU Learning, these methods assume either that the class-prior probability (the ratio of positive samples) is known or estimate it through iterative processes [22,23]. However, the challenge in landslide susceptibility mapping lies in the class-prior probability shift of the modeling dataset to reality. Class-prior probability shift refers to the shift of the distribution of the dataset in the source domain and the target domain [21]. The data distribution of the test set in susceptibility modeling, which is often assumed to be consistent with the training set, is assumption that may be violated in practice. The true class-prior for landslides is much smaller than the modeling dataset [24]. The distribution of landslides and non-landslides in nature is challenging to accurately obtain statistics from a preliminary survey. This discrepancy means that the data distribution in the test set does not accurately represent the wide range of labeled unknown areas in nature. In addition, when the established LSM model is extended to other regions, the regional heterogeneity of landslide distribution density will also lead to the class-priori shift problem.
The distribution of landslides exhibits heterogeneity and scale effects [25,26]. When studying and making decisions in different geographical regions and scales, the varying proportions of landslides to non-landslides can impact the reusability of LSM models. Considering that bagging PU learning adapts to different positive and negative sample distributions in the learning process, we aim to explore whether bagging PU learning can enhance the LSM model’s generalization ability. Although previous studies have utilized PU-BaggingDT for landslide susceptibility assessment [18,19], there has been limited discussion on generalization performance. Additionally, Wright [27] demonstrated that bagging PU learning with SVM classifiers is more robust than PU-BaggingDT for unbalanced datasets. We apply Bagging PU-SVM to LSM modeling, and, considering that the GDBT algorithm, integrating decision trees, is more rapid, efficient, and robust, we propose the Bagging PU-GDBT algorithm. Several classical supervised learning models, including SVM, GDBT, LR, and RF, are employed for comparison. To explore reasonable LSM modeling dataset sample positive-to-negative ratios under the scenario of class-prior shift, we conducted tests on datasets with different sample ratios. The recall of landslide samples quantitatively assesses the model’s classification ability, and the standard error qualitatively contrasts the model’s uncertainty in predicting susceptibility values. Additionally, an analysis of the model’s generalization ability across different geographical regions is performed. We hope that this exploration can contribute valuable insights to landslide early warning and subsequent mitigation efforts.

2. Materials and Methods

2.1. Study Area

The Three Gorges Reservoir area, located in central China, is characterized by a subtropical monsoon climate with mild and humid conditions, abundant rainfall, high mountains, deep canyons, complete stratigraphic outcrops, complex geomorphic structures, and extensive landslide hazard development. For our research, we chose two specific regions within the Three Gorges Reservoir area to establish datasets, as illustrated in Figure 1.
Dataset A was collected from the high-incidence landslide areas in the Three Gorges Region, specifically from Zigui County, Hubei Province, covering an approximate area of 2274 km² (110°26′E~110°50′E, and 30°55′N~31°5′N). Zigui County is traversed by the Yangtze River from west to east, and it includes eight main tributaries organized into perennial stream, with a total flow length of 247.8 km.
Dataset B was collected from Region B, situated in the Enshi Tujia and Miao Autonomous Prefecture, Hubei Province, China, spanning from 110°18′E to 111°0′E longitude and 30°38′N to 31°11′N latitude, covering a total area of 2427 km². In the northern part of the study area, the main stem of the Yangtze River runs through Guandukou Town and Xinling Town from west to east, covering a total length of 24.89 km within the region. In the southern portion of the study area, one of the primary tributaries of the Yangtze River, the Qingjiang River, flows through Jingooping Township, Qingtaiping Town, and Shuibuya Town from west to east.
Both regions share similar climatic characteristics and altitude ranges. However, region A exhibits a greater diversity in geological formations compared to region B. Consequently, region A is selected as the primary research area, while region B is utilized to validate the model’s generalization performance.

2.2. Landslide Inventories and Influence Factors

A landslide inventory is vital in susceptibility estimation, serving as labeled positive samples [28]. Our landslide inventory primarily comprised shallow landslides within the study area, identified through remote sensing image interpretation based on the landslide cataloging database provided by the Hubei Geological Environmental Center. Field surveys supplemented and validated this data. The inventory comprised two types of vector data: potential landslide points, indicating signs of deformation without a complete slide occurrence, and historical landslide polygons. The study area was partitioned using a grid unit with a size of 30 m, consistent with the resolution of most remote sensing images obtained for this study. Each grid unit was treated as a sample. In Region A, the landslide inventory comprises 201 historical landslide polygons and 876 potential landslide points, resulting in a total of 24,986 landslide units and 2,496,683 unlabeled units. Similarly, in Region B, there are 85 historical landslide polygons and 402 potential landslide points, yielding 17,221 landslide units and 2,873,629 unlabeled units. As landslide units were predominantly concentrated within historical landslide polygons, random sampling was conducted on the grid units in these polygons to reduce the likelihood of potential landslide points being mistaken as noise by the model, as illustrated in Figure 1. Ultimately, 3835 positive samples were obtained in Region A, and 3317 positive samples were obtained in Region B.
The evolution of landslides is a complex nonlinear dissipative system with an open structure, and the factors controlling and inducing landslides are still an area of active research [29,30]. The factors considered in this study encompass topographic, geological, environmental, and human-related aspects, as outlined in Table 1.

2.3. Methodology

2.3.1. Landslide Susceptibility Mapping Models

Most PU Learning algorithms simplify the problem of learning classifiers from positive and unlabeled samples into a binary classification problem, relying on the approximate premise that the ratio of positive to unlabeled samples increases monotonically with the positive-negative sample ratio [20,31,32]. However, the occurrence of landslides in extensive geographic regions is a low-probability event. The dataset for susceptibility modeling, as a result of random sampling according to preset ratios, introduces uncertainty regarding the quantity of landslide samples in the unlabeled sample set. The data distribution of unlabeled samples may strongly influence the classifier, and the distribution of positive and negative samples in the test set often fails to accurately represent the sample distribution across the entire study area, making the problem more challenging in practice. Susceptibility evaluation ultimately necessitates probability estimates across the entire study area. Determining the actual proportion of positive and negative samples in the study area is unfeasible. In such circumstances, a bagging strategy might enhance the stability of the classifier in the face of diverse sample distributions [33,34]. Bagging PU learning, as proposed by Mordelet and Vert [31], employs an ensemble approach. This method employs bootstrapping to estimate a series of classifiers from a training set consisting of positive sample set P and unlabeled sample set U , and the results of these classifiers are aggregated through simple averaging. While this method introduces bias, it is not reliant on weights. It depends on the size of the subsample set and adopts a non-approximation setting to minimize the impact of positive samples in the unlabeled sample set. Given these characteristics of Bagging PU learning, it proves more suitable for landslide susceptibility research compared to other PU learning methods.
The specific operation of bagging PU learning is to randomly collect T subset U t containing k unlabeled samples from U . A weak classifier, denoted as f t , is trained using P and U t to distinguish between P and U t . For any x χ outside the training set (where x U ) , f t assigns a probability f t ( x ) of being a positive sample:
f ( x ) = 1 T t = 1 T f t ( x ) .
Ultimately, these T classifiers are aggregated to produce the result.
Bagging PU learning treats P as noise within U , where U represents empirical contamination. In other words, the higher the proportion of P within U , the more unstable the classifier may become. The motivation behind bagging is to obtain diverse set of classifiers by randomly sampling U t with varying contamination rates, thus generating a series of classifiers with diversity. The size k of the subsample set generated from U can play a crucial role in balancing accuracy and the stability of individual classifiers. Since subsets with less random contamination in part train better-performing base classifiers, adjusting k allows for a performance gain from these superior base classifiers that can outweigh the losses incurred by training each classifier on a smaller dataset.
We use γ ^ to denote the probability of hidden positive samples in U , and correspondingly, γ ^ t represents the probability of hidden positive samples in U t . The mean and standard deviation of the positive sample rate can then be expressed as:
E γ ^ t = γ ^ ,
V γ ^ t = 1 k γ ^ 1 γ ^ .
The intuition is that the stability of the classifier can be balanced by adjusting the size, k , of the subset of unlabeled samples to control the disturbance rate of positive samples in U t .
On the one hand, a larger subset generally results in better classifiers, as any classification method typically improves with more training samples. On the other hand, empirical contamination has a greater impact on smaller subsets. Here, we follow the analysis of Mordelet and Vert [31]. When P is not too large, k may not need meticulous tuning since it does not consistently exert a substantial influence on performance. Setting k = N P appears to be a prudent choice. Additionally, the value of T does not significantly affect the overall accuracy of the model when k is greater than 200, relieving Bagging PU learning of the additional burden of parameter adjustment during training. In our experiments, the value of k aligns with the number of positive samples in the training set, and T is set to 30.

2.3.2. Measure the Uncertainty of the Predicted Value

The generalization ability of the model is primarily reflected by the prediction accuracy of the unseen data, serving as the most direct expression of the model’s predictive capability.
Due to the limitations of the landslide inventory, it cannot encompass all historical landslide distributions across the entire study area. Consequently, the actual ratio of positive to negative samples in the study area remains unknown. Clearly, the predictive efficacy of a specific model fluctuates in datasets with varying distributions of positive and negative samples. Without measured values, determining the actual accuracy of the prediction probability for the entire region is unattainable. However, through numerous experiments, we can obtain the standard deviation of the mean estimate. By repeating experiments, we can establish a range and calculate the probability that the true overall mean falls within a certain range.
The uncertainty of the probability estimate is qualitatively summarized through the distribution of the standard error (SE) σ n of the landslide susceptibility value across the susceptibility interval for n replicate experiments:
μ = 1 n i = 1 n y i ,
σ = 1 n i = 1 n ( y i μ ) 2 ,
σ n = σ n .
In Equations (4)–(6), y i represents the estimated susceptibility value for the i th time, μ is the mean of the multiple repeated estimates results, and σ is the standard deviation of the multiple repeated estimates results. This distribution can be fitted using the polynomial obtained through the least square method.
All methods were implemented in Python 3.9. and executed on a Windows machine equipped with a 6-core Intel 2.9 GHz processor and 12 GB RAM.

3. Results

3.1. Classification Performance of Models

Out of the 3835 positive samples in region A, 70% were utilized as training samples, and the remaining 30% were designated as test samples for training and testing the Bagging PU models. These models were then compared with three supervised learning models: logistic regression (LR), support vector machines (SVM), and random forests (RF). An equivalent number of unlabeled sample subsets were collected as negative samples in addition to the positive samples. Due to the biased method used to obtain the landslide modeling dataset, the ratio of positive to negative samples does not align with the actual sample space in the study area. In the real world, there are significantly more negative samples than positive ones in the study area. One of our objectives is to assess whether imbalanced datasets with varying sample ratios significantly impact the classification accuracy of the models. Four positive and unlabeled sample ratios of 1:1, 1:5, 1:10, and 1:20 were adopted to establish the Test Set 1. For each ratio, non-landslide samples were randomly sampled ten times. The Receiver Operating Characteristic Curve (ROC) was utilized to compare the overall classification accuracy of the algorithmic models [35]. The average ROC curve of the model for ten tests is depicted in Figure 2.
All models demonstrated satisfactory overall classification results. Notably, it seems that appropriately increasing the negative sample size can enhance the Area Under the Curve (AUC) of the LSM model. In this study, when the ratio of positive to negative samples is 1:5, all models except GDBT achieved their highest AUC. The AUC of 1:5 GDBT model is also second only to that of 1:20. However, the enhancement of bagging PU learning on the models’ classification performance might not be readily apparent. ROC measures the overall performance of the model without considering the distribution of positive and negative samples. When the distribution of positive and negative samples changes, the shape of the ROC curve remains basically unchanged. However, in the real world, the recall and precision of the dataset may have changed. A higher recall indicates that more landslide samples are detected correctly. Precision, on the other hand, represents the proportion of correctly predicted landslide samples among all predicted landslide samples. When comparing the recall and precision of multiple predictions, bagging PU learning models generally exhibited improvements in both the recall and precision of landslide samples, as illustrated in Figure 3. This improvement becomes increasingly evident with the expansion of the negative sample size.
In the context of landslide susceptibility classification, the accuracy of the historical landslide list heavily relies on the validation range and accuracy of field surveys. The shift in prior probability complicates the ability of some trained models to adapt to changes in data distribution during actual predictions [36]. We cannot anticipate the extent of the class-prior shift in advance, which means that a classifier must adapt to an unknown test distribution.
To assess the generalization performance of the model predicted on the unknown test set, 10 test subsets were randomly sampled from computational units outside the training and initial test sets, forming Test Set 2. Each subset comprises 2514 samples, and the distribution of each subset is unknown. The models with 1:1 and 1:5 ratios, which demonstrated superior classification performance on Test Set 1, were employed to predict the samples in Test Set 2.
Figure 4 shows the recall and precision of models with training set positive and negative sample ratios of 1:1 and 1:5 on Test Set 2. The models generally exhibit high recall for landslide samples, and the recall of the 1:1 models of each algorithm being close to 1, and the recall of the 1:5 models exceeding 80%. However, precision is notably low. This discrepancy arises because the subsets contained very few landslide samples, resulting in a high recall and low precision scenario. Furthermore, it is evident that the 1:5 models utilizing Bagging PU learning exhibit enhancements in recall.

3.2. Uncertainty in the Predicted Susceptibility Values

The average values and SEs for the estimated probabilities were calculated. These probabilities were categorized into five susceptibility intervals: very low [0.01, 0.20], low (0.2, 0.45], moderate (0.45, 0.55], high (0.55, 0.80], and very high (0.80, 1.00], corresponding to five partitions with spatial susceptibility ranging from low to high [37]. The standard error of the mean (SEM) for the predicted probabilities in susceptibility zones of Test Set 1 are presented in Table 2.
For both 1:1 and 1:5 positive and unlabeled sample ratio, the SEMs of BaggingPU-SVM and BaggingPU-GDBT in the high and very high susceptibility zones are lower than those of SVM and GDBT, respectively. This suggests that Bagging PU learning effectively reduces uncertainty in the high susceptibility range.
Figure 5 illustrates the distribution of SEs for the predicted values of a model with training set positive and negative sample ratios of 1:1 and 1:5 on Test set 2. Due to the limited number of landslide samples in Test Set 2, some models lack predictive value in the high and very high susceptibility zones. A least squares curve can be used to fit the relationship between mean and SE. The coefficient of determination (r²) indicates that Bagging PU learning reduces the uncertainty of the predicted values.

3.3. Generalization Prediction of Models

We employed the same set of influence factors used in Region A to construct a dataset for Region B and trained a model with a positive and unlabeled sample ratio of 1:1. To assess the generalization performance of the model beyond the study area, both the Region A based BaggingPU-GDBT model and the Region B based BaggingPU-GDBT model were utilized for LSM, as depicted in Figure 6.
There were noticeable disparities in the susceptibility zones between the two maps. To quantitatively compare these differences, we tallied the number of computing units and landslide units in each susceptibility zone, as presented in Table 3.
For the Region B based BaggingPU-GDBT model, the distribution of moderate and above susceptibility zones aligns with 95.62% of the known landslide samples, with most landslide samples concentrated in the very high susceptibility zone. For the Region A based BaggingPU-GDBT model, the distribution of moderate and above susceptibility zones corresponds to 73.69% of the known landslide samples.
While the recognition rate of landslide samples is not as high as that of the Region B based model, the landslide samples in the Region A based model are predominantly concentrated in the high susceptibility zone. The relatively small prediction area of the moderate and above susceptibility zones leading to a higher frequency ratio, and the susceptibility prediction results still provide reasonable reference values.

4. Discussion

4.1. Classification Ability of Models

In the context of landslide susceptibility assessment, we acknowledge a certain level of tolerance for misclassifying non-landslide units, provided that it aligns with overall accuracy and enhances landslide recognition rates. As depicted in Figure 3 and Figure 4, it is evident that bagging PU learning enhances the recognition rate of landslide samples while maintaining satisfactory overall classification accuracy. Moreover, the increase in recall persists even when there are alterations in the distribution ratio of positive and negative samples within the dataset. This improvement is undeniably advantageous for landslide prevention and control efforts [38]. Furthermore, as dataset imbalances intensify, Bagging PU learning proves valuable in assisting SVM, a model sensitive to dataset disturbances, in maintaining a certain level of classification precision. While the GDBT model inherently exhibits robustness in datasets with varying sample ratios, bagging PU learning still enhances its recall slightly. By training classifiers using diverse subsets of unlabeled data and known landslide samples with different positive and negative sample distributions, varying disturbance rates of unlabeled landslide samples in the test set can be accommodated. This process renders the final ensemble classifier more flexible and stable [31].
From Figure 3, it can be observed that the recall of landslide samples in Test Set 1 de-creases monotonically as the positive-to-negative sample ratio decreases. When the posi-tive-to-negative sample ratio is 1:10, the recall is already below 70%. Therefore, we specu-late that, contrary to some previous studies [38,39,40], collecting more nega-tive samples than positive samples may not be a favorable choice in landslide suscepti-bility modeling. Past studies examining the positive-to-negative sample ratio typically uti-lize ROC for direct comparison, rarely considering landslide recall at the sampling stage. Especially when employing complex algorithms, models built on imbalanced landslide datasets may tend to predict negative samples more, as these algorithms can better fit the training set. Consequently, the resulting susceptibility mapping may exhibit a dispropor-tionately large area of very low susceptibility zones [41]. We utilize models trained with positive-to-negative sample ratios of 1:1 and 1:5 to generate susceptibility maps for Region A. As shown in Figure 7, this supports our viewpoint.
The 1:1 models and 1:5 models were used to predict all computing units in region A, generating susceptibility maps as displayed in Figure 8. The regions predicted as moderate and above susceptibility by the 1:5 models appear notably more concentrated, with fewer instances of misclassifying unlabeled samples as landslides. This corresponds to an improvement in precision. However, unlike typical binary classification problems, unlabeled samples are not definitively negative classes in landslide susceptibility estimation. Using evaluation metrics that treat unlabeled samples as negative classes cannot serve as the sole basis for assessing model quality. Our goal is not only to correctly classify historical landslides but also to estimate the spatial probability of landslides in unlabeled areas. Although the 1:1 models exhibit lower precision, fewer potential landslide points are classified in low and very low susceptibility areas. This suggests that these models have better captured the characteristics of such landslide samples, leading to more reasonable mapping results. In contrast, the 1:5 models increase precision at the cost of classifying more potential landslide points as noise. Moreover, in Figure 7b, logistic regression, serving as a lower-complexity algorithm in this study, predicts the smallest area of very low susceptibility zones. This observation indicates that higher-complexity models are more likely to excessively focus on non-landslide samples in imbalanced datasets. This issue has often been overlooked in the past and merits further exploration.

4.2. Generalization Ability of Models

The generalization performance of LSM models is typically verified through three approaches: 1. Examining the accuracy on the test set splinted from modeling dataset; 2. Applying the model to the region where the modeling dataset originated, excluding the modeling dataset; and 3. Selecting a different study area for validation [42]. In this study, we have tested all three approaches. Test Set 1 and Test Set 2, respectively, represent the test set with the same sample distribution as the modeling dataset and the test set with the same origin of the modeling dataset but different positive-to-negative sample ratios. Region B serves as an additional study area to independently validate the generalization capability of the best model on other datasets.
Figure 3a and Figure 4a illustrate the stability of the bagging PU learning model. Across test sets with different sample ratio distributions, both Bagging PU-SVM and Bagging PU-GDBT exhibit higher recall compared to SVM and GDBT. Figure 8 provides a visualization of the prediction SEs for models trained with positive and unlabeled sample ratios of 1:1 and 1:5 in the training set, tested on Test Set 1. Comparing Figure 5 and Figure 8, the SEs for predictions in the moderate and higher susceptibility zones of Test Set 1 are small, indicating a comprehensive learning of the model for known landslide samples. The overall SEs for predictions on Test Set 2 are larger, particularly in the moderate and higher susceptibility zones. This may be attributed to the uncertainty in sample distribution. However, the SE fitting curves for Bagging PU-SVM and Bagging PU-GDBT are slightly shifted to the right compared to those of SVM and GDBT, suggesting that Bagging PU learning assists in reducing the uncertainty in predicting values for very high susceptibility zones. Therefore, bagging PU methods are considered helpful for the model to overcome class-prior probability shift in the landslide sample distribution to some extent, thereby enhancing the model’s generalization capability.
From Figure 6, it is evident that the Bagging PU-GDBT model trained on data from Region A exhibits limitations in predicting landslide susceptibility in Region B. We hy-pothesize that this discrepancy is due to variations in the influencing factors leading to landslides in the two regions. Figure 9 summarizes the contribution of each factor within the 1:1 Bagging PU-GDBT model for both regions.
For the Region A model, elevation and rainfall contribute significantly more than other factors. For the Region B model, the factors with higher contributions are elevation, valley depth, and distance from road. The differences in the combinations of primary influencing factors between the two regions may be a crucial reason leading to the decreased generalization performance of the Region A-based BaggingPU-GDBT model when predicting in Region B.
In terms of factor contributions, it is evident that Region B is more susceptible to triggering factors when compared to Region A. In Region B, the impact of roads and seismic activity on landslides is more pronounced. Road construction disrupts slope stability through induced slope cutting, and the reservoir-induced earthquakes increased sensitivity of slope stability. There is relatively higher seismic activity in the eastern and southern parts of Region B, in contrast to the weaker seismic activity in the seismic zone where Region A is located.
Despite this, the Region A-based BaggingPU-GDBT model manages to maintain a certain degree of generalization ability in Region B area due to the similarities in elevation and hydrological features. Elevation serves as the primary controlling factor for landslide occurrence in both regions, determining dominant topography, vegetation characteristics, soil weathering, and reflecting, to some extent, human economic and engineering activities [42]. The factors ‘distance from river’ and ‘rainfall’ hold relatively high importance in both regions. The mechanical properties of loose rock mass will change under the action of water [43,44]. The fragile surface materials of slopes exacerbate their instability during intense rainfall, seismic events, and drastic fluctuations in river water levels [45,46,47]. As depicted in Figure 6 and Figure 7, in Region A, high susceptibility areas for landslides predominantly align along both sides of watercourses, while in Region B, these areas primarily cluster along the Yangtze River, Qing River, and their tributaries. The rock formations in both regions consist mainly of carbonate rocks, known for their robust resistance to weathering. Loose deposits are concentrated along the riverbanks, providing the material foundation for landslide occurrence. The water levels in these rivers experience seasonal fluctuations influenced by rainfall and groundwater undergoes changes. This pronounced variability in hydrological characteristics creates advantageous external dynamic conditions for the incubation and development of geological hazards in mountainous regions.
In summary, the spatial heterogeneity of influencing factors in different geographical regions significantly impacts the model’s generalization performance [48]. Although we have preliminary discussed the importance of factors in predictive models in this study, further exploration of the contribution trends of features to landslide susceptibility should be considered in future research for model reuse in regions with high homogeneity in influencing factors.
Landslide Susceptibility Mapping (LSM), as a rapid technique for predicting the spatial distribution of disasters, provides guidance for rational spatial management. Landslides exert an influence on sediment supply in river basins. Landslides within reservoirs introduce a considerable amount of loose material, thereby contributing to subsequent debris flows within the reservoir [49]. The LSM in reservoir areas plays a pivotal role in anticipating potential debris flow occurrences and contributes to catchment planning. The LSM model proposed in this paper, adaptable to varying landslide distribution ratios in different regions, exhibits generalizability. It can be applied to other regions with geological, geographical, hydrological, and climatic similarities, generating landslide sensitivity mappings in situations where historical landslide inventories are incomplete. When applied to global landslide risk monitoring and disaster prevention, it can serve as a foundation for formulating appropriate natural disaster management policies. This is particularly crucial in low- to middle-income countries with high population density, where landslides in densely populated mountainous areas can lead to economic losses, casualties, and even direct and indirect costs for urban-scale constructions or infrastructure [50,51,52,53].

5. Conclusions

Landslide susceptibility modeling encounters challenges arising from imbalanced data distributions and class-prior probability shifts. This study introduces an approach utilizing Bagging PU Learning, a semi-supervised learning method, to enhance the performance of landslide susceptibility models. In addition, our study refines the estimation method of landslide susceptibility uncertainty, employing standard error, and conducts comprehensive evaluations encompassing classification accuracy and prediction probability uncertainty. Our findings suggest that:
  • The selection of positive-to-negative sample ratio profoundly affects the classification performance of LSM model. While models trained on unbalanced datasets exhibit superior overall binary classification performance, those trained on balanced datasets demonstrate higher landslide recall. This emphasizes the importance of the trade-off between precision and recall in LSM modeling, differing from typical binary classification problems.
  • The positive-to-negative sample ratio significantly impacts the mapping results. Models trained on unbalanced datasets tend to predict negative samples, lowering the overall landslide susceptibility probability in the region. Conversely, balanced datasets yield more reasonable for prevention and control planning.
  • Utilizing Bagging PU Learning in classifiers has the potential to boost recall in the context of class-prior probability shift, thereby enhancing the overall generalization performance of the model. This method can reduce the uncertainty of model predictions in high susceptibility areas. In this study, the BaggingPU-GDBT model shows the best performance.
Our research provides new ideas in landslide susceptibility modeling for reliable predictions in unbalanced data scenarios. However, given the spatial heterogeneity of influencing factors, further exploration on how to improve the generalization ability of the model in additional scenarios is needed.

Author Contributions

Conceptualization, L.Z. and R.N.; methodology, L.Z.; software, H.M.; validation, L.Z.; formal analysis, L.Z.; investigation, L.Z., J.D. and H.M.; resources, H.X.; data curation, J.D. and H.X.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z. and X.W.; visualization, L.Z.; supervision, R.N.; project administration, R.N.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42071429, and 111 project under Grant B17040.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Su, M.-B.; Chen, I.-H.; Liao, C.-H. Using TDR Cables and GPS for Landslide Monitoring in High Mountain Area. J. Geotech. Geoenviron. Eng. 2009, 135, 1113–1121. [Google Scholar] [CrossRef]
  2. Zhang, Y.; Tang, H.; Li, C.; Lu, G.; Cai, Y.; Zhang, J.; Tan, F. Design and Testing of a Flexible Inclinometer Probe for Model Tests of Landslide Deep Displacement Measurement. Sensors 2018, 18, 224. [Google Scholar] [CrossRef]
  3. Zhu, H.-H.; Shi, B.; Zhang, C.-C. FBG-Based Monitoring of Geohazards: Current Status and Trends. Sensors 2017, 17, 452. [Google Scholar] [CrossRef] [PubMed]
  4. Caviedes-Voullième, D.; Juez, C.; Murillo, J.; García-Navarro, P. 2D Dry Granular Free-Surface Flow over Complex Topography with Obstacles. Part I: Experimental Study Using a Consumer-Grade RGB-D Sensor. Comput. Geosci. 2014, 73, 177–197. [Google Scholar] [CrossRef]
  5. Cao, Y.; Wei, X.; Fan, W.; Nan, Y.; Xiong, W.; Zhang, S. Landslide Susceptibility Assessment Using the Weight of Evidence Method: A Case Study in Xunyang Area, China. PLoS ONE 2021, 16, e0245668. [Google Scholar] [CrossRef] [PubMed]
  6. Aditian, A.; Kubota, T.; Shinohara, Y. Comparison of GIS-Based Landslide Susceptibility Models Using Frequency Ratio, Logistic Regression, and Artificial Neural Network in a Tertiary Region of Ambon, Indonesia. Geomorphology 2018, 318, 101–111. [Google Scholar] [CrossRef]
  7. Fang, Z.; Wang, Y.; Peng, L.; Hong, H. A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landslide Susceptibility Mapping. Int. J. Geogr. Inf. Sci. 2021, 35, 321–347. [Google Scholar] [CrossRef]
  8. Felicísimo, Á.M.; Cuartero, A.; Remondo, J.; Quirós, E. Mapping Landslide Susceptibility with Logistic Regression, Multiple Adaptive Regression Splines, Classification and Regression Trees, and Maximum Entropy Methods: A Comparative Study. Landslides 2013, 10, 175–189. [Google Scholar] [CrossRef]
  9. Yao, X.; Tham, L.G.; Dai, F.C. Landslide Susceptibility Mapping Based on Support Vector Machine: A Case Study on Natural Slopes of Hong Kong, China. Geomorphology 2008, 101, 572–582. [Google Scholar] [CrossRef]
  10. Liu, M.; Liu, J.; Xu, S.; Zhou, T.; Ma, Y.; Zhang, F.; Konečný, M. Landslide Susceptibility Mapping with the Fusion of Multi-Feature SVM Model Based FCM Sampling Strategy: A Case Study from Shaanxi Province. Int. J. Image Data Fusion 2021, 12, 349–366. [Google Scholar] [CrossRef]
  11. Nefeslioglu, H.A.; Gokceoglu, C.; Sonmez, H. An Assessment on the Use of Logistic Regression and Artificial Neural Networks with Different Sampling Strategies for the Preparation of Landslide Susceptibility Maps. Eng. Geol. 2008, 97, 171–191. [Google Scholar] [CrossRef]
  12. Peng, L.; Niu, R.; Huang, B.; Wu, X.; Zhao, Y.; Ye, R. Landslide Susceptibility Mapping Based on Rough Set Theory and Support Vector Machines: A Case of the Three Gorges Area, China. Geomorphology 2014, 204, 287–301. [Google Scholar] [CrossRef]
  13. Kavzoglu, T.; Sahin, E.K.; Colkesen, I. Landslide Susceptibility Mapping Using GIS-Based Multi-Criteria Decision Analysis, Support Vector Machines, and Logistic Regression. Landslides 2014, 11, 425–439. [Google Scholar] [CrossRef]
  14. Rabby, Y.W.; Li, Y.; Hilafu, H. An Objective Absence Data Sampling Method for Landslide Susceptibility Mapping. Sci. Rep. 2023, 13, 1740. [Google Scholar] [CrossRef] [PubMed]
  15. Su, C.; Wang, B.; Lv, Y.; Zhang, M.; Peng, D.; Bate, B.; Zhang, S. Improved Landslide Susceptibility Mapping Using Unsupervised and Supervised Collaborative Machine Learning Models. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2023, 17, 387–405. [Google Scholar] [CrossRef]
  16. Huang, F.; Yin, K.; Jiang, S.; Huang, J.; Cao, Z. Landslide Susceptibility Assessment Based on Clustering Analysis and Support Vector Machine. Chin. J. Rock Mech. Eng. 2018, 37, 156–167. [Google Scholar] [CrossRef]
  17. Sun, D.; Wu, X.; Wen, H.; Gu, Q. A LightGBM-Based Landslide Susceptibility Model Considering the Uncertainty of Non-Landslide Samples. Geomat. Nat. Hazards Risk 2023, 14, 2213807. [Google Scholar] [CrossRef]
  18. Fang, Z.; Wang, Y.; Niu, R.; Peng, L. Landslide Susceptibility Prediction Based on Positive Unlabeled Learning Coupled With Adaptive Sampling. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11581–11592. [Google Scholar] [CrossRef]
  19. Wu, B.; Qiu, W.; Jia, J.; Liu, N. Landslide Susceptibility Modeling Using Bagging-Based Positive-Unlabeled Learning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 766–770. [Google Scholar] [CrossRef]
  20. Elkan, C.; Noto, K. Learning Classifiers from Only Positive and Unlabeled Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24 August 2008; ACM: New York, NY, USA, 2008; pp. 213–220. [Google Scholar]
  21. Nakajima, S.; Sugiyama, M. Positive-Unlabeled Classification under Class-Prior Shift: A Prior-Invariant Approach Based on Density Ratio Estimation. Mach. Learn. 2023, 112, 889–919. [Google Scholar] [CrossRef]
  22. Li, X.; Liu, B. Learning to Classify Texts Using Positive and Unlabeled Data. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2003; pp. 587–592. [Google Scholar]
  23. Yu, H.; Han, J.; Chang, K.C.-C. PEBL: Positive Example Based Learning for Web Page Classification Using SVM. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23 July 2002; ACM: New York, NY, USA, 2002; pp. 239–248. [Google Scholar]
  24. Tang, L.; Yu, X.; Jiang, W.; Zhou, J. Comparative Study on Landslide Susceptibility Mapping Based on Unbalanced Sample Ratio. Sci. Rep. 2023, 13, 5823. [Google Scholar] [CrossRef]
  25. Wang, Y.; Feng, L.; Li, S.; Ren, F.; Du, Q. A Hybrid Model Considering Spatial Heterogeneity for Landslide Susceptibility Mapping in Zhejiang Province, China. Catena 2020, 188, 104425. [Google Scholar] [CrossRef]
  26. Wu, L. The Multi-Fractal of the Spatial Distribution of Landslide; Deng, W., Ed.; Chongqing Normal University: Chongqing, China, 2011; pp. 99–102. [Google Scholar]
  27. Wright, R. Positive-Unlabeled Learning. 2017. [Google Scholar]
  28. Ullah, K.; Wang, Y.; Fang, Z.; Wang, L.; Rahman, M. Multi-Hazard Susceptibility Mapping Based on Convolutional Neural Networks. Geosci. Front. 2022, 13, 101425. [Google Scholar] [CrossRef]
  29. Liao, M.; Wen, H.; Yang, L. Identifying the Essential Conditioning Factors of Landslide Susceptibility Models under Different Grid Resolutions Using Hybrid Machine Learning: A Case of Wushan and Wuxi Counties, China. Catena 2022, 217, 106428. [Google Scholar] [CrossRef]
  30. Wang, D.; Hao, M.; Chen, S.; Meng, Z.; Jiang, D.; Ding, F. Assessment of Landslide Susceptibility and Risk Factors in China. Nat. Hazards 2021, 108, 3045–3059. [Google Scholar] [CrossRef]
  31. Mordelet, F.; Vert, J.-P. A Bagging SVM to Learn from Positive and Unlabeled Examples. Pattern Recognit. Lett. 2014, 37, 201–209. [Google Scholar] [CrossRef]
  32. Scott, C.; Blanchard, G. Novelty Detection: Unlabeled Data Definitely Help. In Artificial Intelligence and Statistics, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, Clearwater Beach, FL, USA, 15 April 2009; Van Dyk, D., Welling, M., Eds.; PMLR: New York, NY, USA, 2009; Volume 5, pp. 464–471. [Google Scholar]
  33. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  34. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  35. Mandrekar, J.N. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef]
  36. Kouw, W.M.; Loog, M. An Introduction to Domain Adaptation and Transfer Learning. arXiv 2018, arXiv:1812.11806. [Google Scholar] [CrossRef]
  37. Guzzetti, F.; Reichenbach, P.; Ardizzone, F.; Cardinali, M.; Galli, M. Estimating the Quality of Landslide Susceptibility Models. Geomorphology 2006, 81, 166–184. [Google Scholar] [CrossRef]
  38. Zhao, L.; Wu, X.; Niu, R.; Wang, Y.; Zhang, K. Using the Rotation and Random Forest Models of Ensemble Learning to Predict Landslide Susceptibility. Geomat. Nat. Hazards Risk 2020, 11, 1542–1564. [Google Scholar] [CrossRef]
  39. Pourghasemi, H.R.; Kornejady, A.; Kerle, N.; Shabani, F. Investigating the Effects of Different Landslide Positioning Techniques, Landslide Partitioning Approaches, and Presence-Absence Balances on Landslide Susceptibility Mapping. Catena 2020, 187, 104364. [Google Scholar] [CrossRef]
  40. Yang, C.; Liu, L.-L.; Huang, F.; Huang, L.; Wang, X.-M. Machine Learning-Based Landslide Susceptibility Assessment with Optimized Ratio of Landslide to Non-Landslide Samples. Gondwana Res. 2023, 123, 198–216. [Google Scholar] [CrossRef]
  41. Gao, H.; Fam, P.S.; Tay, L.T.; Low, H.C. Comparative Landslide Spatial Research Based on Various Sample Sizes and Ratios in Penang Island, Malaysia. Bull. Eng. Geol. Environ. 2021, 80, 851–872. [Google Scholar] [CrossRef]
  42. Sun, D.; Xu, J.; Wen, H.; Wang, Y. An Optimized Random Forest Model and Its Generalization Ability in Landslide Susceptibility Mapping: Application in Two Areas of Three Gorges Reservoir, China. J. Earth Sci. 2020, 31, 1068–1086. [Google Scholar] [CrossRef]
  43. Chu, H.-J.; Chen, Y.-C.; Ali, M.; Höfle, B. Multi-Parameter Relief Map from High-Resolution DEMs: A Case Study of Mudstone Badland. Int. J. Environ. Res. Public Health 2019, 16, 1109. [Google Scholar] [CrossRef]
  44. Guo, Y.; Li, X.; Ju, S.; Lyu, Q.; Liu, T. Utilization of 3D Laser Scanning for Stability Evaluation and Deformation Monitoring of Landslides. J. Environ. Public Health 2022, 2022, 8225322. [Google Scholar] [CrossRef] [PubMed]
  45. Mantovani, J.R.; Bueno, G.T.; Alcântara, E.; Park, E.; Cunha, A.P.; Londe, L.; Massi, K.; Marengo, J.A. Novel Landslide Susceptibility Mapping Based on Multi-Criteria Decision-Making in Ouro Preto, Brazil. J. Geovisualization Spat. Anal. 2023, 7, 7. [Google Scholar] [CrossRef]
  46. Tesfa, C. GIS-Based AHP and FR Methods for Landslide Susceptibility Mapping in the Abay Gorge, Dejen–Renaissance Bridge, Central, Ethiopia. Geotech. Geol. Eng. 2022, 40, 5029–5043. [Google Scholar] [CrossRef]
  47. Millán-Arancibia, C.; Lavado-Casimiro, W. Rainfall Thresholds Estimation for Shallow Landslides in Peru from Gridded Daily Data. Nat. Hazards Earth Syst. Sci. 2023, 23, 1191–1206. [Google Scholar] [CrossRef]
  48. Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into Geospatial Heterogeneity of Landslide Susceptibility Based on the SHAP-XGBoost Model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef] [PubMed]
  49. Jin, W.; Cui, P.; Zhang, G.; Wang, J.; Zhang, Y.; Zhang, P. Evaluating the Post-Earthquake Landslides Sediment Supply Capacity for Debris Flows. Catena 2023, 220, 106649. [Google Scholar] [CrossRef]
  50. Carrión-Mero, P.; Montalván-Burbano, N.; Morante-Carballo, F.; Quesada-Román, A.; Apolo-Masache, B. Worldwide Research Trends in Landslide Science. Int. J. Environ. Res. Public Health 2021, 18, 9445. [Google Scholar] [CrossRef] [PubMed]
  51. Alcántara-Ayala, I.; Garnica-Peña, R.J. Landslide Warning Systems in Low-And Lower-Middle-Income Countries: Future Challenges and Societal Impact. In Progress in Landslide Research and Technology, Volume 1 Issue 1, 2022; Progress in Landslide Research and Technology; Sassa, K., Konagai, K., Tiwari, B., Arbanas, Ž., Sassa, S., Eds.; Springer International Publishing: Cham, Germany, 2023; pp. 137–147. ISBN 978-3-031-16897-0. [Google Scholar]
  52. Bucała-Hrabia, A.; Kijowska-Strugała, M.; Śleszyński, P.; Rączkowska, Z.; Izdebski, W.; Malinowski, Z. Evaluating the Use of the Landslide Database in Spatial Planning in Mountain Communes (the Polish Carpathians). Land Use Policy 2022, 112, 105842. [Google Scholar] [CrossRef]
  53. Garcia-Delgado, H.; Petley, D.N.; Bermúdez, M.A.; Sepúlveda, S.A. Fatal Landslides in Colombia (from Historical Times to 2020) and Their Socio-Economic Impacts. Landslides 2022, 19, 1689–1716. [Google Scholar] [CrossRef]
Figure 1. Location of the study areas.
Figure 1. Location of the study areas.
Remotesensing 15 05547 g001
Figure 2. Average receiver operating characteristic curves and mean areas under the curve for Test Set 1 of models.
Figure 2. Average receiver operating characteristic curves and mean areas under the curve for Test Set 1 of models.
Remotesensing 15 05547 g002
Figure 3. Classification performance for landslide samples in Test set 1. The proportion of positive and negative samples in Test Set 1 is the same as in the training set. (a) Recall of models with different positive to negative sample ratios; (b) Precision of models with different positive to negative sample ratios.
Figure 3. Classification performance for landslide samples in Test set 1. The proportion of positive and negative samples in Test Set 1 is the same as in the training set. (a) Recall of models with different positive to negative sample ratios; (b) Precision of models with different positive to negative sample ratios.
Remotesensing 15 05547 g003aRemotesensing 15 05547 g003b
Figure 4. Classification performance for landslide samples in Test Set 2. The proportion of positive and negative samples in Test set 2 is unknown. (a) Recall of models with training set sample ratios of 1:1 and 1:5; (b) Precision of models with training set sample ratios of 1:1 and 1:5.
Figure 4. Classification performance for landslide samples in Test Set 2. The proportion of positive and negative samples in Test set 2 is unknown. (a) Recall of models with training set sample ratios of 1:1 and 1:5; (b) Precision of models with training set sample ratios of 1:1 and 1:5.
Remotesensing 15 05547 g004
Figure 5. The spatial occurrence of standard errors (SEs) for the predicted susceptibility values of Test Set 2. (a) SEs of models with training set sample ratios of 1:1; (b) SEs of models with training set sample ratios of 1:5.
Figure 5. The spatial occurrence of standard errors (SEs) for the predicted susceptibility values of Test Set 2. (a) SEs of models with training set sample ratios of 1:1; (b) SEs of models with training set sample ratios of 1:5.
Remotesensing 15 05547 g005aRemotesensing 15 05547 g005b
Figure 6. Landslide susceptibility mappings of region B with the 1:1 positive and unlabeled sample ratio.
Figure 6. Landslide susceptibility mappings of region B with the 1:1 positive and unlabeled sample ratio.
Remotesensing 15 05547 g006
Figure 7. Landslide susceptibility Mappings (LSMs) of region A. (a) LSM of region A generated by model trained with 1:1 positive and unlabeled sample ratio; (b) LSM of region A generated by model trained with 1:5 positive and unlabeled sample ratio.
Figure 7. Landslide susceptibility Mappings (LSMs) of region A. (a) LSM of region A generated by model trained with 1:1 positive and unlabeled sample ratio; (b) LSM of region A generated by model trained with 1:5 positive and unlabeled sample ratio.
Remotesensing 15 05547 g007
Figure 8. The spatial occurrence of standard errors (SEs) for the predicted susceptibility values of Test Set 1. (a) SEs of models with positive to negative sample ratios of 1:1; (b) SEs of models with positive to negative sample ratios of 1:5.
Figure 8. The spatial occurrence of standard errors (SEs) for the predicted susceptibility values of Test Set 1. (a) SEs of models with positive to negative sample ratios of 1:1; (b) SEs of models with positive to negative sample ratios of 1:5.
Remotesensing 15 05547 g008
Figure 9. Factor Importance in the BaggingPU-GDBT model (Vertical axis organized from highest to lowest based on the average importance of factors across 30 base classifiers).
Figure 9. Factor Importance in the BaggingPU-GDBT model (Vertical axis organized from highest to lowest based on the average importance of factors across 30 base classifiers).
Remotesensing 15 05547 g009aRemotesensing 15 05547 g009b
Table 1. Landslide susceptibility influence factors.
Table 1. Landslide susceptibility influence factors.
CategoryFactorsData Source
TopographyElevation, aspect, slope, profile curvature, plan curvature, terrain surface texture, relative slope position, topographic wetness index (TWI), topographic roughness index (TRI), valley depthASTER GDEM data of the Geospatial Data Cloud (http://www.gscloud.cn/, accessed on 1 March 2023)
GeologyLithology, strata, distance from faults, slope structure1:200,000 scale geological map (http://dcc.ngac.org.cn/, accessed on 1 March 2023)
EnvironmentNDVI, land use, distance from rivers, magnitude, Annual average rainfallLand use was extracted from GlobeLand30 (http://globeland30.org/, accessed on 1 March 2023). The NDVI was calculated in the Google Earth Engine platform. Rivers were derived from the 1:100,000 basic geographic database of China national catalogue service for geographic information. Magnitude and rainfall were provided by Hubei Geological Environment Station.
Human activityDistance from roads, POI kernel density1:100,000 basic geographic database of China national catalogue service for geographic information
Table 2. The standard error of the mean of Test Set 1 in susceptibility zones.
Table 2. The standard error of the mean of Test Set 1 in susceptibility zones.
Positive and Unlabeled Sample RatioModelSEM
Very LowLowModerateHighVery High
1:1LR0.002110.002070.004630.003620.0016
SVM0.002080.00210.004650.004270.00158
BaggingPU-SVM0.002440.002090.004170.003660.00155
RF0.002160.002070.004430.004770.00142
GDBT0.001980.002070.00490.005290.0014
BaggingPU-GDBT0.001860.002050.004910.005030.00136
1:5LR0.00060.003950.002120.003030.00411
SVM0.000580.004230.00260.003560.00322
BaggingPU-SVM0.000580.003670.002970.003340.00287
RF0.000570.004250.002990.003350.00248
GDBT0.000550.004790.003030.003370.00227
BaggingPU-GDBT0.000550.00510.003090.003350.00212
Table 3. Zonal statistics of landslide frequency ratio.
Table 3. Zonal statistics of landslide frequency ratio.
ModelSusceptibilityUnite ProportionLandslide Unit ProportionFrequency Ratio
Region B based BaggingPU-GDBT0.00–0.2052.42%0.37%0.007
0.20–0.4523.44%4.01%0.171
0.45–0.556.01%3.46%0.576
0.55–0.8010.64%13.10%1.231
0.80–1.007.49%79.06%10.557
Region A based BaggingPU-GDBT0.00–0.2042.51%6.87%0.162
0.20–0.4534.32%19.44%0.566
0.45–0.556.66%7.81%1.173
0.55–0.8014.86%59.40%3.997
0.80–1.001.65%6.48%3.917
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, L.; Ma, H.; Dong, J.; Wu, X.; Xu, H.; Niu, R. A Comparative Study of Landslide Susceptibility Mapping Using Bagging PU Learning in Class-Prior Probability Shift Datasets. Remote Sens. 2023, 15, 5547. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15235547

AMA Style

Zhao L, Ma H, Dong J, Wu X, Xu H, Niu R. A Comparative Study of Landslide Susceptibility Mapping Using Bagging PU Learning in Class-Prior Probability Shift Datasets. Remote Sensing. 2023; 15(23):5547. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15235547

Chicago/Turabian Style

Zhao, Lingran, Hangling Ma, Jiahui Dong, Xueling Wu, Hang Xu, and Ruiqing Niu. 2023. "A Comparative Study of Landslide Susceptibility Mapping Using Bagging PU Learning in Class-Prior Probability Shift Datasets" Remote Sensing 15, no. 23: 5547. https://0-doi-org.brum.beds.ac.uk/10.3390/rs15235547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop