Severe soil erosion will increase soil sedimentation and severely reduce the water storage and supply capabilities of reservoirs. A previous analysis revealed that 30% of US croplands had excessive soil erosion rates [1
]. Soil erosion has been one of the core topics in agriculture, natural resources conservation, and other related fields since the end of the 1920s [2
]. A significant trend in soil erosion study was developing various measurements and prediction models for different locations and applications, such as the AGNPS (agricultural non-point source pollution model [3
]), CREAMS (chemicals, runoff and erosion from agricultural management systems [4
]), EPIC (erosion-productivity impact calculator [5
]), SWRRBWQ (simulator for water resources in rural basins-water quality [6
]), WEPP (water erosion prediction project [7
]), and USLE (universal soil loss equation [8
]). Despite being the earliest developed model, USLE is still a widely used empirical model and has been used at the national and international levels to estimate soil loss throughout the world [10
There are six crucial factors in the USLE model, including the rainfall erosivity factor (Rm
-factor), soil erodibility factor (Km
-factor), slope length factor (L-factor), slope steepness factor (S-factor), cover management factor (C-factor), and support practice factor (P-factor). The outcome of the USLE model represents the average annual soil loss (Am
). However, the USLE calculation excludes landslides, gully erosions, riverbed or bank erosions, and sediment depositions [11
]. Among the USLE factors, the C-factor (ranging from 0 to 1) is related to the land-use/land-cover (LULC). Thus, it may cause a thousand-fold difference in soil erosion estimation (0.001 vs. 1).
A summary of past studies of soil erosion in a Taiwanese watershed showed that the calculated soil erosion rate varies from 1 to 3310 t/ha-year [14
]. Another study also demonstrated that the non-uniform distribution of the USLE factors might cause a substantial discrepancy in soil erosion estimation [15
]. Therefore, it is necessary to develop a strategy to derive more consistent and reliable factors when applying the USLE model for soil loss evaluation.
This study focused on the assessment of the C-factor. Traditionally, the C-factor was determined based on plot experiments or in-situ investigations [9
]. However, such works were too time-consuming and uneconomical to apply everywhere [16
]. Renard et al. [17
] derived a function to simulate the long-term field experiments by using the prior-land-use sub-factor, canopy-cover sub-factor, surface-cover sub-factor, surface-roughness sub-factor, and soil-moisture sub-factor to evaluate the C-factor. Although the derived function improved the efficiency of the long-term field experiments, significant field works were still unavoidable. Nowadays, connecting the C-factor with the LULC map using a look-up table is an effective strategy [18
]. However, this strategy is limited by the update period of the map, and therefore evaluating the multi-temporal soil erosion is difficult. Similarly, instead of the LULC map, a time-series result can be obtained using satellite remote sensing images and supervised classification techniques [19
]. The limitations here are that ground truth points are needed to assess the accuracy of the image classification, and the results are inferior to the LULC map produced from fieldwork. Performing regression analyses to connect features derived from remotely sensed images (e.g., normalized difference vegetation index, NDVI) with the C-factor is another common approach [21
]. Again, the developed relationships may have a large margin of error, fail to provide any physical meanings, and may be sensitive to vegetation phenology and soil conditions [23
]. According to the comparison above, Table 1
summarizes the typical approaches for estimating the C-factor.
To improve upon previous research on this topic, the goal of this study was to construct a model between the C-factor and related geospatial data (including but not limited to those derived from remotely sensed images) using a data mining approach. If a credible model can be constructed from official data collected in earlier years, then the model can be applied to predict the C-factors for later years by merely updating the remote sensing images, thus achieving multi-temporal analysis without having to update the LULC map every year. Furthermore, using this novel framework to assess and predict the C-factors based on geospatial factors, we will be able to better estimate the rate of soil erosion at a large scale, such as the watershed scale. As shown in Table 1
, we used the official LULC map and a look-up table to determine the C-factors in this study. An RF model was then built to predict the C-factor values from the geospatial factors for use in the USLE model.
The rest of the paper is organized as follows. Section 2
introduces the study area, the geospatial data, and the data mining algorithm. Section 3
presents the C-factor modeling and soil erosion estimation results and a discussion focusing on the class imbalance (imbalanced data) problem. Finally, Section 4
concludes this paper and provides possible future research directions.
3. Results and Discussion
The results of this study are presented as tables, graphs, and statistical metrics in the following sub-sections.
3.1. Cover-Management Factor Modeling
According to the data preprocessing and study procedures described in previous sections, the C-factor RF model was constructed using the training data and tested with the test data. The C-factors were from the official 2004 LULC map (the only map available during the study period), and the results are shown in Table 7
. Unlike a previous study [30
], which only used at most 100 points from each C-factor class and substantially overestimated soil erosion, we tried to maximize the number of data points that could be processed to train the RF model given the memory size limitation of R. Consequently, 4% of the total data points (population dataset) were used (303,682 points selected out of 7,592,062 points). The training dataset result shows that OA = 1, Kappa = 1, and AUC = 1 (the confusion matrix is not shown here to avoid redundancy). This indicates that the RF model can correctly distinguish all 212,578 data points in the training dataset. By contrast, Table 7
shows that the result of the test dataset has less remarkable metrics. While OA is still very high (0.9516), Kappa is only 0.5741, and AUC is 0.7804. Using this RF model, we predicted the C-factor distribution maps from 2004 to 2008 using SPOT images. The resulting maps are shown in Figure 5
a shows the true C-factor distribution from the official 2004 LULC map. The similarity between the prediction (Figure 5
b–f) and the reference C-factor is evident. The red pixels (high C-factor values) of all maps cluster along the central river valley near the center and lower portions of the watershed. However, some of the C-factor cannot be distinguished reliably, and some omission and commission errors have occurred. As shown in column 4 (shaded) of Table 7
, the most noticeable trend is a prediction bias towards the C = 0.01 class, representing natural and artificial forests. This result reflects the overwhelming majority of the C = 0.01 class in the sample (92.5%) when building the RF model (Table 5
). Hence, the RF model tends to classify pixels into the C = 0.01 class. Because of this bias towards a low C-factor value (0.01), the average of 2004 C-factors predicted by the RF model is only 0.0115 (Table 8
), which is lower than the true average of 0.0164 (official 2004 LULC map). Likewise, the predicted average C-factors are also lower in the subsequent years from 2005 to 2008.
To reduce the classification error, we experimented with an ad-hoc down-sampling of the majority class technique that used only 2% data from the majority class (C = 0.01 class) while maintaining a 4% sample rate of the other minority C-factor classes. The percentages of data points from each of the 12 C-factor classes in the input dataset were previously shown in Table 5
. Again, the result of the training dataset shows a perfect classification of OA = 1, Kappa = 1, and AUC = 1 (again, the confusion matrix is not shown here to avoid redundancy). The result of the test dataset (Table 9
) shows that OA = 0.9230, Kappa = 0.6484, and AUC = 0.7807. Compared with Table 7
, we can see that the OA decreases from 0.9516 to 0.9230, the Kappa increases from 0.5741 to 0.6484, and the AUC remains about the same. Using only 2% data from the C = 0.01 class, the predicted average C-factors from 2004 to 2008 range from 0.0124 to 0.0133 (Table 8
), higher than the 4% case and closer to the true average C-factor value. In other words, the reduction of the sampling rate from 4% to 2% increases the Kappa coefficient at the expense of OA. Simultaneously, the reduction of the sampling rate also brings the predictions closer to the reference value (ground truth). Although we had adopted the stratified random sampling method to obtain a representative sample of all C-factor classes, it did not completely avoid the imbalanced data problem. The next section will present how the sampling rate of the majority class affects soil erosion estimates.
3.2. Soil Erosion Estimation
Up until 2017, the USLE model was the only method for estimating the amount of soil loss in the technical regulations for soil and water conservation in Taiwan [43
]. Combining the Rm
-, and LS-factor layers (Figure 4
) with the C-factor layer (Figure 5
), the USLE model was used in this study to estimate the amount of soil erosion. The result based on the official 2004 LULC map is listed in the second column of Table 10
(116.3 t/ha-year). The multi-temporal evaluations from 2004 to 2008 based on the RF models (4% and 2%) are shown in columns 3–7 of the same table year by year. Comparing the results for 2004 (columns 2 and 3), the RF models generate lower than expected (true) soil erosion rates (88.2 and 95.1 vs. 116.3 t/ha-year). Similarly, the soil erosion rates are lower in the subsequent years from 2005 to 2008. Using the 2% RF model, we prepared the predicted soil erosion maps in Figure 6
b–f for the years 2004 through to 2008, while the soil erosion map based on the official 2004 LULC map is shown in Figure 6
a. As such, a high resemblance between the predictions (Figure 6
b–f) and the reference value (Figure 6
a) is evident. The red pixels (high soil erosion rates) of all maps cluster near the center and lower portions of the watershed, indicating good modeling results. The results of the 4% RF model are similar, but we did not include them here to avoid redundancy.
Chen et al. [14
] compiled a table of the calculated amounts of soil erosion of the Shihmen Reservoir watershed from previous studies. The table shows that soil erosion ranges from 1 to 3310 t/ha-year. Our results are close to the lower end of the soil erosion range. By contrast, the only study using a similar methodology to this study [30
], which related nine decision factors, including the gray level co-occurrence matrix (GLCM) to the C-factor, only used at most 100 points from each C-factor class. The study achieved a Kappa coefficient of 0.758, but estimated the soil erosion to be 359.4–629.9 t/ha-year.
Based on the erosion pins installed in the Shihmen Reservoir watershed [44
] and the measurements collected from 8 September 2008 to 10 October 2011, the soil erosion depth ranged from 2.17 to 13.03 mm/year [37
]. The average erosion depth is 6.5 mm/year, which is equivalent to 90.6 t/ha-year if the unit weight of soil is assumed to be 1.4 t/m3
]. Thus, our results are more comparable with the erosion pin measurements. Specifically, it shows that the 4% sampling rate underestimates soil erosion, whereas the 2% sampling rate overestimates soil erosion.
It is worth noting that the long-term land cover and landslide monitoring project [25
] from 2004 to 2009 indicated that only typhoon Aere in 2004 induced significant land degradation and mass movement in the period. Large amounts of debris and driftwood flowed into the Shihmen Reservoir. The high turbidity in the water caused the water distribution system to be shut down for an unprecedented 18 days. A similar situation has not happened since. The removal of land cover in the watershed is the reason why the calculated soil erosion based on the official 2004 LULC map is as high as 116.3 t/ha-year. After Typhoon Aere, as the land stabilized and vegetation re-grew to provide new ground cover, soil erosion reduced substantially. This explains why the measured soil erosion is only 90.6 t/ha-year between 2008 and 2011 (Table 10
According to Table 10
, both of the 4% and 2% C-factor models predict a peak of soil erosion in 2004, followed by a gradual decrease until a small rebound in 2008. This is consistent with the vegetation recovery in the study area from 2005 to 2007 and the hike in rainfall brought by Typhoons Kalmaegi, Sinlaku, and Jangmi in 2008 [45
]. However, the differences in soil erosion rates are not as marked as expected from year to year. Nevertheless, these results confirm that it is possible to develop a multi-temporal data mining model for the C-factor and the corresponding soil erosion.
Although this study’s results demonstrated the feasibility of constructing a data mining model for the USLE C-factor, the modeling results were affected by the majority class’s sampling rate (C = 0.01 class). At the beginning of the study, we used the stratified random sampling method (instead of the simple random sampling method) to ensure that each class (strata) of the population dataset (total data points) were represented. It helped to avoid incorrect analysis, but it did not solve the class imbalance problem entirely. So, we experimented with the down-sampling of the majority class (C = 0.01 class) to 2% and kept the other minority classes at a 4% sample rate. A better result was achieved as indicated by the higher Kappa coefficient (but at the expense of lower OA). To investigate if the classification performance can be further improved, we applied an ad-hoc down-sampling of the majority class technique (similar to random under-sampling) to the population dataset with 1% and 0.5% sampling rates. The resulting percentage compositions of each of the 12 C-factor classes in the input dataset were previously shown in Table 5
. After building corresponding RF models, the overall results are summarized in Table 11
, which shows the OA, the Kappa coefficient, the AUC, the true positive rate of all minority classes combined, the true positive rate of the majority class, the average C-factor of the LULC map, the predicted average C-factor of 2004, the predicted soil erosion rate of 2004, the predicted soil erosion rate of 2008, and the measured soil erosion rate by the erosion pins.
It was found that the down-sampling strategy works well. With the decrease in the majority class sampling rate, the Kappa coefficient increases from 0.574 to 0.732, and the AUC increases from 0.780 to 0.891. Moreover, the true positive rate of all minority classes combined also increases from 0.43 to 0.70. However, the overall accuracy decreases with the down-sampling from 0.952 to 0.846, and the true positive rate of the majority class declines from 0.99 to 0.94.
At first, it appears that 0.5% is the best sampling rate in this study, but the average C-factor indicates otherwise. The C-factor value starts from 0.0115 at 4% and then becomes 0.0130 at 2% and 0.0156 at 1%, gradually approaching the reference value of 0.0164. However, the average C-factor suddenly dips back down to 0.0115 at 0.5%, deviating from the reference value again. Therefore, judging from the average C-factor, 1% is the majority class’s best sampling rate.
The unexpected result of the average C-factor is further investigated in Figure 7
, in which the predicted percentage compositions of four different sampling rates were plotted for all minority classes. The red line indicates the “true” percentage composition from the 2004 official LULC map. As shown in the figure, the difference in the C = 0.16 class between the LULC map and the different sampling rates determines how accurate the final average C-factor is. As the sampling rate decreases from 4% to 2% and 1%, the respective percentage of the C = 0.16 class approaches and then surpasses that of the LULC map. When the sampling rate is further reduced to 0.5%, the percentage of the C = 0.16 class reverses course and drops back to the same level as that of the 4% sampling rate. This explains why the 0.5% sampling rate did not yield a better average C-factor, even though its other metrics (such as OA, Kappa, and AUC) were superior.
Finally, if we compare the predicted soil erosion rates with the rate measured by the erosion pins, we arrive at yet another different conclusion. The 2% sampling rate predicts a soil erosion rate of 93.2 t/ha-year, which is the closest to the measured rate of 90.6 t/ha-year. In this scenario, 2% is the best sampling rate. The best sampling rates under different criteria are shaded in Table 11
for easy comparison.
The results presented above suggest that, apart from the class imbalance problem, other factors are responsible for this study’s modeling performance. Using the overall evaluation metrics (such as OA, Kappa, and AUC) in this study is not entirely appropriate and could be misleading. Since our goal was a two-step approach to model the C-factor and eventually the soil erosion, the predicted soil erosion rate is the most important indicator. A more balanced dataset does not always yield a better modeling result in terms of soil loss, and we consider 2% to be the best sampling rate in this study.
Unlike previous studies, this research developed C-factor models based on data mining techniques to improve the soil erosion assessment in the Shihmen Reservoir watershed in northern Taiwan. Eight geospatial data were selected and used in the modeling. The multi-temporal vegetative indices (NDVI and SAVI) derived from multispectral satellite images were rectified by a topographic correction to reduce variations over time and better characterize the surface radiances of the same targets. The C-factor models built using the RF-based data mining algorithm were used with USLE to estimate the spatiotemporal soil losses from 2004 to 2008. The results were compared against past studies and the measurements of erosion pins. They showed promising classification performance.
It is found that the soil erosion rate in 2004 was the highest because of the unprecedented destruction of Typhoon Aere in 2004. As the vegetation in the watershed re-grew after the typhoon to provide new ground cover, the soil erosion rate decreased steadily until 2008, when a surge of rainfall occurred due to Typhoons Kalmaegi, Sinlaku, and Jangmi. This trend was successfully captured by the RF models, which demonstrates the feasibility of the multi-temporal analysis. Furthermore, using an ad-hoc down-sampling of the majority class technique (at 2% sampling rate), the soil erosion rate was predicted to be 93.2 t/ha-year, very close to the 90.6 t/ha-year measured by the erosion pins installed in the watershed.
In addition, this study provides a case of an imbalanced data problem that differs from other imbalanced data problems in that a more balanced dataset does not always yield a better modeling result. The best sampling rate of the majority class based on different metrics are summarized as follows:
Overall accuracy and true positive rate of the majority class: 4%
Kappa coefficient, AUC, and true positive rate of all minority classes combined: 0.5%
Average C-factor: 1%
Soil erosion rate: 2%
In summary, the results show that the proposed novel framework for assessing and predicting C-factors and soil erosion based on geospatial factors is both viable and practical. The method also has promising classification performance even when faced with an imbalanced data problem. An imbalanced data problem cannot be easily eradicated by removing records from the majority class. Therefore, future research is necessary to improve model performance and soil erosion estimates.