## 1. Introduction

Scientific management and utilization of soil resources is predicated on correct understanding of the continuous changes in regional soil properties. Spatial interpolation is the main method used to evaluate continuous changes in soil properties [

1], as well as being an important research tool in the fields of ‘digital soil’ and ‘pedometrics’ mapping [

2]. Current spatial interpolation methods mainly originate from discrete modern mathematical theories (function theory and differential geometry), and can be largely classified into three groups [

3]: (1) deterministic or non-geostatistical methods (e.g., inverse distance weighting, IDW), (2) geostatistical methods (e.g., ordinary kriging, OK), and (3) combined methods (e.g., regression kriging). These methods are often data- or even variable-specific and their performance depends on many factors. No consistent findings have been acquired to identify the best interpolation method, and most are global models (i.e., the same model is applied over the whole study area) [

4,

5]. However, in areas of landform complexity, the spatial distribution of soil properties is affected by secondary variables such as soil type, land use type, and landform type, and it is difficult to satisfy the basic assumptions of current models [

6,

7]. Further, owing to various shortcomings, single interpolation models restrict the improvement of prediction accuracy.

In recent years, some machine learning methods have been applied to the fields of data mining and spatial interpolation and have demonstrated their predictive accuracy; for example, artificial neural networks (ANN), random forest (RF), and support vector machine (SVM). Furthermore, ANN and SVM have been applied to daily minimum air temperature and rainfall data in some subjects [

8,

9]. However, all of these represent global interpolation models, which are difficult to adapt to landform complexity areas. Utilizing the advantages of ensemble learning for regression, we combined a series of interpolation models to carry out interpolation simulation of the spatial variation in soil properties and verified the reliability of multi-model integration [

10]. Despite this, issues still remain; for example, in previous work, we only conducted a global regression integration of various interpolation models, with limited consideration of discontinuous space and spatial variation problems. Furthermore, remaining interpolation models need improvement and optimization before they can be integrated.

In addition, a range of studies have demonstrated that interpolation accuracy and mapping quality can be effectively improved by the use of secondary variables as supplementary information [

4,

11,

12,

13,

14,

15,

16,

17,

18]. Land use, soil type, grassland type, and geology type might be expected to play a significant auxiliary role in controlling the spatial variation of soil properties. Previous work by Shi et al. [

1] demonstrated the effectiveness of incorporating land use type and soil type to improve interpolation simulation of soil properties. In addition, many studies have identified topography as an important auxiliary element [

3,

19], but previous research results suggest it is not a key factor in the study area [

10]. Therefore, integration of secondary variables in this study should have an important influence on interpolation accuracy.

In order to solve the global model and secondary variable problems that had long troubled the interpolation method, this study aimed to address some of the outstanding issues, with an overall goal of improving the prediction accuracy of the single interpolation model in areas with complex landforms, using soil K^{+} as an example. We applied analysis of variance (ANOVA) to select secondary variables closely related to the spatial variation of soil K^{+}, integrated secondary variables, and constructed a series of soil property interpolation models. To deal with the discontinuity and spatial variation of soil properties in areas with complex landforms, error surfaces were constructed to enable adaptive partitioning of interpolation surfaces for screening suitable base interpolation models. This paper optimized the screened base interpolation models, and built and coordinated multi-model integration interpolation methods (different combinations of interpolation models were selected for different areas) to realize a high precision simulation of soil properties. We evaluated the performance of the different spatial interpolation methods IDW, OK, OK-Landuse, OK-Geology, OK-Soil, and ASM-SP, and analyzed their predictive capabilities in terms of soil K^{+} maps.

## 2. Method

#### 2.1. Study Area and Datasets

The study area (36°38′–37°29′ N, 99°52′–100°50′ E) is located in the southeast part of the Qinghai Lake Basin, on the Tibetan Plateau, China (

Figure 1). The long-term combined action of geological movement and external forces have formed a complicated and diverse array of geomorphic features in the area. The study area covers a total of 2100 km

^{2}, with an altitude ranging between 3043 and 4516 m, and is characterized by complex landforms, including mountains, hills, tablelands, and plains. Abundant agricultural and husbandry activities are carried out in the area.

The study area is characterized by 6 soil types (

Figure 2a); 12 geology types, including alluvial terrace, denudate high terrace, and diluvial plain (

Figure 2b); and 8 land use types, including cropland, grassland, and potential arable land (

Figure 2c). The grassland can be divided into 20 types, mainly including Achnatherum splendens, leymus, and Blysmus sinocompressus (

Figure 2d). We calculated the statistical characteristics of soil K

^{+} in secondary variables using 110 training samples (

Table 1).

Field sampling of surface soil (0–30 cm) at 110 sampling points was carried out in September 2013 to supplement map data. The mean distance between soil sampling locations was approximately 6.74 km. The sampling sites were designed to cover the whole area and include different landscapes. In order to ensure rational distribution of the sampling points across the different geo-environments, a spatially stratified sampling strategy was applied based on landscape types [

20]. Supported by the Qinghai Environmental Monitoring Center, we recorded information such as soil type, altitude, geology, and land use for each sample location. Each position was sampled three times and the mean was recorded as the sample value. Soil samples were taken back to the laboratory for analysis. After air drying, grinding, and screening through a 2 mm sieve, soil K

^{+} of the samples was measured using sodium hydroxide melting analysis [

21].

The secondary variables were compiled in ArcGIS 10.2, and converted to a resolution ratio of 30 m through resampling. Since the study area covers a comparatively large range of landscape types, and the number of samples was relatively small, the spatial distribution of samples was uneven (

Figure 1) and some landscape types with a relatively high degree of fragmentation are poorly represented. In the study area, the subtypes of secondary variables not sampled anywhere covered only a very small area (i.e., scrubland;

Figure 2c;

Table 1). For this area, we directly used the nearest five-point surrounding values for the subtype to calculate the mean value.

#### 2.2. Methods for Spatial Interpolation

#### 2.2.1. Inverse Distance Weighting

IDW is a deterministic method for multivariate interpolation using a known scattered set of points. Values assigned to unknown points are calculated with a weighted average of the values available at known points. Weights are usually inversely proportional to the power of distance [

22], which, at an unsampled location

x, leads to an estimator:

where

Z*(

x) is the predicted value,

Z (

x_{i}) is the measured value,

n is the number of closest points (typically 10 to 30),

p is a parameter (typically

p = 2), and

d is the cut-off distance.

#### 2.2.2. Ordinary Kriging

Kriging interpolation is considered the best unbiased linear estimation method [

23]. When the mathematical expectation of the regionalized variable

Z*(

x) is unknown, it is termed ordinary kriging.

Z*(

x), the estimated value of variables at point

x, is obtained by the linear combination of

n Z (

x_{i})s, the effective observed value, using the expression:

where

${\lambda}_{i}$ is the weight given to the observed value

$Z({x}_{i})$ and represents the contribution of each observed value to the estimated value

Z*(

x). It can be calculated by the semi-variance function of the variables on the condition that the estimated value is unbiased and optimal. The semi-variance function can be expressed by the equation:

where

$\gamma (h)$ is the semi-variance,

N(

h) is the point group number at distance

h,

$Z({x}_{i})$ is the numerical value at position

${x}_{i}$, and

$Z({x}_{i}+h)$ is the numerical value at distance

$({x}_{i}+h)$.

#### 2.2.3. Base Interpolation Models

As a kind of geostatistical model [

24,

25], each observation

z(

x_{mn},y_{mn}) of a specific soil K

^{+} at location (

x,

y) in the

n-th type of the

m-th kind of secondary variable can be expressed as:

where

m(

E_{mn}) is the mean value of

z(

x_{mn},y_{mn}) in the

n-th type of the

m-th kind of secondary variable, and

r(

x_{mn},y_{mn}) is the residual computed by subtracting the mean value

m(

E_{mn}) of the

n-th type of the relative

m-th secondary variable from the measured value of soil K

^{+}. We assumed that

m(

E_{mn}) and

r(

x_{mn},y_{mn}) are mutually independent and that variation of

r(

x_{mn},y_{mn}) is homogeneous over the entire study area. The residuals were then used to interpolate the surface of residuals over the whole study area by OK. Finally, the interpolated residual values were summed to the soil K

^{+} means of the relevant secondary variable as the final interpolated values of OK with secondary variable for the soil K

^{+}; that is, the mean was modified with surface modeling of residuals. See

Section 3.3.1 for the specific modeling process of base interpolation models (i.e., OK-Landuse, OK-Soil, OK-Geology).

#### 2.3. Method for Adaptive Partitioning

A series of interpolation surfaces of soil properties were generated from the base interpolation models to calculate simulation errors for soil sampling points. The error surface, derived from linear interpolation, was used to determine whether the error of each raster cell exceeded a threshold value. Raster cells below the threshold value were clustered to determine the spatial range of applicability of each interpolation model after multiple iterations. The individual steps are shown in

Figure 3a and detailed below.

**Step one:** Raster cell simulation. For a specific soil property interpolation model (e.g., M_{i}), soil properties are calculated for the whole study area at a specific resolution ratio C_{0} to give the raster simulation value S_{0};

**Step two:** The soil property simulation error at each sampling point is calculated by subtracting the simulation value from the measured value;

**Step three:** Error surface construction. The error surface is constructed using linear interpolation based on the simulation error obtained in step two;

**Step four:** Calculate the simulation error of soil properties at each raster cell, based on the error surfaces obtained in step three;

**Step five:** Determine whether e_{i} (i=1, 2, …m), the error of each raster cell, satisfies | e_{i} | < ε, where ε is the error threshold. If it does, this raster cell is marked as a clustering cell;

**Step six:** Clustering. Areas that meet the accuracy threshold are clustered based on the spatial locations clustering cells.

R_{i1},

R_{i2}, and

R_{ik}, etc., are the cluster spaces of the interpolation model

M_{i} (

Figure 3b);

**Step seven:** Repeat the above steps for each interpolation model to determine their applicable spatial ranges.

#### 2.4. Assessment of Performance

Independent validation was applied to assess interpolation accuracy. The soil K^{+} sample data were randomly split into two groups, one of which was used for interpolation and the other for validation. A total of 90 soil K^{+} sample points were used for interpolation and the remaining 20 were used for validation.

We assessed the accuracy of the different interpolation methods by comparing the mean error (ME), mean absolute error (MAE), mean relative error (MRE), root mean square error (RMSE), and accuracy (AC) of predicted and measured values. The specific equations used are as follows:

where

n is the number of samples;

PEV is the potential error variance (PEV);

$z\left({x}_{i}\right)$ and

${z}^{\ast}\left({x}_{i}\right)$ are the measured and predicted values, respectively; and

$o$ is the mean measured value.

AC varies between 0 and 1, with larger values indicating a better predicted result. Smaller values of

ME, MAE and

RMSE, indicate greater interpolation accuracy.

MRE is dimensionless and smaller values indicate greater interpolation accuracy.

## 5. Conclusions

Affected by secondary variables, the spatial distribution of soil properties is subject to problems such as spatial discontinuity and variability. It is difficult for a single global interpolation model to fully explain the spatial instability of spatial variables of soil properties, especially in areas with complex landforms. Using soil K^{+} as a case study, we proposed a kind of adaptive surface modeling that combines secondary variables (ASM-SP). Compared with methods such as OK and OK-Landuse, OK-Soil, and OK-Geology that also combine secondary variables, ASM-SP is able to depict the spatial variation of soil properties in areas with complex landforms more accurately, and reduce simulation errors more effectively, owing to its integration of multiple base interpolation models. In addition, since ASM-SP combines secondary variables and its simulation surface better accords with geographical laws, it provides detailed information about the spatial variation of soil properties that is more accurate and reasonable. This provides greater opportunity for physical explanation of the spatial variance characteristics of soil properties. However, ASM-SP is based on error minimization surfaces; therefore, there is a risk of over-fitting, which will be addressed in future work.

The interpolation accuracy of soil properties in areas with complex landforms has two main challenges. First, there is a non-linear relationship between the soil properties of sampling points and the secondary variables, and the fitting precision of conventional linear models is rather limited. Second, the selected interpolation model must have relatively high simulation accuracy and, preferably, provide the optimal interpolation. However, in reality, every interpolation model has advantages and disadvantages. Even though it is possible to find a global optimum interpolation model through adequate data exploration and analysis, a simple global model is unable to explain the spatial instability of soil property spatial variables. A feasible solution is to combine secondary variables to integrate multiple models, so that different combinations of interpolation models can be selected for different areas. Soil K

^{+} is comparatively representative of soil properties that vary severely within a short horizontal distance. The ASM-SP method would also be applicable to the interpolation of other soil properties (e.g., soil P, PH, Ca, Mg, and Zn). Previously, we verified the advantages of an ensemble learning algorithm in the serial integration of multiple models [

10]. In future research, we plan to comprehensively utilize the machine learning algorithm, combine secondary variables, and build and coordinate adaptive multi-model integration interpolation methods to solve over-fitting problems and to conduct high accuracy surface modeling of soil properties.