Next Article in Journal
Increased Activity of 5-Enolpyruvylshikimate-3-phosphate Synthase (EPSPS) Enzyme Describe the Natural Tolerance of Vulpia myuros to Glyphosate in Comparison with Apera spica-venti
Next Article in Special Issue
Potato Surface Defect Detection Based on Deep Transfer Learning
Previous Article in Journal
The Impacts of Training on Farmers’ Preparedness Behaviors of Earthquake Disaster—Evidence from Earthquake-Prone Settlements in Rural China
Previous Article in Special Issue
Disease Detection in Apple Leaves Using Deep Convolutional Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Study of Various Methods for Handling Missing Data in UNSODA

School of Human Settlements and Civil Engineering, Xi’an Jiaotong University, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Submission received: 18 June 2021 / Revised: 21 July 2021 / Accepted: 27 July 2021 / Published: 30 July 2021
(This article belongs to the Special Issue Image Analysis Techniques in Agriculture)

Abstract

:
UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.

1. Introduction

Soil properties, including bulk density (BD), water content (WC), particle density (PD), porosity (PO), saturated hydraulic conductivity (SHC), organic matter content (OMC), and pH, can be divided into physical properties and chemical properties [1]. OMC and pH are the most important chemical properties. They have significant effects on plant growth [2,3,4]. Physical properties, such as SHC, BD, WC, PD, and PO are frequently measured to calculate soil’s hydraulic properties [5,6,7,8,9] or to characterize soil compaction [10,11]. BD is also widely used as an essential parameter for soil weight-to-volume conversion, especially when calculating the carbon and nutrient contents of a soil layer [12]. SHC is also used to calculate water flux in a soil profile and to design irrigation and drainage systems [13].
In theory, data on these soil properties can be obtained directly through experiments, but in practice, direct measurements are difficult and labor-intensive, and the data are highly variable, particularly for properties such as BD [12,14,15,16,17]. In addition, the measurement of soil–water characteristic curves and unsaturated soil hydraulic conductivity is also time-consuming and labor-intensive [9,18]. For these reasons, methods have been developed to use property data with low variability to estimate soil properties which are difficult to measure directly; these estimation methods are called pedotransfer functions (PFTs) [19,20], and they are commonly data-driven. One of the advantages of these data-driven models is that they are usually far more flexible than standard statistical models and can capture higher-order interactions between the data, resulting in better predictions. For this reason, the role of soil property data is becoming more and more important, and a number of soil property datasets have been established. Among them, the Unsaturated Soil Database (UNSODA) [21,22], the European Database of Soil Hydraulic Properties (HYPRES) [23], SoilVision, and the Harmonized World Soil Database (HWSD) [24] are the most representative. The UNSODA database has been widely used because it provides a large amount of information free of charge. For example, Huang and Zhang [25], Hwang and Powers [26], Hwang et al. [27], Mohammadi and Vanclooster [28], and Chang and Cheng [29] predicted SWCCs using particle size distribution data; Ghanbarian-Alavijeh et al. [30] used soil texture data; and Haverkamp et al. [31], Seki [32], Ghanbarian and Hunt [33], Pham et al. [34], and Vaz et al. [35] used soil properties such as BD, PO, and others.
However, there are missing soil property data in UNSODA for pH [pH] (pH), saturated hydraulic conductivity [k_sat] (SHC), organic matter content [OM_content] (OMC), porosity [porosity] (PO), particle density [particle_density] (PD), and bulk density [bulk_ density] (BD). The square brackets represent the features in the original tables and round brackets represent the features used in this study.
Missing data, a real-world problem often encountered in scientific settings, is problematic because many statistical analyses require complete data. Researchers who want to perform a statistical analysis that requires complete data are forced to choose between imputating data and discarding missing values; the latter is the most common method of using the UNSODA. However, discarding missing data may not be reasonable when the proportion of missing data is not small, as valuable information may be lost and inferential power compromised [36]. According to Strike et al. [37] and Raymond and Roberts [38], when the dataset contains a very small amount of missing data, e.g., the missing rate is less than 10% or 15% across the whole dataset, missing data can simply be removed without loss of valuable information. However, when the missing rate exceeds 15%, the missing information may reduce insights into the data [39], especially when dealing with the extraction of knowledge from a given dataset; therefore, careful consideration should be given to handling of missing data. Missing values are of different types, and some of them are discussed below [40]:
(i)
Missing completely at random (MCAR): The missing data are not related to known values. With this type of missing data, we assume that a whole distribution of data is completely missing.
(ii)
Missing at random (MAR): The missing value depends on an already known value and does not depend upon the missing value itself.
(iii)
Not missing at random (NMAR): The missing value does not depend upon any given or missing value.
These different types of anomaly generally arise from different sources. Data MCAR may arise from sensor recording failure, and there may be no other data dependent on the missing data. By contrast, MAR may arise when some survey questions are not answered, yet there are other questions related to the unanswered items.
The data in UNSODA were mainly contributed by individual scientists, and some of the datasets were taken from the literature. A questionnaire based on suggestions from participants at an international workshop on soil hydraulic properties held in Riverside in 1989 was also used to request information for UNSODA [22,41].
The above discussion explains how UNSODA was created, but we still cannot confirm its type(s) of missing data. For imputation purposes, the missing values in UNSODA were supposed to be MAR.
The objective of this paper was to impute missing values in UNSODA; to our knowledge, this work has not been undertaken previously. The main missing features, such as pH, SHC, OMC, PO, and PD, were all included. After reviewing of existing missing value imputation techniques, we used the random forest (RF) regression method to impute the missing values in UNSOD. Its performance was then compared with the performances of both machine learning-based methods, i.e., support vector (SVR) regression and artificial neural network (ANN) regression, and statistics-based methods, i.e., mean and multiple imputation (MI) methods.

2. A Brief Review of Existing Missing Value Imputation Techniques

Imputation methods involve replacing one missing value with another value that has been estimated based on data mining of available information in the dataset [39]. Imputation methods can be divided into single and multiple imputation methods based on the number of values imputed [39,42]. According to the construction approach used for data imputation, these technologies can also be classified into statistics-based and machine learning-based (or model-based) methods [39]; the details of these approaches are listed in Table 1.
Statistics-based methods are a popular approach for missing data imputation in which a statistic (such as mean) is calculated for each column, and all missing values for that column are replaced with the statistic. The MI method is another widely used statistics-based method, which was first proposed by Rubin in the late 1970s [43]. Instead of imputing a single value for each missing data, multiple imputation creates many completed candidate datasets according to the missing data case, and then combines these candidate datasets into one estimate for the missing data. Machine learning techniques, such as the k nearest neighbor (KNN), RF, ANN, and SVR methods, have been widely employed in the last 20 years [44]. It should be noted that KNN approaches tend to perform poorly in high-dimensional and large-scale data settings.
Table 1. The main statistical-based and machine learning based imputation methods.
Table 1. The main statistical-based and machine learning based imputation methods.
Statistics-BasedMachine Learning-Based
Expectation maximization (EM) [45]Random forest (RF) * [36,46]
Hot deck (HD) [47,48]Artificial neural networks (ANN) * [49]
Multiple imputation (MI) * [50]Support vector regression (SVR) * [51]
Mean/mode * [52]K-nearest neighbor (KNN) [53]
Gaussian mixture model (GMM) [54] Decision tree (DT) [55]
Clustering [55,56]
Note: * the imputation method used in this study.

3. The UNSODA Dataset and Procedure for Missing Value Imputation

3.1. The UNSODA Dataset

The structure of the database, names of tables, and links between tables are summarized in Figure 1. The main table of UNSODA is called “general”. It holds essential information about the soils, such as the geographic location, texture, classification, and environment of each. The “soil_properties” table contains physical and chemical properties for each soil, such as pH, SHC, OMC, PO, PD, and BD. Table 2 summarizes the statistical descriptions for the distribution of soil properties. Figure 2 presents the available sample size, total sample size, and missing proportion of each soil property. Table 2 and Figure 2 show that there are a number of missing values for BD, PD, PO, SHC, OMC, and pH, and the missing proportions for these features are 0.0354, 0.4949, 0.5316, 0.4570, 0.5089, and 0.6203, respectively. To analyze the relationships between variables as comprehensively as possible, the upper depth [depth_upper] (DU) and lower depth [depth_lower] (DL) of the sampling locations are specified. The missing proportions of DU and DL are only 0.1329 and 0.1329, respectively. It should be noted that the sample size used in this study was smaller than available sample size. The main reason was that the available sample was deleted when the corresponding BD was missing.
Figure 3 presents boxplots for the distributions of these soil properties. The statistical descriptions indicate that these features cover different scales. Among them, the SHC had the broadest range, i.e., 0.019–27,648, in which the distribution was mostly skewed toward low values in the range of 20.818 to 459.400. On the other hand, the distribution ratio of PO had the narrowest range, i.e., 0.175–0.915, which centered on the range of 0.405–0.510.
Furthermore, according to Tukey’s rule, 176 outliers, whose values were either higher than Q 3 + 1.5 I Q R or lower than Q 1 1.5 I Q R ( I Q R = Q 3 Q 1 is the interquartile range of the dataset; Q 3 and Q 1 are the first and third quartiles of the dataset, respectively) were detected in cases of DU, DL, pH, SHC, OMC, PO, PD, and BD, as shown in Figure 3. It should be noted that most of the outliers were observed in case of the SHC, i.e., 43 out of 176 outliers.

3.2. Procedure for Missing Values Imputation

The experimental procedure for missing value imputation is shown in Figure 4. Before imputing the missing values in the original incomplete dataset, we first considered one complete dataset (Dataset I) (i.e., with missing values discarded). Once missing values were discarded, the number of datasets decreased significantly ( n = 109 ). Second, a missing value simulation was performed. That is, dataset I was simulated with different missing proportions (e.g., 3%, 7%, 11%, 15%, 19%, and 23%) using an MAR approach. Different incomplete datasets were produced with different proportions of missing data. The purpose of this design was to simulate and compare quantitatively the advantages and the drawbacks of the different imputation methods.

4. Methodology

In this Section, we introduce and describe the methods used to impute the dataset with the values deleted at random (Dataset I) and the original incomplete dataset (Dataset II).
(i)
Statistics-based methods, including mean and MI.
(ii)
Machine learning-based methods, including RF, SVR, and ANN.

4.1. General Notation

Let X be our n × p matrix of predictors that requires imputation [57]:
X = X 1 , X 2 , X 3 , , X p = x 11 x 1 p x n 1 a n p
An arbitrary variable X s contains missing values at entries i m i s s 1 , 2 , , n .
For every variable X s that contains missing values, we can separate the dataset into four categories:
(i)
Non-missing values of variable X s , denoted by y o b s s .
(ii)
Missing values of variable X s , denoted by y m i s s .
(iii)
Variables other than X s , with observation i o b s s = 1 , 2 , . . . , n i m i s s , denoted by x o b s s .
(iv)
Variables other than X s , with observation i m i s s , denoted by x m i s s .

4.2. Statistics-Based Methods

4.2.1. Mean Imputation

Mean imputation is a simple imputation technique that calculates the mean of y o b s s and uses the mean of y o b s s to predict the missing values of X s , i.e., y m i s s . The mean imputation method is easy to perform, simple in process, insensitive to extreme values of the variable, and has good robustness.

4.2.2. Multiple Imputation

As shown in Figure 5, the MI method does not attempt to provide an accurate estimate for the missing data, but rather tries to represent a random sample of the missing data by constructing valid statistical inferences that properly reflect the uncertainty due to missing data. Hence, it retains the advantages of single imputation while allowing the data analyst to obtain valid assessments of uncertainty. In this study, we used SPSS 22.0 to impute the missing values, whose algorithm is predictive mean matching (PMM).

4.3. Machine Learning-Based Methods

4.3.1. RF Imputation Method

RF is an ensemble technique capable of performing regression and classification with the use of multiple decision trees and a technique called bootstrap aggregation, commonly known as bagging, which involves training each decision tree on a different data sample, where sampling is performed with replacement [58,59].
We used binary decision trees (i.e., CART) as the base learner for the RF, as shown in Figure 6. It was necessary to consider how to choose split variables (features) and split points, and how to estimate the quality of the split variable and split point; the calculation formula was as follows:
F x i , v i j = n l e f t N s H x i l e f t + n r i g h t N s H x i r i g h t
where x i is the split variable, v i j is the value of the split variable, n l e f t is the sample size of the left node, n r i g h t is the sample size of the right node, and N s is the total sample size of the variable x i . H ( x ) is the impurity function, which can be calculated as:
H x m = 1 N m i N m y y ¯ m 2
By substituting Equation (3) into (2), we can obtain:
F x i , v i j = 1 N s y i X l e f t y i y ¯ l e f t 2 + y i X r i g h t y i y ¯ r i g h t 2
The training process of a node in the decision tree is mathematically equivalent to the following optimization problem:
x i * , v i j * = arg min x , v F x i , v i j
The above discussion explains how to train a CART; it should be noted that the prediction results of the RF involve averaging all results of all CARTs. The RF can be used to estimate missing values by fitting an RF to predict the non-missing values of X s , i.e., y o b s s x o b s s , and using this to predict the missing values of X s , i.e., y m i s s x m i s s . BD was considered to be the label because it had the lowest missing proportion (0.0354). The RF imputation Algorithm 1 can be described as follows [57]:
Algorithm 1: The RF imputation.
Agriculture 11 00727 i001

4.3.2. SVR Imputation Method

SVR regression is an adaptation of the support vector machine (SVM) algorithm used for regression problems [60]. The SVR can be divided into two types, i.e., hard margin and soft margin [61]. To illustrate the basic idea behind the SVR, we first introduce the case of linear functions, as shown in Figure 7a,b. In the hard margin model, there are no points outside the shaded region; the parameter ε affects the number of support vectors used in the regression function. That is, the smaller the value of ε , the greater the number of support vectors that will be selected. However, this may not be the case, or we may also want to allow for some errors, as shown in Figure 7b; therefore, the soft margin SVR model was proposed. In the soft margin model, another important parameter, the cost parameter (C), is involved; it determines the tolerance for deviations larger than ε from the real value. That is, smaller deviations are tolerable for larger values of C. The training process of SVR is mathematically equivalent to the following optimization problem:
Minimize:
1 2 w 2 + C i = 1 n ξ i + ξ i *
Subject to:
y i w · x b ε + ξ i b + w · x y i ε + ξ i * ξ i , ξ i * 0
where C is the cost parameter and C 0 . w is the coefficient matrix.
The SVR formulations described above make up a linear decision boundary to fit the training dataset. Kernel functions are commonly used for non-linear SVR; they transform the data into a higher-dimensional feature space to enable linear fitting. There are two commonly used kernel functions, the polynomial and Gaussian radial basic functions.
The SVR can be used to estimate missing values by fitting an SVR to predict the non-missing values of X s , i.e., y o b s s x o b s s , and using this to predict the missing values of X s , i.e., y m i s s x m i s s . The SVR imputation algorithm is similar to RF.

4.3.3. ANN Imputation Method

An artificial neural network (ANN) is a computing system designed to simulate the way that the human brain analyzes and processes information. It is the foundation of artificial intelligence and solves problems that would prove impossible or difficult by human or statistical methods. Feed-forward multi-layer perceptron ANNs are frequently used in engineering applications [62], as shown in Figure 8. Here, we use a standard ANN with one hidden layer as an example. Weights of the first layer connect the input data variables to the H hidden units (neurons), and the second layer’s weights connect these hidden neurons to the output units. First, given a p dimensional input vector x , the H hidden neuron outputs are computed in the form [44]:
z h = σ j = 1 p w h j 1 x j + w h 0 1
where h = 1 , 2 , . . . , H , z h is the hidden neuron output, w h j 1 is the first layer weight, w h 0 1 is the corresponding bias parameter, the superscript (1) indicates that the corresponding parameters are in the first layer, and σ is the activation function chosen as the ReLU function in this study. By analogy, considering z h as another input, the output can be calculated.
The ANN can be used to estimate missing values by fitting an ANN to predict the non-missing values of Xs, i.e., y o b s s x o b s s , and using this to predict the missing values of X s , i.e., y m i s s x m i s s . The ANN imputation algorithm is also similar to RF.

5. Model Evaluation

5.1. Quantitative Evaluation

To assess the quality of the RF, SVR, ANN, mean, and MI predictions in the complete dataset (Dataset I), it was essential to establish metrics that allow the comparison of the different methods. This evaluation had to consist of a comparison between the prediction results and the actual results. We used two common statistical measurements, the root mean square error (RMSE) and the mean absolute error (MAE):
RMSE = 1 n i = 1 n x i x ^ i 2
MAE = 1 n i = 1 n x i x ^ i
where x i is the actual value, x ^ i is the predicted value, and n is the total number of missing value.

5.2. Qualitative Evaluation

RMSE and MAE of the imputation should also be used to determine which method performs better for imputing the missing values in UNSODA. However, the real values of these missing values are unknown, and we therefore used nonparametric tests (the reasons for not using ANOVA follow) to analyze whether there were any statistically significant differences between different features before and after imputation. As the missing values in UNSODA are supposed to be MAR, the data before and after the imputation were expected to not show significant differences. Before the ANOVA analysis, all features before and after imputation had to be tested for normality. In this study, the values of skewness and kurtosis were used. In addition, we used a multiple linear regression model to quantitatively determine which imputation method performed better for UNSODA.

6. Parameter Determination and Sensitivity Analysis

In this study, scikit-learn (version: 0.22), an open-source machine learning library, was used to perform the model training. We took the RF as an example to explain how to calibrate the parameters.
As discussed in Section 4.3.1, RF is a meta estimator that fits a number of CARTs on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The main parameters include the number of trees in the forest (n_estimators), the maximum depth of the tree (max_depth), and the minimum number of samples required to split an internal node (min_samples_split, default = 2). It should be noted that the number of features and sample sizes in Datasets I and II were not enough; therefore, the maximum depth of the tree was considered to be none, which meant nodes were expanded until all leaves were pure or until all leaves contained less than min_samples_split samples.
According to Bisong [63] and Pham et al. [64], the number of trees in the forest has a significant effect on the model accuracy and the RF will converge to a lower generalization error as the number of trees increases. Dataset I with a missing proportion = 0.15 was used to estimate the number of trees. As shown in Figure 9, the mean squared errors (MSEs) of pH and SHC did not decline when the numbers of trees exceeded 31 and 41, respectively. However, considering the size of the sample and the number of features, the n_estimators were considered to be 100. By analogy, the parameters of the SVR and ANN can be calibrated; the details are listed in Table 3.

7. Results

7.1. Quantitative Measures for Dataset I

Table 4 summarizes statistical measurements for the performances of RF, SVR, ANN, mean, and MI methods on Dataset I. The obtained results revealed that the missing proportion has a potential effect on the performances of the imputation methods. RF and MI outperformed the other imputation methods examined in this study; the mean and ANN had inferior performances.
For further discussions, Figure 10 presents the relationship between the missing proportion and the statistical measurements for each imputation method. For RF imputation, when the missing proportion increased from 0.03 to 0.11, the RMSE of the DU increased from 5.99 to 33.32, and the MAE increased from 6.61 to 19.80. However, when the missing proportion increased from 0.11 to 0.15, the RMSE was decreased significantly from 33.32 to 20.98, and the MAE from 19.80 to 12.07. When the missing proportion further increased from 0.15 to 0.23, the RMSE increased from 20.98 to 25.91, and the MAE from 12.07 to 16.87. Similar behavior was obtained for DL, pH, SHC, OMC, PO, and PD.
For SVR imputation, when the missing proportion increased from 0.03 to 0.19, the RMSE of the DU increased from 15.39 to 42.73, and the MAE increased from 16.17 to 32.43. However, when the missing proportion increased from 0.19 to 0.23, the RMSE was gradually reduced from 42.73 to 37.49, and the MAE decreased from 32.43 to 30.3. Similar behavior was obtained for DL, pH, SHC, OMC, PO, and PD. The ANN, mean, and MI imputations performed similarly to the RF and SVR imputations.
The above discussion cannot clearly explain which methods performed better for Dataset I, because the RMSEs and MAEs of all features fluctuated within accepted ranges, except for SHC. It should be noted that the RMSE and MAE of SHC were high using every imputation method, probably because of the outliers in the raw data, as shown in Figure 3d. These results indicate that RF, SVR, ANN, mean, and MI methods are adequate and could be used to impute the missing values in the original incomplete dataset (Dataset II).

7.2. Qualitative Measures for Dataset II

After imputing missing values in dataset II, we took the features (or independent variables) DU, DL, pH, SHC, OMC, PO, and PD as the x-axis variables, and the label (or dependent variable) BD as the y-axis variable. We also assumed that XXR referred to the feature after RF imputation, XXS to the feature after SVR imputation, XXA to the feature after ANN imputation, XXM to the feature after mean imputation, XXMI to the feature after MI imputation, and XXO to the feature before imputation (i.e., the raw data).
Figure 11 shows the distributions of DU, DL, pH, SHC, OMC, PO, and PD before and after imputation. The results demonstrate that the data for each feature were distributed well in the major quadrants after RF, SVR, ANN, mean, and MI, indicating that these imputation methods are feasible. The sample size, minimum value, maximum value, mean value, standard deviation, and standard error; and the median, skewness, kurtosis, coefficient of variation, and variance of each feature before and after imputation are listed in Table 5. The mean value was computed from the sample data. It should be noted that the standard deviation differs from the standard error: the standard deviation indicates approximately how far individuals are from the mean values, whereas the standard error estimates the variability of the sample mean—i.e., approximately how far it is from the population mean [65].
Table 5 shows that the standard errors of DU, DL, pH, SHC, OMC, PO, and PD decreased from 1.570, 4.854, 0.091, 93.865, 0.508, 0.005, and 0.006 to 1.368, 4.205, 0.038, 53.360, 0.254, 0.003, and 0.004 after RF imputation; to 1.371, 4.204, 0.040, 53.148, 0.249, 0.003, and 0.004 after SVR imputation; to 1.374, 5.646, 0.043, 54.401, 0.282, 0.003, and 0.004 after ANN imputation, to 1.358, 4.197, 0.033, 52.325, 0.244, 0.003, and 0.003 after mean imputation; and to 1.395, 4.209, 0.039, 57.427, 0.270, 0.003, and 0.004 after MI imputation, indicating that the sample means became closer to the population means. The decreased coefficients of variation and standard deviations indicated that individuals were closer to the sample mean values.
Table 5 also shows that the maximum values of DUO, DLO, SHCO, and OMCO were three standard deviations away from the mean values; and the maximum values of DUR, DUS, DUA, DLR, DLS, DLA, SHCR, SHCS, SHCA, OMCR, OMCS, OMCA, DUM, DLM, SHCM, OMCM, DUMI, DLMI, SHCMI, and OMCMI were still three standard deviations away from the mean values, indicating that the raw data fluctuate greatly. To more clearly visualize this finding, the means, standard deviations, and standard errors for all features before and after imputation are plotted in Figure 12.
Figure 13 presents the boxplots for the DU, DL, pH, SHC, OMC, PO, and PD before and after mean, MI, RF, SVR, and ANN imputation. The statistical descriptions indicate that the same features lay within the same scales. Among them, the SHC still had the broadest range, i.e., 0.019–27,648, in which the distribution was mostly skewed toward the low value in the range from 60.415 to 512.541 after RF imputation; 61.836 to 144.250 after SVR imputation; 22.722 to 455.400 after ANN imputation; 69.600 to 613.559 after mean imputation; 53.272 to 518.500 after MI imputation. On the other hand, the distribution ratio of PO had the narrowest range, i.e., 0.174–0.915, which centered in the range of 0.399–0.490 after RF imputation; 0.430–0.480 after SVR imputation; 0.416–0.500 after ANN imputation; 0.458–0.469 after mean imputation; 0.399 to 0.491 after MI imputation. Figure 13 also shows that mean imputations generated many outliers, such as for pH and PO, which shows that the mean imputation cannot be used for pH and PO. It is worth noting that ANN imputation also generated outliers for DL.
Skewness measures the relative symmetry of a distribution, and a value of zero indicates symmetry. The larger the absolute value of skewness, the more asymmetric the distribution. A positive value indicates a long right tail, and a negative value indicates a long left tail. By contrast, kurtosis measures the relative peakedness. The values of skewness and kurtosis for DUO, DUR, DUS, DUA, DUM, DUMI, DLO, DLR, DLS, DLA, DLM, DLMI, pHO, pHR, pHS, pHA, pHM, pHMI, SHCO, SHCR, SHCS, SCHA, SHCM, SHCMI, OMCO, OMCR, OMCS, OMCA, OMCM, OMCMI, POO, POR, POS, POA, POM, POMI, PDO, PDR, PDS, PDA, PDM, and PDMI are also listed in Table 5. All of these features did not strictly meet the criterion for normality. Therefore, nonparametric tests rather than ANOVA were used to investigate whether there were differences before and after imputation, and the results are listed in Table 6.
Table 6 suggests that there were no significant differences before and after imputation, except for SHCO and SHCR, SHCO and SHCM, SHCO and SHCMI, POO and POR, POO and POM, POO and POMI, pHO and pHA, pHO and pHM, OMCO and OMCA, OMCO and OMCM, and OMCO and OMCMI. As discussed above, this result may have been caused by the raw data. Although there was a difference in SHC and PO before and after RF imputation, we assume that the RF imputation is still valid based on observations of the mean and median. The pH and OMC metrics before and after ANN imputation are similar.
After nonparametric tests, we used a multiple linear regression model to quantitatively determine which imputation method performed better for UNSODA. In the multiple linear regression model, DU, DL, pH, SHC, OMC, PO, and PD were still considered to be independent and BD dependent. However, it should be noted that DU and DL are collinear. Therefore, we should consider only one feature, and DU was used in this study. For comparison, multiple linear regression was also performed for these features when missing data were imputed by zero.
The results of multiple linear regression after RF, SVR, ANN, mean, MI, and zero imputations are presented in Table 7. Table 7 shows that R 2 was 0.910 for the RF imputation, which meant that DU, pH, SHC, OMC, PO, and PD could explain 91.0% of the variation in BD. The model passed the F-test (F = 1273.712, p = 0.000 < 0.05), indicating that at least one of DU, pH, SHC, OMC, PO, and PD affect BD, and the model’s formula was: BD = 1.483 + (0.000 × DU) − (0.001 × pH) + (0.000 × SHC) + (0.002 × OMC) − (2.481 × PO) + (0.410 × PD). A test for multicollinearity indicated that all the variance inflation factor (VIF) values in the model were less than five, which meant that there was no collinearity problem [66]. The results for SVR, ANN, mean, MI, and zero imputations were similar to those for RF imputation, but their R 2 values (0.639 for SVR imputation, 0.885 for ANN imputation, 0.594 for mean imputation, 0.902 for MI method, and 0.379 for zero imputation) were smaller. It should be noted that the R 2 value for MI imputation is close to those of the RF imputation method, and the model also passed the F-test (F = 1154.513, p = 0.000 < 0.05), indicating that the performance of the MI method was close to that of the RF imputation. We also considered the effect of discarding missing values, which decreased the number of datasets substantially (n = 109). The analysis results when missing values were discarded are presented in Table 7. The R 2 value was 0.998, which meant that DU, pH, SHC, OMC, PO, and PD could explain 99.8% of the variation in BD. This model still passed the F-test (F = 6932.797, p = 0.000 < 0.05), indicating that at least one of DU, pH, SHC, OMC, PO, and PD affect BD, and the model’s formula was: BD = 1.069 + (0.000 × DU) − (0.002 × pH) + (0.000 × SHC) + (0.002 × OMC) − (2.551 × PO) + (0.581 × PD). Although the R 2 values of the RF, SVR, and ANN imputation were smaller than those obtained after discarding missing values, the above discussion suggests that these imputation methods are feasible; and RF and MI methods may be better than SVR, ANN, mean, and zero imputation. The results of the actual BD and predicted BD are shown in Figure 14.

8. Discussion

The reason for using regression to impute the missing values is that the regression algorithm believes that there is some connection between the Eigen matrix and the label. That is, we can use features X 1 , X 2 , X 3 , and X 4 to predict the label Y. Similarly, we can also use Y, X 2 , X 3 , and X 4 to predict X 1 .
Before imputing the missing values in the original incomplete, Dataset I was simulated with different missing proportions (e.g., 3%, 7%, 11%, 15%, 19%, and 23%) using an MAR approach. It should be noted that the maximum simulated missing proportion was 23%, which is far below the average level of missing data in Dataset II. In fact, it would be difficult to simulate a high missing proportion considering the sample size of Dataset I. The main reason is the sample size of Dataset I is insufficient; the ANN imputation fails to converge with the increase of the missing proportion.
In this study, we mainly focused on the comparison of various methods of handling missing data in UNSODA. The imputed dataset was not used for modeling and applying in a case study. It will be more convincing if the quality of a real case study’s results can be improved after imputing the missing data using the proposed method. By observing multiple linear regression, we can infer that the data after RF and MI imputation can be used in a case study. We will use the imputed Dataset II to predict BD in the future according to PTFs proposed by Yi et al. [12], Curtis and Post [14], Adams [67] and Rawls [68].
In addition, the outliers were not deleted or replaced because of the imputation, which may have caused the large RMSEs and MAEs for SHC. In a subsequent study, we will further explore how to predict SHC well.

9. Conclusions

Three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI) were used to impute the missing data for DU, DL, pH, SHC, OMC, PO, and PD in UNSODA. Both quantitative and qualitative methods were used to evaluate the feasibility. Based on the results, we can conclude that:
(1)
The RMSEs and MAEs of DU, DL, pH, SHC, OMC, PO, and PD for the complete dataset indicate that RF, SVR, ANN, mean, and MI methods are appropriate for imputing the missing values in UNSODA.
(2)
The standard error significantly decreased after imputation, indicating that the sample means had become closer to the population mean. The decreased coefficients of variation and standard deviations indicate that the individual data points were closer to the sample mean values.
(3)
There were no significant differences before and after imputation for DU, DL, pH, SHC, OMC, PO, and PD.
(4)
DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD after RF, SVR, ANN, mean, and MI imputation, respectively; and this value was 99.8% when missing values were discarded.
(5)
This study suggests that the RF and MI methods may be best for imputing the missing data in UNSODA.

Author Contributions

Conceptualization, Y.F.; methodology, Y.F.; software, Y.F.; validation, Y.F.; formal analysis, Y.F.; investigation, Y.F.; resources, Y.F.; data curation, Y.F.; writing—original draft preparation, Y.F.; writing—review and editing, Y.F., H.L. and L.L.; visualization, Y.F.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities (xzy022019009), the National Natural Science Foundation of China (51879212, 41630639), and Key Projects of Shaanxi International Science and Technology Cooperation Plan (2019KWZ-09).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data used during this study were obtained from a public dataset (https://0-doi-org.brum.beds.ac.uk/10.15482/USDA.ADC/1173246). The datasets analyzed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

Special thanks are given to TsaiTsai for his wonderful videos about machine learning (https://www.bilibili.com/video/BV1vJ41187hk (accessed on 23 August 2020)); We also want to thank two excellent document for RF regression (https://zhuanlan.zhihu.com/p/52052903, https://www.kaggle.com/lmorgan95/missforest-the-best-imputation-algorithm (accessed on 15 July 2021)).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UNSODAinternational unsaturated soil database
RFrandom forests
SVRsupport vector
ANNartificial neural network
MImultiple imputation
SHCsaturated hydraulic conductivity
POporosity
PDparticle density
BDbulk density
OMCorganic matter content
DUupper depth
DLlower depth
WCwater content
VIFvariance inflation factor
MSEmean square error
MAEmean absolute error
RMSEroot mean square error

References

  1. Hartemink, A.E. Soil chemical and physical properties as indicators of sustainable land management under sugar cane in Papua New Guinea. Geoderma 1998, 85, 283–306. [Google Scholar] [CrossRef]
  2. Chung, R.S.; Wang, C.H.; Wang, C.W.; Wang, Y.P. Influence of organic matter and inorganic fertilizer on the growth and nitrogen accumulation of corn plants. J. Plant Nutr. 2000, 23, 297–311. [Google Scholar] [CrossRef]
  3. Islam, A.; Edwards, D.; Asher, C. pH optima for crop growth. Plant Soil 1980, 54, 339–357. [Google Scholar] [CrossRef]
  4. Karapouloutidou, S.; Gasparatos, D. Effects of biostimulant and organic amendment on soil properties and nutrient status of Lactuca sativa in a calcareous saline-sodic soil. Agriculture 2019, 9, 164. [Google Scholar] [CrossRef] [Green Version]
  5. Bruand, A.; Fernández, P.P.; Duval, O. Use of class pedotransfer functions based on texture and bulk density of clods to generate water retention curves. Soil Use Manag. 2003, 19, 232–242. [Google Scholar] [CrossRef]
  6. Shwetha, P.; Varija, K. Soil water retention curve from saturated hydraulic conductivity for sandy loam and loamy sand textured soils. Aquat. Procedia 2015, 4, 1142–1149. [Google Scholar] [CrossRef]
  7. Zhang, J.; Ma, G.; Huang, Y.; Aslani, F.; Nener, B. Modelling uniaxial compressive strength of lightweight self-compacting concrete using random forest regression. Constr. Build. Mater. 2019, 210, 713–719. [Google Scholar] [CrossRef]
  8. Peters, A.; Hohenbrink, T.L.; Iden, S.C.; Durner, W. A simple model to predict hydraulic conductivity in medium to dry soil from the water retention curve. Water Resour. Res. 2021, 57, e2020WR029211. [Google Scholar] [CrossRef]
  9. Fu, Y.; Liao, H.; Chai, X.; Li, Y.; Lv, L. A Hysteretic Model Considering Contact Angle Hysteresis for Fitting Soil-Water Characteristic Curves. Water Resour. Res. 2021, 57, e2019WR026889. [Google Scholar] [CrossRef]
  10. Abu-Hamdeh, N.H. Compaction and subsoiling effects on corn growth and soil bulk density. Soil Sci. Soc. Am. J. 2003, 67, 1213–1219. [Google Scholar] [CrossRef]
  11. Ghezzehei, T.A. Errors in determination of soil water content using time domain reflectometry caused by soil compaction around waveguides. Water Resour. Res. 2008, 44, W08451. [Google Scholar] [CrossRef] [Green Version]
  12. Yi, X.; Li, G.; Yin, Y. Pedotransfer functions for estimating soil bulk density: A case study in the Three-River Headwater region of Qinghai Province, China. Pedosphere 2016, 26, 362–373. [Google Scholar] [CrossRef]
  13. Mohanty, B.; Bowman, R.; Hendrickx, J.; Van Genuchten, M.T. New piecewise-continuous hydraulic functions for modeling preferential flow in an intermittent-flood-irrigated field. Water Resour. Res. 1997, 33, 2049–2063. [Google Scholar] [CrossRef]
  14. Curtis, R.O.; Post, B.W. Estimating bulk density from organic-matter content in some Vermont forest soils. Soil Sci. Soc. Am. J. 1964, 28, 285–286. [Google Scholar] [CrossRef]
  15. Kaur, R.; Kumar, S.; Gurung, H. A pedo-transfer function (PTF) for estimating soil bulk density from basic soil data and its comparison with existing PTFs. Soil Res. 2002, 40, 847–858. [Google Scholar] [CrossRef]
  16. Shiri, J.; Keshavarzi, A.; Kisi, O.; Karimi, S.; Iturraran-Viveros, U. Modeling soil bulk density through a complete data scanning procedure: Heuristic alternatives. J. Hydrol. 2017, 549, 592–602. [Google Scholar] [CrossRef]
  17. Bagarello, V.; Baiamonte, G.; Caia, C. Variability of near-surface saturated hydraulic conductivity for the clay soils of a small Sicilian basin. Geoderma 2019, 340, 133–145. [Google Scholar] [CrossRef]
  18. Zapata, C.E.; Houston, W.N.; Houston, S.L.; Walsh, K.D. Soil–water characteristic curve variability. In Advances in Unsaturated Geotechnics; CRC Press: Boca Raton, FL, USA, 2000; pp. 84–124. [Google Scholar]
  19. Bouma, J. Using soil survey data for quantitative land evaluation. In Advances in Soil Science; Springer: Berlin/Heidelberg, Germany, 1989; pp. 177–213. [Google Scholar]
  20. Wösten, J.; Pachepsky, Y.A.; Rawls, W. Pedotransfer functions: Bridging the gap between available basic soil data and missing soil hydraulic characteristics. J. Hydrol. 2001, 251, 123–150. [Google Scholar] [CrossRef]
  21. Nemes, A.D.; Schaap, M.; Leij, F.; Wösten, J. Description of the unsaturated soil hydraulic database UNSODA version 2.0. J. Hydrol. 2001, 251, 151–162. [Google Scholar] [CrossRef]
  22. Leij, F.J. The UNSODA Unsaturated Soil Hydraulic Database: User’s Manual; National Risk Management Research Laboratory, Office of Research and Development, US Environmental Protection Agency: Cincinnati, OH, USA, 1996.
  23. Wösten, J.; Lilly, A.; Nemes, A.; Le Bas, C. Development and use of a database of hydraulic properties of European soils. Geoderma 1999, 90, 169–185. [Google Scholar] [CrossRef]
  24. Nachtergaele, F.; van Velthuizen, H.; Verelst, L.; Batjes, N.; Dijkshoorn, K.; van Engelen, V.; Fischer, G.; Jones, A.; Montanarela, L. The harmonized world soil database. In Proceedings of the 19th World Congress of Soil Science, Soil Solutions for a Changing World, Brisbane, Australia, 1–6 August 2010; pp. 34–37. [Google Scholar]
  25. Huang, G.; Zhang, R. Evaluation of soil water retention curve with the pore–solid fractal model. Geoderma 2005, 127, 52–61. [Google Scholar] [CrossRef]
  26. Hwang, S.I.; Powers, S.E. Using particle-size distribution models to estimate soil hydraulic properties. Soil Sci. Soc. Am. J. 2003, 67, 1103–1112. [Google Scholar] [CrossRef]
  27. Hwang, S.; Yun, E.; Ro, H. Estimation of soil water retention function based on asymmetry between particle-and pore-size distributions. Eur. J. Soil Sci. 2011, 62, 195–205. [Google Scholar] [CrossRef]
  28. Mohammadi, M.H.; Vanclooster, M. Predicting the soil moisture characteristic curve from particle size distribution with a simple conceptual model. Vadose Zone J. 2011, 10, 594–602. [Google Scholar] [CrossRef]
  29. Chang, C.; Cheng, D. Predicting the soil water retention curve from the particle size distribution based on a pore space geometry containing slit-shaped spaces. Hydrol. Earth Syst. Sci. 2018, 22, 4621–4632. [Google Scholar] [CrossRef] [Green Version]
  30. Ghanbarian-Alavijeh, B.; Liaghat, A.; Huang, G.H.; Van Genuchten, M.T. Estimation of the van Genuchten soil water retention properties from soil textural data. Pedosphere 2010, 20, 456–465. [Google Scholar] [CrossRef]
  31. Haverkamp, R.; Leij, F.J.; Fuentes, C.; Sciortino, A.; Ross, P. Soil water retention: I. Introduction of a shape index. Soil Sci. Soc. Am. J. 2005, 69, 1881–1890. [Google Scholar] [CrossRef]
  32. Seki, K. SWRC fit—A nonlinear fitting program with a water retention curve for soils having unimodal and bimodal pore structure. Hydrol. Earth Syst. Sci. Discuss. 2007, 4, 407–437. [Google Scholar]
  33. Ghanbarian, B.; Hunt, A.G. Improving unsaturated hydraulic conductivity estimation in soils via percolation theory. Geoderma 2017, 303, 9–18. [Google Scholar] [CrossRef]
  34. Pham, K.; Kim, D.; Yoon, Y.; Choi, H. Analysis of neural network based pedotransfer function for predicting soil water characteristic curve. Geoderma 2019, 351, 92–102. [Google Scholar] [CrossRef]
  35. Vaz, C.M.P.; Ferreira, E.J.; Posadas, A.D. Evaluation of models for fitting soil particle-size distribution using UNSODA and a Brazilian dataset. Geoderma Reg. 2020, 21, e00273. [Google Scholar] [CrossRef]
  36. Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
  37. Strike, K.; El Emam, K.; Madhavji, N. Software cost estimation with incomplete data. IEEE Trans. Softw. Eng. 2001, 27, 890–908. [Google Scholar] [CrossRef] [Green Version]
  38. Raymond, M.R.; Roberts, D.M. A comparison of methods for treating incomplete data in selection research. Educ. Psychol. Meas. 1987, 47, 13–26. [Google Scholar] [CrossRef]
  39. Lin, W.C.; Tsai, C.F. Missing value imputation: A review and analysis of the literature (2006–2017). Artif. Intell. Rev. 2020, 53, 1487–1509. [Google Scholar] [CrossRef]
  40. Puri, A.; Gupta, M. Review on Missing Value Imputation Techniques in Data Mining. In Proceedings of the International Conference on Machine Learning and Computational Intelligence, Sydney, Australia, 6–11 August 2017; pp. 35–40. [Google Scholar]
  41. Van Genuchten, M.T.; Leij, F.; Lund, L. Indirect Methods for Estimating the Hydraulic Properties of Unsaturated Soils; U.S. Department of Agriculture: North Bend, WA, USA, 1992.
  42. Lin, J.; Li, N.; Alam, M.A.; Ma, Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl. Intell. 2020, 50, 860–877. [Google Scholar] [CrossRef] [Green Version]
  43. Rubin, D.B. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 1996, 91, 473–489. [Google Scholar] [CrossRef]
  44. Jerez, J.M.; Molina, I.; García-Laencina, P.J.; Alba, E.; Ribelles, N.; Martín, M.; Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 2010, 50, 105–115. [Google Scholar] [CrossRef]
  45. Ghorbani, S.; Desmarais, M.C. Performance comparison of recent imputation methods for classification tasks over binary data. Appl. Artif. Intell. 2017, 31, 1–22. [Google Scholar]
  46. Shah, A.D.; Bartlett, J.W.; Carpenter, J.; Nicholas, O.; Hemingway, H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 2014, 179, 764–774. [Google Scholar] [CrossRef] [Green Version]
  47. Reilly, M. Data analysis using hot deck multiple imputation. J. R. Stat. Soc. Ser. D Stat. 1993, 42, 307–313. [Google Scholar] [CrossRef]
  48. Nishanth, K.J.; Ravi, V. Probabilistic neural network based categorical data imputation. Neurocomputing 2016, 218, 17–25. [Google Scholar] [CrossRef]
  49. Kuligowski, R.J.; Barros, A.P. Using artificial neural networks to estimate missing rainfall data 1. JAWRA J. Am. Water Resour. Assoc. 1998, 34, 1437–1447. [Google Scholar] [CrossRef]
  50. Hassani, H.; Kalantari, M.; Ghodsi, Z. Evaluating the Performance of Multiple Imputation Methods for Handling Missing Values in Time Series Data: A Study Focused on East Africa, Soil-Carbonate-Stable Isotope Data. Stats 2019, 2, 457–467. [Google Scholar] [CrossRef] [Green Version]
  51. Lorenzi, L.; Mercier, G.; Melgani, F. Support vector regression with kernel combination for missing data reconstruction. IEEE Geosci. Remote Sens. Lett. 2012, 10, 367–371. [Google Scholar] [CrossRef]
  52. Humphries, M. Missing Data & How to Deal: An Overview of Missing Data; Population Research Center, University of Texas: Austin, TX, USA, 2013; pp. 39–41. Available online: https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf (accessed on 21 July 2021).
  53. Malarvizhi, R.; Thanamani, A.S. K-nearest neighbor in missing data imputation. Int. J. Eng. Res. Dev. 2012, 5, 5–7. [Google Scholar]
  54. Yan, X.; Xiong, W.; Hu, L.; Wang, F.; Zhao, K. Missing value imputation based on gaussian mixture model for the internet of things. Math. Probl. Eng. 2015, 2015, 548605. [Google Scholar] [CrossRef]
  55. Nikfalazar, S.; Yeh, C.H.; Bedingfield, S.; Khorshidi, H.A. Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl. Inf. Syst. 2020, 62, 2419–2437. [Google Scholar] [CrossRef]
  56. Somasundaram, R.; Nedunchezhian, R. Evaluation of three simple imputation methods for enhancing preprocessing of data with missing values. Int. J. Comput. Appl. 2011, 21, 14–19. [Google Scholar] [CrossRef]
  57. Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef] [Green Version]
  58. Ließ, M.; Glaser, B.; Huwe, B. Uncertainty in the spatial prediction of soil texture: Comparison of regression tree and Random Forest models. Geoderma 2012, 170, 70–79. [Google Scholar] [CrossRef]
  59. Han, H.; Lee, S.; Kim, H.C.; Kim, M. Retrieval of Summer Sea Ice Concentration in the Pacific Arctic Ocean from AMSR2 Observations and Numerical Weather Data Using Random Forest Regression. Remote Sens. 2021, 13, 2283. [Google Scholar] [CrossRef]
  60. Ballabio, C. Spatial prediction of soil properties in temperate mountain regions using support vector regression. Geoderma 2009, 151, 338–350. [Google Scholar] [CrossRef]
  61. Hamasuna, Y.; Endo, Y.; Miyamoto, S. Support Vector Machine for data with tolerance based on Hard-margin and Soft-Margin. In Proceedings of the 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 750–755. [Google Scholar]
  62. Neaupane, K.M.; Adhikari, N. Prediction of tunneling-induced ground movement with the multi-layer perceptron. Tunn. Undergr. Space Technol. 2006, 21, 151–159. [Google Scholar] [CrossRef]
  63. Bisong, E. More supervised machine learning techniques with scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform; Springer: Berlin/Heidelberg, Germany, 2019; pp. 287–308. [Google Scholar]
  64. Pham, T.D.; Bui, N.D.; Nguyen, T.T.; Phan, H.C. Predicting the reduction of embankment pressure on the surface of the soft ground reinforced by sand drain with random forest regression. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2020; Volume 869, p. 072027. [Google Scholar]
  65. Siegel, A. Practical Business Statistics; Academic Press: Cambridge, MA, USA, 2016. [Google Scholar]
  66. Salmerón Gómez, R.; García Pérez, J.; López Martín, M.D.M.; García, C.G. Collinearity diagnostic applied in ridge estimation through the variance inflation factor. J. Appl. Stat. 2016, 43, 1831–1849. [Google Scholar] [CrossRef]
  67. Adams, W. The effect of organic matter on the bulk and true densities of some uncultivated podzolic soils. J. Soil Sci. 1973, 24, 10–17. [Google Scholar] [CrossRef]
  68. Rawls, W.J. Estimating soil bulk density from particle size analysis and organic matter content1. Soil Sci. 1983, 135, 123–125. [Google Scholar] [CrossRef]
Figure 1. An overview of the database structure and the data in UNSODA V2.0 [21].
Figure 1. An overview of the database structure and the data in UNSODA V2.0 [21].
Agriculture 11 00727 g001
Figure 2. The available sample size, total sample size, and missing proportion of each soil property.
Figure 2. The available sample size, total sample size, and missing proportion of each soil property.
Agriculture 11 00727 g002
Figure 3. Boxplots for the soil properties in UNSODA: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD; (h) BD.
Figure 3. Boxplots for the soil properties in UNSODA: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD; (h) BD.
Agriculture 11 00727 g003
Figure 4. The experimental procedure for missing value imputation.
Figure 4. The experimental procedure for missing value imputation.
Agriculture 11 00727 g004
Figure 5. The simplified architecture and mechanism of the MI method.
Figure 5. The simplified architecture and mechanism of the MI method.
Agriculture 11 00727 g005
Figure 6. An illustration of RF regression.
Figure 6. An illustration of RF regression.
Agriculture 11 00727 g006
Figure 7. Illustrations of SVR: (a) hard margin; (b) soft margin.
Figure 7. Illustrations of SVR: (a) hard margin; (b) soft margin.
Agriculture 11 00727 g007
Figure 8. A schematic of an artificial neural network (ANN) with two hidden layers and a single neuron output.
Figure 8. A schematic of an artificial neural network (ANN) with two hidden layers and a single neuron output.
Agriculture 11 00727 g008
Figure 9. The relationships between MSE and n_estimators for (a) pH and (b) SHC.
Figure 9. The relationships between MSE and n_estimators for (a) pH and (b) SHC.
Agriculture 11 00727 g009
Figure 10. RMSEs and MAEs for soil properties after imputation: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Figure 10. RMSEs and MAEs for soil properties after imputation: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Agriculture 11 00727 g010
Figure 11. The distributions of the soil properties before and after imputation: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Figure 11. The distributions of the soil properties before and after imputation: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Agriculture 11 00727 g011
Figure 12. The means, standard deviations, and standard errors of every feature before and after imputation.
Figure 12. The means, standard deviations, and standard errors of every feature before and after imputation.
Agriculture 11 00727 g012
Figure 13. Boxplots for the soil properties before and after RF, SVR, ANN, mean, and MI methods: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Figure 13. Boxplots for the soil properties before and after RF, SVR, ANN, mean, and MI methods: (a) DU; (b) DL; (c) pH; (d) SHC; (e) OMC; (f) PO; (g) PD.
Agriculture 11 00727 g013
Figure 14. The actual BD and predicted BD: (a) RF imputation; (b) SVR imputation; (c) ANN imputation; (d) MI; (e) mean imputation; (f) Discarding missing values.
Figure 14. The actual BD and predicted BD: (a) RF imputation; (b) SVR imputation; (c) ANN imputation; (d) MI; (e) mean imputation; (f) Discarding missing values.
Agriculture 11 00727 g014
Table 2. Statistical descriptions of soil properties in UNSODA.
Table 2. Statistical descriptions of soil properties in UNSODA.
Soil
Property
Effective
Sample
Data
Sample *
RangeMeanQ1 (25%)Q2 (50%)Q3 (75%)Missing
Proportion
**
DU6856590∼32541.2315.00030.00061.0000.1329
DL6856590∼300072.46630.00056.00095.0000.1329
pH3002793.1∼8.66.2594.9006.7007.5400.6203
SHC4294250.019∼27,648613.55920.81895.900459.4000.4570
OMC3883670.01∼88.42.9420.3400.9402.5000.5089
PO3703660.175∼0.9150.4690.4050.4560.5100.5316
PD3993951.65∼2.932.6422.6102.6502.6900.4949
BD7627620.17∼2.11.4441.3401.4901.6000.0354
NOTE: * The data sample was deleted when BD was missing. ** Missing Proportion = missing sample/total sample (790).
Table 3. Parameters used for model training.
Table 3. Parameters used for model training.
Imputation MethodsParameters
RFn_estimators
100
SVRC ε kernel
100 (50)0.01rbf
ANNlearning_rate_initactivationsolveralpha
0.005 (0.06)reluadam0.001 (1)
Note: the value in the round brackets used for Dataset II.
Table 4. Statistical measurements for RF, SVR, ANN, mean, and MI imputations in Dataset I.
Table 4. Statistical measurements for RF, SVR, ANN, mean, and MI imputations in Dataset I.
FeatureMissing ProportionRMSE
RF
SVRANNMeanMIMAE
RF
SVRANNMeanMI
DU0.035.9915.3932.226.685.56.6116.1739.2823.274.8
0.0712.3924.7416.1236.5717.6810.3221.4311.573214.2
0.1133.3239.2334.240.2428.1919.827.7818.9329.0621.8
0.1520.9840.9118.2339.8123.8212.0732.421531.7116.55
0.1925.5842.7345.3864.1244.1617.0732.4338.8849.0330.62
0.2325.9137.4920.444.621.9816.8730.0315.0936.8317.08
DL0.036.4323.7259.0740.915.575.3620.8143.6734.979.93
0.0719.35728.8969.8751.4815.2639.623.4351.6642.44
0.1133.2634.8542.4342.2954.1921.5427.2732.7736.7946.8
0.1535.7567.9638.170.9863.3721.3447.3325.5552.4942.68
0.1926.5428.7661.1431.6444.1217.5721.6542.7127.8232.73
0.2336.5652.5734.7654.5849.4724.5736.5526.2339.9937.49
pH0.030.860.910.321.321.080.810.80.271.290.83
0.071.321.022.111.461.531.160.921.661.271.39
0.111.861.6921.611.881.761.491.821.531.57
0.151.351.271.331.350.971.010.9811.170.78
0.191.651.981.951.661.711.431.711.651.511.51
0.231.661.982.161.771.911.391.691.771.631.75
SHC0.03854.75825.831200.1662.66953.15679.27519.22935.43576.29751.7
0.072818.953867.881710.573756.351887.41236.261256.13751.891560.21736.49
0.111594.871906.361026.961839.52303.95710.16582.82563.94901.18202.68
0.151896.671951.171697.061867.51933.28682.14630.66984.73903.6738.41
0.192299.212986.011794.312867.452003.191098.22995.45852.431252.13790.91
0.231264.22938.441328.83899.61879.01717.31384.24805.91674.99428.77
OMC0.030.690.643.651.780.760.480.532.961.780.53
0.071.781.831.72.511.371.361.321.012.151.24
0.110.491.370.971.521.140.451.020.81.410.88
0.151.691.863.281.981.970.841.352.131.691.21
0.191.71.452.1922.070.921.061.321.81.37
0.231.351.42.751.71.580.911.061.641.631.1
PO0.030.0050.1010.180.0770.0020.0050.0610.1280.0590.002
0.070.0170.0770.0570.0880.0110.0130.0590.0510.0720.005
0.110.010.0740.0830.0740.0040.0080.0640.0750.0640.004
0.150.0230.1990.0880.090.0160.0140.1170.0620.0650.011
0.190.0150.0650.130.0650.0190.0130.0560.0970.0510.012
0.230.0080.0560.1310.0570.0070.0070.0460.1030.0450.004
PD0.030.090.10.170.030.020.070.070.160.020.02
0.070.090.050.220.060.030.060.040.190.050.02
0.110.060.050.130.110.020.050.040.10.060.02
0.150.20.20.250.210.10.110.120.170.120.05
0.190.060.050.260.070.080.050.040.220.050.05
0.230.050.080.240.090.030.050.060.190.070.02
Table 5. The basic information of the features before and after imputation.
Table 5. The basic information of the features before and after imputation.
FeatureNameSample
Size
MinMaxMeanStandard
Deviation
Standard
Error
MedianKurtosisSkewnessCoefficient of
Variation
Variance
DUDUO659032541.23140.3161.57303.5171.35797.78%1625.378
DUR762032539.73637.7521.368304.5621.54195.01%1425.22
DUS762032540.70437.851.37133.2054.3481.47392.99%1432.658
DUA762032541.23437.9361.374354.2581.44192.00%1439.103
DUM762032541.23137.4881.358404.5321.45990.92%1405.386
DUMI762032541.70138.4971.395333.791.35192.32%1482.011
DLDLO6596300072.466124.6014.85456464.53519.855171.94%15,525.368
DLR7626300070.533116.0874.20557.21534.31421.27164.59%13,476.125
DLS7626300070.87116.0494.20455.284534.7721.283163.75%13,467.300
DLA7626300079.92155.8595.64660213.38713.278195.02%24,291.898
DLM7626300072.466115.8624.19765537.05821.344159.89%13,424.037
DLMI7626300071.48116.1764.20959.9531.97321.197162.53%13,496.809
pHpHO2793.18.66.2591.5130.0916.7−1.219−0.35524.17%2.288
pHR7623.18.66.2981.0390.0386.4410.176−0.52616.49%1.079
pHS7623.18.66.4071.1040.046.7440.182−0.84617.23%1.219
pHA7622.85710.4136.1781.1850.0436.334−0.056−0.2419.19%1.405
pHM7623.18.66.2590.9140.0336.2591.884−0.58414.61%0.836
pHMI7623.18.66.291.0640.0396.360.027−0.48416.91%1.131
SHCSHCO4250.01927,648613.5591935.07493.86595.997.7238.448315.39%3,744,509.62
SHCR7620.01927,648540.3761472.96253.36229.387165.3110.812272.58%2,169,617.99
SHCS7620.01927,648384.7971467.12953.14896.737172.69911.23381.27%2,152,468.57
SHCA7620.01927,648501.1111501.69454.401135.307154.20410.371299.67%2,255,085.69
SHCM7620.01927,648613.5591444.40252.325613.559176.67711.294235.41%2,086,297.08
SHCMI7620.01927,648619.7561585.24857.427207.092122.0248.941255.79%2,513,012.29
OMCOMCO3670.0188.42.9429.7270.5080.9450.6796.976330.61%94.611
OMCR7620.0188.42.2177.010.2540.95896.3439.395316.15%49.147
OMCS762−0.73288.41.9536.8640.2490.92105.929.946351.45%47.111
OMCA7620.0188.42.8247.7940.2821.12864.5537.507276.01%60.742
OMCM7620.0188.42.9426.7460.2442.942107.70410.031229.28%45.503
OMCMI7620.0188.42.487.4590.271.11684.8698.843300.78%55.642
POPOO3660.1750.9150.4690.1040.0050.4563.0011.29822.08%0.011
POR7620.1750.9150.4560.0910.0030.4424.521.59820.02%0.008
POS7620.1750.9150.4640.0770.0030.4588.1561.98916.66%0.006
POA7620.1751.0080.4680.0910.0030.4485.6281.819.50%0.008
POM7620.1750.9150.4690.0720.0030.4699.4451.86915.29%0.005
POMI7620.1740.9150.4570.0920.0030.4444.4891.59320.05%0.008
PDPDO3951.652.932.6420.1260.0062.6526.025−3.4014.76%0.016
PDR7621.652.932.6440.1060.0042.65237.489−4.4624.01%0.011
PDS7621.652.932.6420.0990.0042.64840.998−4.43.73%0.01
PDA7621.652.932.6390.1160.0042.6534.429−4.434.40%0.013
PDM7621.652.932.6420.0910.0032.64252.684−4.7163.43%0.008
PDMI7621.652.932.6440.1130.0042.6536.35−4.4844.26%0.013
Table 6. Nonparametric test results.
Table 6. Nonparametric test results.
FeatureMedian( Q 25 , Q 75 )MannWhitney UMannWhitney zpFeatureMedian ( Q 25 , Q 75 )MannWhitney UMannWhitney zp
DURaw data (n = 659)RF (n = 762)250,299.5−0.1010.919 Raw data (n = 425)Mean (n = 762)116,598.5−8.0980.000 **
30.000(5.0,61.0)30.000(15.0,60.0)SHC95.900(20.8,459.4)613.559(69.6,613.6)
SVR (n = 762)247,602.5−0.4530.651 MI (n = 762)135,141.5−4.730.000 **
33.205(15.0,60.0) 207.092(53.3,518.5)
ANN (n = 762)245,348.5−0.7460.456 Raw data (n = 367)RF (n = 762)137,024−0.5460.585
35.000(15.0,60.0) 0.940(0.3,2.5)0.958(0.5,1.9)
Mean (n = 762)243,611.5−0.9730.331 SVR (n = 762)132,498.5−1.4280.153
40.000(15.0,60.0) 0.920(0.4,1.5)
MI (n = 762)245,028−0.7880.431OMC ANN (n = 762) 126,483.5−2.60.009 **
33.000(12.0,61.0) 1.128(0.5,2.3)
DLRaw data (n = 659)RF (n = 762)250,327.5 −0.0970.922 Mean (n = 762)97,759.5−8.3790.000 **
56.000(30.0,95.0)57.210(30.0,91.0) 2.942(1.0,2.9)
SVR (n = 762)249,248.5−0.2370.812 MI (n = 762)127,130.5 −2.4740.013 *
55.284(30.0,91.0) 1.116(0.6,2.1)
ANN (n = 762)246,019.5−0.6560.512 Raw data (n = 366)RF (n = 762) 128,268−2.1820.029 *
60.000(30.0,93.0) 0.456(0.4,0.5)0.442(0.4,0.5)
Mean (n = 762)243,199.5−1.0220.307 SVR (n = 762)137,856−0.310.756
65.000(30.0,91.0) 0.458(0.4,0.5)
MI (n = 762) 247,989−0.4010.689PO ANN (n = 762)139,339−0.0210.983
59.900(30.0,92.0) 0.448(0.4,0.5)
pHRaw data (n = 279)RF (n = 762)100,975.5 −1.2390.215 Mean (n = 762)128,358−2.2130.027 *
6.700(4.9,7.5)6.441(5.7,7.0) 0.469(0.5,0.5)
SVR (n = 762)104,703.5−0.3710.71 MI (n = 762) 128,777.5 −2.0830.037 *
6.744(5.8,7.0) 0.444(0.4,0.5)
ANN (n = 762)97,681.5−2.0060.045 * Raw data (n = 395)RF (n = 762) 140,712−1.8170.069
6.334(5.4,6.9) 2.650(2.6,2.7)2.652(2.6,2.7)
Mean (n = 762)96,880.5−2.3110.021 * SVR (n = 762) 148,596.5 −0.3530.724
6.259(6.3,6.3) 2.648(2.6,2.7)
MI (n = 762)101,044.5 −1.2230.221PD ANN (n = 762)149,471.5−0.190.849
6.360(5.7,7.0) 2.650(2.6,2.7)
SHCRaw data (n = 425)RF (n = 762)134,844.5−4.7830.000 ** Mean (n = 762)148,109.5−0.450.653
95.900(20.8,459.4)229.387(60.4,512.5) 2.642(2.6,2.6)
SVR (n = 762)161,420.5 −0.0890.929 MI (n = 762) 144,380−1.1360.256
96.737(61.8,144.3) 2.650(2.6,2.7)
ANN (n = 762)156,553.5−0.9490.343
135.307(22.7,455.4)
* p < 0.05 ** p < 0.01.
Table 7. Multilinear regression analysis results.
Table 7. Multilinear regression analysis results.
Imputation
Method
Independent
Variable
Unstandardized CoefficientstpVIF R 2 Adj R 2 F
BStandard Error
RFConstant1.4830.1113.4260.000 **-0.9100.909F(6,755) = 1273.712,
p = 0.000
DU002.5790.010 *1.055
pH−0.0010.003−0.1930.8471.052
SHC00−0.3250.7451.035
OMC0.0020.0012.8490.005 **2.899
PO−2.4810.036−68.0310.000 **1.472
PD0.410.04110.1060.000 **2.463
SVRConstant1.8360.2238.2340.000 **-0.6390.636F(6,755) = 223.020,

p = 0.000
DU00−1.3250.1851.121
pH0.0090.0051.8110.071.035
SHC000.6590.511.064
OMC−0.0030.001−2.620.009 **2.445
PO−2.3220.085−27.4770.000 **1.409
PD0.2420.0832.9330.003 **2.193
ANNConstant1.4540.11912.2450.000 **-0.8850.884F(6,755) = 968.854,
p = 0.000
DU00−3.340.001 **1.128
pH0.0110.0033.9180.000 **1.148
SHC00−2.3960.017 *1.057
OMC0.0030.0014.780.000 **3.123
PO−2.5010.042−58.9980.000 **1.554
PD0.4160.0459.3190.000 **2.79
MeanConstant2.6330.21512.2260.000 **-0.5940.591F(6,755) = 183.955,

p = 0.000
DU003.0770.002 **1.046
pH0.0160.0062.4860.013 *1.014
SHC00−0.4620.6441.042
OMC−0.0120.001−11.4960.000 **1.376
PO−2.1610.088−24.6820.000 **1.159
PD−0.0980.076−1.2930.1971.382
MIConstant1.5070.1113.7190.000**-0.9020.901F(6,755)=1154.513,
p = 0.000
DU002.2520.025 *1.082
pH0.0010.0030.3920.6951.079
SHC00−0.6910.491.044
OMC0.0030.0014.1790.000 **2.772
PO−2.4910.036−69.0580.000 **1.325
PD0.3990.0429.590.000 **2.66
ZeroConstant1.480.013112.4480.000 **-0.3790.374F(6,755) = 76.715,
p = 0.000
DU0.00104.1850.000 **1.056
pH−0.0010.002−0.4240.6721.07
SHC004.0290.000 **1.021
OMC−0.0160.001−14.5240.000 **1.063
PO−0.4930.049−10.1280.000 **2.74
PD0.0480.0095.3310.000 **2.738
Discarding
Missing
values
Constant1.0690.0426.9250.000 **-0.9980.997F(6,102) = 6932.797,

p = 0.000
DU002.0270.045 *1.341
pH−0.0020.001−2.0840.040 *1.076
SHC00−2.7620.007 **2.747
OMC0.0020.0013.1510.002 **3.111
PO−2.5510.019−133.2240.000 **2.004
PD0.5810.01540.0310.000 **1.665
* p < 0.05 ** p < 0.01. Dependent variable: BD.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Fu, Y.; Liao, H.; Lv, L. A Comparative Study of Various Methods for Handling Missing Data in UNSODA. Agriculture 2021, 11, 727. https://0-doi-org.brum.beds.ac.uk/10.3390/agriculture11080727

AMA Style

Fu Y, Liao H, Lv L. A Comparative Study of Various Methods for Handling Missing Data in UNSODA. Agriculture. 2021; 11(8):727. https://0-doi-org.brum.beds.ac.uk/10.3390/agriculture11080727

Chicago/Turabian Style

Fu, Yingpeng, Hongjian Liao, and Longlong Lv. 2021. "A Comparative Study of Various Methods for Handling Missing Data in UNSODA" Agriculture 11, no. 8: 727. https://0-doi-org.brum.beds.ac.uk/10.3390/agriculture11080727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop