Next Article in Journal
A Systematic Survey of ML Datasets for Prime CV Research Areas—Media and Metadata
Previous Article in Journal
Balancing Plurality and Educational Essence: Higher Education Between Data-Competent Professionals and Data Self-Empowered Citizens
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance

by
Esra’a Alshdaifat
*,
Doa’a Alshdaifat
,
Ayoub Alsarhan
,
Fairouz Hussein
and
Subhieh Moh’d Faraj S. El-Salhi
Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan
*
Author to whom correspondence should be addressed.
Submission received: 9 December 2020 / Revised: 1 January 2021 / Accepted: 15 January 2021 / Published: 21 January 2021

Abstract

:
It is recognized that the performance of any prediction model is a function of several factors. One of the most significant factors is the adopted preprocessing techniques. In other words, preprocessing is an essential process to generate an effective and efficient classification model. This paper investigates the impact of the most widely used preprocessing techniques, with respect to numerical features, on the performance of classification algorithms. The effect of combining various normalization techniques and handling missing values strategies is assessed on eighteen benchmark datasets using two well-known classification algorithms and adopting different performance evaluation metrics and statistical significance tests. According to the reported experimental results, the impact of the adopted preprocessing techniques varies from one classification algorithm to another. In addition, a statistically significant difference between the considered data preprocessing techniques is demonstrated.

1. Introduction

Data are not always “clean”; the presence of redundant, inconsistent, noisy, and/or missing data in a dataset indicates that data are not clean and need to be handled before applying any machine learning algorithm. Data preprocessing is concerned with solving such issues. In addition, data normalization, discretization, and transformation are data preprocessing tasks. Thus, data preprocessing is a significant step for Knowledge Discovery in Database (KDD). More specifically, the performance of machine learning algorithms is strongly influenced by the adopted preprocessing techniques [1]. Some researchers argue that adopting particular data preprocessing technique relies primarily on the considered dataset [2], while others claim that the selection should be based on experiments [3].
With respect to numerical features, data normalization and handling missing values are considered the main preprocessing issues especially when the adopted classification algorithm was originally designed to handle numerical features. The reason behind the importance of the normalization process with respect to the performance of classification algorithms is that features assigned “small-range” values are dominated by features with “large-range” values; consequently, the small-range features have no influence on the classification process [4,5]. Results from previous research showed that feature normalization has a significant impact on classification accuracy [2,3,4,6,7,8]. Regarding missing values, the“bad” treatment of missing data results in a degradation in classification accuracy, especially when the considered dataset contains a high missing values rate [9,10,11]. Therefore, handling missing values carefully during preprocessing is considered a necessary step in order to obtain a high performance classification model. Much research work has studied the effect of various normalization techniques or handling missing values strategies on classification performance separately; however, few works have evaluated the impact of combining normalization and handling missing values techniques. In addition, less attention has been given to the effect of different treatments of missing values or normalization techniques on classification efficiency.
The main motivation for the work presented in this paper is the desire to supply machine learning researchers and users with recommendations regarding the preprocessing techniques to be adopted in order to obtain high performance classification models. Thus, this paper investigates the impact of combining several preprocessing techniques, related to normalization and dealing with missing values, on the performance of classification algorithms.
In this research, three well-known normalization techniques: (i) min-max normalization, (ii) Z-score normalization, and (iii) decimal scaling normalization are evaluated. With respect to handling missing values, for numeric dimensions, three well-known strategies are evaluated: (i) discarding instances that include missing values, (ii) replacing missing values with the feature mean, and (iii) using the k-Nearest Neighbor (kNN) algorithm to replace missing values. Two alternative classification algorithms are considered to generate the prediction models after applying the preprocessing techniques: (i) Support Vector Machines (SVMs) and (ii) Artificial Neural Networks (ANNs). As a result, nine variations of preprocessing combinations are evaluated for each classification algorithm. It is worth noting here that some classification algorithms were originally designed to handle numerical data, for example kNN and SVM. However, such algorithms can be adapted to handle categorical data. Other classification algorithms were originally designed to handle categorical data; examples include decision tree, naive Bayes, and rule based classifiers. However, such algorithms can be adapted to handle numerical data. In the research presented in this paper, classification algorithms that were originally designed to handle numerical data are considered. It is expected that these algorithms will be affected by the normalization process due to: (i) being originally designed to handle numerical features (the nature of the algorithm) and (ii) applying some calculations on numerical features, like distance computation.
In order to determine if one technique significantly outperforms another (others), the Friedman statistical test [12] and the Nemenyi post-hoc test [13] have been applied. From the foregoing, the objectives of the work presented in this paper can be summed up as follows:
  • We evaluate the effect of combining several preprocessing techniques, applied to numerical features, on the performance of classification algorithms.
  • We find the optimal combination of preprocessing techniques, with respect to the numerical values, that results in more accurate classification.
The above-mentioned objectives can be articulated by the following big question:
“What are the most convenient techniques that can be adopted to produce high performance classification models in terms of classification effectiveness and efficiency?”
The remainder of this paper is organized as follows: Section 2 provides the required background to the work described in this paper and discusses the previous work that studied the effect of preprocessing techniques on the performance of classification models. Section 3 describes the datasets that have been used to evaluate the considered preprocessing techniques. Section 4 presents the adopted experimental methodology. Section 5 presents and discusses the obtained results. Finally, Section 6 concludes the discussion and provides directions for further work.

2. Related Work

This section provides a review of preprocessing techniques, normalization, and handling missing values with respect to numerical attributes. In addition, the section presents a summary of related work on the effect of preprocessing techniques on the performance of classification algorithms. The section is organized as follows: Section 2.1 provides an overview of data normalization techniques, while Section 2.2 presents an overview of the missing values problem and the most common ways to deal with it. A summary of the previous related work on the effect of preprocessing techniques on the performance of classification algorithms is presented in Section 2.3.

2.1. Normalization

Data normalization is a preprocessing technique applied to numerical features before applying classification or clustering algorithms that are mainly designed to handle numerical features. The reason behind the importance of the normalization process is to avoid a number of the considered features concealing the effect of others, particularly when features have different varying ranges. On the other hand, selecting the normalization technique and normalization range (interval) is considered a significant step during the preprocessing stage, due to the “change” that affects the considered data and consequently the results of the machine learning algorithm that will be applied after preprocessing [3]. The most widely used data normalization techniques are [5]:
  • Min-max normalization: This is one of the most common techniques to normalize data, in which values for the considered feature are transformed to new smaller ones within a predefined interval, usually [0–1] is adopted [5]. It is recognized that min-max normalization maintains all the relationships in the considered data [6]. Each value in the considered feature is mapped to a new normalized value according to the following equation [5]:
    v = v m i n A m a x A m i n A ( n e w _ m a x A n e w _ m i n A ) + n e w _ m i n A
    where v is the new normalized value, v is the original value for the given feature, m a x A is the maximum value for the given feature A, m i n A is the minimum value for the given feature A, while n e w _ m a x A and n e w _ m i n A represent the maximum and minimum values for the new considered range.
  • Z-score normalization: This is a statistical normalization technique that handles the outlier issue [5]. The mean and standard deviation for the considered feature are used to transform the feature values. More specifically, values for the considered feature are transformed into new normalized values by applying the following equation [5]:
    v = v μ σ
    where μ is the mean value of the designated feature and σ is the standard deviation of the considered feature. Applying the Z-score normalization technique, values below the mean appear as negative numbers, values above the mean as positive numbers, while values that are exactly equal to the mean are mapped to zero.
  • Decimal normalization: This is a normalization technique that normalizes the designated feature by moving the decimal point of the feature values, where the maximum absolute value of the considered feature determines the decimal point movement. Each value in the designated feature is mapped to a new normalized value according to the following equation [5]:
    v = v 10 j
    where j is the smallest integer to get max ( | v | ) < 1.

2.2. Handling Missing Values

Missing data are recognized as one of the significant issues that should be handled carefully during the preprocessing stage, before applying machine learning algorithms, to obtain effective machine learning models. In practice, a dataset may contain missing data that are generated due to several reasons such as human errors, equipment faults, data unavailability (some people reject providing a value for specific features), and data not being up-to-date or inconsistent with other existing data (consequently removed). In addition, the detection of a data anomaly can be considered a reason for missing values, where the anomalous values are deleted and replaced with new values using a repairing mechanism [14]. The rates of missing values can be categorized as follows [15]:
  • “Trivial”, where 1% of the data are missing.
  • “Manageable”, where 1–5% of the data are missing.
  • “Sophisticated”, where 5–15% of the data are missing; therefore sophisticated methods are required to handle this.
  • “Severe”, more than 15% of the data are missing; thus, the serious influence of any applied technique would be noted.
In this paper, all rates were used in the experiments and taken into consideration.
According to the literature, the most widely used strategies to deal with missing values are [5,15,16,17]:
  • Deleting an instance: Instances with missing values for at least one feature are deleted (ignored); this technique is the default option to deal with missing values with respect to many statistical packages [18,19].
  • Filling manually: The missing values are filled manually; therefore, it is considered not efficient and not feasible especially when handling datasets that include a high missing values rate.
  • Replacing with a global constant: A global value is utilized to fill in the missing values such as “unknown”.
  • Replacing with the mean: The mean for a specific feature is used to fill in any missing values for that feature; this technique is also referred to as “maximum likelihood” [20]. Several variations of this technique are available. One variation utilizes the feature mean for all instances belonging to the same class label instead of the mean of all instances to fill in missing values.
  • Using a prediction model: The decision tree, regression, and Bayesian models can be adopted to predict the missing values. Recent studies have used deep neural networks to repair data because neural networks can handle natural data that include missing values perfectly [21].
  • Adopting an imputation procedure: The missing values are estimated based on a specific procedure, the most widely used procedure being k-Nearest Neighbor (kNN). Adopting the kNN procedure, missing values are imputed according to the most similar instances, where the distance measure (such as the Euclidean or Manhattan distance) is used to determine the most similar instances. Moreover, the repairing mechanisms adopted for handling anomalous data can be considered as one of the imputation procedures that exploit observations of the same data features in nearby locations [14]. Note here that Points 4, and 5 can be considered as imputation processes and coupled with this point.
  • Adopting multiple imputation procedure: Multiple simulated variations of the considered dataset are produced and analyzed; after that, the results are joined together in order to output the inference [16].
Among the previous strategies, discarding an instance, replacing with the mean, and kNN imputation are considered the most common strategies for dealing with missing values and also available in most data mining tools. Thus, those techniques are considered in the work presented in this paper.

2.3. Previous Work on the Impact of Preprocessing Techniques on the Performance of Classification Algorithms

This section discusses the previous work that studied the effect of preprocessing techniques on the performance of classification models. Many techniques have been proposed to normalize data and deal with missing values. Several experimental studies tried to find the “best” technique to be used before applying classification algorithms; thus, a better prediction can be obtained. According to the literature, the research on the effect of preprocessing techniques on the performance of classification algorithms can be summarized as follows:
  • Each research work evaluated various data preprocessing techniques. More specifically, some studies evaluated a number of normalization techniques, and some others evaluated some ways for dealing with missing data. Note here that only one reference was found by the author that evaluated the effect of normalization and handling missing values on classification accuracy using only one medical dataset [6]. In this research work, we focus on the most widely used preprocessing techniques with respect to numerical variables.
  • Most research works were focused primarily on a specific classification algorithm. More specifically, with respect to normalization, most experiments were conducted to evaluate the impact on the performance of Support Vector Machine (SVM) or Artificial Neural Networks (ANNs) such as the work presented in [3,6,22,23]. On the other hand, with respect to handling missing values, experiments were conducted using rule-based, Decision Tree (DT), or kNN classifiers such as the work presented in [15,16,24].
  • With respect to normalization, the evaluation in most research works was conducted using a specific dataset such as a hyperspectral dataset [4,22], a medical dataset [3,6,7,25], or a direct marketing dataset [2]. Only a few researchers have studied the effect of normalization on classification performance using several general datasets such as the experimental study presented in [23]. On the other hand, related work on handling missing values can be categorized into three categories according to the utilized datasets: (i) research work that utilized datasets with missing values in their original form [26,27], (ii) research work that utilized datasets with no missing values in their original form (missing values are generated artificially) [28], and (iii) research work that utilized datasets with and without missing values in their original form [15].
  • With respect to handling missing values, as noted earlier, instrument failure is considered one of the main reasons for finding missing values in the datasets. Sensors are one of the instruments that are subject to failure for several reasons including environmental factors. Recently, several researchers have directed their research work toward handling missing or corrupted data resulting from sensor failures [29,30]. The field of renewable energy forecasting [31,32] is considered an example of this case, where the data are collected by geographically distributed sensors [33]. In order to handle the missing values in such datasets, some researchers replaced them with the mean of the same attribute observed for the same month of the same year at the same hour [33]. Moreover, linear interpolation, mode imputation, k-nearest neighbors, and multivariate imputation by chain equations (MICEs) are also used to solve the missing values problem with respect to renewable energy forecasting [29].
  • Most research works used evaluation measures that evaluate the accuracy of the classifiers (such as the error rate and accuracy), while efficiency measures were not taken into consideration (such as model generation time or prediction time).
  • Most research works did not consider statistical tests to rigorously compare the performance of different preprocessing techniques.
In the context of the work described in this paper, several combinations of normalization techniques and handling missing values strategies are investigated using two well-known classification algorithms and eighteen benchmark datasets from different disciplines and feature various characteristics. Additionally, different evaluation measures and statistical tests are adopted during the evaluation process.

3. Evaluation Datasets

This section provides a description of the main characteristics of the evaluation datasets. Eighteen datasets from different disciplines with various numbers of instances, class labels, and features were taken from the University of California Irvine (UCI) machine learning repository [34]. Table 1 presents the main characteristics of the evaluation datasets. Recall that the research presented in this paper is concerned with the effect of different preprocessing techniques on classification performance with respect to numerical features; the considered datasets include at least one numerical feature. In addition, to precisely study the effect of diverse treatments of missing values on classification performance, nine of the considered datasets contain missing values (“original missing values”), while the remaining nine do not. The objective behind choosing datasets with “no missing” values is to artificially generate various rates of missing values; thus, a deeper and more comprehensive investigation can be achieved.

4. The Adopted Experimental Methodology

This section presents the adopted experimental methodology. Figure 1 summarizes the entire methodology. As shown in Figure 1, and as recognized, the generation of classification models commences with acquiring a dataset. Recall that eighteen benchmark datasets from various disciplines are considered. As noted in the previous section, the evaluation datasets can be categorized into two categories according to the inclusion of missing values: original missing values and no missing values. The first step in the adopted preprocessing strategy is to artificially introduce missing values for datasets that do not feature missing values. Two different rates are adopted to generate missing values: 10% sophisticated and 20% severe rates, respectively (see the literature review).
Now, the dataset includes missing values and is ready to be treated using one of the missing values treatment strategies (deleting instances that include missing values, replacing missing values with the feature mean, and using the k-Nearest Neighbor (kNN) algorithm). The next step is the normalization process, where the given dataset is normalized using one of the normalization techniques (min-max, Z-score and decimal). Consequently, nine alternative data preprocessing combination techniques are applied for each dataset: (i) Delete&MinMaxcombination technique, (ii) Delete&Z-score combination technique, (iii) Delete&Decimal combination technique, (iv) Mean&MinMax combination technique, (v) Mean&Z-score combination technique, (vi) Mean&Decimal combination technique, (vii) kNN&MinMax combination technique, (viii) kNN&Z-score combination technique, and (ix) kNN&Decimal combination technique.
After that, the considered classification algorithms (SVM and ANN) are applied to each dataset variation in order to generate the desired classification model. The final step in the adopted methodology is the evaluation process in which the performances of the resulting classification models are compared. Concerning effectiveness evaluation, accuracy and Area Under the receiver operating Curve (AUC) [35] measures are considered. On the other hand, model construction time is adopted to evaluate the efficiency. In addition, a statistical significance test is applied to the obtained results to ensure a more precise comparison.

5. Experiments and Evaluation

The well-known Weka data mining tool [36] was used for data preprocessing and classification models’ generation. All experiments were executed utilizing Intel(R) Core(TM) i7-4600U [email protected] 2.70 GHz with 8 GB RAM memory, running Windows 7 Professional. Ten-fold Cross-Validation (TCV) was adopted to obtain accurate results. Despite including average accuracy and average AUC results, the analysis was based on the average AUC because the AUC is a more precise measure than accuracy for comparing machine learning algorithms [35,37].
As noted earlier, in total, nine data preprocessing combination techniques are considered for each dataset with respect to each classification algorithm. In the context of the dataset with “no missing” values, the nine different data preprocessing combination techniques are applied to: (i) datasets having 10% missing values generated artificially and (ii) datasets having 20% missing values generated artificially.
Thus, the obtained results are organized in the following sub-sections as follows: Section 5.1 presents the obtained results using datasets that originally included missing values with respect to the nine alternative data preprocessing combination techniques and the two classification algorithms (ANN and SVM). Section 5.2 presents the obtained results using datasets that include 10% missing values that were generated artificially with respect to the nine alternative data preprocessing combination techniques and the two classification algorithms (ANN and SVM). Section 5.3 presents the obtained results using datasets that that include 20% missing values (generated artificially) with respect to the nine alternative data preprocessing combination techniques and the two classification algorithms (ANN and SVM). Section 5.4 discusses the classification models’ efficiency based on model generation time.

5.1. Results Obtained from Datasets Having Missing Values Originally

We commence with the results obtained when using the ANN classification algorithm coupled with the nine alternative data preprocessing combination techniques. Table A1 presents the results in terms of accuracy and AUC measures. As noted earlier, the discussion of the results will be based on the AUC measure. Thus, Figure 2 shows the results in terms of the AUC measure. From the figure, it can be clearly observed that no one data preprocessing technique outperforms the others for all datasets. In addition, it can be noted that for most datasets, the obtained results are close, except the HCC survival dataset, where the delete strategy significantly degrades the classification accuracy regardless of the adopted normalization technique, the reasons behind this being: (i) the high missing values rate compared to the remaining eight datasets (see Table 1) and (ii) the distribution of missing values in the dataset.
The results obtained when using the SVM classification algorithm coupled with the nine alternative data preprocessing combination techniques are presented in Figure 3, and the detailed results are tabulated in Table A2. From the figure, we can observe the significant impact of the adopted preprocessing technique on the classification accuracy with respect to some datasets, such as the case of the Thyroid dataset where the obtained AUC results range from 0.500 to 0.833. Another case is the Hepatitis dataset, where the obtained AUC range was [0.500–0.772]. With respect to the HCC survival dataset, the same as using the ANN classifier, the delete strategy produced the worst AUC results regardless of the adopted normalization technique.
In order to achieve a more precise evaluation of the effect of different preprocessing combination techniques on classification effectiveness, statistical tests were applied. Regarding the statistical comparison of the nine considered data preprocessing combination techniques coupled with the ANN classifier, the Friedman test was applied. Figure 4a shows the reported Friedman test results using SPSS. The Friedman test reported that there was no significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 9.826, p = 0.277). With respect to comparing the nine data preprocessing combination techniques and SVM classifier, the Friedman test reported a significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 19.456, p = 0.013), as shown in Figure 4b. Consequently, the Nemenyi post-hoc test was applied to determine the data preprocessing combination technique that significantly outperformed the others. When applying the Nemenyi post-hoc test, two models are significantly different if the difference of their mean rank is higher than or equal to the Critical Difference (CD) [13]. The CD is calculated according to the following Equation [37].
C D = q α k k + 1 6 N
where q α is the confidence level, k is the number of models, and N is the number of datasets.
With respect to our comparison, k = 9, N = 9, and α = 0.05 were adopted. Thus, C D = 3.102 9 9 + 1 6 9 = 4.005 . Then, the difference between the mean ranks manipulated for each pair of models (preprocessing combinations) is compared with the value of the critical difference. Because the difference between the highest mean rank and the lowest mean rank is less than the CD (6.33 − 3.06 = 3.27 < 4.005), the Nemenyi test did not detect any significant differences between the models.

5.2. Results Obtained from Datasets Having 10% Artificially Generated Missing Values

This section presents the results obtained when using the ANN and SVM classification algorithms coupled with the nine alternative data preprocessing combination techniques for datasets with a 10% missing values rate (generated artificially). We commence with the results obtained when using the ANN classification algorithm presented in Figure 5, and Table A3 presents the detailed results. From the figure, it can be seen that Delete&Zscore produced the best AUC results for three datasets, Delete&MinMax generated the best AUC for one dataset, Mean&MinMax generated the best AUC for one dataset, Mean&MinMax generated the best AUC for one dataset, Mean&Zscore generated the best AUC for one dataset, Mean&Decimal generated the best AUC for one dataset, and kNN&MinMax generated the best AUC for one dataset. For one dataset, the same AUC results were obtained regardless of the adopted data preprocessing combination technique. It is interesting to note here that SeismicBumps was highly affected by the adopted preprocessing combination technique where the AUC range was [0.575–0.743]. Note here that the AUC value 0.575 was obtained when applying the Delete&Decimal preprocessing combination technique.
Figure 6 displays the results using the nine data preprocessing combination techniques coupled with the SVM classification algorithm in the context of a 10% missing values rate. From the figure, it can be noted that Delete&Zscore produced the best AUC results for most datasets. More specifically, Delete&Zscore produced the best AUC results for six datasets, while Delete &MinMax generated the best AUC for one dataset, and kNN &Zscore generated the best AUC for one dataset. For the remaining dataset (SeismicBumps), the same AUC results were obtained regardless of the adopted data preprocessing combination technique. Note here that the detailed results are presented in Table A4.
Regarding the statistical comparison of the nine considered data preprocessing combination techniques coupled with the ANN classifier, the Friedman test was applied. The Friedman test demonstrated that there was a significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 26.900, p = 0.001). As a result, the Nemenyi post-hoc test was applied to determine the data preprocessing combination technique that significantly outperformed the others. Note that k = 9, N = 9, and α = 0.05; thus, CD ≈ 4.005 was adopted. Figure 7 presents a visual representation of the Nemenyi test, where the mean ranks of all considered method are plotted (mean ranks were reported by the Friedman test using SPSS, where the highest mean rank was assigned to the best method). The models that are not significantly different are connected. Note here that the best model is positioned on the right. Interestingly, the Nemenyi test noted that Delete&Zscore, Mean&MinMax, Mean&Zscore, kNN&MinMax, and kNN&Zscore significantly outperformed Delete&Decimal. In other words, the statistical test result indicated that decimal normalization was the least effective normalization technique regardless of the coupled missing values treatment strategy. In addition, the worst combination technique was Delete&Decimal.
With respect to comparing the nine data preprocessing combination techniques and the SVM classifier, the Friedman test reported a significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 50.979, p = 0.000). Again, the Nemenyi post-hoc test was conducted to determine the data preprocessing combination technique that significantly outperformed the others. Figure 8 presents the visual representation of the Nemenyi test. As shown in the figure, the Nemenyi post-hoc noted that: (i) Delete&Zscore, Mean&Zscore, and kNN&Zscore significantly outperformed Delete&Decimal and Mean&Decimal, and (ii) Delete&Zscore and kNN&Zscore significantly outperformed kNN&Decimal. Again, Decimal normalization was the least effective normalization technique. In addition, Z-score normalization was the most effective normalization technique regardless of the coupled missing values treatment strategy.

5.3. Results Obtained from Datasets Having 20% Artificially Generated Missing Values

The results obtained when using ANN classification algorithm coupled with the nine alternative data preprocessing combination techniques for datasets with 20% missing values are displaced in Figure 9. An interesting observation is that the delete strategy was working “good” even with 20% missing values compared to other missing values treatment strategies. More specifically, Delete&MinMax produced the best AUC results for three datasets, and Delete&Zscore produced the best AUC results for two datasets. For the remaining datasets, Mean&MinMax generated the best AUC for one dataset; Mean&Decimal generated the best AUC for one dataset; kNN&MinMax generated the best AUC for one dataset; and kNN&Zscore generated the best AUC for one dataset. The detailed results are presented in Table A5.
The results obtained when using the SVM classification algorithm coupled with the nine alternative data preprocessing combination techniques for datasets having 20% missing values are presented in Figure 10. The same as the case of the ANN classifier, the delete strategy was working “well” even with 20% missing values compared to other missing values treatment strategies. In addition, the Z-score technique generated the best AUC results for most cases regardless of the adopted treatment for missing values. More specifically, Delete&Zscore produced the best AUC results for four datasets, and kNN&Zscore produced the best AUC results for three datasets. For the remaining two datasets, Delete&Decimal generated the best AUC for one dataset, and all techniques generated the same AUC result for one dataset (SeismicBumps). The detailed results are presented in Table A6.
Regarding the statistical comparison of the nine considered data preprocessing combination techniques coupled with the ANN classifier, the Friedman test was applied. The Friedman test demonstrated that there was a significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 16.052, p = 0.042). Applying the Nemenyi post-hoc test, the only reported significant difference was between Mean&MinMax and Delete&Decimal, where the Mean&MinMax technique significantly outperformed Delete&Decimal, as shown in Figure 11.
With respect to the statistical comparison of the nine considered data preprocessing combination techniques coupled with the SVM classifier, the Friedman test was applied. The Friedman test demonstrated that there was a significant difference between the nine data preprocessing techniques ( X 2 ( 2 ) = 42.669, p = 0.000). The Nemenyi post-hoc test reported that: (i) Delete&Zscore, Mean&Zscore, and kNN&Zscore significantly outperformed Delete&Decimal, and (ii) Delete&Zscore and kNN&Zscore significantly outperformed Mean&Decimal and kNN&Decimal, as shown in Figure 12.

5.4. Classification Models Efficiency

The previous Section 5.1, Section 5.2 and Section 5.3 presented a comparison of the effectiveness of the nine considered data preprocessing combination techniques. In order to achieve a comprehensive comparison, this sub-section presents a comparison of the efficiency of the nine considered data preprocessing combination techniques. Figure 13 shows the generation time results (in seconds) obtained when using the ANN classification algorithm coupled with the nine data preprocessing combination techniques, and Figure 14 shows the generation time results (in seconds) obtained when using the SVM classification algorithm coupled with the nine data preprocessing combination techniques
Commencing with missing values treatment strategies, as expected, it can be noted that the lowest generation run times were obtained when using the delete strategy for handling missing values, and this is very obvious for datasets featuring high missing values rates; while the kNN technique for handling missing values generated the highest generation times. Additionally, it is interesting to note here that the effect of handling missing values on classification efficiency was more obvious when the ANN classification algorithm was adopted to generate the classification models. With respect to the data normalization techniques, there was no significant difference in efficiency between the three considered data normalization techniques.

6. Conclusions and Future Work

Handling missing values and data normalization are considered important preprocessing activities prior to applying classification algorithms. In this paper, the effect of different combinations of data preprocessing techniques was investigated. Three well-known normalization techniques and three well-known strategies for handling missing values were considered. Consequently, nine alternative data preprocessing combination techniques were evaluated: (i) Delete&MinMax combination technique, (ii) Delete&Z-score combination technique, (iii) Delete&Decimal combination technique, (iv) Mean&MinMax combination technique, (v) Mean&Z-score combination technique, (vi) Mean&Decimal combination technique, (vii) kNN&MinMax combination technique, (viii) kNN&Z-score combination technique, and (ix) kNN&Decimal combination technique. The classification models were generated using the ANN and SVM classification algorithms. Eighteen datasets were used to evaluate the nine data preprocessing combination techniques. The datasets were categorized into three categories according to the inclusion of missing values: (i) datasets having missing values originally, (ii) datasets having 10% missing values generated artificially, and (iii) datasets having 20% missing values generated artificially.
From the reported evaluation, there was no noticeable difference between the considered data preprocessing combination techniques with respect to most datasets that featured missing values originally. In other words, there was no significant effect of the adopted preprocessing techniques for most datasets having less than 10% missing values. Regarding datasets having 10% missing values, there was a significant effect of the adopted preprocessing techniques on the performance of classification models, the statistical tests results indicating that decimal normalization was the least effective normalization technique regardless of the coupled missing values treatment strategy, while Z-score normalization was the most effective normalization technique regardless of the coupled missing values treatment strategy. Moreover, the worst combination technique was Delete&Decimal.
In the context of datasets having 20% missing values, unexpectedly, the delete strategy worked very well compared to the considered missing values treatment strategies. Thus, we proved that the delete strategy can be adopted for datasets featuring up to 20% missing values and can produce comparable classification accuracy compared to the mean and kNN strategies. In addition, the same as the case of the datasets with 10% missing values, decimal normalization was the least effective normalization technique, while Z-score normalization tended to generate the best AUC results, and the worst preprocessing combination technique was Delete&Decimal.
Interestingly, the impact of the adopted preprocessing techniques varied from one classification algorithm to another. More specifically, the effect of the data preprocessing techniques was more noticeable when the SVM classifier was utilized to generate the classification models. Overall, for most scenarios, Delete&Decimal was the worst preprocessing combination technique that could be applied before generating the desired classification model.
As future work, the authors intend to investigate the impact of different preprocessing techniques on clustering algorithms. In addition, generating datasets with more than a 20% missing values rate will be considered in order to determine the best preprocessing techniques to be adopted for such datasets.

Author Contributions

Conceptualization, E.A.; methodology, E.A.; software, D.A.; validation, A.A.; formal analysis, E.A.; investigation, E.A. and A.A.; resources, F.H. and S.M.F.S.E.-S.; data curation, D.A., F.H., and S.M.F.S.E.-S.; writing, original draft preparation, E.A.; writing, review and editing, F.H.; visualization, D.A.; supervision, A.A.; project administration, E.A. All authors read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors are thankful to the Hashemite University for the endless support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that featured missing values originally.
Table A1. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that featured missing values originally.
ANN Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Automobile82.390.92986.160.94769.810.88079.510.92379.020.92973.170.88179.020.92078.540.92073.170.880
Kidney99.371.000100.001.00099.361.00098.250.99997.500.99997.500.99198.751.00098.500.99997.000.985
Credit85.150.90283.460.90885.600.90183.330.90383.040.90283.910.89983.910.90482.610.89983.330.901
Cylinder72.200.72673.290.76272.560.74075.370.82375.930.83171.300.76975.740.81477.040.84175.560.803
Dermatology97.490.99797.490.99796.930.99797.270.99797.540.99797.540.99897.270.99797.540.99797.540.998
HCC survival25.000.18825.000.12525.000.18872.120.76672.730.77475.760.78272.730.76574.550.78473.940.796
Hepatitis82.500.83182.500.81586.250.81580.650.79183.870.85282.580.81181.940.80284.520.84681.940.819
Mammographic80.000.85280.720.87280.360.85179.810.85781.060.88080.020.85278.980.84379.810.87279.190.846
Thyroid96.290.94897.470.95292.400.87197.380.95998.010.94194.330.88897.450.95398.010.94294.410.885
Table A2. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that featured missing values originally.
Table A2. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that featured missing values originally.
SVM Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Automobile64.150.79974.840.87664.150.78868.290.82776.590.87264.880.81467.800.82777.070.87164.880.813
Kidney99.370.988100.001.00099.360.98898.750.99098.500.98793.250.94698.000.98498.750.99093.250.946
Credit86.220.86886.060.86786.220.86884.930.85685.070.85884.930.85685.220.86085.220.86085.220.860
Cylinder71.480.67374.010.70666.070.61775.190.73773.890.72469.630.67875.370.73473.890.72168.520.666
Dermatology97.770.99396.090.98896.930.99097.540.99396.720.99096.990.99097.540.99396.720.99096.990.990
HCC survival37.500.37525.000.25025.000.25073.940.71973.940.72272.730.67074.550.72773.330.71572.120.662
Hepatitis85.000.69386.250.76383.750.50085.160.75683.870.74879.350.51285.810.77283.230.73279.350.512
Mammographic80.240.80482.170.82280.240.80479.080.79482.830.82879.080.79476.900.76782.520.82577.110.769
Thyroid91.980.50096.220.83391.980.50093.880.50097.060.82293.880.50093.880.50096.900.80193.880.500
Table A3. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 10% missing values.
Table A3. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 10% missing values.
ANN Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Abalone26.690.76225.100.73025.200.72825.540.74825.640.73325.140.73026.090.74425.160.73225.130.725
Ecoli82.100.94077.780.92576.540.90884.230.94779.460.94078.270.93783.930.94679.170.93778.270.935
Glass63.100.79170.240.85455.950.76959.810.80863.080.80455.140.77661.220.81662.150.82054.210.778
Iris92.550.99393.620.99391.490.96490.670.97593.330.98286.670.95086.670.93192.670.97989.330.974
ShopperIntention84.290.79486.770.85982.130.72385.490.81686.850.86484.280.78584.890.81585.360.84984.010.773
PageBlocks96.540.95996.320.96693.610.87495.450.95095.780.96094.660.94094.810.93995.380.95094.130.913
PenDigits93.420.96395.330.98292.370.95989.920.95889.740.95790.570.95891.640.96491.230.96592.290.963
SeismicBumps92.110.61490.790.61793.420.57592.920.71590.830.71193.000.74393.030.72691.870.69793.110.737
Vehicle72.130.86372.130.88151.640.75371.630.89570.090.87362.290.85673.290.91170.920.87963.710.861
Table A4. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 10% missing values.
Table A4. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 10% missing values.
SVM Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Abalone23.210.70525.510.74022.850.68324.010.71325.660.72622.500.69423.380.70924.220.71622.690.696
Ecoli75.930.91884.570.93562.960.82879.460.92082.740.93373.210.87679.760.92283.330.93474.110.885
Glass61.900.78067.860.81640.480.58656.070.73859.810.78643.930.66056.070.73861.220.79043.930.660
Iris95.740.97795.740.97092.550.96088.670.93692.000.95186.000.91188.670.93693.330.95785.330.931
ShopperIntention87.440.63088.290.68284.760.51587.700.63588.050.65785.900.55487.400.62187.620.64085.480.538
PageBlocks93.340.66396.000.85991.020.50992.380.66895.290.82790.900.56092.670.70894.880.81590.550.540
PenDigits97.000.99597.710.99679.220.94092.530.98892.260.98884.690.96694.920.99295.060.99385.520.970
SeismicBumps94.300.50094.300.50094.300.50093.420.50093.420.50093.420.50093.420.50093.420.50093.420.500
Vehicle63.110.77772.950.85228.690.49067.260.82670.800.84743.740.69869.860.81473.170.86142.550.684
Table A5. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 20% missing values.
Table A5. Average accuracy and AUC values obtained using the ANN classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 20% missing values.
ANN Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Abalone26.340.75125.060.71225.170.71424.300.73724.010.72425.230.71825.080.72624.920.71723.580.712
Ecoli78.670.90869.330.88078.670.89581.250.93676.790.92675.300.91880.650.93478.270.92275.890.915
Glass59.260.73951.850.70040.740.57059.350.78460.750.81052.340.74859.810.77166.360.83554.670.752
Iris94.340.99594.340.99492.450.99092.000.98090.000.97884.670.95293.330.97592.670.96484.000.954
ShopperIntention84.870.78686.660.85280.180.69284.610.79186.240.85082.900.75184.040.78584.870.83083.420.754
PageBlocks95.190.99594.690.96992.370.77194.280.92995.050.93993.880.91994.100.92094.720.93593.900.906
PenDigits88.320.95790.420.97884.730.92686.910.95285.550.94987.470.95588.330.95788.050.95689.150.958
SeismicBumps88.680.57188.680.30690.570.60292.760.71291.870.68693.030.74892.880.70591.020.68693.110.729
Vehicle59.090.79459.090.83340.910.51766.900.85964.420.83458.750.83668.910.88166.900.86161.230.845
Table A6. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 20% missing values.
Table A6. Average accuracy and AUC values obtained using the SVM classification algorithm coupled with the nine data preprocessing techniques with respect to the datasets that feature 20% missing values.
SVM Classification
Technique Delete & Delete & Delete & Mean & Mean & Mean & kNN & kNN & kNN &
MinMax Zscore Decimal MinMax Zscore Decimal MinMax Zscore Decimal
DatasetAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUCAcc.AUC
Abalone22.490.70124.130.73522.030.66623.170.70724.160.71422.150.68922.930.70323.910.70822.330.692
Ecoli66.670.88277.330.90050.670.75576.790.89980.060.91867.860.83177.680.90080.360.91970.240.851
Glass62.960.72955.560.68440.740.42054.670.72056.540.76543.460.64455.610.72357.480.76642.990.643
Iris94.340.97155.560.94996.230.98186.000.92788.000.93276.000.84392.000.95594.000.96383.330.907
ShopperIntention87.560.60787.180.64084.950.50087.340.61787.730.63985.460.53786.870.59787.220.61485.200.525
PageBlocks91.540.60794.690.82990.050.50091.720.61894.630.79990.770.55492.020.64194.240.78290.480.538
PenDigits91.320.98195.810.98813.470.57787.880.97987.890.98080.900.95690.910.98591.270.98682.680.962
SeismicBumps92.450.50092.450.50092.450.50093.420.50093.420.50093.420.50093.420.50093.420.50093.420.500
Vehicle59.090.74468.180.80722.730.38662.060.80261.700.80145.980.69964.540.82066.190.82844.440.688

References

  1. Kuhn, M.; Johnson, K. Data Pre-processing. In Applied Predictive Modeling; Springer: New York, NY, USA, 2013; pp. 27–59. [Google Scholar] [CrossRef]
  2. Crone, S.; Lessmann, S.; Stahlbock, R. The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 2006, 173, 781–800. [Google Scholar] [CrossRef]
  3. KumarSingh, B.; Verma, K.; Thoke, A.S. Investigations on Impact of Feature Normalization Techniques on Classifier’s Performance in Breast Tumor Classification. Int. J. Comput. Appl. 2015, 116, 11–15. [Google Scholar] [CrossRef]
  4. Alizadeh Naeini, A.; Babadi, M.; Homayouni, S. Assessment of Normalization Techniques on the Accuracy of Hyperspectral Data Clustering. ISPRS Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, XLII-4/W4, 27–30. [Google Scholar] [CrossRef] [Green Version]
  5. Jiawei, H.; Micheline, K.; Jian, P. Data Mining: Concepts and Techniques; Morgan Kaufmann: San Mateo, CA, USA, 2011. [Google Scholar]
  6. Jayalskshmi, T.; Santhakumaran, A. Impact of Preprocessing for Diagnosis of Diabetes Mellitus Using Artificial Neural Networks. In Proceedings of the 2010 Second International Conference on Machine Learning and Computing, Bangalore, India, 12–13 February 2010; pp. 109–112. [Google Scholar] [CrossRef]
  7. Huang, H.C.; Qin, L.X. Empirical evaluation of data normalization methods for molecular classification. PeerJ 2018, 6, e4584. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Rozenstein, O.; Paz Kagan, T.; Salbach, C.; Karnieli, A. Comparing the Effect of Preprocessing Transformations on Methods of Land-Use Classification Derived From Spectral Soil Measurements. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 1–12. [Google Scholar] [CrossRef]
  9. Baitharu, T.R.; Pani, S.K. Effect of Missing Values on Data Classification. J. Emerg. Trends Eng. Appl. Sci. (JETEAS) 2013, 4, 311–316. [Google Scholar]
  10. Olsen, I.; Kvien, T.; Uhlig, T. Consequences of handling missing data for treatment response in osteoarthritis: A simulation study. Osteoarthr. Cartil. 2012, 20, 822–828. [Google Scholar] [CrossRef] [Green Version]
  11. Hunt, L.A. Missing Data Imputation and Its Effect on the Accuracy of Classification; Data Science; Palumbo, F., Montanari, A., Vichi, M., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 3–14. [Google Scholar]
  12. Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  13. Nemenyi, P. Distribution-Free Multiple Comparisons; Princeton University: Princeton, NJ, USA, 1963. [Google Scholar]
  14. Corizzo, R.; Ceci, M.; Japkowicz, N. Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data. Big Data Res. 2019, 16, 18–35. [Google Scholar] [CrossRef]
  15. Acuña, E.; Rodriguez, C. The Treatment of Missing Values and its Effect on Classifier Accuracy. In Classification, Clustering, and Data Mining Applications; Banks, D., McMorris, F.R., Arabie, P., Gaul, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 639–647. [Google Scholar]
  16. Saar-Tsechansky, M.; Provost, F. Handling Missing Values when Applying Classification Models. J. Mach. Learn. Res. 2007, 8, 1623–1657. [Google Scholar]
  17. GarcíA-Laencina, P.J.; Sancho-GóMez, J.L.; Figueiras-Vidal, A.R. Pattern Classification with Missing Data: A Review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
  18. Osborne, J. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do before and after Collecting Your Data; SAGE: Thousand Oaks, CA, USA, 2013. [Google Scholar]
  19. Purwar, A.; Singh, S.K. Hybrid prediction model with missing value imputation for medical data. Expert Syst. Appl. 2015, 42, 5621–5631. [Google Scholar] [CrossRef]
  20. Luengo, J.; García, S.; Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst. 2012, 32, 77–108. [Google Scholar] [CrossRef]
  21. Qie, Y.; Song, P.; Hao, C. Data Repair Without Prior Knowledge Using Deep Convolutional Neural Networks. IEEE Access 2020, 8, 105351–105361. [Google Scholar] [CrossRef]
  22. Foody, G.M.; Arora, M.K. An evaluation of some factors affecting the accuracy of classification by an artificial neural network. Int. J. Remote Sens. 1997, 18, 799–810. [Google Scholar] [CrossRef]
  23. Eftekhary, M.; Gholami, P.; Safari, S.; Shojaee, M. Ranking Normalization Methods for Improving the Accuracy of SVM Algorithm by DEA Method. Mod. Appl. Sci. 2012, 6, 26–36. [Google Scholar] [CrossRef] [Green Version]
  24. Wohlrab, L.; Fürnkranz, J. A Comparison of Strategies for Handling Missing Values in Rule Learning; Technical Report; Knowledge Engineering Group, Technische Universität Darmstadt: Darmstadt, Germany, 2009. [Google Scholar]
  25. Almuhaideb, S.; Menai, M.E.B. Impact of preprocessing on medical data classification. Front. Comput. Sci. 2016, 10, 1082–1102. [Google Scholar] [CrossRef]
  26. Jordanov, I.; Petrov, N.; Petrozziello, A. Classifiers accuracy improvement based on missing data imputation. J. Artif. Intell. Soft Comput. Res. 2018, 8, 31–48. [Google Scholar] [CrossRef] [Green Version]
  27. Peugh, J.; Enders, C. Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement. Rev. Educ. Res. 2004, 74, 525–556. [Google Scholar] [CrossRef]
  28. Aleryani, A.; Wang, W.; De La Iglesia, B. Dealing with Missing Data and Uncertainty in the Context of Data Mining. In Hybrid Artificial Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2018; pp. 289–301. [Google Scholar] [CrossRef] [Green Version]
  29. Kim, T.; Ko, W.; Kim, J. Analysis and Impact Evaluation of Missing Data Imputation in Day-ahead PV Generation Forecasting. Appl. Sci. 2019, 9, 204. [Google Scholar] [CrossRef] [Green Version]
  30. Akçay, H.; Filik, T. Short-term wind speed forecasting by spectral analysis from long-term observations with missing values. Appl. Energy 2017, 191, 653–662. [Google Scholar] [CrossRef]
  31. Corizzo, R.; Ceci, M.; Fanaee-T, H.; Gama, J. Multi-aspect renewable energy forecasting. Inf. Sci. 2021, 546, 701–722. [Google Scholar] [CrossRef]
  32. Agoua, X.G.; Girard, R.; Kariniotakis, G. Short-Term Spatio-Temporal Forecasting of Photovoltaic Power Production. IEEE Trans. Sustain. Energy 2018, 9, 538–546. [Google Scholar] [CrossRef] [Green Version]
  33. Ceci, M.; Corizzo, R.; Malerba, D.; Rashkovska, A. Spatial autocorrelation and entropy for renewable energy forecasting. Data Min. Knowl. Discov. 2019, 33, 698–729. [Google Scholar] [CrossRef]
  34. Lichman, M. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 1 June 2019).
  35. Huang, J.; Ling, C.X. Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef] [Green Version]
  36. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
  37. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. The proposed research methodology for determining the most convenient preprocessing techniques that can be adopted to produce high performance classification models.
Figure 1. The proposed research methodology for determining the most convenient preprocessing techniques that can be adopted to produce high performance classification models.
Data 06 00011 g001
Figure 2. The results obtained when using ANN classification for datasets having missing values originally.
Figure 2. The results obtained when using ANN classification for datasets having missing values originally.
Data 06 00011 g002
Figure 3. The results obtained when using SVM classification for datasets having missing values originally.
Figure 3. The results obtained when using SVM classification for datasets having missing values originally.
Data 06 00011 g003
Figure 4. The reported Friedman test results for datasets having missing values originally.
Figure 4. The reported Friedman test results for datasets having missing values originally.
Data 06 00011 g004
Figure 5. The results obtained when using ANN classification for datasets having 10% missing values.
Figure 5. The results obtained when using ANN classification for datasets having 10% missing values.
Data 06 00011 g005
Figure 6. The results obtained when using SVM classification for datasets having 10% missing values.
Figure 6. The results obtained when using SVM classification for datasets having 10% missing values.
Data 06 00011 g006
Figure 7. A visual representation of the post-hoc test results for datasets that feature 10% missing values when using the ANN classifier. Connected techniques are not significantly different, and the best technique is positioned on the right. CD, Critical Difference.
Figure 7. A visual representation of the post-hoc test results for datasets that feature 10% missing values when using the ANN classifier. Connected techniques are not significantly different, and the best technique is positioned on the right. CD, Critical Difference.
Data 06 00011 g007
Figure 8. A visual representation of the post-hoc test results for datasets that feature 10% missing values when using the SVM classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Figure 8. A visual representation of the post-hoc test results for datasets that feature 10% missing values when using the SVM classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Data 06 00011 g008
Figure 9. The results obtained when using ANN classification for datasets having 20% missing values.
Figure 9. The results obtained when using ANN classification for datasets having 20% missing values.
Data 06 00011 g009
Figure 10. The results obtained when using SVM classification for datasets having 20% missing values.
Figure 10. The results obtained when using SVM classification for datasets having 20% missing values.
Data 06 00011 g010
Figure 11. A visual representation of the post-hoc test results for datasets that feature 20% missing values when using the ANN classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Figure 11. A visual representation of the post-hoc test results for datasets that feature 20% missing values when using the ANN classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Data 06 00011 g011
Figure 12. A visual representation of the post-hoc test results for datasets that feature 20% missing values when using the SVM classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Figure 12. A visual representation of the post-hoc test results for datasets that feature 20% missing values when using the SVM classifier. Connected techniques are not significantly different, and the best technique is positioned on the right.
Data 06 00011 g012
Figure 13. Generation time results (in seconds) obtained using the ANN classification algorithm coupled with the nine data preprocessing combination techniques.
Figure 13. Generation time results (in seconds) obtained using the ANN classification algorithm coupled with the nine data preprocessing combination techniques.
Data 06 00011 g013
Figure 14. Generation time results (in seconds) obtained using the SVM classification algorithm coupled with the nine data preprocessing combination techniques.
Figure 14. Generation time results (in seconds) obtained using the SVM classification algorithm coupled with the nine data preprocessing combination techniques.
Data 06 00011 g014
Table 1. The evaluation datasets’ description.
Table 1. The evaluation datasets’ description.
DatasetInstance #Classes #Feature #Features Type (Numerical, Nominal)Missing Values (in All, in Numerical)Missing Values Rate (%)Area
Automobile205425(15, 10)(59, 57)1.5Life
ChronicKidneyDisease400224(11, 14)(1012, 778)9.5Medicine
Credit Approval690215(6, 9)(67, 25)0.65Financial
Cylinder Bands540235(22, 13)(999, 571)5.30Physical
Dermatology366634(1, 33)(8, 8)0.06Medicine
HCC survival165249(26, 23)(826, 475)10.22Medicine
Hepatitis155219(6, 13)(167, 122)5.67Medicine
MammographicMasses96125(1, 4)(162, 83)3.37Medicine
Thyroid (sick)3772229(7, 22)(6064, 5914)2.17Medicine
Abalone4178288(7, 1)None0Wildlife
Ecoli33687(7,0)None0Biology
PenDigits10,9921016(16,0)None0Computer
Glass21469(9, 0)None0Physical
Page Blocks5473510(10, 0)None0Computer
Waveform5000321(21, 0)None0Physical
Vehicle846418(18,0)None0Computer
Online Shoppers’12,330217(10, 7)None0Business
Purchasing Intention
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Alshdaifat, E.; Alshdaifat, D.; Alsarhan, A.; Hussein, F.; El-Salhi, S.M.F.S. The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance. Data 2021, 6, 11. https://0-doi-org.brum.beds.ac.uk/10.3390/data6020011

AMA Style

Alshdaifat E, Alshdaifat D, Alsarhan A, Hussein F, El-Salhi SMFS. The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance. Data. 2021; 6(2):11. https://0-doi-org.brum.beds.ac.uk/10.3390/data6020011

Chicago/Turabian Style

Alshdaifat, Esra’a, Doa’a Alshdaifat, Ayoub Alsarhan, Fairouz Hussein, and Subhieh Moh’d Faraj S. El-Salhi. 2021. "The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance" Data 6, no. 2: 11. https://0-doi-org.brum.beds.ac.uk/10.3390/data6020011

Article Metrics

Back to TopTop