5.2.1. Machine Learning Approaches
Classification is the sub-field of supervised learning that is concerned with the prediction of the category of a given input. The classification model or classifier is trained using a labeled training set (i.e., a dataset containing observations whose category membership is known). Each observation in the dataset is an n-dimensional vector, and each element of the vector is called a feature (also attribute or variable). We have used five classifiers: K nearest neighbor (K-NN) with (K = 1 and K = 3), support vector machines (SVMs), naive Bayes (NB) and classification trees (C4.5). A brief description of all of them is included below.
Instance-based learning belongs to the K-NN paradigm, a distance-based classifier. It computes the distance of a new case to be classified to each of the observations in the database it uses as the model and decides the class it will assign based on the K nearest cases. We have used the instance-based algorithm described in [49
Support vector machines (SVMs)
SVMs are a set of related supervised learning methods used for classification and regression. In a bi-class problem, SVM views the input data as two sets of vectors (one set per class) in an n-dimensional space. The SVM will construct a separating hyperplane in that space, one which maximizes the margin between the two datasets. To calculate the margin, two parallel hyperplanes are constructed, one on each side of the separating hyperplane, which are “pushed up against” the two datasets. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both classes, since, in general, the larger the margin, the lower the generalization error of the classifier [51
]. SVMs were extended to classify datasets that are not linearly separable through the use of non-linear kernels. In our work, we use non-linear SVMs with a radial kernel.
There are many classifiers based on probability theory. Most of them are based on Bayes’ theorem and try to obtain the class for which the a posteriori
probability is the greatest given the predictor variables of the case to be classified. In this work, we have used the naive Bayes (NB) classifier [52
]. The name of this classifier comes from its underlying assumption, namely that the features are independent. This is a very strong assumption and, hence, considered “naive”.
A classification tree is a classifier composed of nodes and branches that break the set of samples into a set of covering decision rules. In each node, a single test is made to obtain the partition. The starting node is called the root of the tree. In the final nodes or leaves, a decision about the classification of the case is made. In this work, we have used the C4.5 algorithm [53
5.2.2. Feature Selection
Feature selection (FS) is the process of identifying the features that are relevant to a particular learning (or data mining) problem. FS is a key process in supervised classification [54
] and can improve classification performance (accuracy, area under the receiver operating characteristic (ROC) curve (AUC), etc
Most algorithms for FS can be categorized as filter or wrapper approaches. In the filter approach, an attribute (or attribute subset) is evaluated by only using intrinsic properties of the data (e.g., statistical or information-based measures). Filter techniques have the advantage of being fast and general, in the sense that the subset obtained is not biased in favor of a specific classifier. However, they lack robustness against interactions among features. In addition, it is not clear how to determine the cut-off point for rankings to select only truly important features and exclude noise.
Wrapper algorithms have the advantage of achieving greater accuracy than filters, but with the disadvantage of being (far) more time-consuming and obtaining an attribute subset that is biased towards the used classifier. Over the last decade, wrapper-based FS has been an active area of research. Different search algorithms [55
] have been used to guide the search process, while some classifiers (e.g., naive Bayes, KNN, etc
.) are used as a surrogate in order to evaluate the goodness of the subset proposed by the search algorithm.
5.2.3. Experimental Results
As presented in Section 4, each raw image will undergo the proposed delineation steps that provide a set of 2D straight segments. Several statistics are extracted from this set of segments and are used as the predictor variables in the classifier. The statistics extracted are listed next, where the lengths are expressed as a percentage of the image width, the slopes represented in degrees and the slope differences the differences between slopes for every pair of segments in the delineation. For the image of the processed delineation, the sums of the rows (horizontal accumulation) and the sums of the columns (vertical accumulation) are calculated.
Regarding the lengths, these statistics are the minimum, maximum, least frequent, most frequent, mean and standard deviation.
Regarding the slopes, the statistics used are the maximum, least frequent, most frequent, mean, standard deviation and percentage of slopes that are vertical (or very nearly vertical, i.e., a 90-degree slope).
For slope differences, the statistics used are the least frequent, most frequent, mean, standard deviation, percentage of differences between zero and four degrees (pairs of segments that are parallel or nearly parallel) and percentage of differences between 86 and 94 degrees (pairs of segments that are perpendicular or nearly perpendicular).
For the horizontal accumulation, the maximum (expressed as the percentage of the width), least frequent, mean and standard deviation are used.
For the vertical accumulation, the minimum (expressed as the percentage of the height), maximum, least frequent, most frequent, mean and standard deviation are used.
Altogether, every image is described by 28 features that summarize statistics about the extracted straight segments. Every feature was rescaled to the interval [0, 1].
Classifier Parameters The classifiers we have chosen to use are the K Nearest Neighbor (K-NN), Support Vector Machines (SVM), Naive Bayes (NB), and Classification Trees (C4.5) algorithms. For K-NN we have chosen to test two possible values of K, the number of nearest neighbors, 1 and 3. For SVM, we use a non-linear SVM with a Gaussian kernel e(−γ|u−v|2) with γ = 1/N = 1/86 (N being the number of instances) and the cost parameter C for C-SVC equal to 150.
Classification Using All Features and Manually Selected Features The performance of a classifier is usually measured in terms of the success rate (or classification accuracy); that is, the proportion of instances correctly classified given a test set. The test set must contain instances that are not present in the training set, i.e., the dataset used to train the classifier. When the available dataset is small, splitting it into training and test sets means the classifier has to be trained and tested, respectively, with few data. In order to attenuate this problem, we have chosen to follow the leave-one-out cross-validation (LOOCV) protocol to perform the evaluation. The k-fold cross-validation technique, in general, splits the dataset into k subsets. Then, it performs k iterations of training and testing, using one of the k subsets as the test set (a different one each iteration) and the union of the remaining subsets as the training set. After the k iterations, each instance of the dataset has been classified by the classifier exactly once. These predicted classifications are then compared to the actual class labels to calculate the success rate. The LOOCV technique is a particular case of the k-fold cross-validation technique in which k is set to the number of instances in the data. This, of course, means that each fold contains just one element.
In order to quantify the classification accuracy, we have performed LOOCV tests using the five classifiers. Table 1
illustrates the classification accuracy obtained with the five classifiers in two cases. The first case corresponds to the use of the set of the 28 extracted features. The second case corresponds to the use of a subset of 19 manually selected features out of the 28. These 19 features are of two types: original features and features that were formed by combining original features. As can be seen, by using the raw 28 features, the non-linear SVM provided the best performance. However, when the manually selected features were used, the 1-NN provided the best performance. We can also observe that the use of manually selected features improved the performance of all classifiers (except for the BN classifier for which the performance remains the same). For example, the SVM performance increased 2.7%.
Classification with Automatic Feature Selection
The above manual selection of features was guided by some intuition about feature relevance in discriminating classes. However, this manual selection is not necessarily optimal and can be replaced by automatic feature selection using, for example, the wrapper technique. The goal is to find a subset among the complete set of features (28 features) that maximizes the classification accuracy. We evaluated the performance using LOOCV. The search strategy chosen was a genetic algorithm [56
] with the following parameter values: the population size was set to 80; the maximum number of generations was set to 20; the crossover probability was set to 0.7; and the mutation probability to 0.05. This kind of processing is carried out efficiently, since the total number of features is 28.
shows the classification accuracy obtained with the five classifiers and the features selected by the wrapper method. Compared to the scheme that uses all features (28 features) or the manually selected features (19 features), the accuracy of all classifiers was improved. For example, the recognition rate of the 1-NN classifier increased by 12.8% with respect to the result with 28 features and by 4.2% with respect to the result with the manually selected features (19 features). We can observe that, following automatic feature selection, the SVM classifier provided the best performance.
illustrates the precision, recall and F1 measure for every class obtained after feature selection for all classifiers. Finally, Table 4
shows the confusion matrix for the best result obtained, that is, the SVM classifier with feature selection using the wrapper method.
Note that the wrapper technique for feature selection makes use of the classifier to evaluate the different feature subsets. Thus, the optimal subset of features differs from one classifier to another. Table 5
illustrates the selected features for every classifier. For every feature, we show a binary string formed by five bits that correspond to the five classifiers, 1-NN, 3-NN, NB, J48 and SVM, respectively. A zero bit indicates that the feature was not selected by the wrapper technique for the corresponding classifier. A one bit indicates that the feature was selected. As expected, the relevance of features depends highly on the classifier used. For instance, the SVM classifier (fifth bit) preferred one feature related to lengths, four features related to slopes and their differences, two features related to the percentage of pairwise relation (parallel and perpendicular) and five features about row and column accumulation. Moreover, we can observe that all classifiers have selected the amount of segment pairs having a parallelism relation.
Analysis of the Results The results show that all classifiers have more difficulty classifying the instances of the second class correctly. The reason is twofold. First, this is the class with the least number of training examples, and this has a direct negative impact on the performance of the classifier. Second, the geometric arrangement of walls belonging to Class 2 shares many statistics with the two remaining classes. This is because Class 2 is, by definition, the class in between Classes 1 and 3. Class 2 shares the use of non-regular blocks with Class 1. On the other hand, Class 2 shares with Class 3 that the blocks are arranged in rows.
In practice, discriminating between Classes 1 and 2 is challenging, even for humans; the boundary is not at all clear-cut. Figure 12
shows a Class 2 wall that could be easily classified as Class 1. The blocks in this wall are irregular, so it is clearly not Class 3. However, we could say that the bottom half, in which we can clearly see rows, is Class 2 and the top half, in which there are no obvious rows, Class 1. Therefore, misclassifying Class 1 and 2 instances is not, generally, a fatal error. Additionally, in some cases, even misclassifying Class 2 and 3 instances might not be fatal.
However, misclassifying Class 1 and 3 instances is always a fatal error. From the confusion matrix in Table 4
, we can conclude that the best performing classifier has made only four fatal errors (4.7%), and these are all Class 3 instances that have been classified as Class 1. The fatal errors are due to the incorrect delineation of these images. Figure 13
shows an example in which the delineation framework proposed almost completely fails to delineate the regular blocks in a Class 3 wall. This is, currently, the weakest point of the framework and one to which it is imperative that we find a solution.
In conclusion, the results obtained are promising. The best classifier achieved a 87.2% success rate, and among the 12.8% of incorrectly classified cases, only four (4.7% of the sample size) were fatal errors. Nonetheless, the delineation algorithms need to be improved so that, at least, the classifier makes no fatal errors.