2.1. Ensemble Classifiers
A basic principle behind ensemble learning is that, by combining a series of classifiers that perform slightly better than random guessing (known as weak learners), a single strong classifier can be constructed. A decision tree (DT) classifier [32
] is the epitome of a weak learner, and is often incorporated into ensembles to establish a strong classifier. One popular type of ensemble tree models is referred to as Bootstrap Aggregation, or Bagging Trees (BT) [33
]. As with its other ensemble counterparts, the BT model aims to address the problem of overfitting of DTs. The BT model involves randomly selecting a subset of the training data (with replacement) to train each individual decision tree model in the ensemble. The final classification result is obtained through majority voting of the output (i.e., LULC class) of the individual DTs in the ensemble.
The RF classifier is similar to BT in that each DT in the ensemble is grown using a random subset of the training data, but RF also randomly selects a subset of the classification variables for classifying each DT [24
]. The generated DTs thus have higher variance and lower bias. To train an RF model, two hyperparameters need to be set: the number of randomly selected features (Mtry
) used for splitting each node and the number of trees (Ntree
). Based on the experiments of Breiman [9
], reasonable accuracy was obtained on different data sets when Mtry
was set to
is the number of variables. In addition, Lawrence, et al. [34
] reported that an Ntree
of 500 or more produced unbiased estimates of error.
Boosting algorithms are greedy techniques that are also popular in the context of RS [26
]. Unlike BT and RF, boosting models do not grow DTs in parallel. They instead sequentially train individual DTs, each of which is an improved version of the previous one that resulted in smaller error rate. The most commonly used type of boosting algorithms is perhaps Adaptive Boosting (AdaBoost) [35
]. The foundation of boosting was later improved by introducing a more generalized solution called gradient boosting (GB) [36
]. GB works based on minimizing a loss function through fitting an additive basis function model sequentially to the gradient residual of the loss function. Another improved version of the regular gradient tree boosting model recently proposed is XGB [31
]. This type of ensemble tree models has been recently reported to be a powerful ML algorithm for mapping purposes in RS [37
]. This algorithm is in fact a specific implementation of the concept of the regular GB. In XGB, a regularization term (added to the loss function) is used to constrain the model, thereby helping control the complexity of the model, better avoiding overfitting. To train an XGB model, three main parameters should be set: Ntree
, learning rate (eta
), and the depth of each individual tree (depth
). Tuning these three free parameters helps improve the performance of the model in terms of both speed and accuracy.
2.3. Deep Learning Architectures
The structure of an MLP, which is of the family of feedforward networks, can be schematically seen in Appendix A
). An MLP is composed of four main components: Input layer, neurons (nodes), hidden layers, and output layer. To put it simply, an MLP can be defined as a series of layers of neurons successively connected to each other by weights, which are iteratively adjusted through an optimization process. The training procedure in feedforward neural networks is based on the back-propagation algorithm. This algorithm propagates error from the output layer to the input layer. Through this process, which usually employs stochastic gradient descent approach for optimizing the loss function, the weights in each layer of the network are updated iteratively until network’s error rate reaches a desired minimum state.
In the earlier forms of feedforward networks, adding more than one layer to the network did not have any effect on the accuracy. In fact, the gradient update was not able to back-propagate the error to the first layers, and thus no parameter update was applied to them; this phenomenon, known as vanishing gradients, was one of the main barriers to applying DNNs. In one of the revolutionary findings by Glorot and Bengio [41
] in the field of neural networks, this problem was overcome using a new strategy for weight initialization. This provided new insights into taking advantage of DNNs to model complex patterns in different applications including object detection/recognition, semantic segmentation, hand-written recognition, speech recognition, etc.
Although the number of hidden layers and neurons can affect the performance of a network, there is not any solid rule for determining optimal values for them. It is also obvious that fitting more complex models (by designing more complex DNNs) to the input data increases the possibility of overfitting. To address this problem, these two hyperparameters can be set by performing cross-validation or based on a user’s a priori knowledge. In addition, to overcome overfitting, regularization techniques, including dropout, are normally applied to increase the generalizability of the model.
Autoencoders are a variant of neural networks that are structurally similar to MLPs. The main application of an autoencoder is to find the features that best represent input data reconstructed, helping prevent overfitting, especially in cases where sufficient labeled data are not available [42
]. Therefore, they could be very useful in RS applications because of the difficulty in collecting labeled data. As seen in Figure A2
, an autoencoder has two main parts that distinguish it from an MLP, namely the coding layer and decoding layer. In the coding layer, the network learns to map the data to a lower-dimensional feature space, similar to what Principal Component Analysis (PCA) does.
The decoding layer is responsible for reconstructing the coded, dimensionally reduced data. To put it differently, this layer approximates the input data using the coded data. It is therefore evident that the number of neurons in the input layer must be the same as the number of neurons in the output layer. To reconstruct complex data more accurately, it is possible to stack multiple autoencoders together, resulting in a deep autoencoder or stacked autoencoder. The coding layers of a stacked autoencoder can also be fed into a supervised learning model for classification purposes.
Explicitly reducing the number of neurons that our output from the encoding layer is not the only solution to reduce dimensionality and to extract useful structures from the input data in an unsupervised setup. Penalizing the neurons of a stacked autoencoder is another way to compress feature space, which permits the same or more neurons in successive layers as their input features. In this regard, one solution is to add a sparsity term to the loss function to establish SAE (Figure A3
). To put it simply, by adding this sparsity term during training process, some neurons in the coding layer are deactivated to preclude the model from memorizing the pattern/structure of training data. Sparsifying the network at each training iteration helps the network learn more useful features to reconstruct the input data even when a larger number of neurons in the hidden layers are used. The sparsity term commonly used is Kullback–Leibler (KL) divergence, which is a function for comparing the similarity between two distributions. This extra term added to the loss function has two free parameters: sparsity parameter and sparsity weight. The sparsity parameter controls the average activation of hidden neurons (i.e., neurons in the hidden layer(s)). The sparsity weight is a scale value that determines the magnitude of the sparsity term imposed.
Another powerful type of autoencoders is variational autoencoder (VAE) [43
]. Unlike regular autoencoders and SAEs, a VAE is a probabilistic model; that is, a VAE maps the input data to a probability distribution (in the coding space or latent space) rather than to a fixed encoded vector. Instead of learning a function to map a single data point in the feature space into an encoded single value, a VAE learns the parameters (i.e., mean and standard deviation) of a probability distribution (in a latent space) from the input data. From this probability distribution, which is typically chosen (but not limited) to be Gaussian, a single data point is randomly sampled through an additional layer called sampling layer. Finally, the samples are fed into the decoding layer to apply the reconstruction process. A VAE also takes advantage of the KL divergence for comparing the distribution learned with a normal distribution (with a mean of 0 and standard deviation of 1). In other words, the KL divergence controls if the learned distribution is not significantly different from a normal distribution (For more detailed information on VAEs, the reader is recommended to refer to Doersch [44
]). As implied from the above descriptions, VAEs are inherently generative models. Indeed, the capability of VAEs in transforming input data to a probability distribution and then sampling from it helps them be able to generate new instances from the probability distribution constructed. This feature is useful in cases where sufficient labeled data are not available for classification so that the analyst can use a VAE to synthesize new instances to improve the accuracy and generalizability of modeling.
Another variant of feedforward multilayer models is CNN, which is perhaps currently the most popular DL model for object detection/identification. CNNs are very simplistic analogies of the mammalian visual cortex. Unlike MLP and autoencoders, inputs into CNNs are image patches, not vectorized data, which is very useful for extracting/learning spatial and contextual features. In a CNN architecture, a fixed-size image patch is mapped to a vector of probabilities calculated for each of the classes considered [42
]. Another difference between an MLP and a CNN is that CNNs are not fully connected (except their last layer), which means that each neuron is not connected to all other neurons in the next layer. In fact, this is the main advantage of CNNs that helps them generate learned features through applying sequentially convolutional filters to fixed-size inputs. In the first layers, convolutional filters extract low-level features (e.g., edges). The last layers, on the other hand, are responsible for extracting and learning high-level information (e.g., land features). When stacked together, these convolutional layers can be very beneficial to detect/recognize objects of interest. To improve the efficiency of CNNs, some convolutional layers are followed by a pooling/subsampling layer to reduce output size of the convolutional layers. One of the problems with CNNs is that they only accept fixed-size image patches. In RS mapping, this can negatively affect classification results because LULC boundaries for geometrically distorted and for small LULCs can be ignored by the convolutional filters used in the CNN, which causes unwanted uncertainty in the classification process [29
]. One of the solutions to address this problem is to integrate GEOBIA with a CNN to account for the boundaries of land features to be classified.