Next Article in Journal
Entropy Generation on MHD Eyring–Powell Nanofluid through a Permeable Stretching Surface
Previous Article in Journal
Entropy Generation on Nanofluid Flow through a Horizontal Riga Plate
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Extreme Learning Machine for Multi-Label Classification

1
School of Information Science and Technology, Northwest University, Xi’an 710069, China
2
Computer Information Science and Engineering, University of Florida, Gainesville, FL 32608, USA
3
Department of Computer Science, Xi’an Jiaotong University City College, Xi’an 710069, China
*
Author to whom correspondence should be addressed.
Submission received: 26 February 2016 / Revised: 21 May 2016 / Accepted: 1 June 2016 / Published: 8 June 2016

Abstract

:
Extreme learning machine (ELM) techniques have received considerable attention in the computational intelligence and machine learning communities because of the significantly low computational time required for training new classifiers. ELM provides solutions for regression, clustering, binary classification, multiclass classifications and so on, but not for multi-label learning. Multi-label learning deals with objects having multiple labels simultaneously, which widely exist in real-world applications. Therefore, a thresholding method-based ELM is proposed in this paper to adapt ELM to multi-label classification, called extreme learning machine for multi-label classification (ELM-ML). ELM-ML outperforms other multi-label classification methods in several standard data sets in most cases, especially for applications which only have a small labeled data set.

1. Introduction

Multi-label classification deals with one object which possibly belongs to multiple labels simultaneously, which are common in real-world applications, such as text categorization, scene and video annotation, bioinformatics, and music emotion classification [1]. For example, a sunrise image could be labeled with “sun,” “sky” and “sea” at the same time in semantic scene classification. Formally [2], let X = d denote the d-dimensional input space, and Y = { y 1 , y 2 , , y m } denote the label space consisting of m possible class labels. The task of multi-label learning is to learn a function: h : X   2 Y from the multi-label training set D = { ( x i , Y i ) | 1 i n } . For each multi-label example ( x i , Y i ), x i X is a d-dimensional feature vector ( x i 1 , x i 2 , , x i d ) T and Y i Y is the relevant label with x i . For any unseen instance x i X , the multi-label classifier h ( · ) predicts h ( x ) Y as the set of proper labels for x .
Multi-label classification has attracted a lot of attention in the past few years [3,4,5,6]. Nowadays, there are two main ways to construct various discriminative multi-label classification algorithms: problem transformation and algorithm adaptation. The key philosophy of problem transformation methods is to fit the data to the algorithm, while the key philosophy of algorithm adaptation methods is to fit the algorithm to the data [2].
A problem transformation strategy tackles a multi-label learning problem by transforming it into multiple independent binary or multi-class sub-problems, constructing a sub-classifier for each sub-problem using an existing technique, and then assembling all sub-classifiers into an entire multi-label classifier. It is convenient and fast to implement a problem transformation method due to the number of existing techniques and their free software. Representative algorithms include Binary Relevance [7], AdaBoost.MH [8], Calibrated Label Ranking [3], Random k-labelsets [9], etc.
An algorithm adaptation strategy tackles multi-label learning problem by adapting popular learning techniques to deal with multi-label data. Representative algorithms include Multi-Label k-Nearest Neighbor (ML-kNN) [10], Multi-Label Decision Tree (ML-DT) [11], Ranking Support Vector Machine (Rank-SVM) [12], Backpropagation for Multi-Label Learning (BP-MLL) [13], etc. The basic idea of ML-kNN is to adapt k-nearest neighbor techniques to deal with multi-label data, where a maximum a posteriori (MAP) rule is utilized to make predictions by reasoning with the labeling information embodied in the neighbors. The basic idea of BP-MLL is to adapt feed-forward neural networks to deal with multi-label data, where the error back propagation strategy is employed to minimize a global error function capturing label correlations.
In the multi-labeled setting, classes belonging to one instance are often related to each other. The performance of the multi-label learning system is poor if it ignores the relationships between the different labels of each instance. Therefore, the famous Rank-SVM defines the margin over hyper planes for relevant–irrelevant label pairs, which explicitly characterizes label correlations of an individual instance. Rank-SVM achieves great accuracy. Unfortunately, Rank-SVM has a high computational cost, which limits its usability for many applications. Therefore, it is still necessary to build some novel efficient multi-label algorithms.
Recently, Huang et al. [14,15,16] proposed a novel learning algorithm for single-hidden layer feedforward neural networks called extreme learning machine (ELM). The single-hidden layer feedforward neural networks have been widely applied in machine learning [17,18,19,20,21,22] and ELM represents one of the recent successful approaches in machine learning. Compared with traditional computational intelligence techniques, ELM exhibits better generalization performance at a much faster learning speed and with fewer human interventions. ELM techniques have received considerable attention in computational intelligence and machine learning communities, in both theoretic study and applications [23,24,25,26,27,28,29]. ELM provides solutions to regression, clustering, feature learning, binary classification and multiclass classifications, but not to multi-label learning, which is a harder task than traditional binary and multi-class problems. Therefore, a thresholding method-based ELM is proposed in this paper to adapt ELM to multi-label classification, called ELM-ML (Extreme Learning Machine for Multi-Label classification). Experiments on eight multi-label datasets show that the performance of ELM-ML is superior to some other well-established multi-label learning algorithms including Rank-SVM, ML-kNN, BP-MLL and Multi-Label Naïve Bayes (MLNB) in most cases, especially for applications which only have a small labeled data set.

2. A Brief Review of ELM

This section briefly reviews the standard ELM [14]. ELM was originally proposed for single hidden-layer feedforward neural networks (SLFNs). For N arbitrary distinct samples ( x i , t i ), where x i = [ x i 1 , x i 2 , , x i d ] T R d and t i = [ t i 1 , t i 2 , , t i m ] T R m , standard SLFNs with N ˜ hidden nodes and activation function g ( x ) are mathematically modeled as
i = 1 N ˜ β i g ( x j ) = i = 1 N ˜ β i g ( w i · x j + b i ) = o j ,   j = 1 , , N
where w i = [ w i 1 , w i 2 , , w i N ] T is the weight vector connecting the ith hidden node and the input nodes, β i = [ β i 1 , β i 2 , , β i m ] T is the weight vector connecting the ith hidden node and the output nodes, and b i is the thresholding of the ith hidden node. w i · x j denotes the inner product of w i and x j . If a SLFNs with N ˜ hidden nodes with activation function g ( x ) can approximate these N samples with zero error, it means that j = 1 N ˜ o j t j = 0 , i.e., there exist β i , w i and b i such that
i = 1 N ˜ β i g ( w i · x j + b i ) = t j ,   j = 1 , , N
The above N equations can be written compactly as
H β = T
where
H = [ h ( x 1 ) h ( x N ) ]   = [ g ( w 1 · x 1 + b 1 ) g ( w N ˜ · x 1 + b N ˜ ) g ( w 1 · x N + b 1 ) g ( w N ˜ · x N + b N ˜ ) ] N × N ˜
β = [ β 1 T β N ˜ T ] N ˜ × m and   T = [ t 1 T t N T ] N × m
H is called the hidden layer output matrix of the SLFN; the ith column of H is the ith hidden node output with respect to inputs x 1 , x 2 ,…, x N . It has been proven [15] that if the activation function g is infinitely differentiable in any interval, the hidden layer parameters can be randomly generated. Therefore, Formula (3) becomes a linear system and the output weights β are estimated as
β ^ =   H T ,
where H is the Moore–Penrose generalized inverse of the hidden layer output matrix H. Thus, ELM randomly generated hidden node parameters and then analytically calculated the hidden layer output matrix H and the output weights β . This avoids any long training procedure where a hidden layer of SLFNs need to be tuned. Compared with traditional computational intelligence techniques, ELM provides better generalization performance at a much faster learning speed and with fewer human interventions.

3. ELM-ML

In this section, we will describe our multi-label classification algorithm, called an extreme learning machine for multi-label classification (ELM-ML).
From the standard optimization method point of view [15], ELM with multi-output nodes can be formulated as
Minimize :   1 2 β 2 + C 1 2 i = 1 N ξ i 2 ,   Subject to   h ( x i ) β = t i T ξ i T
Formula (7) tends to reach not only the smallest training error but also the smallest norm of output weights, where 1 i N . N is the number of training samples. ξ i =   [ ξ i , 1 , , ξ i , m ] T is the training error vector of the m output nodes with respect to the training sample x i . C is a user-specified parameter and provides a trade-off between the distance of the separating margin and the training error. The predicted class label of a given testing sample is the index number of the output node which has the highest output value. Formula (7) provides a solution to multi-class classifications.
Multi-label learning is a harder task than traditional multi-class problems, which is a special case of multi-label classification. One sample belongs to several related labels simultaneously, so we cannot simply regard the index number of the highest output value as a predicted class for a given testing sample. A proper thresholding function t h ( x ) should be set. Naturally, the predicted class labels for a given testing sample are those index numbers for output nodes which have higher output value than the predefined thresholding.
Setting t h ( x ) as a constant function is a common multi-label thresholding method. One straightforward choice is to use zero as the calibration constant [7,8]. An alternative choice for the calibration constant is 0.5, when the multi-label learned model f ( x , y ) represents the posterior probability of y being a proper label of x [10,11,23]. Another popular approach, called experimental thresholding [30], consists in testing different values as thresholds on a validation set and choosing the value which maximizes the effectiveness. Of all the threshold calibration strategies above, experimental thresholding seems to be the most reasonable, but it is time-consuming and labor-intensive. We believe that the thresholding function t h ( x ) should be learned from instances. That is to say, different instances should correspond to different thresholdings in multi-label learning model. A novel method would be to consider the thresholding function t h ( x ) as a regression problem for the training data. In this paper, we use an ELM algorithm with a single output node to solve this regression problem. Overall, the proposed ELM-ML algorithm has two phases: multi-class classifier-based ELM with multi-outputs and thresholding function learning-based ELM with single outputs. The pseudo-code of ELM-ML is summarized in Figure 1.

4. Experiments

Firstly, we compare the proposed thresholding strategy in this paper with the constant thresholding strategy and strategy in Rank-SVM [12]. Secondly, we compare the performance of different multi-label classification algorithms, including our algorithm ELM-ML, Rank-SVM, MLNB, BP-MLL [13] and ML-kNN [10] on eight multi-label classification data sets. Before presenting our experimental results, we briefly introduce eight benchmark data sets and six multi-label performance measures.

4.1. Datasets

In order to verify the performance of thresholding strategy and different multi-label classification algorithms, a wide variey of data sets have been tested in our simulations, which are of small/large sizes, low/high dimensions, and small/large labels. These data sets [31] include diversified multi-label classification cases, which cover four distinct domains: text, scene, music and biology.
To characterize the properties of the multi-label data sets, several useful multi-label indicators can be utilized. The most natural way to measure the degree of multi-labeledness is Label Cardinality (LC): L C ( D ) = 1 m i = 1 m | Y i | , i.e., the average number of labels per sample. Accordingly, Label Density(LD) normalizes label cardinality by the number of possible labels in the label space: L D ( D ) = 1 | y | · L C ( D ) . Table 1 describes these eight benchmark data sets, in which #Training and #Test means the numbers of training examples and test examples, respectively. As shown in Table 1, the data sets cover a different range of cases whose characteristics are diversified. These data sets are categorized into three groups, such as small data sets, medium data sets and large data sets, according to the amount of training samples. Significantly, there are not only low or high dimensions but also small or large labels in any group.

4.2. Evaluation Measures

With the aim of fair and honest evaluation, performance of the multi-label learning algorithms should be tested on a broad range of metrics instead of only on the one being optimized. In this paper, we chose six evaluation criteria suitable for classification: Subset Accuracy, Hamming Loss, Accuracy, Precision, Recall and F1. These measures are defined as follows [2].
Assume a test data set of size n to be S = {( x 1 , Y 1 ),…, ( x i , Y i ),…, ( x n , Y n )} and h ( · ) be the learned multi-label classifier. A common practice in multi-label learning is to return a real-valued function f ( x , y ) . For a unseen instance x , the real-valued output f ( x , y ) on each label should be calibrated against the thresholding function output t h ( x ) .
  • Hamming Loss
The hamming loss evaluates the fraction of misclassified instance-label pairs, i.e., a relevant label is missed or an irrelevant label is predicted.
Hamming Loss ( h ) = 1 n i = 1 n | h ( x i ) Δ Y i |
where Δ stands for the symmetric difference between two sets.
  • Subset Accuracy
The subset accuracy evaluates the fraction of correctly classified examples, i.e., the predicted label set is identical to the ground-truth label set. Intuitively, subset accuracy can be regarded as a multi-label counterpart of the traditional accuracy metric, and tends to be overly strict especially when the size of the label space is large.
Subset Accuracy ( h ) = 1 n i = 1 n | h ( x i ) = Y i |
  • Accuracy, Precision, Recall and F1
Accuracy ( h ) = 1 n i = 1 n | Y i h ( x i ) | | Y i h ( x i ) |
Precison ( h ) = 1 n i = 1 n | Y i h ( x i ) | | h ( x i ) |
Recall ( h ) = 1 n i = 1 n | Y i h ( x i ) | | Y i |
F 1 ( h ) = 2 × Precision ( h ) × Recall ( h ) Precision ( h ) + Recall ( h )
Here, Y i and h ( x i ) correspond to the ground-truth and predicted label set for x i , respectively.
Obviously, except for the first metrics, the larger the last five metric value, the better the system’s performance is.

4.3. Results

4.3.1. Thresholding Function

We compare three thresholding strategies: the proposed thresholding strategy in ELM-ML, thresholding strategy in Rank-SVM and constant strategy. To achieve an optimal constant thresholding parameter Ct, we tested different values as thresholdings on the validation set and chose the value which maximizes the effectiveness. More specifically, hamming loss and subset accuracy are regarded as criteria to tune constant parameter Ct, respectively. Constant parameter Ct is tuned from −1 to 1 with interval step rate δ = 0.1 . Rank-SVM employs the stacking-style procedure to set the thresholding function t h ( x ) [12]. We apply multi-class ELM algorithm providing scores for each sample, then use three thresholding strategies, called ELM-ML, ELM-Rank-SVM and ELM-Constant, to predict labels for each sample.
In experiments, our computational platform is a 32-bit HP workstation with Interl® (Santa Clara, CA, USA) CoreTM i3-2130 CPU (4CPUs) except for on TMC2007-500 data set and TMC2007 data set, because the amount of training data for these two data sets is very large, and for which the computational platform is a 64-bit HP server with Interl® CoreTM i7-4700MQ CPU (8CPUs).
All results are detailed in Table 2, Table 3 and Table 4. The best performance is highlighted in boldface. As seen in Table 2, Table 3 and Table 4, the proposed thresholding strategy in ELM-ML achieves the highest performance among them in most cases. It is worthwhile to note that we tune parameter Ct very carefully on each data set in a constant thresholding strategy. The hamming loss and subset accuracy measures are regarded as criteria to tune parameter Ct. That is to say, when we verified these three thresholding strategies using hamming loss, hamming loss was regarded as a criterion to tune parameter Ct; when we verified these three thresholding strategies using subset accuracy, subset accuracy was regarded as a criterion to tune parameter Ct. Unfortunately, a constant thresholding strategy holds only a slight advantage in the subset accuracy criterion for the scene data set. Meanwhile, the constant thresholding strategy outperforms others for the hamming loss criterion only on the Enron and CAL500 data sets.
Training times and testing times are listed in Table 4. We only compare running time of the thresholding strategy in ELM-ML and the thresholding strategy in Rank-SVM, because the constant thresholding is time-consuming. Obviously, ELM-ML achieves overwhelming performance. In conclusion, the proposed thresholding strategy in ELM-ML is effective and efficient.

4.3.2. Multi-Label Algorithms

We also compare the performances of different multi-label classification algorithms, including ELM-ML, Rank-SVM, BP-MLL, MLNB and ML-kNN on eight multi-label classification data sets. We downloaded Matlab code [32] of BP-MLL, MLNB and ML-kNN . We developed the ELM-ML algorithm in Matlab and chose a sigmoid activation function. Hidden nodes were set to N ˜ = 1000. We accept their recommended parameter settings. The best parameters of BP-MLL, MLNB and ML-kNN reported in the literature [10,13,33], were used. For BP-MLL, the learning rate is fixed at 0.05, the number of hidden neurons is 20% of the number of input neurons, the training epochs is set to be 100 and the regularization constant is fixed to be 0.1. For MLNB, the fraction of remaining features after PCA is set to the moderate value of 0.3. For ML-kNN, the Laplacian estimator s = 1 and k = 10 are used. The number of iterations is fixed at 100. For Rank-SVM developed in Matlab, a Gaussian kernel is tested, where kernel parameter γ and cost parameter   C need to be chosen appropriately for each data set. In our experiments, the Hamming Loss measure is regarded as a criterion to tune two parameters. To achieve an optimal parameter combination ( γ ,   C ), we use a similar tuning procedure as in [9]. The optimal parameters of Rank-SVM on each data set are shown in Table 5. Due to the large size of the label space, Rank-SVM failed to calculate the results on CAL500. On the other hand, Rank-SVM, BP-MLL and MLNB did not output experimental results for TMC2007 and TMC2007-500. The amounts of training data of TMC2007 and TMC2007-500 are huge, which leads to high computational complexity. BP-MLL needs more iteration, which is time-consuming. Therefore, there are no results on these three data sets or parts of them.
The detailed experimental results are shown in Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12. The best performance among the five comparing algorithms is highlighted in boldface. From Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11, ELM-ML obtains the best performances in all six criteria on a small data set. Genbase, Emotions and CAL500 are all small-size training data regardless of the size of labels and feature dimensions. That is to say, the proposed ELM-ML is more suitable to solve those applications for which a large amount of labeled data is difficult to obtain. ELM-ML tends to achieve better results than other well-established multi-label algorithms when only a small amount of labeled training data is available. However, ELM-ML is inferior to others on medium data sets, whereas Rank-SVM and ML-kNN work well. It is interesting that if a large number of labeled data are obtained easily, ELM-ML presents the best performance.
Furthermore, training times and testing times are listed in Table 12. ELM-ML achieves the best testing time and ML-kNN obtains the best training time except on the Scene and TMC2007-500 data sets.

5. Conclusions

In this paper, we present an ELM-ML algorithm to solve multi-label classification. ELM is regarded as a recent successful approach to machine learning, because ELM requires a significantly lower computational time for training a learning model and provides better generalization performance with less human intervention. However, ELM does not provide a solution to multi-label classifications. A post-processing step, threshold calibration strategies, should be used to predict the label set of a given sample. A novel method would be to consider the thresholding function t h ( x ) as a regression problem for training data with class labels. In this paper, we first use an ELM algorithm with multi-output nodes to train a learning model returning a real-valued function, then use the ELM algorithm with a single output node to learn a thresholding function. Experiments on eight diverse benchmark multi-label datasets show that ELM-ML is effective and efficient.

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. The author also thanks Zhihua Zhou, Mingling Zhang and Jianhua Xu, whose software and data have been used in our experiments. The authors also thank Changmeng Jiang and Jingting Xu for doing some related experiments. This work was supported by NSFc 61202184 and the natural science basic research plan in Shaanxi province of China 2015JQ6240.

Author Contributions

Xia Sun proposed the idea in this paper and conceived and designed the experiments. Xia Sun, Jun Feng, Su-Shing Chen wrote the paper. Jingting Xu and Changmeng Jiang performed the experiments. Feijuan He analyzed the data. All of the authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xu, J. Multi-label core vector machine with a zero label. Pattern Recognit. 2014, 47, 2542–2557. [Google Scholar] [CrossRef]
  2. Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
  3. Furnkranz, J.; Hullermeier, E.; Mencia, E.L.; Brinker, K. Multilabel classification via calibrated label ranking. Mach. Learn. 2008, 73, 133–153. [Google Scholar] [CrossRef]
  4. Ji, S.; Sun, L.; Jin, R.; Ye, J. Multi-label multiple kernel learning. In Advances in Neural Information Processing Systems 21; Koller, D., Schuurmans, D., Bengio, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2009; pp. 777–784. [Google Scholar]
  5. Guo, Y.; Schuurmans, D. Adaptive large margin training for multilabel classification. In Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francico, CA, USA, 7–11 August 2011; pp. 374–379.
  6. Quevedo, J.R.; Luaces, O.; Bahamonde, A. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recognit. 2012, 45, 876–883. [Google Scholar]
  7. Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef]
  8. Schapire, R.E.; Singer, Y. Boostexter: A boosting-based system for text categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
  9. Xu, J. An efficient multi-label support vector machine with a zero label. Expert Syst. Appl. 2012, 39, 2894–4796. [Google Scholar] [CrossRef]
  10. Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
  11. Clare, A.; King, R.D. Knowledge discovery in multi-label phenotype data. In Lecture Notes in Computer Science 2168; de Raedt, L., Siebes, A., Eds.; Springer: Berlin, Germany, 2001; pp. 42–53. [Google Scholar]
  12. Elisseeff, A.; Weston, J. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; Volume 14, pp. 681–687. [Google Scholar]
  13. Zhang, M.L.; Zhou, Z.H. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar] [CrossRef]
  14. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  15. Huang, G.B.; Zhou, H.M.; Ding, X.J.; Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2012, 42, 513–529. [Google Scholar] [CrossRef] [PubMed]
  16. Huang, G.B.; Zhou, H.M.; Ding, X.J. Optimization method based extreme learning machine for classification. Neurocomputing 2010, 74, 155–163. [Google Scholar] [CrossRef]
  17. Huang, G.B.; Chen, Y.Q.; Babri, H.A. Classification ability of single hidden layer feedforward neural networks. IEEE Trans. Neural Netw. 2000, 11, 799–801. [Google Scholar] [CrossRef] [PubMed]
  18. Chen, B.D.; Zhao, S.L.; Zhu, P.P.; Principe, J.C. Quantized kernel least mean square algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 22–32. [Google Scholar] [CrossRef] [PubMed]
  19. Chen, B.D.; Zhao, S.L.; Zhu, P.P.; Principe, J.C. Quantized kernel recursive least squares algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1484–1491. [Google Scholar] [CrossRef] [PubMed]
  20. Chen, S.; Cowan, C.F.; Grant, P.M. Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Netw. 1991, 2, 302–309. [Google Scholar] [CrossRef] [PubMed]
  21. Liu, W.; Principe, J.C.; Haykin, S. Kernel Adaptive Filtering: A Comprehensive Introduction. John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  22. Principe, J.C.; Chen, B.D. Universal Approximation with Convex Optimization: Gimmick or Reality? IEEE Comput. Intell. Mag. 2015, 10, 68–77. [Google Scholar] [CrossRef]
  23. Rong, H.-J.; Ong, Y.-S.; Tan, A.-H.; Zhu, Z. A fast pruned-extreme learning machine for classification problem. Neurocomputing 2008, 72, 359–366. [Google Scholar] [CrossRef]
  24. Mohammed, A.A.; Minhas, R.; Jonathan Wu, Q.M.; Sid-Ahmed, M.A. Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recognit. 2011, 44, 2588–2597. [Google Scholar] [CrossRef]
  25. Wang, Y.; Cao, F.; Yuan, Y. A study on effectiveness of extreme learning machine. Neurocomputing 2011, 74, 2483–2490. [Google Scholar] [CrossRef]
  26. Xia, M.; Zhang, Y.; Weng, L.; Ye, X. Fashion retailing forecasting based on extreme learning machine with adaptive metrics of inputs. Knowl. Based Syst. 2012, 36, 253–259. [Google Scholar] [CrossRef]
  27. Mishra, A.; Goel, A.; Singh, R.; Chetty, G.; Singh, L. A novel image watermarking scheme using extreme learning machine. In Proceedings of the 2012 International Joint Conference on IEEE Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; pp. 1–6.
  28. Horata, P.; Chiewchanwattana, S.; Sunat, K. Robust extreme learning machine. Neurocomputing 2013, 102, 31–44. [Google Scholar] [CrossRef]
  29. Ji, S.; Tang, L.; Yu, S.; Ye, J. Extracting shared subspace for multi-label classification. In Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 381–389.
  30. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2. [Google Scholar] [CrossRef]
  31. Software & Datasets. Available online: http://computer.njnu.edu.cn/Lab/LABIC/LABIC_Software.html (accessed on 4 June 2016).
  32. Min-Ling Zhang's Publication. Available online: see http://cse.seu.edu.cn/people/zhangml/Publication.htm (accessed on 4 June 2016).
  33. Zhang, M.-L.; Peña, J.M.; Robles, V. Feature selection for multi-label naive bayes classification. Inf. Sci. 2009, 179, 3218–3229. [Google Scholar] [CrossRef]
Figure 1. The pseudo-code of ELM-ML.
Figure 1. The pseudo-code of ELM-ML.
Entropy 18 00225 g001
Table 1. Information for eight benchmark data sets.
Table 1. Information for eight benchmark data sets.
DatasetDomain#Training#TestAttributesLabelsLCLD
Small data sets
GenbaseBiology4631911185271.350.050
EmotionsMusic3912027261.870.312
CAL500Music3002026817426.040.150
Medium data sets
YeastBiology1500917103144.240.303
SceneScene1211119629461.070.178
EnronText11235791001533.380.064
Large data sets
TMC2007-500Text215197077500222.160.098
TMC2007Text21519707749060222.160.098
Table 2. Hamming loss for three thresholding strategies on eight data sets (hamming loss ↓).
Table 2. Hamming loss for three thresholding strategies on eight data sets (hamming loss ↓).
DatasetsAlgorithms
ELM-MLELM-Rank-SVM ELM-Constant
Small data sets
Genbase9.3058 × 10−40.95440.0013 (Ct = −0.6)
Emotions0.20870.21450.2129 (Ct = −0.2)
CAL5000.16210.18370.1447 (Ct = 0.6)
Medium data sets
Yeast0.19800.23540.2052 (Ct = 0.0)
Scene0.11930.11780.1506 (Ct = 0.8)
Enron0.08510.92900.0598 (Ct = 0.8)
Large data sets
TMC2007-5000.05370.05680.0537 (Ct = −0.1)
TMC20070.06310.08540.0632 (Ct = −0.1)
Table 3. Subset sccuracy for three thresholding strategies on eight data sets (subset accuracy ↑).
Table 3. Subset sccuracy for three thresholding strategies on eight data sets (subset accuracy ↑).
DatasetsAlgorithms
ELM-MLELM-Rank-SVMELM-Constant
Small data sets
Genbase0.974900.9648 (Ct = −0.6)
Emotions0.26730.24260.2574 (Ct = −0.2)
CAL500000 (Ct = −1)
Medium data sets
Yeast0.19190.07960.1603 (Ct = −0.2)
Scene0.50420.43390.5117 (Ct = 0)
Enron0.070800.0570 (Ct = 0.4)
Large data sets
TMC20070.32900.32130.3150 (Ct = −0.1)
TMC2007-5000.25380.03080.2436 (Ct = −0.1)
Table 4. Computation time of non-constant thresholding strategies.
Table 4. Computation time of non-constant thresholding strategies.
DatasetsELM-MLELM-Rank-SVM
Training Time (Second)Testing Time (Second)Training Time (Second)Testing Time (Second)
Small data sets
Genbase1.07390.02124.66560.0673
Emotions0.50160.00640.79160.0190
CAL5001.14900.015914.1100.2024
Medium data sets
Yeast0.80940.02917.34440.1197
Scene1.43000.05082.45360.1110
Enron3.39220.050419.69760.2460
Large data sets
TMC2007-500150.830.2312198.720.7656
TMC2007152477.3831227778.7663
Table 5. The optimal γ and C values for each data set.
Table 5. The optimal γ and C values for each data set.
Data Sets
GenbaseEmotionsYeastSceneEnron
γ −3−3 0 −4−2
C 0.250.1250.12518
Hamming Loss0.08650.33170.23300.80770.0560
Table 6. Hamming loss for 5 algorithms on eight data sets (hamming loss ↓).
Table 6. Hamming loss for 5 algorithms on eight data sets (hamming loss ↓).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase9.3058 × 10−40.00430.00370.04560.0865
Emotions0.20870.21040.22520.33170.3317
CAL5000.13810.13930.14090.1755-
Medium data sets
Yeast0.19800.19960.20840.23300.2330
Scene0.11930.09890.29070.21210.8077
Enron0.08510.05190.05320.13390.0560
Large data sets
TMC2007-5000.05370.05760.08060.1663-
TMC20070.06310.0652---
Table 7. Subset accuracy for 5 algorithms on eight data sets (subset accuracy ↑).
Table 7. Subset accuracy for 5 algorithms on eight data sets (subset accuracy ↑).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase0.97490.92460.90450.00000.0000
Emotions0.26730.23760.23270.08540.0000
CAL5000.00000.00000.00000.0000-
Medium data sets
Yeast0.19190.15920.15160.01530.1680
Scene0.50420.57270.16050.16300.0000
Enron0.07080.04320.10020.00690.0397
Large data sets
TMC2007-5000.32900.30700.18500.0418-
TMC20070.25380.2436---
Table 8. Precision for five algorithms on eight data sets (precision ↑).
Table 8. Precision for five algorithms on eight data sets (precision ↑).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase0.99650.97990.97240.05680.9875
Emotions0.80830.79070.66250.59480.7005
CAL5000.4576130.60400.58560.4224-
Medium data sets
Yeast0.75420.75850.75050.70900.7034
Scene0.82000.85120.44900.66820.7967
Enron0.57620.62340.69330.22170.5949
Large data sets
TMC2007-5000.76370.73830.60010.3726-
TMC20070.61500.6088---
Table 9. Recall for five algorithms on eight data sets (recall ↑).
Table 9. Recall for five algorithms on eight data sets (recall ↑).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase0.99580.95010.97610.00000.9749
Emotions0.66910.57340.63700.47110.6436
CAL5000.37570.23170.27170.3610-
Medium data sets
Yeast0.63780.54910.64370.55990.6544
Scene0.64130.65470.16810.60830.6426
Enron0.51870.33640.64220.54490.5664
Large data sets
TMC2007-5000.71900.67220.67060.6212-
TMC20070.59470.5909---
Table 10. F1 for five algorithms on eight data sets (F1 ↑).
Table 10. F1 for five algorithms on eight data sets (F1 ↑).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase0.99540.96480.97170.00000.9774
Emotions0.68130.61260.64830.52650.6685
CAL5000.41270.33500.37120.3893-
Medium data sets
Yeast0.66830.62760.65600.62050.6637
Scene0.61600.65760.17010.58820.6400
Enron0.46600.42410.61410.38530.5608
Large data sets
TMC2007-5000.74070.70370.63340.4658-
TMC20070.60960.6072---
Table 11. Accuracy for five algorithms on eight data sets (accuracy ↑).
Table 11. Accuracy for five algorithms on eight data sets (accuracy ↑).
DatasetsAlgorithms
ELM-MLML-kNNBP-MLLMLNBRank-SVM
Small data sets
Genbase0.99080.95020.95350.00000.9749
Emotions0.52450.50080.52660.42780.5552
CAL5000.25950.20040.22520.2368-
Medium data sets
Yeast0.52650.49200.51840.48020.5349
Scene0.56860.62930.16740.54470.6106
Enron0.31950.29880.47120.21450.0004
Large data sets
TMC2007-5000.61270.57890.49210.2986-
TMC20070.48770.4869---
Table 12. Computation time of four algorithms.
Table 12. Computation time of four algorithms.
ELM-MLML-kNNBP-MLLMLNB
Train (s)Test (s)Train (s)Test (s)Train (s)Test (s)Train (s)Test (s)
Genbase1.0740.0210.2090.3061.013 × 1045.9491.466 × 1030.604
Emotions0.5020.0060.1310.1582.606 × 1031.6252.408 × 1020.122
CAL5001.1490.0160.0830.0679.5610.1590.1250.160
Yeast0.8090.0290.4171.3589.805 × 1037.6841.821 × 1031.064
Scene1.4300.0501.5330.8789.243 × 1032.8406.561 × 1020.623
Enron3.3920.0500.4611.5112.09 × 10422.513.621 × 1031.739
TMC2007-5001.7400.2315.113159.36.89 × 10277.901.508 × 1021.437
TMC20071.524 × 1047.3836.856 × 1031.905 × 102----

Share and Cite

MDPI and ACS Style

Sun, X.; Xu, J.; Jiang, C.; Feng, J.; Chen, S.-S.; He, F. Extreme Learning Machine for Multi-Label Classification. Entropy 2016, 18, 225. https://0-doi-org.brum.beds.ac.uk/10.3390/e18060225

AMA Style

Sun X, Xu J, Jiang C, Feng J, Chen S-S, He F. Extreme Learning Machine for Multi-Label Classification. Entropy. 2016; 18(6):225. https://0-doi-org.brum.beds.ac.uk/10.3390/e18060225

Chicago/Turabian Style

Sun, Xia, Jingting Xu, Changmeng Jiang, Jun Feng, Su-Shing Chen, and Feijuan He. 2016. "Extreme Learning Machine for Multi-Label Classification" Entropy 18, no. 6: 225. https://0-doi-org.brum.beds.ac.uk/10.3390/e18060225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop