A Multiclass Nonparallel Parametric-Margin Support Vector Machine

Du, Shu-Wang; Zhang, Ming-Chuan; Chen, Pei; Sun, Hui-Feng; Chen, Wei-Jie; Shao, Yuan-Hai

doi:10.3390/info12120515

Open AccessArticle

A Multiclass Nonparallel Parametric-Margin Support Vector Machine

¹

Zhijiang College, Zhejiang University of Technology, Shaoxing 312030, China

²

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310024, China

³

Management School, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Information 2021, 12(12), 515; https://0-doi-org.brum.beds.ac.uk/10.3390/info12120515

Submission received: 15 November 2021 / Revised: 7 December 2021 / Accepted: 8 December 2021 / Published: 10 December 2021

Download

Browse Figures

Versions Notes

Abstract

:

The twin parametric-margin support vector machine (TPMSVM) is an excellent kernel-based nonparallel classifier. However, TPMSVM was originally designed for binary classification, which is unsuitable for real-world multiclass applications. Therefore, this paper extends TPMSVM for multiclass classification and proposes a novel K multiclass nonparallel parametric-margin support vector machine (MNP-KSVC). Specifically, our MNP-KSVC enjoys the following characteristics. (1) Under the “one-versus-one-versus-rest” multiclass framework, MNP-KSVC encodes the complicated multiclass learning task into a series of subproblems with the ternary output

{- 1, 0, + 1}

. In contrast to the “one-versus-one” or “one-versus-rest” strategy, each subproblem not only focuses on separating the two selected class instances but also considers the side information of the remaining class instances. (2) MNP-KSVC aims to find a pair of nonparallel parametric-margin hyperplanes for each subproblem. As a result, these hyperplanes are closer to their corresponding class and at least one distance away from the other class. At the same time, they attempt to bound the remaining class instances into an insensitive region. (3) MNP-KSVC utilizes a hybrid classification and regression loss joined with the regularization to formulate its optimization model. Then, the optimal solutions are derived from the corresponding dual problems. Finally, we conduct numerical experiments to compare the proposed method with four state-of-the-art multiclass models: Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC. Experimental results demonstrate the feasibility and effectiveness of MNP-KSVC in terms of multiclass accuracy and learning time.

Keywords:

multiclass classification; twin support vector machine; nonparallel hyperplane; hybrid classification and regression; MNP-KSVC

1. Introduction

Data mining has become an essential tool for integrating information technology and industrialization due to the growing size of available databases [1]. One of the main applications in data mining is supervised classification. This aims to assign a label to an unseen instance that is as correct as possible from a given set of classes based on the learning model. Recently, support vector machine (SVM) [2,3] has been a preeminent maximum-margin learning paradigm for data classification. The basic idea of SVM is to find an optimal decision boundary via maximizing the margin between two parallel support hyperplanes. Compared with the neural network, SVM has the following attractive features [3]: (1) the structural risk minimization principle is implemented in SVM to control the upper bound of the generalization error, leading to an excellent generalization ability; (2) the global optimum can be achieved by optimizing a quadratic programming problem (QPP). Furthermore, kernel techniques enable SVM to handle complicated nonlinear learning tasks effectively. During recent decades, SVM has been successfully applied in a wide variety of fields ranging from scene classification [4], fault diagnosis [5,6], EEG classification [7,8], pathological diagnosis [9,10], and bioinformatics [8] to power applications [11].

One limitation in the classical SVM is the strictly parallel requirements for support vector hyperplanes. Namely, parallel hyperplanes are challenging to use to capture a data structure with a cross-plane distribution [12,13], such as “XOR” problems. To alleviate the above issue, the nonparallel SVM models have been proposed in the literature [13,14,15,16,17] during the past years. This approach relaxes the parallel requirement in SVM and seeks nonparallel hyperplanes for different classes. The pioneering work is the generalized eigenvalue proximal SVM (GEPSVM) proposed by Mangasarian and Wild [12], which attempts to find a pair of nonparallel hyperplanes via solving eigenvalue problems (EPs). Subsequently, Jayadeva et al. [13] proposed a novel QPP-type nonparallel model for classification, named TWSVM. The idea of TWSVM is to generate two nonparallel hyperplanes such that each hyperplane is closer to one of the two classes and is at least one apart from the other class. Compared with the classical SVM, the nonparallel SVM models (GEPSVM and TWSVM) have lower computational complexity and better generalization ability. Therefore, in the last few years, they have been studied extensively and developed rapidly, including a least squares version of TWSVM (LSTSVM) [16], structural risk minimization version of TWSVM (TBSVM) [17],

ν

-PTSVM [18], nonparallel SVM (NPSVM) [19,20], nonparallel projection SVM (NPrSVM) [21], and so on [21,22,23,24,25,26,27,28].

The above nonparallel models were mainly proposed for binary classification problems. However, most real-world applications [29,30,31,32] are related to multiclass classifications such as disease diagnosis, fault detection, image recognition, and text categorization. Therefore, many researchers are interested in extending SVM models from binary to multiclass classification. Generally, the decomposition procedure has been considered to be an effective way to achieve multiclass extensions. Yang et al. [33] proposed a multiple birth SVM (MBSVM) for multiclass classification based on the “one-versus-rest” strategy, which is the first multiclass extension of the nonparallel SVM model. Angulo et al. [34] proposed an “one-versus-one-versus-rest” multiclass framework. In contrast to the “one-versus-one” strategy, it constructs

k (k - 1) / 2

binary classifiers with all data points, which can avoid the risk of information loss and class distortion problems. Following this framework, Xu et al. [35] proposed a multiclass extension of TWSVM, termed Twin-KSVC. Results show that Twin-KSVC has a better generalization ability than MBSVM in most cases. Nasiri et al. [36] formulated Twin-KSVC in the least-squared sense to boost learning efficiency and further presented the LST-KSVC model. Lima et al. [32] proposed an improvement on LST-KSVC (ILST-KSVC) with regularization to implement the structural risk minimization principle.

As a successful extension of SVM, the twin parametric-margin support vector machine (TPMSVM) [15] was proposed to pursue a pair of parametric-margin nonparallel hyperplanes. Unlike GEPSVM and TWSVM, each hyperplane in TPMSVM aims to be closer to its class and far away from the other class. The parametric-margin mechanism enables TPMSVM to be suitable for many cases and results in better generalization performance. However, TPMSVM can only deal with binary classification learning tasks. The above motivates us to propose a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. The proposed MNP-KSVC is endowed with the following attractive advantages:

The proposed MNP-KSVC encodes the K multiclass learning task into a series of “one-versus-one-versus-rest” subproblems with all the training instances. Then, it encodes the outputs of subproblems with the ternary output ${- 1, 0, + 1}$ , which helps to deal with imbalanced cases.
For each subproblem, MNP-KSVC aims to find a pair of nonparallel parametric-margin to separate the two selected classes together with the remaining classes. Unlike TMPSVM, each parametric-margin hyperplane is closer to its class and at least at a distance away from the other class, meanwhile mapping the remaining instances into a region.
To implement the empirical risks, MNP-KSVC considers the hybrid classification and regression loss. The Hinge loss is utilized to penalize the errors of the focused two class instances and the $ε$ -intensive loss for the remaining class instances.
Extensive numerical experiments are performed on several multiclass UCI benchmark datasets, and their results are compared with four models (Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC). The comparative results indicate the effectiveness and feasibility of the proposed MNP-KSVC for multiclass classification.

The remainder of this paper is organized as follows. Section 2 briefly introduces notations and related works. Section 3 proposes our MNP-KSVC. The model optimization is also discussed in Section 3. The nonlinear version of MNP-KSVC is extended in Section 4. Experimental results are described in Section 5 and Section 6 presents a discussion and future work.

2. Preliminaries

In this section, we first describe the notations used throughout the paper. Then, we briefly revisit the nonparallel classifier PTSVM and its variants.

2.1. Notations

In this paper, scalars are denoted by lowercase italic letters, vectors by lowercase bold face letters, and matrices by uppercase letters. All vectors are column vectors unless transformed to row vectors by a prime superscript

{(\cdot)}^{'}

. A vector of zeros of arbitrary dimensions is represented by

0

. In addition, we denote

e

as a vector of ones and

I

as an identity matrix of arbitrary dimensions. Moreover, let

∥ \cdot ∥

stand for the

L_{2}

-norm.

2.2. TPMSVM

The TPMSVM [15] is originally proposed for binary classification with a heteroscedastic noise structure. It attempts to seek two nonparallel parametric-margin hyperplanes via the following QPPs:

\begin{matrix} \begin{matrix} min_{w_{1}, b_{1}} & \frac{1}{2} {∥ w_{1} ∥}^{2} + ν_{1} \sum_{i \in I_{+}} ξ_{1 i} + c_{1} \sum_{j \in I_{-}} η_{1 j}, \\ s . t . & w_{1}^{'} x_{i} + b_{1} \geq - ξ_{1 i}, ξ_{1 i} \geq 0, \\ w_{1}^{'} x_{j} + b_{1} = η_{1 j}, \end{matrix} \end{matrix}

(1)

and

\begin{matrix} \begin{matrix} min_{w_{2}, b_{2}} & \frac{1}{2} {∥ w_{2} ∥}^{2} + ν_{2} \sum_{i \in I_{-}} ξ_{2 i} - c_{2} \sum_{j \in I_{+}} η_{2 j}, \\ s . t . & w_{2}^{'} x_{i} + b_{2} \leq ξ_{1 i}, ξ_{2 i} \geq 0, \\ w_{2}^{'} x_{j} + b_{2} = η_{2 j}, \end{matrix} \end{matrix}

(2)

where

ν_{1}, ν_{2}, c_{1}, c_{2}

are positive parameters, and the decision hyperplane is half of the sum of the two hyperplanes. To obtain the solutions of problems (1) and (2), one needs to resort to the dual problems

\begin{matrix} \begin{matrix} min_{α_{1}} & \frac{1}{2} \sum_{i \in I_{+}} \sum_{j \in I_{+}} α_{1 i}^{2} x_{i}^{'} x_{j} - ν_{1} \sum_{i \in I_{+}} \sum_{j \in I_{-}} α_{1 i} x_{i}^{'} x_{j}, \\ s . t . & \sum_{i \in I_{+}} α_{1 i} = ν_{1}, \\ 0 \leq α_{1 i} \leq c_{1} \end{matrix} \end{matrix}

(3)

and

\begin{matrix} \begin{matrix} min_{α_{2}} & \frac{1}{2} \sum_{i \in I_{-}} \sum_{j \in I_{-}} α_{2 i}^{2} x_{i}^{'} x_{j} - ν_{2} \sum_{i \in I_{-}} \sum_{j \in I_{+}} α_{2 i} x_{i}^{'} x_{j}, \\ s . t . & \sum_{i \in I_{-}} α_{2 i} = ν_{2}, \\ 0 \leq α_{2 i} \leq c_{2} . \end{matrix} \end{matrix}

(4)

Then, the solutions

(w_{1}, b_{1})

and

(w_{2}, b_{2})

of dual problems (3) and (4) can be calculated according to the Karush–Kuhn–Tucker (KKT) conditions [3],

\begin{matrix} w_{1} = \sum_{i \in I_{+}} α_{1 i} x_{i} - ν_{1} \sum_{j \in I_{-}} x_{j} and w_{2} = - \sum_{i \in I_{-}} α_{2 i} x_{i} + ν_{2} \sum_{j \in I_{+}} x_{j}, \end{matrix}

(5)

\begin{matrix} b_{1} = - \frac{1}{| I_{S V_{1}} |} \sum_{i \in I_{S V_{1}}} w_{k}^{'} x_{i} and b_{2} = - \frac{1}{| I_{S V_{2}} |} \sum_{i \in I_{S V_{2}}} w_{k}^{'} x_{i}, \end{matrix}

(6)

where

I_{S V_{1}}

and

I_{S V_{2}}

are indices of the support vector set.

Note that TPMSVM can capture more complex heteroscedastic error structures via parametric-margin hyperplanes compared with TWSVM [13]. However, the variable b in TMPSVM is not strictly convex, leading to a lack of a unique solution. Moreover, the optimization problems (1) and (2) are only designed for binary classification tasks, and thus are unsuitable for many real-world multiclass learning applications.

3. The Proposed MNP-KSVC

3.1. Model Formulation

To address the above issues in TPMSVM, this subsection proposes a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. Inspired by the “hybrid classification and regression” learning paradigm [35], MNP-KSVC decomposes the complicated K multiclass learning task into a series of “one-versus-one-versus-rest” subproblems. Each subproblem focuses on the separation of the two selected classes together with the remaining classes. Here, we utilize

{- 1, 1}

to represent the label of the two selected classes and 0 to label the rest. Namely, the subproblem is encoded with the ternary output

{- 1, 0, 1}

. The main idea of MNP-KSVC is to find a pair of nonparallel parametric-margin hyperplanes for each subproblem,

f_{1} (x) = w_{1}^{'} x + b_{1} = 0 and f_{2} (x) = w_{2}^{'} x + b_{2} = 0,

(7)

such that each one is approximate to the corresponding class while as far as possible from the other class on one side. Moreover, the remaining classes are restricted in a region between these hyperplanes.

Formally, we use

I_{k}

to express the set of indices for instances belonging to the k label in the subproblem, where k is in

{- 1, 0, 1}

. Inspired by the TPMSVM [15], our MNP-KSVC considers the following two loss functions for the above ternary output learning problem:

\begin{matrix} R_{1}^{e m p} = & ν_{1} \sum_{j \in I_{-}} (w_{1}^{'} x_{j} + b_{1}) + c_{1} \sum_{i \in I_{+}} max (0, 1 - (w_{1}^{'} x_{i} + b_{1})) \\ + c_{2} \sum_{l \in I_{0}} max (0, ε - 1 + (w_{1}^{'} x_{l} + b_{1})) \end{matrix}

(8)

and

\begin{matrix} R_{2}^{e m p} = & - ν_{2} \sum_{i \in I_{-}} (w_{2}^{'} x_{i} + b_{2}) + c_{3} \sum_{j \in I_{+}} max (0, 1 + (w_{2}^{'} x_{j} + b_{2})) \\ + c_{4} \sum_{l \in I_{0}} max (0, ε - 1 - (w_{2}^{'} x_{l} + b_{2})), \end{matrix}

(9)

where

ν_{1}, ν_{2}, c_{1}, c_{2}, c_{3}, c_{4} > 0

are penalty parameters and

ε \in (0, 1]

is a margin parameter. Introducing the regularization term

{∥ w ∥}_{2}^{2} + b^{2}

yields the primal problems of MNP-KSVC:

\begin{matrix} \begin{matrix} min_{w_{1}, ξ_{1}, η_{1}} & \frac{1}{2} (∥ w_{1} ∥^{2} + b_{1}^{2}) + ν_{1} \sum_{j \in I_{-}} (w_{1}^{'} x_{j} + b_{1}) + c_{1} \sum_{i \in I_{+}} ξ_{1 i} + c_{3} \sum_{l \in I_{0}} η_{1 l} \\ s . t . & (w_{1}^{'} x_{i} + b_{1}) \geq 1 - ξ_{1 j}, ξ_{1 i} \geq 0, \\ - (w_{1}^{'} x_{l} + b_{1}) \geq ε - 1 - η_{1 l}, η_{1 l} \geq 0 \end{matrix} \end{matrix}

(10)

and

\begin{matrix} \begin{matrix} min_{w_{2}, ξ_{2}, η_{2}} & \frac{1}{2} (∥ w_{2} ∥^{2} + b_{2}^{2}) - ν_{2} \sum_{i \in I_{+}} (w_{2}^{'} x_{i} + b_{2}) + c_{2} \sum_{j \in I_{-}} ξ_{2 i} + c_{4} \sum_{l \in I_{0}} η_{2 l} \\ s . t . & - (w_{2}^{'} x_{j} + b_{2}) \geq 1 - ξ_{2 j}, ξ_{2 j} \geq 0, \\ (w_{2}^{'} x_{l} + b_{2}) \geq ε - 1 - η_{2 l}, η_{2 l} \geq 0, \end{matrix} \end{matrix}

(11)

where

ξ_{1}, η_{1}, ξ_{2}, η_{2}

are non-negative slack vectors.

To deliver the mechanism of MNP-KSVC, we now give the following analysis and geometrical explanation for problem (10):

The first term is the $L_{2}$ -norm of $w_{1}$ and $b_{1}$ . Minimizing this is done with the aim of regulating the model complexity of MNP-KSVC and avoiding over-fitting. Furthermore, this regularization term makes the QPPs strictly convex, leading to a unique solution.
The second term is the sum of the projection value of the −1 labeled instances $x_{j \in I_{-}}$ on $f_{1} (x_{j})$ . Optimizing this term leads instances $x_{j \in I_{-}}$ to be as far as possible from the +1 labeled parametric-margin hyperplane $f_{1} (x_{j})$ .
The third term with the first constraint requires the projection values of the +1 labeled instances $x_{i \in I_{+}}$ on hyperplane $f_{1} (x_{j})$ to be not less than 1. Otherwise, a slack vector $ξ_{1 i}$ is introduced to measure its error when the constraint is violated.
The last term with the second constraint aims for the projection values of the remaining 0 labeled instances $x_{l \in I_{0}}$ on hyperplane $f_{1} (x_{j})$ to be not more than $1 - ε$ . Otherwise, a slack variable $η_{1 l}$ is utilized to measure the corresponding error. Optimizing this is done with the aim of keeping instances $x_{l \in I_{0}}$ at least $ε$ distance from the +1 labeled instances. Moreover, $ε$ controls the margin between “+” and “0” labeled instances.

The geometrical explanation for problem (11) is similar. Let

u_{1} = [w_{1} b_{1}]

,

u_{2} = [w_{2} b_{2}]

,

\tilde{x} = [x 1]

. For the sake of simplicity, denote

A = {\tilde{x}}_{i \in I_{+}}

,

B = {\tilde{x}}_{j \in I_{-}}

and

C = {\tilde{x}}_{l \in I_{0}}

as the instances belonging to

+ 1

,

- 1

, and 0 labels, respectively. Then, the matrix formulations of problems (10) and (11) can be expressed as

\begin{matrix} \begin{matrix} min_{u_{1}, ξ_{1}, η_{1}} & \frac{1}{2} {∥ u_{1} ∥}^{2} + ν_{1} e_{-}^{'} B u_{1} + c_{1} e_{+}^{'} ξ_{1} + c_{3} e_{0}^{'} η_{1} \\ s . t . & A u_{1} \geq e_{+} - ξ_{1}, ξ_{1} \geq 0, \\ - C u_{1} \geq (ε - 1) e_{0} - η_{1}, η_{1} \geq 0 \end{matrix} \end{matrix}

(12)

and

\begin{matrix} \begin{matrix} min_{u_{2}, ξ_{2}, η_{2}} & \frac{1}{2} {∥ u_{2} ∥}^{2} - ν_{2} e_{+}^{'} A u_{2} + c_{2} e_{-}^{'} ξ_{2} + c_{4} e_{0}^{'} η_{2} \\ s . t . & - B u_{2} \geq e_{-} - ξ_{2}, ξ_{2} \geq 0, \\ C u_{2} \geq (ε - 1) e_{0} - η_{2}, η_{2} \geq 0 . \end{matrix} \end{matrix}

(13)

In what follows, we discuss the solutions of problems (12) and (13).

3.2. Model Optimization

To obtain solutions to problems (12) and (13), we first derive their dual problems by Theorem 1.

Theorem 1.

Optimization problems

\begin{matrix} min_{α_{1}, β_{1}} & \frac{1}{2} [α_{1}^{'} β_{1}^{'}] [\begin{matrix} A A^{'} & - A C^{'} \\ - C A^{'} & C C^{'} \end{matrix}] [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ - (ν_{1} e_{-}^{'} [B A^{'} - B C^{'}] + [e_{+}^{'} (ε - 1) e_{0}^{'}]) [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ s . t . & 0 \leq α_{1} \leq c_{1} e_{+}, 0 \leq β_{1} \leq c_{3} e_{0} \end{matrix}

(14)

and

\begin{matrix} min_{α_{2}, β_{2}} & \frac{1}{2} [α_{2}^{'} β_{2}^{'}] [\begin{matrix} B B^{'} & - B C^{'} \\ - C B^{'} & C C^{'} \end{matrix}] [\begin{matrix} α_{2} \\ β_{2} \end{matrix}] \\ - (ν_{2} e_{+}^{'} [A B^{'} - A C^{'}] + [e_{-}^{'} (ε - 1) e_{0}^{'}]) [\begin{matrix} α_{2} \\ β_{2} \end{matrix}] \\ s . t . & 0 \leq α_{2} \leq c_{3} e_{-}, 0 \leq β_{2} \leq c_{4} e_{0} \end{matrix}

(15)

are the dual problems of (12) and (13), respectively.

Proof of Theorem 1.

Taking problem (12) as an example, firstly, we construct its Lagrangian function as

\begin{matrix} L (Ξ_{1}) = & \frac{1}{2} {∥ u_{1} ∥}^{2} + ν_{1} e_{-}^{'} B u_{1} + c_{1} e_{+}^{'} ξ_{1} + c_{3} e_{0}^{'} η_{1} \\ - α_{1}^{'} (A u_{1} + ξ_{1} - e_{+}) - β_{1}^{'} (- C u_{1} + η_{1} - (ε - 1) e_{0}) \\ - φ_{1}^{'} ξ_{1} - γ_{1} η_{1}, \end{matrix}

(16)

where

α_{1}, β_{1}, φ_{1}, γ_{1}

are the non-negative Lagrange multipliers to constraints of problem (12), and

Ξ_{1} = {u_{1}, ξ_{1}, η_{1}, α_{1}, β_{1}, φ_{1}, γ_{1}}

. According to KKT conditions [3,37], the Lagrangian function (16) has to be maximized with its dual variables

α_{1}, β_{1}, φ_{1}, γ_{1}

while being minimized with its primal variables

u_{1}, ξ_{1}, η_{1}

. Differentiate

L (Ξ_{1})

with respect to

u_{1}, η_{1}, ξ_{1}

; then, optimal conditions of problem (12) are obtained by

\begin{matrix} \nabla L_{u_{1}} = u_{1} + ν_{1} B^{'} e_{-} - A^{'} α_{1} + C^{'} β_{1} = 0, \end{matrix}

(17)

\begin{matrix} \nabla L_{ξ_{1}} = c_{1} e_{+} - α_{1} - φ_{1} = 0, \end{matrix}

(18)

\begin{matrix} \nabla L_{η_{1}} = c_{3} e_{0} - β_{1} - γ_{1} = 0, \end{matrix}

(19)

\begin{matrix} α_{1}^{'} (A u_{1} + ξ_{1} - e_{+}) = 0, \end{matrix}

(20)

\begin{matrix} β_{1}^{'} (- C u_{1} + η_{1} - (ε - 1) e_{0}) = 0, \end{matrix}

(21)

\begin{matrix} φ_{1}^{'} ξ_{1} = 0, \end{matrix}

(22)

\begin{matrix} γ_{1} η_{1} = 0 . \end{matrix}

(23)

From (17), we have

\begin{matrix} u_{1} = A^{'} α_{1} - C^{'} β_{1} - ν_{1} B^{'} e_{-} . \end{matrix}

(24)

Since

φ_{1}, γ_{1} \geq 0

, from (18) and (19), we derive

0 \leq α_{1} \leq c_{1} e_{+} and 0 \leq β_{1} \leq c_{3} e_{0} .

(25)

Finally, substituting (24) into the Lagrangian function (16) and using KKT conditions (17)–(23), the dual problem of (12) can be formulated as

\begin{matrix} min_{α_{1}, β_{1}} & \frac{1}{2} [α_{1}^{'} β_{1}^{'}] [\begin{matrix} A A^{'} & - A C^{'} \\ - C A^{'} & C C^{'} \end{matrix}] [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ - (ν_{1} e_{-}^{'} [B A^{'} - B C^{'}] + [e_{+}^{'} (ε - 1) e_{0}^{'}]) [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ s . t . & 0 \leq α_{1} \leq c_{1} e_{+}, 0 \leq β_{1} \leq c_{3} e_{0} . \end{matrix}

(26)

Similarly, we can derive the dual problem of (13) as problem (15). □

For ease of notation, define

q_{1} = [α_{1}; β_{1}]

,

N_{1} = [A; - C]

,

h_{1} = [e_{+}; (ε - 1) e_{0}]

, and

e_{q 1} = [c_{1} e_{+}; c_{3} e_{0}]

. Then, problem (14) can be succinctly reformulated as

\begin{matrix} min_{q_{1}} & \frac{1}{2} q_{1}^{'} H_{1} q_{1} - d_{1} q_{1} \\ s . t . & 0 \leq q_{1} \leq e_{q 1} \end{matrix}

(27)

where

H_{1} = N_{1} N_{1}^{'} = [\begin{matrix} A A^{'} & - A C^{'} \\ - C A^{'} & C C^{'} \end{matrix}]

(28)

and

d_{1} = ν_{1} e_{-}^{'} B N_{1}^{'} + h_{1}^{'} = ν_{1} e_{-}^{'} [B A^{'} - B C^{'}] + [e_{+}^{'} (ε - 1) e_{0}^{'}]

(29)

Similarly, define

q_{2} = [α_{2}; β_{2}]

,

N_{2} = [- B; C]

,

h_{2} = [e_{-}; (ε - 1) e_{0}]

, and

e_{q 2} = [c_{2} e_{-}; c_{4} e_{0}]

for problem (15). Then, it can be reformulated as

\begin{matrix} min_{q_{2}} & \frac{1}{2} q_{2}^{'} H_{2} q_{2} - d_{2} q_{2} \\ s . t . & 0 \leq q_{2} \leq e_{q 2} \end{matrix}

(30)

where

H_{2} = N_{2} N_{2}^{'} = [\begin{matrix} B B^{'} & - B C^{'} \\ - C B^{'} & C C^{'} \end{matrix}]

(31)

and

d_{2} = ν_{2} e_{+}^{'} A N_{2}^{'} + h_{2}^{'} = ν_{2} e_{+}^{'} [A B^{'} - A C^{'}] + [e_{-}^{'} (ε - 1) e_{0}^{'}]

(32)

After solving dual problems (27) and (30) by the standard QPP solver, we can obtain the solutions to primal problems (12) and (13) by Proposition 1 according to KKT conditions without proof.

Proposition 1.

Suppose that

q_{1} = [α_{1}; β_{1}]

and

q_{2} = [α_{2}; β_{2}]

are solutions to dual problems (27) and (30), respectively. Then, solutions

u_{1}

and

u_{2}

to primal problems (12) and (13) can be formulated by

\begin{matrix} u_{1} = N_{1}^{'} q_{1} - ν_{1} B^{'} e_{-} = A^{'} α_{1} - C^{'} β_{1} - ν_{1} B^{'} e_{-} \end{matrix}

(33)

and

\begin{matrix} u_{2} = - N_{2}^{'} q_{2} + ν_{2} A^{'} e_{+} = - B^{'} α_{2} + C^{'} β_{2} + ν_{2} A^{'} e_{+} \end{matrix}

(34)

3.3. Decision Rule

As mentioned in Section 3.1, our MNP-KSVC decomposes the multiclass learning task into a series of subproblems with the “one-versus-one-versus-rest” strategy. Specifically, we construct

K (K - 1) / 2

MNP-KSVC classifiers for K-class classification. For each

(k_{1}, k_{2})

-pair classifier, we relabel the dataset with ternary outputs

{- 1, 0, + 1}

according to the two selected and the remaning class instances. Namely, we label “+1”, “−1”, and “0” to instances belonging to the

k_{i}

class,

k_{j}

class, and all the remaining classes, respectively. Then, we train it on the new relabeled dataset by solving problems (12) and (13).

As for the decision, we predict the label of an unseen instance

x

on the voting strategy from the ensemble results of

K (K - 1) / 2

MNP-KSVC classifiers. Namely, we determine the vote via each classifier according to the regions in which

x

is located. Taking the

(k_{1}, k_{2})

-pair classifier as an example, firstly, we reformulate the parametric-margin hyperplanes (7) w.r.t.

u = [w; b]

and

\tilde{x} = [x 1]

as

f_{1} (x) = w_{1}^{'} x + b_{1} = u_{1}^{'} \tilde{x} and f_{2} (x) = w_{2}^{'} x + b_{2} = u_{2}^{'} \tilde{x},

(35)

If

x

is located above the “+” hyperplane

f_{1} (x) > 1 - ε

—i.e., satisfying the condition

u_{1}^{'} \tilde{x} > 1 - ε

—we vote it to the

k_{i}

class. On the other hand, if

x

is located below the “−” hyperplane

f_{2} (x) < - 1 + ε

, we vote it to the

k_{j}

class. Otherwise, it belongs to the remaining classes. In summary, the decision function for the

(k_{1}, k_{2})

-pair classifier can be expressed as

\begin{matrix} g (x) = \{\begin{matrix} + 1 & f_{1} (x) > 1 - ε, \\ - 1 & f_{2} (x) < - 1 + ε, \\ 0 & otherwise . \end{matrix} \end{matrix}

(36)

Finally, the given unseen instance

x

is assigned to the class label that gets the most votes. In summary, the whole procedure of MNP-KSVC is established in Algorithm 1 with the Figure 1.

Algorithm 1 The procedure of MNP-KSVC

1:: Input dataset $T = {(x_{i}, y_{i}) | 1 \leq i \leq m}$ , where $x_{i} \in R^{n}$ and $y_{i} = {1, \dots, K}$ .
2:: Choose parameters $ν_{1}, ν_{2}, c_{1}, c_{2}, c_{3}, c_{4} > 0$ and $ε \in (0, 1]$ .
Training Procedure:
3:: for $k_{i}$ in $range (1, \dots, K)$ do
4:: for $k_{j}$ in $range (k_{i} + 1, \dots, K)$ do
5:: Relabel “+1”, “−1”, and “0” to instances belonging to $k_{i}$ , $k_{j}$ , and the rest classes.
6:: Construct $A$ , $B$ , and $C$ for problem (12) and (13) of $(k_{1}, k_{2})$ -pair classifier.
7:: Solve the corresponding dual problem (27) and (30) by QPP solver. Then, get
their solution $q_{1} = [α_{1}; β_{1}]$ and $q_{2} = [α_{2}; β_{2}]$ .
8:: Build the auxiliary functions for classifier w.r.t. Proposition 1

$k_{1} - class : f_{1} (x) = w_{1}^{'} x + b_{1} = q_{1}^{'} N_{1} x - ν_{1} e_{-}^{'} B x$

(37)

and

$k_{2} - class : f_{2} (x) = w_{2}^{'} x + b_{2} = - q_{2}^{'} N_{2} x + ν_{2} e_{+}^{'} A x$

(38)
9:: end for
10:: end for
Predicting Procedure:
11:: For an unseen instance $x$ , assign it to class y via the following vote strategy.
12:: Initialize the vote vector $v o t e = 0$ for K classes.
13:: for $k_{i}$ in $range (1, \dots, K)$ do
14:: for $k_{j}$ in $range (k_{i} + 1, \dots, K)$ do
15:: Compute the decision function $g (x)$ for the $(k_{1}, k_{2})$ -pair classifier w.r.t. (36).
16:: if $g (x) = = 1$ then
17:: Update $v o t e (k_{i}) + = 1$
18:: else if $g (x) = = - 1$ then
19:: Update $vote (k_{j}) + = 1$
20:: end if
21:: end for
22:: end for
23:: Finally, assign the most votes of class to $x$ via

$label (x) \leftarrow arg max_{k = {1, \dots, K}} v o t e (k)$

(39)

4. Model Extension to the Nonlinear Case

In practice, the linear classifier is sometimes not suitable for many real-world nonlinear learning tasks [18,21,30]. One of the effective solutions is to map linearly non-separable instances into the feature space. Thus, in this section, we focus on the nonlinear extension of MNP-KSVC.

To construct our nonlinear MNP-KSVC, we consider the feature mapping

x^{φ} = φ (x) : R^{n} \to H

(RKHS, Reproducing Kernel Hilbert Space) with the kernel trick [3]. Define

A^{ϕ} = {ϕ ({\tilde{x}}_{i})}_{i \in I_{+}}

,

B^{ϕ} = {ϕ ({\tilde{x}}_{j})}_{j \in I_{-}}

, and

C^{ϕ} = {ϕ ({\tilde{x}}_{l})}_{j \in I_{0}}

. Then, the nonlinear MNP-KSVC model optimizes the following two primal QPPs:

\begin{matrix} \begin{matrix} min_{u_{1}, ξ_{1}, η_{1}} & \frac{1}{2} {∥ u_{1} ∥}^{2} + ν_{1} e_{-}^{'} B_{ϕ} u_{1} + c_{1} e_{+}^{'} ξ_{1} + c_{3} e_{0}^{'} η_{1} \\ s . t . & A_{ϕ} u_{1} \geq e_{+} - ξ_{1}, ξ_{1} \geq 0, \\ - C_{ϕ} u_{1} \geq (ε - 1) e_{0} - η_{1}, η_{1} \geq 0 \end{matrix} \end{matrix}

(40)

and

\begin{matrix} \begin{matrix} min_{u_{2}, ξ_{2}, η_{2}} & \frac{1}{2} {∥ u_{2} ∥}^{2} - ν_{2} e_{+}^{'} A_{ϕ} u_{2} + c_{2} e_{-}^{'} ξ_{2} + c_{4} e_{0}^{'} η_{2} \\ s . t . & - B_{ϕ} u_{2} \geq e_{-} - ξ_{2}, ξ_{2} \geq 0, \\ C_{ϕ} u_{2} \geq (ε - 1) e_{0} - η_{2}, η_{2} \geq 0 . \end{matrix} \end{matrix}

(41)

where

ξ_{1}, η_{1}, ξ_{2}, η_{2}

are non-negative slack vectors. Because the formulations of nonlinear problems (40) and (41) are similar to the linear cases (12) and (13), we can obtain their solutions in a similar manner.

In what follows, we define the kernel operation for MNP-KSVC.

Definition 1.

Suppose that

K (\cdot, \cdot)

is an appropriate kernel function; then, the kernel operation in matrix form is defined as

K (A, B) = 〈A_{ϕ}, B_{ϕ}〉,

(42)

whose

i j

-th element can be computed by

K {(A, B)}_{i j} = ϕ {(x_{i})}^{'} ϕ (x_{j}) = K (x_{i}, x_{j}) .

(43)

Then, we can derive the dual problems of (40) and (41) by Theorem 2.

Theorem 2.

Optimization problems

\begin{matrix} min_{α_{1}, β_{1}} & \frac{1}{2} [α_{1}^{'} β_{1}^{'}] [\begin{matrix} K (A, A) & - K (A, C) \\ - K (C, A) & K (C, C) \end{matrix}] [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ - (ν_{1} e_{-}^{'} [K (B, A) - K (B, C)] + [e_{+}^{'} (ε - 1) e_{0}^{'}]) [\begin{matrix} α_{1} \\ β_{1} \end{matrix}] \\ s . t . & 0 \leq α_{1} \leq c_{1} e_{+}, 0 \leq β_{1} \leq c_{3} e_{0} \end{matrix}

(44)

and

\begin{matrix} min_{α_{2}, β_{2}} & \frac{1}{2} [α_{2}^{'} β_{2}^{'}] [\begin{matrix} K (B, B) & - K (B, C) \\ - K (C, B) & K (C, C) \end{matrix}] [\begin{matrix} α_{2} \\ β_{2} \end{matrix}] \\ - (ν_{2} e_{+}^{'} [K (A, B) - K (A, C)] + [e_{-}^{'} (ε - 1) e_{0}^{'}]) [\begin{matrix} α_{2} \\ β_{2} \end{matrix}] \\ s . t . & 0 \leq α_{2} \leq c_{3} e_{-}, 0 \leq β_{2} \leq c_{4} e_{0} \end{matrix}

(45)

are the dual problems of (40) and (41), respectively.

Proof of Theorem 2.

By introducing non-negative Lagrange multipliers

α_{1}

,

β_{1}

,

φ_{1}

, and

γ_{1}

to constraints of problem (40), its Lagrangian function is built as

\begin{matrix} L (Ξ_{1}) = & \frac{1}{2} {∥ u_{1} ∥}^{2} + ν_{1} e_{-}^{'} B_{ϕ} u_{1} + c_{1} e_{+}^{'} ξ_{1} + c_{3} e_{0}^{'} η_{1} \\ - α_{1}^{'} (A_{ϕ} u_{1} + ξ_{1} - e_{+}) - β_{1}^{'} (- C_{ϕ} u_{1} + η_{1} - (ε - 1) e_{0}) \\ - φ_{1}^{'} ξ_{1} - γ_{1} η_{1}, \end{matrix}

(46)

where

Ξ_{1} = {u_{1}, ξ_{1}, η_{1}, α_{1}, β_{1}, φ_{1}, γ_{1}}

. Differentiate

L (Ξ_{1})

with respect to

u_{1}, η_{1}, ξ_{1}

; then, optimal conditions of problem (40) are obtained by

\begin{matrix} \nabla L_{u_{1}} = u_{1} + ν_{1} B_{ϕ}^{'} e_{-} - A_{ϕ}^{'} α_{1} + C_{ϕ}^{'} β_{1} = 0, \end{matrix}

(47)

\begin{matrix} \nabla L_{ξ_{1}} = c_{1} e_{+} - α_{1} - φ_{1} = 0, \end{matrix}

(48)

\begin{matrix} \nabla L_{η_{1}} = c_{3} e_{0} - β_{1} - γ_{1} = 0, \end{matrix}

(49)

\begin{matrix} α_{1}^{'} (A_{ϕ} u_{1} + ξ_{1} - e_{+}) = 0, \end{matrix}

(50)

\begin{matrix} β_{1}^{'} (- C_{ϕ} u_{1} + η_{1} - (ε - 1) e_{0}) = 0, \end{matrix}

(51)

\begin{matrix} φ_{1}^{'} ξ_{1} = 0, \end{matrix}

(52)

\begin{matrix} γ_{1} η_{1} = 0 . \end{matrix}

(53)

From (47), we have

\begin{matrix} u_{1} = A_{ϕ}^{'} α_{1} - C_{ϕ}^{'} β_{1} - ν_{1} B_{ϕ}^{'} e_{-} . \end{matrix}

(54)

Since

φ_{1}, γ_{1} \geq 0

, from (48) and (49), we derive

0 \leq α_{1} \leq c_{1} e_{+} and 0 \leq β_{1} \leq c_{3} e_{0} .

(55)

Finally, substituting (54) into the Lagrangian function (16) and using KKT conditions (47)–(53), the dual problem of (40) can be formulated as (44). Similarly, we can derive the dual problem of (41) as problem (45). □

The procedure of the nonlinear MNP-KSVC is similar to that of the linear one, but with the following minor modifications in Algorithm 1:

In contrast to some existing nonparallel SVMs, we do not need to consider the extra kernel-generated technique since only inner products appear in dual problems (14) and (15) of the linear MNP-KSVC. These dual formulations enable MNP-KSVC to behave consistently in linear and nonlinear cases. Thus, taking appropriate kernel functions instead of inner products in the Hessian matrix of dual problem (14) and (15), i.e.,

$H_{ϕ 1} = [\begin{matrix} K (A, A) & - K (A, C) \\ - K (C, A) & K (C, C) \end{matrix}] and H_{ϕ 2} = [\begin{matrix} K (B, B) & - K (B, C) \\ - K (C, B) & K (C, C) \end{matrix}],$

(56)

we can obtain the dual formation of the nonlinear MNP-KSVC in (44) and (45).
Once we obtain the solution $q_{1} = [α_{1}; β_{1}]$ and $q_{2} = [α_{2}; β_{2}]$ to problems (44) and (45) respectively, the corresponding primal solutions $u_{1}$ and $u_{2}$ in feature space can be formulated by

$\begin{matrix} u_{1} = A_{ϕ}^{'} α_{1} - C_{ϕ}^{'} β_{1} - ν_{1} B_{ϕ}^{'} e_{-} \end{matrix}$

(57)

and

$\begin{matrix} u_{2} = - B_{ϕ}^{'} α_{2} + C_{ϕ}^{'} β_{2} + ν_{2} A_{ϕ}^{'} e_{+} \end{matrix}$

(58)
For an unseen instance $x$ , construct the following decision functions for $(i, j)$ -pair nonlinear MNP-KSVC classifier as

$\begin{matrix} g_{ϕ} (x) = \{\begin{matrix} + 1 & f_{ϕ 1} (\tilde{x}) > 1 - ε, \\ - 1 & f_{ϕ 2} (\tilde{x}) < - 1 + ε, \\ 0 & otherwise, \end{matrix} \end{matrix}$

(59)

where $\tilde{x} = [x 1]$ , the auxiliary functions $f_{ϕ 1} (x)$ and $f_{ϕ 2} (x)$ in feature space can be expressed as

$\begin{matrix} f_{ϕ 1} (\tilde{x})) = K (u_{1}, \tilde{x}) = α_{1}^{'} K (A, \tilde{x}) - β_{1}^{'} K (C, \tilde{x}) - ν_{1} e_{-}^{'} K (B, \tilde{x}) \end{matrix}$

(60)

and

$\begin{matrix} f_{ϕ 2} (\tilde{x})) = K (u_{2}, \tilde{x}) = - α_{2}^{'} K (B, \tilde{x}) + β_{2}^{'} K (C, \tilde{x}) + ν_{2} e_{+}^{'} K (A, \tilde{x}) \end{matrix}$

(61)

5. Numerical Experiments

5.1. Experimental Setting

To demonstrate the validity of MNP-KSVC, we perform extensive experiments on several benchmark datasets that are commonly used for testing machine learning algorithms. In experiments, we focus on comparing MNP-KSVC and four state-of-the-art multiclass models—Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC—detailed as follows:

Multi-SVM [38]: The idea is similar to the “one-versus-all” SVM [3]. However, it generates K binary SVM classifiers by solving the one large dual QPP. That is, the k-th classifier is trained with the k-th class instances encoded with positive labels and the remaining class instances with negative labels. Then, the label of an unseen instance is assigned by the “voting” scheme. The penalty parameter for each classifier in Multi-SVM is c.
MBSVM [33]: It is the multiclass extension of the binary TWSVM, which is based on the “one-versus-all” strategy. MBSVM aims to find K nonparallel hyperplanes by solving K QPPs simultaneously. Specifically, the k-th class instances are as far away as the k-th hyperplane while the remaining instances are proximal to the k-th hyperplane. An unseen instance is assigned to the label depending on to which of the K hyperplanes it lies farthest away. The penalty parameter for each classifier in MBSVM is c.
MTPMSVM: Inspired by MBSVM [33], we use the “one-versus-all” strategy to implement the multiclass version of TPMSVM [15] as a baseline. In contrast to MBSVM, it aims to find K parametric-margin hyperplanes, such that each hyperplane is closer to its corresponding class instances and as far away from the remaining class instances. The penalty parameters for each classifier in MTPMSVM are $(ν, c)$ .
Twin-KSVC [35]: It is another novel multiclass extension of TWSVM. Twin-KSVC evaluates all the training instances in a “one-versus-one-versus-rest” structure with the ternary output ${- 1, 0, + 1}$ . It aims to find a pair of nonparallel hyperplanes for each of two kinds of samples selected from K classes. The remaining class instances are mapped into a region within these two nonparallel hyperplanes. The penalty parameters for each classifier in Twin-KSVC are $(c_{1}, c_{2}, c_{3}, c_{4}, ε)$ .

All methods are implemented by MATLAB on a PC with an i7 Intel Core processor with 32 GB RAM. The quadratic programming problems (QPPs) of all the classifiers are solved by the “quadprog” function in MATLAB. Now, we describe the setting of our experiments:

Similar to [35,38], we use multiclass accuracy to measure each classifier, defined as

$accuracy = \frac{1}{N} \sum_{k = 1}^{K} \sum_{x \in I_{k}} I (\hat{g} (x) = k)$

(62)

where N is the scalar of the dataset, K is the total classes, $\hat{g} (x)$ is the prediction of the classifier, and $I (\cdot)$ is an indicator function, which returns 1 if the class matches and 0 otherwise. Moreover, we adopt training time to represent the learning efficiency.
To reduce the complexity of parameter selection for multiclass classifiers, we use the same parameter setting for each learning subproblem. Specifically, we set the same for all classifiers c in MBSVM and MBSVM, $ν, c$ in MTPSVM, $c_{1} = c_{2}, c_{3} = c_{4}$ in Twin-KSVC, and $ν_{1} = ν_{2}, c_{1} = c_{2}, c_{3} = c_{4}$ in MNP-KSVC. For the nonlinear case, the RBF kernel $K (x_{i}, x_{j}) = \exp (\frac{∥ x_{i} - x_{j} ∥^{2}}{γ})$ is considered, where $γ > 0$ is the kernel parameter.
It is usually unknown beforehand which parameters are optimal for classifiers at hand. Thus, we employ the 10-fold cross-validation technique [3] for parameter selection. In detail, each dataset is randomly partitioned into 10 subsets with similar sizes and distributions. Then, the union of 9 subsets is used as the training set, while the remaining 1 is used as the testing set. Furthermore, we apply the grid-based approach [3] to obtain the optimal parameters of each classifier. Namely, the penalty parameters $c, c_{1}, c_{2}, ν, ν_{1}$ and the kernel parameter $γ$ are selected from ${2^{i} | i = - 6, \dots, 6}$ , while the margin parameter $ε$ is chosen from ${i | i = 0.1, 0.2, \dots, 0.9}$ . Once selected, we return them to learn the final decision function.

5.2. Result Comparison and Discussion

For comparison, we consider 10 real-world multiclass datasets from the UCI machine learning repository (the UCI datasets are available at http://archive.ics.uci.edu/ml (accessed on 10 September 2021)), whose statistics are summarized in Table 1. These datasets represent a wide range of domains (include phytology, bioinformatics, pathology, and so on), sizes (from 178 to 2175), features (from 4 to 34), and classes (from 3 to 10). All datasets are normalized before training such that features are transformed into

[- 1, 1]

. Moreover, we carry out experiments as follows. Firstly, each dataset is divided into 2 subsets: 70% for training and 30% for testing. Then, we train classifiers with 10-fold cross-validation executions. Finally, we predict the testing set with the fine-tuning classifiers. Each experiment is repeated 10 times.

Table 2 and Table 3 contain a summary of learning results for the proposed MNP-KSVC model with other compared methods using linear and nonlinear kernels, respectively. The results on 10 benchmark datasets include the mean and standard of the testing multiclass accuracy (%), whose best performance is highlighted in bold. The comparison results reveal the following:

MNP-KSVC yields better performance than other classifiers in terms of accuracy on almost all datasets. This confirms the efficacy of the proposed MNP-KSVC on the multiclass learning tasks.
Nonparallel based models (MBSVM, MTPMSVM, Twin-KSVC, and MNP-KSVC) outperform the traditional SVM model. The reason is that SVM simply utilizes the parallel hyperplanes to learn the decision function, leading to less capability to capture the underlying multiclass distributions.
MNP-KSVC has a better generalization ability than MBSVM and Twin-KSVC in most cases. For instance, MNP-KSVC obtains a higher accuracy (84.59%) than MBSVM (80.82%) and Twin-KSVC (80.12%) on the Ecoli dataset in the linear case. Similar results can be obtained from the other datasets. Since MBSVM and Twin-KSVC simply implement empirical risk minimization, they are easy to overfit. The regularization terms are considered in our MNP-KSVC, which regulates the model complexity to avoid overfitting.
MTPMSVM is another multiclass extension, which is based on the “one-versus-rest” strategy. With the help of the “hybrid classification and regression” learning paradigm, our MNP-KSVC can learn more multiclass discriminate information.
Furthermore, we count the number of Superior/Inferior (W/L) instances to the compared classifier on all datasets for both linear and nonlinear cases, listed at the bottom of Table 2 and Table 3. The results indicate that MNP-KSVC achieves the best results against others in terms of both W/L and average accuracy.

To provide more statistical evidence [39,40,41], we have further performed the non-parametric Friedman test to check whether there are significant differences between MNP-KSVC and other compared classifiers. The bottom lines of Table 2 and Table 3 list the average ranks of each classifier calculated by the Friedman test according to multiclass accuracy. Results show that MNP-KSVC has the first rank in linear and nonlinear cases, followed by NPSVM and RPTSVM successively. Now, we calculate the

X_{F}^{2}

value for the Friedman test as

\begin{matrix} X_{F}^{2} = \frac{12 \times N}{k (k + 1)} [\sum_{i = 1}^{k} r_{i}^{2} - \frac{k {(k + 1)}^{2}}{4}] \end{matrix}

(63)

where N is the number of datasets, k is the number of classifiers, and

r_{i}

is the average rank on N datasets for the i-th model. For the linear case, we compute term

\sum_{i = 1}^{k} r_{i}^{2}

in Table 2 as

\begin{matrix} \begin{matrix} \sum_{i = 1}^{k} r_{i}^{2} & = {4.6}^{2} + {2.9}^{2} + {3.1}^{2} + {2.7}^{2} + {1.7}^{2} \\ \approx 49.1989 \end{matrix} \end{matrix}

(64)

Then, substituting

k = 5

,

N = 10

and (64) into (63), we have

\begin{matrix} \begin{matrix} X_{F}^{2} = \frac{12 \times 10}{5 (5 + 1)} [49.1989 - \frac{5 {(5 + 1)}^{2}}{4}] \approx 16.7956 \end{matrix} \end{matrix}

(65)

Based on the above Friedman statistic

X_{F}^{2} = 16.7956

, we calculate the F-distribution statistic

F_{F}

with

(k - 1, (k - 1) (N - 1)) = (4, 36)

degrees of freedom as

\begin{matrix} F_{F} = \frac{(N - 1) \times X_{F}^{2}}{N (k - 1) - X_{F}^{2}} = \frac{9 \times 16.7956}{10 \times 4 - 16.7956} \approx 6.5143 \end{matrix}

(66)

Moreover, we compute the p-value, which rejects the null hypothesis at the level of significance

α = 0.05

. Similarly, we calculate the statistic for the nonlinear case, as summarized in Table 4. The results reject the null hypothesis for both linear and nonlinear cases and reveal the existence of significant differences in the performances of classifiers.

Furthermore, we record the average learning time of each classifier for the above UCI datasets experiments, as shown in Figure 2 and Figure 3. The results show that our MNP-KSVC is faster than Multi-SVM and Twin-KSVC, while slightly slower than MBSVM and MTPMSVM for linear and nonlinear cases. Multi-SVM performs the slowest of all classifiers because Multi-SVM needs to solve larger problems than the nonparallel-based classifiers. Moreover, the Hessian matrix of dual QPPs in MNP-KSVC avoids the time-costly matrix inversion, leading to greater effectiveness than Twin-KSVC. Overall, the above results confirm the feasibility of Twin-KSVC.

6. Discussion and Future Work

This paper proposes a novel K multiclass nonparallel parametric-margin support vector machine, termed MNP-KSVC. Specifically, our MNP-KSVC has the following attractive merits:

For the K-class learning task, our MNP-KSVC first transforms the complicated multiclass problem into $K (K - 1) / 2$ subproblems via a “one-versus-one-versus-rest” strategy. Each subproblem focuses on separating the two selected classes and the rest of the classes. That is, we utilize ${- 1, 1}$ to represent the label of the two selected classes and 0 to label the rest. Unlike the “one-versus-all” strategy used in Multi-SVM, MBSVM, and MTPMSVM, this encoding strategy can alleviate the imbalanced issues that sometimes occur in multiclass learning [32,35].
For each subproblem, our MNP-KSVC aims to learn a pair of nonparallel parametric-margin hyperplanes (36) with the ternary encoding ${- 1, 0, + 1}$ . These parametric-margin hyperplanes are closer to their corresponding class and at least one distance from the other class. Meanwhile, they restrict the rest of the instances into an insensitive region. A hybrid classification and regression loss joined with the regularization is further utilized to formulate the optimization problems (10) and (11) of MNP-KSVC.
Moreover, the nonlinear extension is also presented to deal with the nonlinear multiclass learning tasks. In contrast to MBSVM [33] and Twin-KSVC [35], the linear and nonlinear models in MNP-KSVC are consistent. Applying the linear kernel in the nonlinear problems (44) and (45) results in the same formulations as the original linear problems (14) and (15).
Extensive experiments on various datasets demonstrate the effectiveness of the proposed MNP-KSVC compared with Multi-SVM, MBSVM, MTPMSVM, and Twin-KSVC.

There are several interesting directions to research in the future, such as extensions to semi-supervised learning [26,42], multi-label learning [22], and privilege-information learning [43].

Author Contributions

Conceptualization, S.-W.D. and W.-J.C.; Funding acquisition, W.-J.C. and Y.-H.S.; Investigation, M.-C.Z. and P.C.; Methodology, S.-W.D., M.-C.Z. and Y.-H.S.; Project administration, S.-W.D. and W.-J.C.; Supervision, W.-J.C. and Y.-H.S.; Validation, M.-C.Z. and H.-F.S.; Visualization, P.C. and H.-F.S.; Writing—original draft, S.-W.D., P.C. and H.-F.S.; Writing—review and editing, S.-W.D. and W.-J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant number: 61603338, 11871183, 61866010, and 11426202; the Natural Science Foundation of Zhejiang Province of China, Grant number: LY21F030013; the Natural Science Foundation of Hainan Province of China, Grant number: 120RC449; the Scientific Research Foundation of Hainan University, Grant number: kyqd(sk)1804.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable. No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

QPP	Quadratic Problem Programming
KKT	Karush–Kuhn–Tucker
SVM	Support Vector Machine
TWSVM	Twin Support Vector Machine
Multi-SVM	Multiclass Support Vector Machine
MBSVM	Multiple Birth Support Vector Machine
MTPMSVM	Multiple Twin Parametric Margin Support Vector Machine
Twin-KSVC	Multiclass Twin Support Vector Classifier
MNP-KSVC	Multiclass Nonparallel Parametric-Margin Support Vector Classifier

References

Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann: San Francisco, CA, USA, 2012. [Google Scholar]
Vapnik, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
Deng, N.; Tian, Y.; Zhang, C. Support Vector Machines: Theory, Algorithms and Extensions; CRC Press: Philadelphia, PA, USA, 2013. [Google Scholar]
Sitaula, C.; Aryal, S.; Xiang, Y.; Basnet, A.; Lu, X. Content and context features for scene image representation. Knowl.-Based Syst. 2021, 232, 107470. [Google Scholar] [CrossRef]
Ma, S.; Cheng, B.; Shang, Z.; Liu, G. Scattering transform and LSPTSVM based fault diagnosis of rotating machinery. Mech. Syst. Signal Process. 2018, 104, 155–170. [Google Scholar] [CrossRef]
Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of Fake Stereo Audio Using SVM and CNN. Information 2021, 12, 263. [Google Scholar] [CrossRef]
You, S.D. Classification of Relaxation and Concentration Mental States with EEG. Information 2021, 12, 187. [Google Scholar] [CrossRef]
Kang, J.; Han, X.; Song, J.; Niu, Z.; Li, X. The identification of children with autism spectrum disorder by SVM approach on EEG and eye-tracking data. Comput. Biol. Med. 2020, 120, 103722. [Google Scholar] [CrossRef]
Lazcano, R.; Salvador, R.; Marrero-Martin, M.; Leporati, F.; Juarez, E.; Callico, G.M.; Sanz, C.; Madronal, D.; Florimbi, G.; Sancho, J.; et al. Parallel Implementations Assessment of a Spatial-Spectral Classifier for Hyperspectral Clinical Applications. IEEE Access 2019, 7, 152316–152333. [Google Scholar] [CrossRef]
Fabelo, H.; Ortega, S.; Szolna, A.; Bulters, D.; Pineiro, J.F.; Kabwama, S.; J-O’Shanahan, A.; Bulstrode, H.; Bisshopp, S.; Kiran, B.R.; et al. In-Vivo Hyperspectral Human Brain Image Database for Brain Cancer Detection. IEEE Access 2019, 7, 39098–39116. [Google Scholar] [CrossRef]
Roy, S.D.; Debbarma, S. A novel OC-SVM based ensemble learning framework for attack detection in AGC loop of power systems. Electr. Power Syst. Res. 2022, 202, 107625. [Google Scholar] [CrossRef]
Mangasarian, O.L.; Wild, E.W. Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 69–74. [Google Scholar] [CrossRef]
Jayadeva; Khemchandani, R.; Chandra, S. Twin Support Vector Machines for Pattern Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef]
Ding, S.; Hua, X. An overview on nonparallel hyperplane support vector machine algorithms. Neural Comput. Appl. 2014, 25, 975–982. [Google Scholar] [CrossRef]
Peng, X. TPMSVM: A novel twin parametric-margin support vector machine for pattern recognition. Pattern Recogn. 2011, 44, 2678–2692. [Google Scholar] [CrossRef]
Arun Kumar, M.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
Shao, Y.; Zhang, C.; Wang, X.; Deng, N. Improvements on Twin Support Vector Machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
Chen, W.J.; Shao, Y.H.; Li, C.N.; Liu, M.Z.; Wang, Z.; Deng, N.Y. ν-projection twin support vector machine for pattern classification. Neurocomputing 2020, 376, 10–24. [Google Scholar] [CrossRef]
Tian, Y.; Qi, Z.; Ju, X.; Shi, Y.; Liu, X. Nonparallel Support Vector Machines for Pattern Classification. IEEE Trans. Cybern. 2014, 44, 1067–1079. [Google Scholar] [CrossRef] [Green Version]
Tian, Y.; Ping, Y. Large-scale linear nonparallel support vector machine solver. Neural Netw. 2014, 50, 166–174. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Shao, Y.; Li, C.; Wang, Y.; Liu, M.; Wang, Z. NPrSVM: Nonparallel sparse projection support vector machine with efficient algorithm. Appl. Soft Comput. 2020, 90, 106142. [Google Scholar] [CrossRef]
Chen, W.; Shao, Y.; Li, C.; Deng, N. MLTSVM: A novel twin support vector machine to multi-label learning. Pattern Recogn. 2016, 52, 61–74. [Google Scholar] [CrossRef]
Bai, L.; Shao, Y.H.; Wang, Z.; Chen, W.J.; Deng, N.Y. Multiple Flat Projections for Cross-Manifold Clustering. IEEE Trans. Cybern. 2021, 1–15. Available online: https://0-ieeexplore-ieee-org.brum.beds.ac.uk/document/9343292 (accessed on 20 October 2021). [CrossRef]
Hou, Q.; Liu, L.; Zhen, L.; Jing, L. A novel projection nonparallel support vector machine for pattern classification. Eng. Appl. Artif. Intell. 2018, 75, 64–75. [Google Scholar] [CrossRef]
Liu, L.; Chu, M.; Gong, R.; Zhang, L. An Improved Nonparallel Support Vector Machine. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5129–5143. [Google Scholar] [CrossRef]
Chen, W.; Shao, Y.; Deng, N.; Feng, Z. Laplacian least squares twin support vector machine for semi-supervised classification. Neurocomputing 2014, 145, 465–476. [Google Scholar] [CrossRef]
Shao, Y.; Chen, W.; Zhang, J.; Wang, Z.; Deng, N. An efficient weighted Lagrangian twin support vector machine for imbalanced data classification. Pattern Recogn. 2014, 47, 3158–3167. [Google Scholar] [CrossRef]
Chen, W.; Shao, Y.; Ning, H. Laplacian smooth twin support vector machine for semi-supervised classification. Int. J. Mach. Learn. Cybern. 2014, 5, 459–468. [Google Scholar] [CrossRef]
Gao, Z.; Fang, S.C.; Gao, X.; Luo, J.; Medhin, N. A novel kernel-free least squares twin support vector machine for fast and accurate multi-class classification. Knowl.-Based Syst. 2021, 226, 107123. [Google Scholar] [CrossRef]
Ding, S.; Zhao, X.; Zhang, J.; Zhang, X.; Xue, Y. A review on multi-class TWSVM. Artif. Intell. Rev. 2017, 52, 775–801. [Google Scholar] [CrossRef]
Qiang, W.; Zhang, J.; Zhen, L.; Jing, L. Robust weighted linear loss twin multi-class support vector regression for large-scale classification. Signal Process. 2020, 170, 107449. [Google Scholar] [CrossRef]
de Lima, M.D.; Costa, N.L.; Barbosa, R. Improvements on least squares twin multi-class classification support vector machine. Neurocomputing 2018, 313, 196–205. [Google Scholar] [CrossRef]
Yang, Z.; Shao, Y.; Zhang, X. Multiple birth support vector machine for multi-class classification. Neural Comput. Appl. 2013, 22, 153–161. [Google Scholar] [CrossRef]
Angulo, C.; Parra, X.; Català, A. K-SVCR. A support vector machine for multi-class classification. Neurocomputing 2003, 55, 57–77. [Google Scholar] [CrossRef]
Xu, Y.; Guo, R.; Wang, L. A Twin Multi-Class Classification Support Vector Machine. Cogn. Comput. 2013, 5, 580–588. [Google Scholar] [CrossRef]
Nasiri, J.A.; Moghadam Charkari, N.; Jalili, S. Least squares twin multi-class classification support vector machine. Pattern Recogn. 2015, 48, 984–992. [Google Scholar] [CrossRef]
Mangasarian, O.L. Nonlinear Programming; SIAM Press: Philadelphia, PA, USA, 1993. [Google Scholar]
Tomar, D.; Agarwal, S. A comparison on multi-class classification methods based on least squares twin support vector machine. Knowl.-Based Syst. 2015, 81, 131–147. [Google Scholar] [CrossRef]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 200, 675–701. [Google Scholar] [CrossRef]
Yu, Z.; Wang, Z.; You, J.; Zhang, J.; Liu, J.; Wong, H.S.; Han, G. A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets. IEEE Trans. Cybern. 2017, 47, 4418–4431. [Google Scholar] [CrossRef]
Hatamlou, A. Black hole: A new heuristic optimization approach for data clustering. Inform. Sci. 2013, 222, 175–184. [Google Scholar] [CrossRef]
Chen, W.; Shao, Y.; Xu, D.; Fu, Y. Manifold proximal support vector machine for semi-supervised classification. Appl. Intell. 2014, 40, 623–638. [Google Scholar] [CrossRef]
Li, Y.; Sun, H.; Yan, W.; Cui, Q. R-CTSVM+: Robust capped L1-norm twin support vector machine with privileged information. Inform. Sci. 2021, 574, 12–32. [Google Scholar] [CrossRef]

Figure 1. The flowchart for the training and predicting procedures of the proposed MNP-KSVC model.

Figure 2. The learning times on benchmark datasets for linear classifiers.

Figure 3. The learning times on benchmark datasets for nonlinear classifiers.

Table 1. Statistics for benchmark datasets used in experiments. # denotes the corresponding quantity.

Datasets	#Instances	#Training	#Testing	#Attributes	#Class
Balance	625	438	187	4	3
Ecoli	327	229	98	7	5
Iris	150	105	45	4	3
Glass	214	150	64	13	6
Wine	178	125	53	13	3
Thyroid	215	150	65	5	3
Dermatology	358	251	107	34	6
Shuttle	2175	1522	653	9	5
Contraceptive	1473	1031	442	9	3
Pen Based	1100	770	330	16	10

Table 2. Performance comparison on benchmark datasets for linear classifiers, in terms of

mean \pm std

of the testing multiclass accuracy (%). The Win/Loss (W/L) denotes the number of datasets for which MNP-KSVC is superior/inferior to the compared classifiers. Ave. Acc and rank denote each classifier’s average accuracy and rank over all datasets.

Table 2. Performance comparison on benchmark datasets for linear classifiers, in terms of

mean \pm std

of the testing multiclass accuracy (%). The Win/Loss (W/L) denotes the number of datasets for which MNP-KSVC is superior/inferior to the compared classifiers. Ave. Acc and rank denote each classifier’s average accuracy and rank over all datasets.

Datasets	Multi-SVM	MBSVM	MTPMSVM	Twin-KSVC	MNP-KSVC
Balance	79.91 ± 4.86	86.14 ± 4.63	87.43 ± 3.56	86.64 ± 1.92	88.33 ± 2.03
Ecoli	72.32 ± 7.03	73.62 ± 4.95	73.77 ± 4.33	74.62 ± 3.86	75.91 ± 2.52
Iris	93.24 ± 2.66	92.96 ± 2.24	93.21 ± 2.53	92.69 ± 3.24	94.13 ± 2.49
Glass	69.18 ± 9.85	72.89 ± 7.28	68.68 ± 6.79	71.65 ± 5.96	71.38 ± 5.38
Wine	93.24 ± 3.02	95.28 ± 1.46	98.19 ± 1.61	97.23 ± 1.12	97.14 ± 1.27
Thyroid	90.24 ± 2.53	93.74 ± 1.58	92.92 ± 1.43	96.97 ± 1.08	97.52 ± 1.54
Dermatology	81.82 ± 3.79	86.67 ± 1.69	84.46 ± 3.29	90.37 ± 2.32	89.06 ± 3.16
Shuttle	71.58 ± 4.78	84.04 ± 2.92	77.17 ± 3.3	78.76 ± 4.81	83.16 ± 1.92
Contraceptive	38.53 ± 3.76	43.95 ± 2.51	44.65 ± 2.93	42.25 ± 2.45	44.22 ± 2.03
Pen Based	79.59 ± 3.75	85.94 ± 1.26	81.94 ± 2.04	83.21 ± 2.77	86.78 ± 1.37
Ave. Acc	76.97	81.52	80.24	81.43	82.76
W/L	10/0	8/2	8/2	9/1	/
Ave. rank	4.6	2.9	3.1	2.67	1.7

Table 3. Performance comparison on benchmark datasets for nonlinear classifiers, in terms of the

mean \pm std

of the testing multiclass accuracy (%). The Win/Loss (W/L) denotes the number of datasets for which MNP-KSVC is superior/inferior to the compared classifiers. Ave. Acc and rank denote each classifier’s average accuracy and rank over all datasets.

Table 3. Performance comparison on benchmark datasets for nonlinear classifiers, in terms of the

mean \pm std

of the testing multiclass accuracy (%). The Win/Loss (W/L) denotes the number of datasets for which MNP-KSVC is superior/inferior to the compared classifiers. Ave. Acc and rank denote each classifier’s average accuracy and rank over all datasets.

Datasets	Multi-SVM	MBSVM	MTPMSVM	Twin-KSVC	MNP-KSVC
Balance	79.94 ± 5.57	87.13 ± 4.57	89.92 ± 3.27	90.17 ± 4.08	91.41 ± 3.42
Ecoli	79.81 ± 5.19	80.82 ± 4.25	84.74 ± 3.49	80.12 ± 4.49	84.59 ± 3.83
Iris	90.26 ± 2.65	96.89 ± 1.83	97.39 ± 2.14	94.36 ± 2.03	98.04 ± 1.47
Glass	58.66 ± 4.76	52.64 ± 4.08	62.73 ± 3.95	56.01 ± 4.26	64.12 ± 2.98
Wine	94.36 ± 2.12	94.31 ± 1.87	98.38 ± 2.62	97.26 ± 1.61	98.04 ± 1.45
Thyroid	91.82 ± 1.75	93.27 ± 1.95	95.34 ± 0.89	94.14 ± 1.05	95.63 ± 0.84
Pen Based	86.53 ± 3.93	89.51 ± 3.29	85.06 ± 3.68	86.12 ± 2.47	88.78 ± 2.68
Dermatology	84.43 ± 4.29	83.82 ± 3.86	84.51 ± 2.64	83.33 ± 3.29	85.26 ± 3.12
Shuttle	74.36 ± 3.39	83.74 ± 2.73	87.06 ± 2.89	86.91 ± 2.38	89.37 ± 1.73
Contraceptive	42.09 ± 4.95	44.28 ± 4.02	45.93 ± 3.85	47.51 ± 3.57	47.47 ± 3.68
Ave. Acc	78.22	80.64	83.11	81.59	84.27
W/L	10/0	9/1	8/2	9/1	/
Ave. rank	4.3	3.69	2.3	3.3	1.4

Table 4. Results of Friedman test on learning results.

	Statistic $X_{F}^{2}$	p-Value	Hypothesis
Linear	6.5143	2.9503 × 10⁻⁴	reject
Nonlinear	10.2004	1.2267 × 10⁻⁵	reject

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, S.-W.; Zhang, M.-C.; Chen, P.; Sun, H.-F.; Chen, W.-J.; Shao, Y.-H. A Multiclass Nonparallel Parametric-Margin Support Vector Machine. Information 2021, 12, 515. https://0-doi-org.brum.beds.ac.uk/10.3390/info12120515

AMA Style

Du S-W, Zhang M-C, Chen P, Sun H-F, Chen W-J, Shao Y-H. A Multiclass Nonparallel Parametric-Margin Support Vector Machine. Information. 2021; 12(12):515. https://0-doi-org.brum.beds.ac.uk/10.3390/info12120515

Chicago/Turabian Style

Du, Shu-Wang, Ming-Chuan Zhang, Pei Chen, Hui-Feng Sun, Wei-Jie Chen, and Yuan-Hai Shao. 2021. "A Multiclass Nonparallel Parametric-Margin Support Vector Machine" Information 12, no. 12: 515. https://0-doi-org.brum.beds.ac.uk/10.3390/info12120515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multiclass Nonparallel Parametric-Margin Support Vector Machine

Abstract

1. Introduction

2. Preliminaries

2.1. Notations

2.2. TPMSVM

3. The Proposed MNP-KSVC

3.1. Model Formulation

3.2. Model Optimization

3.3. Decision Rule

4. Model Extension to the Nonlinear Case

5. Numerical Experiments

5.1. Experimental Setting

5.2. Result Comparison and Discussion

6. Discussion and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI