Semi-Supervised Multi-Label Dimensionality Reduction Learning by Instance and Label Correlations

Li, Runxin; Du, Jiaxing; Ding, Jiaman; Jia, Lianyin; Chen, Yinong; Shang, Zhenhong

doi:10.3390/math11030782

Open AccessArticle

Semi-Supervised Multi-Label Dimensionality Reduction Learning by Instance and Label Correlations

¹

Yunnan Key Lab of Computer Technology Applications, Kunming University of Science and Technology, Kunming 650500, China

²

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

³

School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ 85287, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(3), 782; https://0-doi-org.brum.beds.ac.uk/10.3390/math11030782

Submission received: 30 December 2022 / Revised: 25 January 2023 / Accepted: 1 February 2023 / Published: 3 February 2023

(This article belongs to the Special Issue Mathematical Optimization in Pattern Recognition, Machine Learning and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

:

The label learning mechanism is challenging to integrate into the training model of the multi-label feature space dimensionality reduction problem, making the current multi-label dimensionality reduction methods primarily supervision modes. Many methods only focus attention on label correlations and ignore the instance interrelations between the original feature space and low dimensional space. Additionally, very few techniques consider how to constrain the projection matrix to identify specific and common features in the feature space. In this paper, we propose a new approach of semi-supervised multi-label dimensionality reduction learning by instance and label correlations (SMDR-IC, in short). Firstly, we reformulate MDDM which incorporates label correlations as a least-squares problem so that the label propagation mechanism can be effectively embedded into the model. Secondly, we investigate instance correlations using the k-nearest neighbor technique, and then present the

l_{1}

-norm and

l_{2, 1}

-norm regularization terms to identify the specific and common features of the feature space. Experiments on the massive public multi-label data sets show that SMDR-IC has better performance than other related multi-label dimensionality reduction methods.

Keywords:

semi-supervised learning; multi-label learning; dimensionality reduction; instance correlations; common features

MSC:

90C25; 62J05; 90C52

1. Introduction

Multi-label learning tasks are frequently accompanied by high-dimensional feature space, as are other machine learning paradigms. When learning directly from large-scale, high-dimensional data, algorithms usually fail to perform in classification [1]. First, the high-dimensional space’s overabundance of redundant and irrelevant information makes it more difficult for the model to identify the interclass structure of the data; second, the multicollinearity between the feature attributes of high-dimensional data results in the model’s poor generalization ability; and third, in high-dimensional feature spaces, traditional distances (such as the Euclidean distance) do not have the ability to measure the manifold structure between samples. In reality, Euclidean distance plays a significant role in the majority of classifiers. The term “dimensional curse” [2] is used to describe these problems. Dimensionality reduction becomes an important preprocessing step for multi-label learning on high-dimensional data as a result.

Contrary to the traditional single-label learning tasks, which presuppose that the labels of instances are mutually exclusive, the labels of multi-label instances are inter-correlated. In the multi-class classification task of photos, for instance, “desert” and “camel” coexist frequently, although “sea” does not frequently coexist with “desert”. People are motivated by this fact to learn about or label samples with unknown labels using known labels by relying on the correlations between labels. Unfortunately, it is challenging to incorporate such a label-learning mechanism into the model to estimate the labels of unlabeled instances due to the particularity of feature dimensionality reduction. Because of this, the majority of multi-label dimensionality reduction techniques that have been developed recently are supervised settings that require enough labeled instances [3,4,5,6]. These supervised approaches, sadly, fail to take into account the fact that, in many real-world applications, it is difficult to annotate enough unlabeled instances because of the high labeling costs. As a consequence, labeling a sizable training dataset is typically impractical in authentic situations [7]. Unlabeled instances, on the contrary, are typically available and plentiful. Then, certain semi-supervised multi-label dimensionality reduction techniques have been developed (see [8,9,10]), which can enhance learning performance by efficiently combining a large number of unlabeled instances with the scarce number of labeled instances.

Most of these semi-supervised dimensionality reduction techniques start by developing various label propagation methods based on label correlations, then applying them to the k-nearest neighbor (kNN) graph to generate soft labels for instances that are not labeled. When expanding labeled training samples, the projection matrix from high-dimensional space to low-dimensional space is trained on the assumption that the sample features’ between-class distance achieves the maximum and their within-class distance reaches the minimum. This approach, often known as MLDA (multi-label linear discriminant analysis) [6], or its promotion variant, is the core of these concepts. The instance correlations between the original feature space and low dimensional space are ignored by the MLDA framework, despite the fact that it can integrate label correlations.

Given that the multi-label data with high-dimensional features contains a significant amount of redundant and irrelevant information, it is necessary to selectively extract specific and common features of the feature space for concentration, while reducing dimensions, and to remove the negative effects of unimportant features. Feature extraction and feature selection are techniques that are comparable to this. The

l_{2, 1}

-norm is frequently employed in multi-label feature selection because it may choose distinguishing features for all instances with joint sparsity (each feature has a lower score for all instances or a higher score for all instances) [11,12]. The

l_{2, 1}

-norm’s drawback is equally clear though—it does not take into account the distinctive features or the redundant correlation of features [13]. To compensate for the shortcomings of the

l_{2, 1}

-norm and improve the projection matrix’s ability to identify within-class features of samples, we use the

l_{1}

-norm as the model regularization term to learn the high sparsity specific features of the low-dimensional feature space of samples.

In this paper, we propose a novel method, namely semi-supervised multi-label dimensionality reduction learning by instance and label correlations (SMDR-IC) based on dependence maximization. This method effectively utilizes the information from both labeled and unlabeled instances, simultaneously considers the label and instance correlations, and also concurrently incorporates specific and common features of feature space. Our major contributions are summarized as follows.

•: The Hilbert–Schmidt Independence Criterion (HSIC) [14] has been mathematically shown to maximize the dependence between the original feature description and the associated class label. Motivated by this, we use the matrix factorization technique to reconstruct the HSIC empirical estimator in MDDM [4] into a least squares problem, enabling the label propagation mechanism to be seamlessly incorporated into the dimensionality reduction learning model.
•: Consideration is given to the instance correlations. In order to use instance correlations in dimensionality reduction, we introduce a new assumption, which states that if two instances have a high degree of correlation in the original feature space, they should also have a high degree of correlation in the low-dimensional feature space. The instance correlations are assessed using the k-nearest neighbor approach.
•: Through the use of the $l_{1}$ -norm and $l_{2, 1}$ -norm regularization terms to select the appropriate features, the specific features and the common features of the feature space are simultaneously investigated in our method, which helps to enhance the performance of dimensionality reduction.

The rest of the paper is organized as follows. The related work is briefly reviewed in Section 2. Section 3 introduces the details of our proposed SMDR-IC method. Experimental results are analyzed in Section 4. Finally, Section 5 concludes this paper.

2. Related Work

2.1. Dimensionality Reduction

In this section, we mainly review the related works on dimensionality reduction, including unsupervised, supervised, and semi-supervised methods.

For a long time, people have been concerned about the topic of data feature dimensionality reduction. Since unsupervised dimensionality reduction techniques do not use label information, they can theoretically be used to reduce the dimensionality of instances with multiple labels without using label data. A popular dimensionality reduction technique is principal component analysis (PCA), which constructs the projection matrix by maximizing feature variances or minimizing squared constructed error [15]. Other unsupervised techniques, such as locally linear embedding (LLE) [16], Laplacian eigenmaps (LE) [17], and flexible manifold embedding (FME) [18], have also been reported. Latent semantic indexing (LSI) [19], which was first used for document analysis and information retrieval, has since evolved into a successful unsupervised dimensionality reduction method. These techniques are primarily aimed at obtaining a low-dimensional representation by maintaining the manifold structure of instances. The large number of redundant and irrelevant information in high-dimensional data features, however, makes it impossible for a single feature learning method to extract the varied representation of the data. Consequently, the supervised and semi-supervised dimensionality reduction modes of feature and label information coupling are becoming increasingly popular in the problem of multi-label dimension reduction. The goal of multi-label informed latent semantic indexing (MLSI) [20], which extends LSI to a supervised method, is to achieve a projection matrix that maximizes feature variances and binary label variances through a method based on linear combination. This method aims to capture correlations between labels, while also preserving the information of inputs. However, it does not investigate the internal relationship between features and labels.

Currently, there have been three basic frameworks for these dimensionality reduction methods of multi-label data feature space. To achieve the purpose of the training projection matrix, the first strategy is to identify the main direction in the label space and feature space and maximize the linear correlation between them. The theoretical foundations of this tactic come from Hotelling, who proposed in [21] that canonical correlation analysis (CCA) can be viewed as the problem of locating the basis vectors of two groups of variables so as to maximize the correlation between the projections of variables on these basis vectors. CCA can be used directly—without any adjustments—in the multi-label scenario. Hardoon et al. [5] used the CCA technique to address the multi-label dimensionality reduction problem in the beginning. The CCA method of multi-label dimensionality reduction, described in [5], aims to maximize the linear correlation between the feature set derived from the low dimensional projection space and the label set. As a foundational technique, CCA has made significant contributions to numerous extensions in multi-label feature dimensionality reduction. LS-CCA [22], for example, expands CCA by using a least-squares formulation and its several regularized variations, suggesting that CCA can be transformed into slightly different least-squares problems. CCA is extended by 2SDSR [23] by combining it with other feature reduction methods. The CCA framework’s drawback is that semi-supervision expansion is challenging.

One of the most well-known supervised dimensionality reduction techniques is linear discriminative analysis (LDA) [24], which utilizes label information to define the between-class and within-class scatter matrices and then maximizes the Rayleigh quotient between the two matrices to find a projection matrix that makes instances of the same class to be close while the different class is far away in a low-dimensional space. The second main framework for dimensionality reduction is provided by this technique. By employing various label weighting settings, LDA has been extended to multiple types of multi-label LDA versions [6,25,26,27,28,29]. wMLDAb [25] adopts a binary weight, wMLDAe [26] an entropy-based weight, wMLDAc (i.e., MLDA) [6] a correlation-based weight, wMLDAf [27] a fuzzy-based weight, and wMLDAd [28] a dependence-based weight. MLDA-LC [29], in particular, constructs an adjacency graph to represent instance similarity as a graph Laplacian matrix and then combines the Laplacian matrix into the MLDA method to reveal the local structure of multi-label instances. The advantage of the LDA framework is that the path optimization problem built on the basis of the framework can easily be converted into the eigenvalue and eigenvector problem of the matrix for solution. The disadvantage is that because the scatter matrix is constructed using Euclidean distance, the traditional distance cannot adequately capture the data’s complex manifold structure when the data feature dimension is high.

Gretton et al. [14] developed the mathematical theory for HSIC in 2015, the third key framework for dimensionality reduction. The empirical estimator for HSIC as well as an explanation of how HSIC can measure the relationship between the original feature description and the associated class label can be found in [14]. Since then, multi-label dimensionality reduction learning has begun focusing on HSIC. MDDM (multi-label dimensionality reduction via dependence maximization) [4], as a supervised baseline technique of multi-label dimensionality reduction inside the HSIC framework, aims to learn the projection matrix by maximizing feature-label dependence. (MDDM and HSIC will be briefly reviewed in Section 2.2). In its initial form, MDDM was developed using two different projection strategies: MDDMp and MDDMf. The former relies on orthonormal projection directions, whereas the latter makes the projected features orthonormal. In order to avoid the direct eigendecomposition on the large-scale matrix, SSMDDM [30] presents an effective approach for finding the optimum solution of MDDM. It then reformulates MDDM as a least-squares problem and develops a shared subspace MDDM for multi-label dimensionality reduction. However, the label correlations, which are crucial for multi-label learning, are not taken into account in the least-squares problem of MDDM, as recast by SSMDDM.

In the last ten years, a number of semi-supervised multi-label dimensionality reduction techniques have been put forth to make use of labeled and unlabeled instances. Some of the methods combine the learning of a classifier with the learning of a low-dimensional embedding, such as SSDR-MC [8], BSSML [31], and so on [32]. The corresponding supervised multi-label dimensionality reduction techniques are also available in semi-supervised forms. Examples include [33], which introduces the semi-supervised CCA based on Laplacian regularization; MSDA [34], which adds two regularization terms to the MLDA objective function by setting up two matrices (adjacency matrix and similarity matrix); SSMLDR [9], which first obtains soft labels of unlabeled instances by label propagation, and then the soft labels of all instances, both labeled and unlabeled, are used to construct the scatter matrices of MLDA; SMDRdm [10], which is similar to [9], the scatter matrices are constructed by estimating the soft labels of unlabeled instances using label propagation, and further, the empirical measure of HSIC is plugged into the LDA framework to train the projection matrix with inter-class scatter minimization and dependence maximization as the objective function. SSMLDR and SMDRdm only pay attention to label correlations and ignore instance correlations between original feature space and low-dimensionality feature space. Furthermore, these two methods are based on LDA, a framework that heavily relies on the distance function, making them more vulnerable to outliers and increases the soft label error [35,36,37]. Recently, Mikalsen et al. [38] extended the dimensionality reduction technique to noisy multi-label cases by developing an anti-noise label propagation method that they used to label unlabeled samples before using the MDDM method to reduce the dimension of data features. NMLSDR is the name of this approach.

Although these semi-supervised strategies have shown good experimental results, they cannot overcome the restrictions of the three frameworks mentioned above. These framework models can only make limited improvements, and other regular term constraints cannot be freely added, which is why current dimension reduction methods rarely consider the specific and commom features of feature space.

2.2. The Brief Review of HSIC and MDDM

As a helpful measure of dependence, the Hilbert–Schmidt Independence Criterion (HSIC) has been used in numerous machine learning applications. Let

X = [x_{1}, x_{2}, \dots, x_{n}]

be a training data set, which consists of n instances, and

Y = [y_{1}, y_{2}, \dots, y_{n}]

be the label matrix, where

y_{i}

denotes the class label vector of

x_{i}

. Given a multi-lable data set

{(X, Y)}

with joint distribution

P_{X Y}

, k and l denote the kernel functions, and the feature kernel matrix and label kernel matrix, respectively, are defined as

K \in R^{n \times n}

,

K_{i j} = k (x_{i}, x_{j})

, and

L \in R^{n \times n}

,

L_{i j} = l (y_{i}, y_{j})

. Then, the empirical estimator of the HSIC is given by [14]:

HSIC (X, Y, P_{X Y}) = {(n - 1)}^{- 2} tr (HKHL),

(1)

where

tr (\cdot)

indicates the trace operation of a matrix.

H \in R^{n \times n}

is the centering matrix defined as

H_{i j} = δ_{i j} - \frac{1}{n}

, where

δ_{i j} = 1

if

i = j

and

δ_{i j} = 0

otherwise.

With maximizing the correlation between the original feature description and the relevant class labels, the HSIC empirical estimator is used in MDDM to project the original data into the low-dimensional feature space. Denote the projection matrix as

P

. An instance

x

is projected into a new space by

ϕ (x) = P^{T} x

, and the induced kernel functions are given as

k (x_{i}, x_{j}) = 〈ϕ (x_{i}), ϕ (x_{j})〉 = 〈P^{T} x_{i}, P^{T} x_{j}〉

and

l (y_{i}, y_{j}) = 〈y_{i}, y_{j}〉

, where

〈x_{i}, x_{j}〉

denotes the inner product defined as

〈x_{i}, x_{j}〉 = x_{i}^{T} x_{j}

[4].

Drop

{(n - 1)}^{- 2}

and notice that

K = X^{T} {PP}^{T} X

and

L = Y^{T} Y

. The optimization procedure of Equation (1) is wrote as searching for the optimal linear projection:

P^{*} = \underset{P}{arg max} tr ({HX}^{T} {PP}^{T} XHL) .

(2)

To avoid a trivial solution, an additional constraint for

P

is introduced, which leads to the following expression [4]:

\{\begin{cases} max_{P} & tr ({HX}^{T} {PP}^{T} XHL), \\ s . t . & p_{i}^{T} (μ {XX}^{T} + (1 - μ) I) p_{j} = δ_{i j} (1 \leq i, j \leq d), \end{cases}

(3)

where d is the dimension of the lower dimensional space,

P = [p_{1}, p_{2}, \dots, p_{d}]

;

μ \in [0, 1]

is a pre-defined parameter to control the importance between two constraints. When

μ = 0

, the projection matrix is orthonormal, called an orthogonal projection [4]; when

μ = 1

, the projected features are uncorrelated on the training data and called uncorrelated subspace dimensionality reduction [4].

It is easy to verify that the optimal solutions of Equation (3) are characterized by the following generalized eigenvalue problem:

{XHLHX}^{T} p = λ (μ {XX}^{T} + (1 - μ) I) p .

(4)

3. Materials and Methods

In this section, we first go over some important notation and symbols, and thereafter elaborate on our proposed SMDR-IC method.

3.1. Preliminaries

Let

X = [x_{1}^{T}; \dots; x_{l}^{T}; x_{l + 1}^{T}; \dots; x_{n}^{T}] \in R^{n \times d}

be a data set of n d-dimensional instances, where the first l of the instances are labeled and the remaining u are unlabeled,

l + u = n

.

L = {1, 2, \dots, c}

denotes the label set, and c means that each instance includes c labels. Drawing on [9], the additional

(c + 1)

-th label is appended into the label set in order to detect the outliers. Define the initial label matrix

Y = [y_{1}^{T}; \dots; y_{l}^{T}; y_{l + 1}^{T}; \dots; y_{n}^{T}] = [Y_{l}; Y_{u}] \in R^{n \times (c + 1)}

, where

Y_{l}

denotes the initial labels of labeled instances and

Y_{u}

denotes the initial labels of unlabeled instances. For the labeled instances,

Y_{i j} = 1

if the i-th instance is labeled as j, and

Y_{i j} = 0

otherwise. For the unlabeled instances,

Y_{i j} = 1

if

j = c + 1

, and

Y_{i j} = 0

otherwise. We suppose the predicted label matrix as

F = [F_{1}^{T}; \dots; F_{l}^{T}; F_{l + 1}^{T}; \dots; F_{n}^{T}] = [F_{l}; F_{u}] \in R^{n \times (c + 1)}

, where

F_{i} \in R^{c + 1} (1 \leq i \leq n)

are column vectors and

0 \leq F_{i j} \leq 1

.

F_{l}

denotes the predicted labels of labeled instances and

F_{u}

denotes the predicted labels of unlabeled instances.

The objective is to learn a projection matrix

P \in R^{d \times t}

that projects an instance

x

from original feature space

R^{d}

to a lower dimensional space representation

z \in R^{t}

, and

z = x^{T} P,

(5)

where

t ≪ d

.

3.2. Obtaining Soft Label by Label Propagation

3.2.1. Neighborhood Graph Construction

To accomplish label propagation, a graph construction consisting of labeled and unlabeled instances is built to evaluate the similarities among neighboring instances. The weighted adjacency matrix

W

is defined specifically by using a

k NN

graph over n instances, as shown below:

W_{i j} = \{\begin{cases} 1, & if x_{i} \in k NN (x_{j}) or x_{j} \in k NN (x_{i}), \\ 0, & otherwise, \end{cases}

(6)

where

k NN (x_{i})

contains the k-nearest neighbors of

x_{i}

computed by the Euclidean distance. Because of its simplicity and wide applicability, the weight in Equation (6) is simply set to

0 - 1

weight. This adjacency matrix

W

can also be obtained by other weight (i.e., Gaussian heat kernel) and distance settings.

We normalize

W

as a stochastic matrix

\tilde{W}

to ensure that the sum of the transitional probabilities from i to the other nodes of the graph equals 1, as follows:

\tilde{W} = D^{- 1} W,

(7)

where

D = diag (d_{11}, \dots, d_{n n})

and

d_{i i} = \sum_{j = 1}^{n} W_{i j}

.

{\tilde{W}}_{i j}

can be considered as the probability of a transition from node i to node j along the edge between them.

3.2.2. Label Propagation

Multi-label learning differs from single-label. The former assumes that instances’ labels are inter-correlated, while the latter assumes that instance labels are independent of one another. To convert the single-label propagation version to the multi-label propagation version, we first normalize the initial label matrix:

{\tilde{Y}}_{i j} = \{\begin{matrix} \frac{1}{∣ Y_{i} ∣}, & if Y_{i j} = 1, \\ 0, & otherwise, \end{matrix}

(8)

where

∣ Y_{i} ∣

denotes the number of labels that the instance

x_{i}

belong to.

The following equation updates the probability that instance

x_{i}

has the j-th class label:

F_{i j} (t + 1) = λ_{i} \sum_{k = 1}^{n} {\tilde{W}}_{i k} F_{k j} (t) + (1 - λ_{i}) {\tilde{Y}}_{i j} .

(9)

Obviously, in each iteration, the probability of each instance partially propagates from their neighbors and partially from their own labels. According to the labeled and unlabeled instances, we divide the matrix

\tilde{W}

,

\tilde{Y}

,

F

into the following forms based on the labeled and unlabeled instances:

\tilde{W} = [\begin{matrix} {\tilde{W}}_{l l} & {\tilde{W}}_{l u} \\ {\tilde{W}}_{u l} & {\tilde{W}}_{u u} \end{matrix}], \tilde{Y} = [\begin{matrix} {\tilde{Y}}_{l} \\ {\tilde{Y}}_{u} \end{matrix}], F = [\begin{matrix} F_{l} \\ F_{u} \end{matrix}] .

(10)

For the labeled instances, we fix the labels as

F_{l} = {\tilde{Y}}_{l}

and

λ_{l} = 0

. For the unlabeled instances, the iteration can be written as follows:

F_{u} (t + 1) = I_{λ_{u}} {\tilde{W}}_{u l} F_{l} (t) + I_{λ_{u}} {\tilde{W}}_{u u} F_{u} (t) + (I - I_{λ_{u}}) {\tilde{Y}}_{u},

(11)

where

I \in R^{u \times u}

is an identify matrix and

I_{λ_{u}} \in R^{u \times u}

is a diagonal matrix with the diagonal elements

λ_{u}

. Due to

F_{l} (t) = {\tilde{Y}}_{l}

,

F_{u} (0) = {\tilde{Y}}_{u}

, we have:

\begin{matrix} F_{u} (t + 1) & = \sum_{i = 0}^{t} {(I_{λ_{u}} {\tilde{W}}_{u u})}^{i} I_{λ_{u}} {\tilde{W}}_{u l} {\tilde{Y}}_{l} + {(I_{λ_{u}} {\tilde{W}}_{u u})}^{t + 1} {\tilde{Y}}_{u} \\ + \sum_{i = 0}^{t} {(I_{λ_{u}} {\tilde{W}}_{u u})}^{i} (I - I_{λ_{u}}) {\tilde{Y}}_{u}, \end{matrix}

(12)

{F_{u} (t)}

is a convergent sequence, the convergence analysis has been presented in [9]. After a finite number of iterations,

F_{u}

will converge to

F_{u} = {(I - I_{λ_{u}} {\tilde{W}}_{u u})}^{- 1} (I_{λ_{u}} {\tilde{W}}_{u l} \tilde{Y_{l}} + (I - I_{λ_{u}}) {\tilde{Y}}_{u}) .

(13)

It is easily found that the sum of each row of

F_{u}

is equal to 1. This means that the elements in

F

are the probability values and

F_{i j}

can be viewed as the posterior probability of the instance

x_{i}

belonging to the j-th class. In particular,

F_{i, c + 1}

represents the probability of the instance

x_{i}

belonging to the outliers. After obtaining the predicted label

F_{i j}

for each instance

x_{i}

, we define these labels

F_{i j}

(1 \leq j \leq c)

as soft labels.

3.3. Dimensionality Reduction

In this subsection, we utilize soft labels to learn the projection matrix

P

. Firstly, we go into detail on how to incorporate the label propagation mechanism and label correlations into MDDM to construct least squares, and then we integrate instance correlations as well as specific and common features of feature space into our approach separately. Finally, we discuss how our proposed dimensionality reduction strategy was optimized.

3.3.1. Design of the Semi-Supervised Mode

It is evident from label propagation Equations (6) and (9) that the label propagation measures the soft labels of unlabeled instances by the similarity of instance features, whereas label correlations are almost never employed. Therefore, to improve the performance of the propagation mechanism, we take the label correlations into consideration. According to the previous work [6], the correlation, namely cosine similarity, between two distinct label classes is expressed as follows:

C_{k l} = \cos (Y_{(k)}, Y_{(l)}) = \frac{〈Y_{(k)}, Y_{(l)}〉}{∥Y_{(k)}∥ ∥Y_{(l)}∥} .

(14)

Then, we have

{\tilde{F}}_{u} = F_{u} C

and

\tilde{F} = [F_{l}; {\tilde{F}}_{u}]

. Next, the resulting soft label matrix is then used to compute the label kernel matrix

L

in Section 2.2, effectively transforming the MDDM into a semi-supervised technique.

L

can be rewritten below:

L = \tilde{F} {\tilde{F}}^{T} .

(15)

3.3.2. Reformulate MDDM to Least Square

To reformulate MDDM to a least squares problem, we first define two necessary matrix

S_{1}

,

S_{2}

as follows:

\{\begin{cases} S_{1} = μ X^{T} X + (1 - μ) I, \\ S_{2} = X^{T} HLHX . \end{cases}

(16)

Then, the generalized eigenvalue problem in Equation (4) can be written as

S_{2} p = λ S_{1} p

. If

S_{1}

is inverse (when

μ \neq 1

,

S_{1}

must be inverse), the optimal

P

is given by the eigenvectors of

S_{1}^{- 1} S_{2}

corresponding to the d eigenvalues. Now, we apply the matrix decomposition technique to decompose

S_{1}^{- 1} S_{2}

. The singular value decomposition of

X

is defined as:

X = U diag (Σ_{t}, 0) V^{T},

(17)

where

U \in R^{n \times n}

and

V \in R^{d \times d}

are orthogonal matrices and

t = rank (X)

;

Σ_{t} \in R^{t \times t}

is an orthogonal matrix with the diagonal elements being singular values of

X

, and diag

(Σ_{t}, 0) \in R^{n \times d}

is a matrix in which the first r diagonal elements are singular values of

X

and 0 otherwise.

Let

U = [U_{1}, U_{2}]

and

V = [V_{1}, V_{2}]

, where

U_{1} \in R^{n \times t}

,

U_{2} \in R^{n \times (n - t)}

,

V_{1} \in R^{d \times t}

, and

V_{2} \in R^{d \times (d - t)}

. Then

S_{1}

,

S_{2}

can be rewritten as:

\{\begin{cases} S_{1} = V_{1} [μ Σ_{t}^{2} + (1 - μ) I] V_{1}^{T}, \\ S_{2} = V_{1} Σ_{t} U_{1}^{T} H \tilde{F} {\tilde{F}}^{T} H U_{1} Σ_{t} V_{1}^{T} . \end{cases}

(18)

According to Equation (18), we can calculate

S_{1}^{- 1} S_{2}

as:

S_{1}^{- 1} S_{2} = V_{1} {[μ Σ_{t}^{2} + (1 - μ) I]}^{- 1} Σ_{t} U_{1}^{T} H \tilde{F} {\tilde{F}}^{T} H U_{1} Σ_{t} V_{1}^{T} .

(19)

We define a diagonal matrix

B

as follows:

B = {[μ Σ_{t}^{2} + (1 - μ) I]}^{- \frac{1}{2}} .

(20)

Notice that

B

is an inverse matrix and

B = B^{T}

. According to Equations (19) and (20), we rewrite

S_{1}^{- 1} S_{2}

as:

S_{1}^{- 1} S_{2} = V_{1} B B^{T} Σ_{t} U_{1}^{T} H \tilde{F} {\tilde{F}}^{T} H U_{1} Σ_{t} B B^{- 1} V_{1}^{T} .

(21)

Denote

T = {\tilde{F}}^{T} {HU}_{1} Σ_{t} B \in R^{c \times t}

, and let

T = P_{1} Λ P_{2}^{T}

be the singular value decomposition of

T

, where

P_{1} \in R^{c \times c}

,

P_{2} \in R^{t \times t}

, and

Λ \in R^{c \times t}

is a diagonal matrix. Then we have:

\begin{matrix} S_{1}^{- 1} S_{2} & = V_{1} B P_{2} Λ P_{1}^{T} P_{1} Λ P_{2}^{T} B^{- 1} V_{1}^{T} \\ = V_{1} B P_{2} \tilde{Λ} P_{2}^{T} B^{- 1} V_{1}^{T}, \end{matrix}

(22)

where

\tilde{Λ} = Λ^{2}

. Thus, the solution for problem Equation (4), which consists of the eigenvectors corresponding to the eigenvalues of

S_{1}^{- 1} S_{2}

, is provided by the following equation:

P = V_{1} B P_{2} .

(23)

Consider the following least squares problem:

min_{P} {∥XP - Z∥}_{F}^{2},

(24)

where we assume that both the observation matrix

X

and the target matrix

Z

are centered. The optimal solution of Equation (24) is given by [39]:

P = {(X^{T} X)}^{†} X^{T} Z,

(25)

where

{(\cdot)}^{†}

is the pesudo-inverse of a matrix. If we set

Z = U_{1} Σ_{t} {BP}_{2}

, then we have:

\begin{matrix} P & = V_{1} Σ_{t}^{- 2} V_{1}^{T} V_{1} Σ_{t} U_{1}^{T} U_{1} Σ_{t} {BP}_{2} \\ = V_{1} B P_{2}, \end{matrix}

(26)

which is exactly the same formula as in Equation (23). It implies that the MDDM formulation in Equation (4) is equivalent to the least-squares formulation in Equation (24). On the basis of this connection of equivalence, we can attach some constraint conditions—such as instance correlations and sparse constraints—as regularization items.

3.3.3. Incorporating Instance Correlations

In our dimensionality reduction approach, we not only consider the label correlations as shown in Equations (15) and (18) but also incorporate the instance correlations. By making the assumption that two instances,

x_{i}

and

x_{j}

, may be related in the label space if they are correlated in the feature space, the previous classification algorithms [40,41] incorporate instance correlations. In fact, instances in multi-label dimensionality reduction problems should also maintain the interrelation of their features and labels even before and after dimensionality reduction. As a result, if two instances

x_{i}

and

x_{j}

are correlated in the original feature space, we assume they will also be related in the low-dimensionality feature space.

Instead of evaluating the instance correlations using the cosine similarity, we adopt the k-nearest neighbor (kNN) mechanism to reduce the effects of noisy and redundant features. Thus, the weighted adjacency matrix

W

in Equation (6) can be exploited to define the following regularization term for assessing the interrelation of feature space:

min_{P} \sum_{i, j}^{n} W_{i j} ∥x_{i}^{T} P - x_{j}^{T} P∥ = tr ({(XP)}^{T} L (XP)) = tr (O^{T} LO),

(27)

where

O = XP

is the output low-dimensionality matrix and

L = W^{*} - W

indicates the

n \times n

Laplacian matrix of

W

.

W^{*}

is a diagonal matrix and

W_{i i}^{*} = \sum_{j = 1}^{n} W_{i j}

. After incorporating this regularization, we can rewrite our objective function in Equation (24) as follows:

min_{P} \frac{1}{2} {∥XP - Z∥}_{F}^{2} + \frac{α}{2} tr (O^{T} LO) .

(28)

3.3.4. Incorporating Specific and Common Features

Furthermore, two norm regularization terms that constrain the sparsity of matrix

P

are also incorporated in our technique to enhance the performance of the dimensionality reduction. One is the

l_{1}

-norm, which can enforce sparsity among all elements in

P

and shrink some parameters to zero, allowing specific features in the original feature space to be selected. The

l_{2, 1}

-norm, which is the other norm, can guarantee the sparsity of

P

in rows, which is advantageous for choosing common features in the original feature space. For instance, in Figure 1, when the sparse projection matrix

P

with five rows and three columns is used to reduce the dimensions of instances

x_{1}

and

x_{2}

, the first feature

f_{1}^{'}

of the projection instances

z_{1}

and

z_{2}

is dominated of the original features

f_{1}

and

f_{4}

, the second feature

f_{2}^{'}

is made up of the original features

f_{1}

,

f_{2}

, and

f_{3}

, and the third feature

f_{3}^{'}

is determined by

f_{2}

and

f_{5}

, respectively. These features can be thought of as the specific features of each feature in the low dimensional feature space. To give

P

this performance, we employ the

l_{1}

-norm. While features

f_{1}

and

f_{2}

contribute to the first, second, and third features of

z_{1}

and

z_{2}

, they can be regarded as common features in the original feature space, and we leverage the

l_{2, 1}

-norm to capture these common features.

The final objective function of our proposed approach can be rewritten as Equation (29) after incorporating these two regularization terms.

min_{P} \frac{1}{2} {∥XP - Z∥}_{F}^{2} + \frac{α}{2} tr (O^{T} LO) + β {∥P∥}_{1} + γ {∥P∥}_{2, 1},

(29)

where

α

,

β

, and

γ

are constant coefficients.

3.3.5. Optimization

Despite knowing that Equation (29) is a convex optimization problem, the objective function is not smooth because of the non-smoothness of the

l_{1}

-norm and

l_{2, 1}

-norm regularization terms. To address this non-smooth optimization problem, we first release

{∥P∥}_{2, 1}

by

tr (P^{T} AP)

in the following [11], where

A

indicates a

d \times d

diagonal matrix, the i-th diagonal value in

A

is denoted as

A_{i i} = \frac{1}{2 {∥P_{i}∥}_{2}}

. Then, to solve the

l_{1}

-norm regularization term, we employ the accelerated proximal gradient method (APG).

In the general accelerated proximal gradient method, a convex optimization problem can be defined as:

min_{P \in H} F (P) = f (P) + g (P),

(30)

where

H

indicates a real Hilbert space.

f (P)

and

g (P)

are both convex, but they are smooth and non-smooth, respectively. The gradient function

\nabla f

is also Lipschitz continuous, i.e.,

{∥\nabla f (P_{1}) - \nabla f (P_{2})∥}_{F}^{2} \leq L_{f} ∥Δ P∥

, where

Δ P = P_{1} - P_{2}

and

L_{f}

is the Lipschitz constant. Proximal gradient algorithms, rather than directly minimizing

F (P)

, minimize a sequence of separable quadratic approximations to

F (P)

, denoted as:

Q_{L_{f}} (P, P^{(m)}) = f (P^{(m)}) + 〈\nabla f (P^{(m)}), P - P^{(m)}〉 + \frac{L_{f}}{2} {∥P - P^{(m)}∥}_{F}^{2} + g (P) .

(31)

Let

G^{(m)} = P^{(m)} - \frac{1}{L_{f}} \nabla f (P^{(m)})

, then

\begin{matrix} P^{*} & = \underset{P}{arg min} Q_{L_{f}} (P, P^{(m)}) \\ = \underset{P}{arg min} g (P) + \frac{L_{f}}{2} {∥P - G^{(m)}∥}_{F}^{2} . \end{matrix}

(32)

According to Equations (29) and (30),

f (P)

and

g (P)

are defined as follows:

\begin{matrix} f (P) & = \frac{1}{2} {∥XP - Z∥}_{F}^{2} + \frac{α}{2} tr (O^{T} LO) + γ {∥P∥}_{2, 1}, \end{matrix}

(33)

\begin{matrix} g (P) & = β {∥P∥}_{1} . \end{matrix}

(34)

According to Equation (33), we can calculate

\nabla f (P)

as:

\nabla f (P) = X^{T} XP - X^{T} Z + α X^{T} LXP + γ AP .

(35)

According to Equations (32)–(34), the projection matrix

P

can be optimized by

\begin{matrix} P^{*} & = \underset{P}{arg min} Q_{L_{f}} (P, P^{(m)}) \\ = \underset{P}{arg min} \frac{L_{f}}{2} {∥P - G^{(m)}∥}_{F}^{2} + g (P) \\ = \underset{P}{arg min} \frac{1}{2} {∥P - G^{(m)}∥}_{F}^{2} + \frac{β}{L_{f}} {∥P∥}_{1} . \end{matrix}

(36)

Lin et al. [42] proposed that instead, setting

P^{(m)} = P_{m} + \frac{b_{m - 1} - 1}{b_{m}} (P_{m} - P_{m - 1})

for a sequence

b_{m}

by satisfying

b_{m + 1}^{2} - b_{m} \leq b_{m}^{2}

can improve the convergence rate to

O (m^{- 2})

, where

P_{m}

is the result of

P

at the m-th iteration and

P^{(m)}

is the intermediate variable at the m-th iteration. Additionally, for the

l_{1}

-norm regularization term

g (P)

, if

H

is a normed space endowed with the Frobenius norm

{∥\cdot∥}_{F}

and

g (\cdot)

is

l_{1}

-norm, then

P_{m + 1}

is generated by soft-thresholding the entries of

G^{(m)}

as:

P_{m + 1} = S_{ε} [G^{(m)}] = \underset{P}{arg min} \frac{1}{2} {∥P - G^{(m)}∥}_{F}^{2} + ε {∥P∥}_{1},

(37)

where

S_{ε} [ω]

is the soft-thresholding operation,

ω \in R

and

ε > 0

, defined as:

S_{ε} [ω] = \{\begin{cases} ω - ε, & if ω > ε, \\ ω + ε, & if ω < ε, \\ 0, & otherwise . \end{cases}

(38)

Then, in each iteration,

P_{m + 1}

can be obtained by the following soft-thresholding operation:

P_{m + 1} = S_{\frac{β}{L_{f}}} [G^{(m)}] .

(39)

Here, we calculate the Lipschitz constant

L_{f}

. Given

P_{1}

and

P_{2}

, according to Equation (35), we have:

\begin{matrix} {∥\nabla f (P_{1}) - \nabla f (P_{2})∥}_{F}^{2} \\ = {∥X^{T} X Δ P + α X^{T} LX Δ P + γ A Δ P∥}_{F}^{2} \\ \leq 3 ({∥X^{T} X Δ P∥}_{F}^{2} + {∥α X^{T} LX Δ P∥}_{F}^{2} + {∥γ A Δ P∥}_{F}^{2}) \\ \leq 3 ({∥X^{T} X∥}_{2}^{2} + {∥α X^{T} LX∥}_{2}^{2} + {∥γ A∥}_{2}^{2}) {∥Δ P∥}_{F}^{2}, \end{matrix}

(40)

where

Δ P = P_{1} - P_{2}

. Obviously,

L_{f}

is formed by

L_{f} = \sqrt{3 ({∥X^{T} X∥}_{2}^{2} + {∥α X^{T} LX∥}_{2}^{2} + {∥γ A∥}_{2}^{2})} .

(41)

We fix the value of

A

, which is determined by the initial value of

P

, to ensure that

L_{f}

will always be a constant value.

Algorithm 1 summarizes the pseudo-code of the SMDR-IC method. This algorithm produces the projection matrix

P

, which can map the features of instances from d-dimensional to t-dimensional space, here

t = rank (X)

and

t \leq d

. If the dimensionality is to be reduced to

r (r < t)

, the first r columns of the projection matrix

P

can be chosen.

Algorithm 1 SMDR-IC: Semi-supervised Multi-label Dimensionality Reduction Learning by Instance and Label Correlations

Require:: Feature matrix $X \in R^{n \times d}$ , label matrix $Y \in R^{n \times c}$ , and parameters k, $μ$ , $α$ , $β$ , $γ$ , $σ$
Ensure:: Projection matrix $P \in R^{d \times t}$ , $t = rank (X)$
1:: $F_{l} \leftarrow Y, F_{u} \leftarrow 0$ ;
2:: Calculate the neighborhood graph construction matrix $W$ for label propagation and instance correlations by using k-nearest neighbors;
3:: Obtain the soft label matrix $F$ ;
4:: Calculate the label correlations matrix $C$ and obtain the label matrix $\tilde{F}$ ;
5:: Calculate the target matrix $Z$ ;
6:: $b_{0}, b_{1} \leftarrow 1$ , $m \leftarrow 1$ ;
7:: $P_{0}, P_{1} \leftarrow {(X^{T} X + σ I)}^{- 1} X^{T} Z$ ;
8:: Calculate the diagonal matrix $A$ ;
9:: Calculate Lipschitz constant $L_{f}$ ;
10:: repeat:
$P^{(m)} \leftarrow P_{m} + \frac{b_{m - 1} - 1}{b_{m}} (P_{m} - P_{m - 1})$ ;
$G^{(m)} \leftarrow P^{(m)} - \frac{1}{L_{f}} \nabla f (P^{(m)})$ ;
$P_{m + 1} \leftarrow S_{\frac{β}{L_{f}}} (G^{(m)})$ ;
$b_{m + 1} \leftarrow \frac{1 + \sqrt{4 b_{m}^{2} + 1}}{2}$ ;
Calculate the diagonal matrix $A_{m + 1}$ ;
untill: stop criterion is reached;
11:: $P \leftarrow P_{m}$ ;

An alternative optimization to the accelerated proximal gradient method is the alternating direction method of multipliers (ADMM), which has the advantage that

f (P)

and

g (P)

are completely independent, so they can both be non-smooth.

4. Results

4.1. Benchmark Data Sets

To verify the efficiency of our proposed method, we conduct experiments on fifteen publicly available real-world data sets from Mulan (http://mulan.sourceforge.net/datasets-mlc.html, accessed on 5 March 2022), the statistics of which are summarized in Table 1. These data sets are divided into four categories: music, image, biology, and text. Emotions is a multi-label music data set, Scene and Corel5k are multi-label image data sets, Yeast is a multi-label biology data set, and the remaining are subsets of Yahoo, which is a multi-label text (web) data set. All data sets are standardized to have a mean of zero and a variance of one.

4.2. Evaluation Metrics

The performances of different dimensionality reduction methods are evaluated by employing seven widely used evaluation metrics: Hamming Loss, Ranking Loss, Average Precision (AvgPrec), OneError, Macro-F1 (MacroF1), Micro-F1 (MicroF1), and Coverage. These evaluation criteria are given in [4,43] along with detailed descriptions.

For convenience, we modify Hamming Loss, Ranking Loss, and OneError to 1-Hamming, 1-Ranking, and 1-OneError, respectively, so that higher values always mean better for the first six evaluation metrics, and lower values indicate better performance for Coverage.

4.3. Comparison Methods and Parameters Settings

The previously mentioned multi-label dimensionality reduction methods, CCA [5], MDDM [4], and six MLDA variants—wMLDAb [25], wMLDAe [26], wMLDAc [6], wMLDAf [27], wMLDAd [28], and MLDA-LC [29]—are all taken into account to compare the performance of SMDR-IC. These supervised methods can only be trained on the labeled portion of the training instances since they require labeled instances. MLDA-LC, specifically, adopts an instance correlation measure similar to our technique to constrain the consistency between the class-wise within-class distances of instances in the low-dimensional feature space and the original feature space, allowing the LDA framework to capture local structures. Moreover, three semi-supervised approaches, SSMLDR [9], SMDRdm [10], and NMLSDR [38], are compared to SMDR-IC. NMLSDR is a dimensionality reduction method for noisy multi-label data that was recently proposed. To compare and demonstrate fairness, we replace NMLSDR’s noise-coping label propagation strategy with our approach’s label propagation mechanism.

We begin by projecting the high-dimensional instances into a low-dimensional space using these dimensionality reduction methods. The widely used ML-kNN [43] classifier is then applied to classify the instances in the low-dimensional space.

In the experiments, we set

k = 10

to construct the kNN neighborhood graph construction. To be more convincing, the label propagation parameters are set to the same as those in SSMLDR, i.e.,

λ_{l} = 0

for labeled instances and

λ_{u} = 0.99

for unlabeled instances.

μ

in Equation (16) is set to

0.5

, as in MDDM [4], to constrain the orthogonality of the objective matrix. The regularization parameters

α

,

β

, and

γ

are tuned by a grid search strategy from

{10^{i} | i = - 5, - 4, \dots, - 1}

and the best results are reported. All comparison method parameters were chosen based on their publications’ recommendations.

The experiment was performed using Matlab R2020a on a desktop PC with an Intel(R) Core(TM) i7-7700k 4.20GHz CPU, 16GB RAM, and NVIDIA GeForce GTX 1080 8G GPU, with a 64-bit Windows 10 operating system.

4.4. Experimental Results

In our experiments, we randomly partition 70% of instances as the training set and 30% of instances as the test set. In the training set, we randomly select 20% of instances as the labeled set and the remaining instances as the unlabeled set to learn a projection matrix

P

, which is used to project the d-dimensionality training and test sets to an r-dimensionality representation. To avoid the random effect, the random partitioning and selection are repeated 10 times on each dataset for each comparing method, with the averaged results reported.

Due to the restrictions of the MLDA framework, the five methods wMLDAb, wMLDAe, wMLDAc, wMLDAf, and wMLDAd have a target dimension of at most

c - 1

. The rank of the initial feature matrix is maximum and the feature dimension of the data is reduced by MLDA-LC. Therefore, we set the dimensions of projected features to

t = c - 1

for all approaches.

Using the training set after dimensionality reduction, we train an ML-kNN classifier and validate its performance on the low-dimensionality test set. As shown in Table 2, SMDR-IC demonstrated excellent effectiveness by obtaining optimal results on all data sets for seven measures. Indeed, SMIC has over five evaluation criteria on 11 data sets that produce the highest results, while other outcomes are marginally inferior.

In the meantime, we randomly chose more labeled instances to evaluate the effectiveness of SMDR-IC. As shown in Table 3 and Table 4, SMDR-IC performs better when 50% of the training set’s instances are labeled than when 20% are. However, SMDR-IC with 70% labeled instances of the training set shows higher performance than it does with 20% labeled instances and has somewhat worse performance than it does with 50% labeled instances, but it still ranked first on average. This outcome is primarily due to the fact that the merits of the supervised approaches start to emerge as the fraction of labeled instances rises enough.

Regarding the experimental results reported in Table 2, Table 3 and Table 4, the semi-supervised methods outperform the supervised methods because these supervised methods obtain the geometric structure of instances by using labeled instances, whereas the semi-supervised method leverages labeled and unlabeled instances to obtain a more comprehensive geometric structure of instances. Furthermore, because the label propagation mechanism of the NMSLDR method is replaced by that of the SMDR-IC method, the comparison results demonstrate that learning of instance similarity, common features, and specific features all play significant roles in capturing the geometric structure of instances.

To investigate further the performance differences between the compared methods and the proposed approach SMDR-IC, the necessary analyses were carried out using the Friedman test [44]. Table 5 lists the Friedman statistics

F_{F}

for each evaluation metric and critical levels. The null hypothesis that all comparison algorithms perform equally well is rejected for each evaluation metric at significance level

α = 0.05

. The proposed approach, SMDR-IC, is viewed as the control method, and the Bonferroni–Dunn test is utilized as the post-hoc test [44]. The critical distance (CD), defined as

CD = q_{a} \sqrt{\frac{K (K + 1)}{6 N}}

, is used to assess the average ranking differences between any two algorithms. Here,

q_{a} = 3.2680

,

(K = 12, N = 15)

, and

CD = 4.3025

. Figure 2 depicts the CD diagrams for each evaluation metric. The red line in each subfigure shows that the distances between SMDR-IC and certain comparison methods are less than CD, indicating statistical similarity. We observe that SMDR-IC and MDDM have significant statistical similarities for Ranking Loss, Average Precision, OneError, and Coverage, which is consistent with MDDM where the baseline method of SMDR-IC and SMDR-IC ranks second under MacroF1 and first under the other six evaluation metrics.

We also selected four domain data sets (Emotions, Scene, Yeast, and Arts) and five comparing methods, including supervised and semi-supervised to further study the performance of SMDR-IC under different target dimensionalities. Figure 3, Figure 4, Figure 5 and Figure 6 show the average results concerning 1-Hamming, 1-Ranking, AvgPrec, and Coverage, respectively. It can also be seen that SMDR-IC almost always outperforms other comparing methods on these evaluation metrics. These results further demonstrate the effectiveness of SMDR-IC for multi-label dimensionality reduction.

4.5. Sensitivity Analysis of Parameters

Three crucial parameters—

α

,

β

, and

γ

—are present in our dimensionality reduction model.

α

controls the contributions of embedding instance correlations, while

β

and

γ

regulate the sparsity of the projection matrix. It is necessary for us to conduct the sensitivity analysis of parameters for SMDR-IC. All values of parameters are specified as

[10^{- 7}, 10^{- 6}, \dots, 10^{2}, 10^{3}]

based on four domain data sets, including Emotions, Scene, Yeast, and Arts. We tune one parameter by a step

10^{1}

, while fixing other parameters at their best settings.

Figure 7, Figure 8 and Figure 9 show the influence of three parameters on four data sets. Parameter

α

controls the instance correlations, and the larger value of it means the higher importance of instance correlations. We can clearly see that when the

α

increases from

10^{- 7}

to

10^{0}

, the performance tends to increase slowly and remain stable, but when the

α

exceeds 1, performance begins to decline slowly on Yeast and Arts and rapidly on Emotions and Scene. This is explained by the fact that the influence of instance correlations becomes more significant as the

α

value increases, potentially limiting the influence of other factors and resulting in poor performance. Parameters

β

and

γ

control the sparsity of specific features and common features, respectively. The trends are similar to the parameter

α

, with the increasing values of

β

and

γ

, the performance tends to increase and maintain stability. Nevertheless, with parameters

β

or

γ > 10^{1}

, the performance beings to decline, slowly on Yeast and Arts, and rapidly on Emotions and Scene.

4.6. Convergence and Time Complexity Analysis

The accelerated proximal gradient method (APG), a special version of the gradient descent method, is used to solve optimization problems with non-differentiable objective functions. The SMDR-IC optimization model Equation (29) can be transformed into the standard model solved by APG. For its convergence theory, see [42].

As the number of iterations is increased, Figure 10 illustrates the changes in the objective function values for the SMDR-IC dimensionality reduction model on the four data sets. It is clear that the value of the objective function rapidly declines as iteration times increase. More specifically, on the Emotions dataset after around 50 iterations, and on the Scene, Yeast, and Arts data sets after about 150, 40, and 2500 iterations, respectively, the objective function value tends to a fixed value. In conclusion, the APG method steadily converges when solving the SMDR-IC dimensionality reduction model.

The following is the conclusion of the complexity analysis of SMDR-IC methods. The time complexity of constructing a kNN graph matrix

W

and computing the label correlations matrix

C

is denoted as

O (n^{2} d)

and

O (l c^{2})

, respectively, where n is the number of labeled and unlabeled instances, d is the dataset dimensionality, l is the number of labeled instances, and c is the number of labels. The time complexity of obtaining the soft labels is approximately

O (n^{2} c)

. The time cost of calculating the target matrix

Z

is dominated by SVD on

X

, which can be denoted as

O (n^{2} d + n d^{2})

.

O (n d^{2} + d^{3})

is the time complexity of initializing

P

. The time complexity of calculating the Lipschitz constant

L_{f}

is denoted by the symbol

O (d^{3})

. The time cost of iteration steps is dominated by calculating the gradient of

f (P)

, which can be denoted as

O (n d^{2} + n^{2} d)

. As a consequence, the total time complexity of SMDR-IC is denoted as

O (m (n^{2} d + n d^{2}) + d^{3})

, where m is the number of iterations.

Table 6 lists the calculation time of each methods on four data sets. According to Table 6, we see the following. Firstly, the semi-supervised dimensionality methods take longer than the supervised ones. This is because semi-supervised dimensionality methods use more instances than supervised ones in the learning process. Specifically, semi-supervised dimensionality reduction methods use both labeled and unlabeled instances to learn, whereas supervised methods use only labeled instances. Semi-supervised dimensionality methods also require obtaining soft labels for unlabeled instances. Secondly, when comparing other semi-supervised dimensionality reduction methods, SMDR-IC performs with similar efficiency on data sets with medium data volume, and slower efficiency on data sets with small and large data volume, while most of them take a longer time. This is because the final result of SMDR-IC is obtained by the gradient optimization method, and the number of iterations for the optimization method to converge varies on different data sets, making it difficult to find a universally applicable setting.

5. Conclusions

In this paper, we introduced a novel method, namely semi-supervised multi-label dimensionality reduction learning by instances and label correlations (SMDR-IC), which effectively utilizes the information from both labeled instances and unlabeled instances by label propagation. SMDR-IC exploits the instance correlations by assuming that if two instances are correlated in the original feature space, they will also be related in the low-dimensionality feature space. The label correlations are also taken into account by reformulating the least squares method. Furthermore,

l_{1}

-norm and

l_{2, 1}

-norm regularization terms are respectively utilized to select specific features and common features in feature space. Finally, extensive experiments on fifteen data sets prove that our method can outperform other well-established multi-label dimensionality reduction methods.

The main shortcoming of the SMDR-IC technique is that the projection mapping

ϕ

considered is a linear operator, and the linear operator may not be capable of learning the manifold structure of high-dimensional space completely. Then, in future research, we will concentrate on developing more effective nonlinear projection operators and researching the classification method coupled with dimensionality reduction techniques. In addition, SMDR-IC is slightly disadvantaged in terms of efficiency, mainly due to the difficulty in finding universally applicable convergence determination conditions for iterative optimization for different data sets. Therefore, iterative convergence determination methods or more efficient solution methods applicable to different data sets are to be further studied in future work.

Author Contributions

Conceptualization, formal analysis, writing—original draft preparation, R.L. and J.D. (Jiaxing Du); methodology, software, validation, writing—review and editing, R.L., J.D. (Jiaman Ding) and L.J.; resources, supervision, funding acquisition, Y.C. and Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the open fund of the Yunnan Key Laboratory of Computer Technology Applications, and in part by the National Natural Science Foundation of China (No. 12063002 and 62262035).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, L.; Ji, S.; Ye, J. Multi-Label Dimensionality Reduction; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Bellman, R. Dynamic programming and Lagrange multipliers. Proc. Natl. Acad. Sci. USA 1956, 42, 767. [Google Scholar] [CrossRef] [PubMed]
Siblini, W.; Kuntz, P.; Meyer, F. A review on dimensionality reduction for multi-label classification. IEEE Trans. Knowl. Data Eng. 2021, 33, 839–857. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Z.H. Multilabel dimensionality reduction via dependence maximization. ACM Trans. Knowl. Discov. Data (TKDD) 2010, 4, 1–21. [Google Scholar] [CrossRef]
Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Ding, C.; Huang, H. Multi-label linear discriminant analysis. In Proceedings of the Computer Vision—ECCV, Heraklion, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 126–139. [Google Scholar]
Kong, X.; Ng, M.K.; Zhou, Z.H. Transductive multilabel learning via label set propagation. IEEE Trans. Knowl. Data Eng. 2011, 25, 704–719. [Google Scholar] [CrossRef]
Qian, B.; Davidson, I. Semi-supervised dimension reduction for multi-label classification. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 569–574. [Google Scholar]
Guo, B.; Hou, C.; Nie, F.; Yi, D. Semi-supervised multi-label dimensionality reduction. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 919–924. [Google Scholar]
Yu, Y.; Wang, J.; Tan, Q.; Jia, L.; Yu, G. Semi-supervised multi-label dimensionality reduction based on dependence maximization. IEEE Access 2017, 5, 21927–21940. [Google Scholar] [CrossRef]
Nie, F.; Huang, H.; Cai, X.; Ding, C. Efficient and robust feature selection via joint l2,1-norms minimization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; Volume 23, pp. 1813–1821. [Google Scholar]
Hu, L.; Li, Y.; Gao, W.; Zhang, P.; Hu, J. Multi-label feature selection with shared common mode. Pattern Recognit. 2020, 104, 107344. [Google Scholar] [CrossRef]
Li, J.; Li, P.; Hu, X.; Yu, K. Learning common and label-specific features for multi-Label classification with correlation information. Pattern Recognit. 2022, 121, 108259. [Google Scholar] [CrossRef]
Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In Proceedings of the International Conference on Algorithmic Learning Theory, Singapore, 8–11 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 63–77. [Google Scholar]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef] [Green Version]
Belkin, M.; Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neural Inf. Process. Syst. 2001, 14, 585–591. [Google Scholar]
Nie, F.; Xu, D.; Tsang, I.W.H.; Zhang, C. Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction. IEEE Trans. Image Process. 2010, 19, 1921–1932. [Google Scholar] [PubMed]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
Yu, K.; Yu, S.; Tresp, V. Multi-label informed latent semantic indexing. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 15–19 August 2005; pp. 258–265. [Google Scholar]
Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
Sun, L.; Ji, S.; Ye, J. Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 194–200. [Google Scholar]
Pacharawongsakda, E.; Theeramunkong, T. A two-stage dual space reduction framework for multi-labe classification. In Proceedings of the Trends and Applications in Knowledge Discovery and Data Mining, Delhi, India, 11 May 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 330–341. [Google Scholar]
Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Park, C.H.; Lee, M. On applying linear discriminant analysis for multi-labeled problems. Pattern Recognit. Lett. 2008, 29, 878–887. [Google Scholar] [CrossRef]
Chen, W.; Yan, J.; Zhang, B.; Chen, Z.; Yang, Q. Document transformation for multi-label feature selection in text categorization. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 451–456. [Google Scholar]
Lin, X.; Chen, X.W. KNN: Soft relevance for multi-label classification. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 349–358. [Google Scholar]
Xu, J. A weighted linear discriminant analysis framework for multi-label feature extraction. Neurocomputing 2018, 275, 107–120. [Google Scholar] [CrossRef]
Yuan, Y.; Zhao, K.; Lu, H. Multi-label linear Ddiscriminant analysis with locality consistency. In Proceedings of the Neural Information Processing, Montreal, QC, Canada, 8–13 December 2014; Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 386–394. [Google Scholar]
Shu, X.; Lai, D.; Xu, H.; Tao, L. Learning shared subspace for multi-label dimensionality reduction via dependence maximization. Neurocomputing 2015, 168, 356–364. [Google Scholar] [CrossRef]
Gönen, M. Coupled dimensionality reduction and classification for supervised and semi-supervised multilabel learning. Pattern Recognit. Lett. 2014, 38, 132–141. [Google Scholar] [CrossRef]
Yu, T.; Zhang, W. Semisupervised multilabel learning with joint dimensionality reduction. IEEE Signal Process. Lett. 2016, 23, 795–799. [Google Scholar] [CrossRef]
Blaschko, M.B.; Shelton, J.A.; Bartels, A.; Lampert, C.H.; Gretton, A. Semi-supervised kernel canonical correlation analysis with application to human fMRI. Pattern Recognit. Lett. 2011, 32, 1572–1583. [Google Scholar] [CrossRef] [Green Version]
Li, H.; Li, P.; Guo, Y.J.; Wu, M. Multi-label dimensionality reduction based on semi-supervised discriminant analysis. J. Cent. South Univ. Technol. 2010, 17, 1310–1319. [Google Scholar] [CrossRef]
Hubert, M.; Van Driessen, K. Fast and robust discriminant analysis. Comput. Stat. Data Anal. 2004, 45, 301–320. [Google Scholar] [CrossRef]
Croux, C.; Dehon, C. Robust linear discriminant analysis using S-estimators. Can. J. Stat. 2001, 29, 473–493. [Google Scholar] [CrossRef]
Hubert, M.; Rousseeuw, P.J.; Van Aelst, S. High-breakdown robust multivariate methods. Stat. Sci. 2008, 23, 92–119. [Google Scholar] [CrossRef]
Mikalsen, K.Ø.; Soguero-Ruiz, C.; Bianchi, F.M.; Jenssen, R. Noisy multi-label semi-supervised dimensionality reduction. Pattern Recognit. 2019, 90, 257–270. [Google Scholar] [CrossRef]
Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2013. [Google Scholar]
Han, H.; Huang, M.; Zhang, Y.; Yang, X.; Feng, W. Multi-label learning with label specific features using correlation information. IEEE Access 2019, 7, 11474–11484. [Google Scholar] [CrossRef]
Huang, S.J.; Zhou, Z.H. Multi-Label Learning by Exploiting Label Correlations Locally; AAAI Press: Palo Alto, CA, USA, 2012. [Google Scholar]
Lin, Z.; Ganesh, A.; Wright, J.; Wu, L.; Chen, M.; Ma, Y. Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix; Report no. UILU-ENG-09-2214, DC-246; Coordinated Science Laboratory: Urbana, IL, USA, 2009; Available online: https://hdl.handle.net/2142/74352 (accessed on 20 December 2022).
Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1. An explanation of specific and common features of feature space.

Figure 2. Bonferroni–Dunn test for SMDR-IC and other compared techniques. (a) 1-Hamming. (b) 1-Ranking. (c) AvgPrec. (d) 1-OneError. (e) MacroF1. (f) MicroF1. (g) Coverage.

Figure 3. Results of 1-Hamming on four data sets under different target dimensionalities. (a) Emotions. (b) Scene. (c) Yeast. (d) Arts.

Figure 4. Results of 1-Ranking on four data sets under different target dimensionalities. (a) Emotions. (b) Scene. (c) Yeast. (d) Arts.

Figure 5. Results of AvgPrec on four data sets under different target dimensionalities. (a) Emotions. (b) Scene. (c) Yeast. (d) Arts.

Figure 6. Results of Coverage on four data sets under different target dimensionalities. (a) Emotions. (b) Scene. (c) Yeast. (d) Arts.