Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System

Liu, Peng; Zhao, Shulin; Zou, Quan; Ding, Yijie

doi:10.3390/app12115383

Open AccessArticle

Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System

¹

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Institute of Yangtze Delta Region (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(11), 5383; https://0-doi-org.brum.beds.ac.uk/10.3390/app12115383

Submission received: 24 April 2022 / Revised: 18 May 2022 / Accepted: 23 May 2022 / Published: 26 May 2022

(This article belongs to the Special Issue Application of Evolutionary Computing for Bioinformatics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Cell-penetrating peptides (CPPs) are short peptides that can carry cargo into cells. CPPs are widely utilized due to their powerful loading capacity and transduction efficiency. Identifying CPPs is the basis for studying their functions and mechanisms; however, experimental methods to identify CPPs are expensive and time-consuming. Recently, CPP predictors based on machine learning methods have become a research hotspot. Although considerable progress has been made, some challenges remain unresolved. First, most predictors employ a variety of feature descriptors to transform an original sequence into multiview data; however, extant methods ignore the relationships between different views, limiting further performance improvement. Second, most machine learning models are actually black boxes and cannot offer insightful advice. In this paper, a novel Hilbert–Schmidt independence criterion (HSIC)-based multiview TSK fuzzy system is proposed. Compared with other machine learning methods, TSK fuzzy systems have better interpretability, and the introduction of multiview mechanisms provides comprehensive insight into the intrinsic laws of the data. HSIC is utilized here to measure the independence and enhance the complementarity between different views. Notably, the proposed method attained prediction accuracy results of 92.2% and 96.2% for the training and independent test sets, respectively. The empirical results show that our promising approach features greater recognition performance than the state-of-the-art method.

Keywords:

cell-penetrating peptides; machine learning; TSK fuzzy system; multiview learning; HSIC

1. Introduction

Traditional therapeutic drugs are greatly limited due to the complexity of the human immune system and the selective penetration of the cell membrane. As such, many diseases require treatment at the molecular level. We expect to deliver drugs directly to target cells while minimizing the impact on cells and avoiding permanent damage. Cell-penetrating peptides (CPPs) can be used to complete this task. CPPs are a class of short peptides with a length between 5–50 amino acid residues [1] that can carry DNA, protein, and other biomolecules into cells and will not cause irreparable damage to cells when the concentration of CPPs is low. CPPs are widely utilized due to their powerful loading capacity and transduction efficiency. Therefore, the correct identification of CPPs is of great significance. Unfortunately, the traditional experimental approach is time-consuming and costly to predict CPPs, and the prediction accuracy is not satisfactory.

In recent years, machine learning-based methods have been widely applied [2,3] to predict CPPs. These methods have two main steps, namely, (1) selecting a suitable feature extraction method to transform the original sequence into vector form. In this process, to reduce information loss, a variety of descriptors are often adopted to convert the sequence into multiview data. The second step is to (2) build a learning model and utilize the features obtained in the above step as input to train the model. Such machine learning-based predictors have evolved rapidly in the past few years. CellPDD, proposed by Gautam et al. [4], adopts several feature representation methods, such as the amino acid composition, dipeptide composition, and binary spectroscopy. Diener et al. [5] improved the prediction performance by utilizing the amino acid frequency and physicochemical property features. Wei et al. constructed a high-quality dataset, CPP924, and presented two effective predictors: SkipCPP-Pred [6] and CPPred-RF [7]. An adaptive skip dipeptide composition descriptor and random forest algorithm were employed. The TargetCPP proposed by Arif et al. [8] adopted split amino acid composition and composite protein sequence representation, covering multiview information, and the gradient boost decision tree algorithm was employed to improve the prediction performance. Fu et al. [9] built a predictor named StackCPPred based on the residue pairwise energy matrix and employed support vector machine recursive feature elimination and correlation bias reduction to improve the identification ability. In addition to these predictors for CPPs, some methods have been used to predict other therapeutic peptides. PEPred, proposed by Wei et al. [10], and PPTPP, proposed by Zhang et al. [11], can be used to predict eight therapeutic peptides: AAP, ABP, ACP, AIP, AVP, CPP, QSP, and SBP. ITP-Pred, proposed by Cai et al. [12], can predict both CPP and QSP. These prediction methods improve the ability to discriminate CPPs and lay the foundation for the wide application of CPPs.

A fuzzy system is a rule-based system that implements knowledge representation via fuzzy logic and inference. The core of a fuzzy system is a knowledge base composed of IF-THEN rules. In this paper, the Takagi–Sugeno–Kang (TSK) fuzzy system [13,14] was adopted due to its excellent interpretability and data-driven learning ability [15,16,17].

In multiview learning, each view can benefit from knowledge from other views, which is the complementarity principle. In addition, some studies [18] have noted that the independence of different views can serve as a beneficial complement to multiview learning. In this paper, the Hilbert–Schmidt independence criterion (HSIC) [19] is employed to measure the independence of different views and realize the idea of the complementarity principle.

Although there are many methods available to predict CPPs, some critical questions remain unanswered. These problems include the following: (1) many proposed predictors adopt multiple feature descriptors, but they simply splice each feature vector and directly input the hybrid feature into the prediction model. The disadvantage of doing so is that the interaction from different views and the statistical characteristics of the data are ignored. The predictive performance is also compromised as a result. (2) Most machine learning models are actually black boxes; however, a fuzzy system with a knowledge base based on fuzzy rules has good interpretability and can provide insightful suggestions to study the underlying rules.

To solve the above problems, a CPP predictor based on a multiview TSK fuzzy system is proposed, and the main workflow of the process is shown in Figure 1. First, two feature descriptors were employed, namely, soft symmetric alignment and pseudo-amino acid composition. Then, the correlation-based feature selection algorithm was adopted to remove redundant features and noise. The resulting feature subset was input into the multiview TSK fuzzy system. Finally, the multiview decision result was obtained.

The main contributions of this study are as follows: (1) We introduce a multiview TSK fuzzy system based on HSIC. Compared with other machine learning methods, TSK fuzzy systems have advantages in interpretability. The introduction of multiview mechanisms allows for comprehensive insight into the intrinsic laws of the data. HSIC was utilized to measure the independence and enhance the complementarity between different views. (2) The proposed method is competitive or better than the state-of-the-art CPP predictors. The empirical results show that our method has broad application prospects.

2. Materials and Methods

2.1. Data Collection

In this paper, the CPP740 [7] dataset was adopted. CPP740 contains 370 CPPs and the same amount of non-CPPs. In addition, we also employed an independent test set to validate the performance of our method. The dataset contains 92 positive samples and the same number of negative samples.

2.2. Feature Extraction

2.2.1. Pseudo-Amino Acid Composition

During the process of converting biological sequences into vectors, it is inevitable that some information will be lost. To maximally retain the information of the original sequence, a variety of feature representation methods have been proposed. Among them, the pseudo-amino acid composition (Pse-AAC), proposed by Chou et al. [20], has been widely used in various fields of bioinformatics. Pse-AAC incorporates contiguous local sequence-order information and global sequence-order information into the feature vector. After application of this method, the sequence is represented as a 50-dimensional feature vector.

2.2.2. Soft Symmetric Alignment

The soft symmetric alignment (SSA) feature was adopted by Lv et al. [21] to predict anticancer peptides. It is a deep representation learning feature extraction method. It trains a three-layer stacked BiLSTM encoder that converts the sequence into a matrix

R^{L \times 121}

, where L is the length of the peptide. Then, the similarity loss function is used to optimize the model parameters through backpropagation, and the sequence is transformed into a 121-D feature vector.

2.3. Feature Selection

To improve the computational efficiency, eliminate redundant features, avoid overfitting problems and improve the generalization ability of the model, it is necessary to adopt a suitable feature selection method. In this paper, we employed the correlation-based feature selection algorithm (CFS) [22], which does not rank individual features but searches for the optimal subset of features. A feature subset is considered valuable if its features are highly correlated with the labels and its redundancy is low. The greedy algorithm was used to search for feature subsets, and a subset containing 37 features was selected, including 15 SSA features and 22 Pse-AAC features.

2.4. TSK Fuzzy System

The TSK fuzzy system is a classic fuzzy model. Its input and output are nonfuzzy values, and it is highly flexible and interpretable. Therefore, we chose the TSK fuzzy system as the basic model. A fuzzy rule in TSK can be defined as follows:

R^{k} : IF x_{1} is A_{1}^{k} \land x_{2} is A_{2}^{k} \land \dots \land x_{D} is A_{D}^{k}, THEN y^{k} (x) = p_{0}^{k} + p_{1}^{k} x_{1} + p_{2}^{k} x_{2} \dots + p_{D}^{k} x_{D}, k = 1, 2, \dots, K

(1)

The above TSK fuzzy system consists of K rules, and the input vector

x = {[x_{1}, x_{2}, \dots, x_{D}]}^{T}

.

A_{d}^{k}

is a fuzzy set corresponding to the dth feature of the kth rule,

y^{k}

is the output of the kth rule, and

p_{d}^{k}

is the parameter. The membership function of the fuzzy set

A_{d}^{k}

is commonly represented by a Gaussian function:

μ_{A_{d}^{k}} (x_{d}) = \exp (- \frac{{(x_{d} - c_{d}^{k})}^{2}}{2 σ_{d}^{k}})

(2)

where

c_{d}^{k}

denotes the center and

σ_{d}^{k}

denotes the variance. In this paper, the fuzzy c-means (FCM) algorithm is employed to calculate

c_{d}^{k}

and

σ_{d}^{k}

.

c_{d}^{k} = \frac{\sum_{i = 1}^{N} u_{i k} x_{i d}}{\sum_{i = 1}^{N} u_{i k}}

(3)

σ_{d}^{k} = \frac{h \sum_{i = 1}^{N} u_{i k} {(x_{i d} - c_{d}^{k})}^{2}}{\sum_{i = 1}^{N} u_{i k}}

(4)

The output of the TSK fuzzy system is the combination of the results of each rule, which can be expressed as:

y (x) = \frac{\sum_{k = 1}^{K} μ^{k} (x) y^{k} (x)}{\sum_{k = 1}^{K} μ^{k} (x)} = \sum_{k = 1}^{K} {\tilde{μ}}^{k} (x) y^{k} (x)

(5)

where

μ^{k} (x) = \prod_{d = 1}^{D} μ_{A_{d}^{k}} (x_{d})

(6)

and

{\tilde{μ}}^{k} (x) = \frac{μ^{k} (x)}{\sum_{k = 1}^{K} μ^{k} (x)}

(7)

For the input vector

x

, let

x_{e} = {(1, x^{T})}^{T}

(8)

{\tilde{x}}^{k} = {\tilde{μ}}^{k} (x) x_{e}

(9)

x_{g} = {({({\tilde{x}}^{1})}^{T}, {({\tilde{x}}^{k})}^{T}, \dots, {({\tilde{x}}^{k})}^{T})}^{T}

(10)

then

p^{k} = {(p_{0}^{k}, p_{1}^{k}, \dots, p_{D}^{k})}^{T}

(11)

p_{g} = {({(p^{1})}^{T}, {(p^{2})}^{T}, \dots, {(p^{K})}^{T})}^{T}

(12)

y (x) = p_{g}^{T} x_{g}

(13)

According to the above transformation, the TSK fuzzy system is transformed into a linear model. We employed the method of Deng et al. [23] to solve model coefficients. The objective function is as follows:

\min_{p_{g}} J_{TSK} (p_{g, c}) = \frac{1}{2} \sum_{c = 1}^{C} p_{g, c}^{T} p_{g, c} + \frac{λ_{p_{g}}}{2} \sum_{c = 1}^{C} \sum_{i = 1}^{N} ∥ y_{i c} - p_{g, c}^{T} x_{g i} ∥^{2}

(14)

Taking the derivative of the objective function with respect to

p_{g, c}

, the optimal solution of

p_{g, c}

can be obtained:

p_{g, c} = {(I_{D \times D} + \sum_{i = 1}^{N} x_{g i} x_{g i}^{T})}^{- 1} \cdot (λ_{p_{g}} \sum_{i = 1}^{N} x_{g i} y_{i c})

(15)

2.5. Multiview TSK Fuzzy System via HSIC

The complementarity principle is an important criterion in multiview learning. In our data, each view corresponds to a group of features, so each view has unique information. Therefore, making accurate predictions requires integrating information from each view. In this paper, we apply the Hilbert–Schmidt independence criterion (HSIC) to realize the idea of the complementarity principle. The HSIC is used to measure the independence between different views. The independence of each view can reduce redundant information and enhance complementarity. According to the method of Cao et al. [19], the empirical version of HSIC is summarized as follows:

HSIC (E^{v}, E^{h}) = {(n - 1)}^{- 2} Tr (K^{v} H K^{h} H)

(16)

where

E^{v}

is the prediction error in view

v

and

K^{v}

is the Gram matrix in view

v

. We set

K^{v} = E^{v} {(E^{v})}^{T}

.

h_{i j} = δ_{i j} - 1 / n

centers the Gram matrix to have a zero mean in the feature space. For notational convenience, we ignore the scaling factor

{(n - 1)}^{- 2}

. When all views except view

v

are fixed, we minimize the following function:

\sum_{h = 1; h \neq v}^{V} HSIC (E^{v}, E^{h}) = Tr (\sum_{h = 1; h \neq v}^{V} H K^{v} H K^{h})

(17)

= Tr (\sum_{h = 1; h \neq v}^{V} {(E^{v})}^{T} H K^{h} H E^{v}) = Tr ({(E^{v})}^{T} G^{v} E^{v})

(18)

where

G^{v} = \sum_{h = 1; h \neq v}^{V} H K^{h} H

(19)

With the HSIC, we obtain the following objective function:

\min_{P^{v}, E^{v}} J_{TSK - HSIC} (P^{v}, E^{v}) = Tr (\frac{1}{2} \sum_{v = 1}^{V} {(P^{v})}^{T} P^{v} + \frac{λ_{P}^{v}}{2} \sum_{v = 1}^{V} {(E^{v})}^{T} E^{v} + \frac{γ}{2} \sum_{v = 1}^{V} {(E^{v})}^{T} G^{v} E^{v})

(20)

s . t . Y_{v e c} = X_{g}^{v} P^{v} + E^{v} for v = 1, 2, \dots, V

(21)

where

X_{g}^{v}

is the matrix of the input data of view

v

through the transformation of Equation (10),

Y_{v e c}

is the matrix obtained from the label vector through one-hot coding,

E^{v}

represents the error matrix of view

v

,

Y_{v e c}, E^{v} \in R^{N \times C}

, and

P^{v}

is a matrix composed of the consequent parameters of the TSK fuzzy system,

P^{v} \in R^{K (D^{v} + 1) \times C}

.

V

is the number of all views,

C

is the total number of classes,

N

is the total number of data samples,

K

is the total number of rules, and

D

is the number of data dimensions.

γ, λ_{P}^{v}

are regularization parameters. Their values can be obtained by cross validation. The Lagrange function of this problem is defined as:

ℒ (P^{v}, E^{v}) = J_{TSK - HSIC} (P^{v}, E^{v}) - Tr (\sum_{v = 1}^{V} α^{v}^{T} (X_{g}^{v} P^{v} + E^{v} - Y_{v e c}))

(22)

Let

\partial ℒ / \partial P^{v} = 0

,

\partial ℒ / \partial E^{v} = 0

and

\partial ℒ / \partial α^{v} = 0

:

{\begin{matrix} \frac{\partial ℒ}{\partial P^{v}} = 0 \to P^{v} = {(X_{g}^{v})}^{T} α^{v} \\ \frac{\partial ℒ}{\partial E^{v}} = 0 \to λ_{P}^{v} E^{v} + γ G^{v} E^{v} = α^{v} \\ \frac{\partial ℒ}{\partial α^{v}} = 0 \to Y_{v e c} = X_{g}^{v} P^{v} + E^{v} \\ w h e r e v = 1, 2, \dots, V \end{matrix}

(23)

solving these equations, the solution of

P^{v}

and

E^{v}

can be obtained

E^{v} = {(λ_{P}^{v} X_{g}^{v} {(X_{g}^{v})}^{T} + γ X_{g}^{v} {(X_{g}^{v})}^{T} G^{v} + I_{N})}^{- 1} Y_{v e c}

(24)

P^{v} = {(X_{g}^{v})}^{T} (λ_{P}^{v} I_{N} + γ G^{v}) E^{v} = {(X_{g}^{v})}^{T} (λ_{P}^{v} I_{N} + γ G^{v}) {(λ_{P}^{v} X_{g}^{v} {(X_{g}^{v})}^{T} + γ X_{g}^{v} {(X_{g}^{v})}^{T} G^{v} + I_{N})}^{- 1} Y_{v e c}

(25)

2.6. Parameter Setting

Selecting appropriate parameters for the model can improve the prediction performance, enhance the generalization ability and avoid overfitting problems. In this study, all parameters were determined for the training set through five-fold cross validation. For the single view TSK fuzzy system, the number of fuzzy rules was taken as the set

{2, 4, 6, 8, 10}

, the scaling parameter h in Equation (4) was taken as the set

{10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}, 10^{3}}

, and the regularization parameter

λ_{P}^{v}

was from the set

{2^{- 10}, 2^{- 9}, \dots, 2^{9}, 2^{10}}

. For the multiview TSK fuzzy system, the regularization parameter

γ

was taken as the set

{2^{- 10}, 2^{- 9}, \dots, 2^{9}, 2^{10}}

.

2.7. Performance Metrics

In this paper, we employed the accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew’s correlation coefficient (MCC) to evaluate the performance of our model. The values of ACC, SN and SP are in the range [0, 1], but we want the prediction performance of the proposed model to be higher than the random prediction results, so the acceptable values are [0.5, 1]. Similarly, the range of values for MCC is [−1, 1], and the acceptable values are [0, 1]. Their values are calculated as follows:

A C C = \frac{T P + T N}{T P + F P + T N + F N}

(26)

S N = \frac{T P}{T P + F N}

(27)

S P = \frac{T N}{T N + F P}

(28)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}

(29)

In these expressions, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

3. Results

In this study, we employed two independent datasets, namely, CPP740 and CPP 184. SSA and Pse-AAC describe data from different views. We compared the performance differences between the single features and combined features. Then, the correlation-based feature selection algorithm was employed to remove redundant features. The resulting feature subset was input into the multiview TSK fuzzy system. Finally, cross and independent tests were adopted to validate the empirical performance of the model. The results prove that our method outperformed the state-of-the-art methods in the literature.

3.1. Performance Analysis from a Single View

For the CPP740 dataset, we input the SSA and Pse-AAC features into the classic single-view TSK fuzzy system as shown in Equation (14). The results are shown in Table 1. The AAC, SN, SP, MCC of the Pse-AAC feature were 91.1%, 90.5%, 91.6%, and 0.822, respectively, which were better than the SSA feature for all indices. After mixing the two features and taking the multiview approach, it can be seen that the obtained results were not as good as those of a single view, and the ACC, SN, SP, MCC values were 90.1%, 89.2%, 90.8%, and 0.802, respectively. We believe that the outcome changes were not caused by multiview techniques but were worsened by the addition of redundant information and irrelevant features. Therefore, it is essential to employ an appropriate feature selection method.

3.2. Performance Analysis after Feature Selection

In this study, the correlation-based feature selection algorithm was utilized because of its excellent performance. We splice the 121-D SSA features with the 50-D Pse-AAC features. The obtained 171-D hybrid features are employed as the input to the feature selection algorithm. Then a 37-dimensional feature subset was obtained, which included 15 SSA features and 22 Pse-AAC features. Among the 121-D SSA features, only 15 features were selected into the optimal feature subset, while 22 of 50-D Pse-AAC were selected. Combined with the performance analysis of the single view, we believe that the Pse-AAC descriptor provides more valuable information for predicting CPPs.

Table 2 shows the five-fold cross validation results after feature selection. It can be observed that the average ACC is 92.2%, the average SN is 90.8%, the average SP is 93.5%, and the average MCC is 0.844. These metrics indicate that there is improvement relative to the prior feature selection, along with a reduction in feature dimensions. CFS improves the computational efficiency as well as the prediction performance of the model.

3.3. Comparative Analysis with Other Classifiers

Table 3 shows the five-fold cross validation results using MV-TSK-FS-HSIC and some classic algorithms with the selected features of the CPP740 dataset. The classic methods included XGBoost, naïve Bayes (NB), and random forest (RF). Among them, the results of RF were the best, with an ACC of 90.4%, SN of 88.6%, SP of 92.2%, MCC of 0.809 and AUC of 0.967. Figure 2 shows the receiver operating characteristic curves of different classifiers for the CPP740 dataset. The AUC of the proposed method is 0.975, which is higher than that of other classic algorithms. Unlike the multiview TSK fuzzy system, we directly input the hybrid features into the classic models. This ignores the relationship between different views and the statistical characteristics of data, so the performance is inferior to that of MV-TSK-FS-HSIC.

3.4. Comparison Analysis for the CPP740 Dataset

As shown in Table 4, several extant methods, including PPTPP [11], ITP-PRED [12], and PEPred [10] were compared with our method for the CPP740 dataset. The best results of the previous methods were obtained by PEPred. The ACC, SN, SP, MCC and AUC were 91.2%, 90.3%, 92.2%, 0.824, and 0.972, respectively. Compared with PEPred, our method increased the ACC, SN, SP, MCC, and AUC values by 0.01, 0.005, 0.013, 0.02, and 0.003, respectively.

3.5. Comparison Analysis of an Independent Test Set

To verify the generalization ability of our model, we employed an independent test set with 184 samples. The dataset contains 92 CPPs and the same number of non-CPPs, and there was no overlap with the samples of the training set. The experimental results of the five-fold cross validation are shown in Table 5. The ACC, SN, SP, MCC, and AUC of MV-TSK-FS-HSIC values were 96.2%, 96.7%, 95.7%, 0.924, and 0.990, respectively. Compared with ITP-PRED, our method improved the ACC, SN, MCC, and AUC values by 0.011, 0.039, 0.02, and 0.011, respectively. Only SP was inferior to ITP-PRED. The results prove that our method is superior to the state-of-the-art predictors.

4. Discussion and Conclusions

In this study, a novel multiview TSK fuzzy system is proposed. First, SSA and Pse-AAC descriptors were employed to convert the original sequence into multiview data. Second, we utilized the correlation-based feature selection algorithm to obtain the optimal feature subset. Finally, the resulting feature subset was input into the multiview TSK fuzzy system based on the HSIC, and a multiview decision result was then obtained. We compared the performance of the proposed model with several classical machine learning algorithms by a five-fold cross-validation. The empirical results demonstrate that the proposed method outperforms the classical methods in terms of ACC, SN, SP, and MCC metrics. We validated the performance of the model using the CPP dataset, the AUC reached 0.975 and 0.990 and the ACC achieved 92.2% and 96.2% on the training and test sets, respectively. The results prove that our method is superior to the existing CPP predictors.

Through feature analysis, we found that Pse-AAC features played a more important role than SSA features in identifying CPPs. We believe that this is because Pse-AAC features contain physicochemical information that can distinguish CPPs from non-CPPs. Also feature selection is necessary to alleviate the overfitting problem, remove redundant features, reduce data dimension and lower computational cost. When the data matrix

X

is converted to the input matrix

X_{g}

of the TSK fuzzy system, the dimension increases dramatically, which can seriously deteriorate the computational efficiency of the model if no feature selection is performed.

Although the proposed model has been proven to be effective in experiments, there is still room for improvement. In future research, we expect to introduce a novel multiview mechanism to investigate the relationships between different views and enhance the interpretability of TSK fuzzy systems.

Author Contributions

Conceptualization, Y.D.; methodology, Y.D.; software, P.L.; validation, S.Z.; formal analysis, S.Z. and P.L.; investigation, S.Z.; resources, Q.Z.; data curation, P.L.; writing—original draft preparation, P.L.; writing—review and editing, Y.D.; visualization, P.L.; supervision, Q.Z.; project administration, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Natural Science Foundation of China (No. 62172076, 61902271), and the Municipal Government of Quzhou under Grant Number 2020D003 and 2021D004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this paper can be downloaded at http://lab.malab.cn/~acy/BioseqData/Protein/ITP-Pred.rar (accessed on 16 January 2022).

Acknowledgments

We thank Abd El-Latif Hesham for his contribution to this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Milletti, F. Cell-penetrating peptides: Classes, origin, and current landscape. Drug Discov. Today 2012, 17, 850–860. Available online: https://0-linkinghub-elsevier-com.brum.beds.ac.uk/retrieve/pii/S1359644612000839 (accessed on 10 March 2022). [CrossRef] [PubMed]
Zhao, S.; Ju, Y.; Ye, X.; Zhang, J.; Han, S. Bioluminescent Proteins Prediction with Voting Strategy. Curr. Bioinform. 2021, 16, 240–251. Available online: https://www.eurekaselect.com/182388/article (accessed on 10 March 2022). [CrossRef]
Rahaman, M.M.; Li, C.; Yao, Y.; Kulwa, F.; Wu, X.; Li, X.; Wang, Q. DeepCervix: A deep learning-based framework for the classification of cervical cells using hybrid deep feature fusion techniques. Comput. Biol. Med. 2021, 136, 104649. [Google Scholar] [CrossRef] [PubMed]
Gautam, A.; Chaudhary, K.; Kumar, R.; Sharma, A.; Kapoor, P.; Tyagi, A.; Raghava, G.P.S. In silico approaches for designing highly effective cell penetrating peptides. J. Transl. Med. 2013, 11, 1–12. Available online: http://translational--medicine-biomedcentral-com-s.vpn.uestc.edu.cn:8118/articles/2010.1186/1479-5876-2011-2074 (accessed on 10 March 2022). [CrossRef] [PubMed] [Green Version]
Diener, C.; Garza Ramos Martínez, G.; Moreno Blas, D.; Castillo González, D.A.; Corzo, G.; Castro-Obregon, S.; Del Rio, G. Effective Design of Multifunctional Peptides by Combining Compatible Functions. PLoS Comput. Biol. 2016, 12, e1004786. [Google Scholar] [CrossRef] [PubMed]
Wei, L.; Tang, J.; Zou, Q. SkipCPP-Pred: An improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genom. 2017, 18, 742. Available online: https://0-bmcgenomics-biomedcentral-com.brum.beds.ac.uk/articles/710.1186/s12864-12017-14128-12861 (accessed on 10 March 2022). [CrossRef] [PubMed] [Green Version]
Wei, L.; Xing, P.; Su, R.; Shi, G.; Ma, Z.S.; Zou, Q. CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. J. Proteome Res. 2017, 16, 2044–2053. [Google Scholar] [CrossRef] [PubMed]
Arif, M.; Ahmad, S.; Ali, F.; Fang, G.; Li, M.; Yu, D.-J. TargetCPP: Accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J. Comput.-Aided Mol. Des. 2020, 34, 841–856. Available online: http://0-link-springer-com.brum.beds.ac.uk/810.1007/s10822-10020-00307-z (accessed on 10 March 2022). [CrossRef] [PubMed]
Fu, X.; Cai, L.; Zeng, X.; Zou, Q. StackCPPred: A stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 2020, 36, 3028–3034. Available online: https://0-academic-oup-com.brum.beds.ac.uk/bioinformatics/article/3036/3010/3028/5762610 (accessed on 10 March 2022). [CrossRef]
Wei, L.; Zhou, C.; Su, R.; Zou, Q. PEPred-Suite: Improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics 2019, 35, 4272–4280. [Google Scholar] [CrossRef]
Zhang, Y.P.; Zou, Q. PPTPP: A novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics 2020, 36, 3982–3987. [Google Scholar] [CrossRef] [PubMed]
Cai, L.; Wang, L.; Fu, X.; Xia, C.; Zeng, X.; Zou, Q. ITP-Pred: An interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Brief. Bioinform. 2021, 22, bbaa367. [Google Scholar] [CrossRef] [PubMed]
Takagi, T.; Sugeno, M. Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 1985, 1, 116–132. [Google Scholar] [CrossRef]
Mamdani, E.H. Application of fuzzy logic to approximate reasoning using linguistic synthesis. IEEE Trans. Comput. 1977, 26, 1182–1191. [Google Scholar] [CrossRef]
Jiang, Y.; Deng, Z.; Chung, F.-L.; Wang, G.; Qian, P.; Choi, K.-S.; Wang, S. Recognition of Epileptic EEG Signals Using a Novel Multiview TSK Fuzzy System. IEEE Trans. Fuzzy Syst. 2017, 25, 3–20. Available online: http://0-ieeexplore-ieee-org.brum.beds.ac.uk/document/7778175/ (accessed on 10 March 2022). [CrossRef]
Gu, X.; Chung, F.-L.; Wang, S. Bayesian Takagi–Sugeno–Kang fuzzy classifier. IEEE Trans. Fuzzy Syst. 2016, 25, 1655–1671. [Google Scholar] [CrossRef]
Jiang, Y.; Zhang, Y.; Lin, C.; Wu, D.; Lin, C.-T. EEG-based driver drowsiness estimation using an online multi-view and transfer TSK fuzzy system. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1752–1764. [Google Scholar] [CrossRef]
Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison Wisconsin, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
Cao, X.; Zhang, C.; Fu, H.; Si, L.; Hua, Z. Diversity-induced Multi-view Subspace Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; Volume 586–594, ISBN 978-581-4673-6964-4670. Available online: http://0-ieeexplore-ieee-org.brum.beds.ac.uk/document/7298657/ (accessed on 10 March 2022).
Chou, K.-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Genet. 2001, 43, 246–255. Available online: https://0-onlinelibrary-wiley-com.brum.beds.ac.uk/doi/210.1002/prot.1035 (accessed on 10 March 2022). [CrossRef] [PubMed]
Lv, Z.; Cui, F.; Zou, Q.; Zhang, L.; Xu, L. Anticancer peptides prediction with deep representation learning features. Brief. Bioinform. 2021, 22, bbab008. [Google Scholar] [CrossRef] [PubMed]
Hall, M.A. Correlation-based Feature Subset Selection for Machine Learning. Ph.D. Thesis, The University of Waikato, Hamilton, New Zealand, 1998. Available online: https://ci.nii.ac.jp/naid/10018668219 (accessed on 10 March 2022).
Zhaohong, D.; Kup-Sze, C.; Yizhang, J.; Shitong, W. Generalized Hidden-Mapping Ridge Regression, Knowledge-Leveraged Inductive Transfer Learning for Neural Networks, Fuzzy Systems and Kernel Methods. IEEE Trans. Cybern. 2014, 44, 2585–2599. Available online: http://0-ieeexplore-ieee-org.brum.beds.ac.uk/document/6780983/ (accessed on 10 March 2022). [CrossRef] [PubMed]

Figure 1. Workflow of the proposed method.

Figure 2. Receiver operating characteristic curves of different classifiers after feature selection over five-fold cross validation of the CPP740 dataset.

Table 1. Performance of different features on the training set in five-fold cross validation.

Feature	ACC (%)	SN (%)	SP (%)	MCC
Pse-AAC	91.1	90.5	91.6	0.822
SSA	88.4	86.2	90.8	0.768
Pse-AAC+SSA	90.1	89.2	90.8	0.802

Table 2. Five-fold cross validation of the CPP740 dataset.

Fold Set	ACC (%)	SN (%)	SP (%)	MCC
1	90.5	88.2	93.1	0.812
2	90.5	90.8	90.3	0.811
3	92.6	88.2	96.3	0.852
4	94.0	93.4	94.4	0.878
5	93.2	93.2	93.2	0.865
Average	92.2	90.8	93.5	0.844

Table 3. The performance of different classifiers for the CPP740 dataset after feature selection (five-fold cross validation).

Method	ACC (%)	SN (%)	SP (%)	MCC
NB	90.1	85.4	94.9	0.806
XGBoost	90.3	90.0	90.5	0.805
RF	90.4	88.6	92.2	0.809
MV-TSK-FS-HSIC	92.2	90.8	93.5	0.844

Table 4. Comparison of existing methods using the CPP740 dataset and five-fold cross validation.

Method	ACC (%)	SN (%)	SP (%)	MCC	AUC
PPTPP	74.9	71.6	78.1	0.498	0.824
ITP-PRED	89.0	86.3	93.2	0.787	0.962
PEPred	91.2	90.3	92.2	0.824	0.972
MV-TSK-FS-HSIC	92.2	90.8	93.5	0.844	0.975

Table 5. Comparison of existing methods with an independent test set and five-fold cross validation.

Method	ACC (%)	SN (%)	SP (%)	MCC	AUC
PEPred	-	-	-	-	0.952
PPTPP	-	-	-	-	0.967
ITP-PRED	95.1	92.8	97.8	0.904	0.989
MV-TSK-FS-HSIC	96.2	96.7	95.7	0.924	0.990

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, P.; Zhao, S.; Zou, Q.; Ding, Y. Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System. Appl. Sci. 2022, 12, 5383. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115383

AMA Style

Liu P, Zhao S, Zou Q, Ding Y. Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System. Applied Sciences. 2022; 12(11):5383. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115383

Chicago/Turabian Style

Liu, Peng, Shulin Zhao, Quan Zou, and Yijie Ding. 2022. "Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System" Applied Sciences 12, no. 11: 5383. https://0-doi-org.brum.beds.ac.uk/10.3390/app12115383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Cell-Penetrating Peptides Using a Novel HSIC-Based Multiview TSK Fuzzy System

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Feature Extraction

2.2.1. Pseudo-Amino Acid Composition

2.2.2. Soft Symmetric Alignment

2.3. Feature Selection

2.4. TSK Fuzzy System

2.5. Multiview TSK Fuzzy System via HSIC

2.6. Parameter Setting

2.7. Performance Metrics

3. Results

3.1. Performance Analysis from a Single View

3.2. Performance Analysis after Feature Selection

3.3. Comparative Analysis with Other Classifiers

3.4. Comparison Analysis for the CPP740 Dataset

3.5. Comparison Analysis of an Independent Test Set

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI