Next Article in Journal
Identification of a Potential Founder Effect of a Novel PDZD7 Variant Involved in Moderate-to-Severe Sensorineural Hearing Loss in Koreans
Next Article in Special Issue
CNNDLP: A Method Based on Convolutional Autoencoder and Convolutional Neural Network with Adjacent Edge Attention for Predicting lncRNA–Disease Associations
Previous Article in Journal
Biopolymeric Films of Amphiphilic Derivatives of Chitosan: A Physicochemical Characterization and Antifungal Study
Previous Article in Special Issue
In Silico Prediction of Drug-Induced Liver Injury Based on Ensemble Classifier Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule

1
School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China
2
Engineering Research Center of Internet of Things Applied Technology, Ministry of Education, Wuxi 214122, China
3
School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
4
Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
5
School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
*
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci. 2019, 20(17), 4175; https://0-doi-org.brum.beds.ac.uk/10.3390/ijms20174175
Submission received: 30 July 2019 / Revised: 10 August 2019 / Accepted: 19 August 2019 / Published: 26 August 2019
(This article belongs to the Special Issue Special Protein or RNA Molecules Computational Identification 2019)

Abstract

:
DNA-binding proteins play an important role in cell metabolism. In biological laboratories, the detection methods of DNA-binding proteins includes yeast one-hybrid methods, bacterial singles and X-ray crystallography methods and others, but these methods involve a lot of labor, material and time. In recent years, many computation-based approachs have been proposed to detect DNA-binding proteins. In this paper, a machine learning-based method, which is called the Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF), is proposed to identifying DNA-binding proteins. First of all, multi-view sequence features are extracted from protein sequences. Next, a Multiple Kernel Learning (MKL) algorithm is employed to combine multiple features. Finally, a Fuzzy Kernel Ridge Regression (FKRR) model is built to detect DNA-binding proteins. Compared with other methods, our model achieves good results. Our method obtains an accuracy of 83.26% and 81.72% on two benchmark datasets (PDB1075 and compared with PDB186), respectively.

1. Introduction

The interaction between DNA and protein exists in various tissues of the living body. For example, DNA–protein interactions during many activities such as DNA replication, DNA repair, DNA packaging, DNA modification, and viral infection. The study of DNA binding residues in DNA–protein interactions facilitates a comprehensive understanding of the mechanisms of chromatin recombination and gene-regulated expression. The methods of detecting DNA-binding proteins are mainly deployed by biochemistry and physical chemistry methods. However, wet experiment-based methods are both time and money consuming.
The protein information of 3D structures or their complexes is important for drug design. X-ray crystallography is expensive and time-consuming [1,2,3]. Lots of sequence-based information, such as PTM (posttranslational modification) sites in proteins [4,5,6,7,8,9], DNA-methylation sites [10], protein–drug interaction in cellular networking [11], protein–protein interactions [12] and recombination spots [13], have been predicted by sequential tools such as Pseudo Amino Acid Composition (PseAAC) [14] and Pseudo K-tuple Nucleotide Composition (PseKNC) approach [15]. Bioinformatics has played important roles in the development of novel drugs.
Computational methods based on Machine Learning (ML) have been developed to predict DNA-binding proteins. Currently, ML technology is playing key roles in lots of biological field, including prediction of DNA methylcytosine sites [16,17], O-GlcNAcylation sites [18], potential disease-associated microRNAs [19,20], protein remote homology [21], protein subcellular localization [22], electron transport proteins [23] and analyzing microbiology [24] et al. The computational methods can be classified into two types of methods: sequence-based models and a structure-based models.
The sequence-based methods extract features from protein sequences and employ ML to build predictive models. PseAAC and Support Vector Machine (SVM) [25] were used to construct a model for identifying DNA-Binding Proteins [26]. Kumar et al. [27] used Position Specific Scoring Matrix (PSSM) of protein sequences to develop an SVM classifier called DNAbinder. The PSSM describes proetin sequences. PSI-BLAST [28] can calculate PSSM for target protein. Liu et al. [29] proposed iDNAPro-PseAAC model, which employed PseAAC and PSSM features. Wei et al. [30] used local PSSM features to represent local information of proteins. Sequence-based approachs can implement large-scale predictions.
Structure-based models employ structure features to predict DNA-binding proteins. Compared with sequence-based methods, structure-based models achieve better performance. The main reason is that 3D structure of proteins determine the shape and surface area of the protein. Nimrod et al. [31] used the average surface electrostatic potentials of the protein to build a Random Forest (RF) model to predict DNA-binding proteins. Due to the known structures being less than sequences, the structure-based models can not predict all proteins.
In recent publications [32,33,34,35] and two review papers [36,37], researchers developed useful predictors for bioinformatics. Many methods obeyed a rule, called Chou’s five-step rule. This rule contains five steps: (1) a benchmark dataset is constructed to train and test the predictive models; (2) the selected samples should truly reflect their correlation of the target; (3) the prediction problem can be solved by a powerful algorithm; (4) the cross-validation tests are performed to evaluate the performance of the methods; (5) building a web-server for the predictive model. The above rule is clear in logic, and completely transparent in operation. This rule can easily repeat the reported results by other researchers and is very convenient for the experimental scientists. Our method is also based on Chou’s five-step rule.
To avoid losing the sequence–pattern information of proteins, the PseAAC [14,36,38] was proposed by Chou. Chou’s general PseAAC [36] has been widely used to extract features from sequence and PSSM of protein. In addition, a useful web-server called “Pse-in-One2.0” [39,40] has been established. The server can extract feature vectors for DNA/RNA and protein/peptide sequences. We also emply Pse-in-One2.0 to extract features from protein sequences.
In this study, we propose a novel model via a Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF) to predict DNA-binding proteins. The multiple sequence features are extracted and constructed to multiple kernels, respectively. Next, a Multiple Kernel Learning (MKL) algorithm linearly weights these kernels. Fuzzy membership scores of each training sample are calculated by an integrated kernel. Finally, Fuzzy Kernel Ridge Regression (FKRR) is trained to predict DNA-binding proteins.

2. Results

To evaluate our proposed method (FKRR-MVSF), two benchmark datasets of DNA-binding proteins are employed in our study. First of all, we analyze the performance of different features. Then, our model is compared with other methods via a Jackknife test. Finally, an independent test set is used to test the robustness of FKRR-MVSF.

2.1. Data Sets

In our study, two benchmark datasets (PDB1075 and PDB186 datasets) are used to test our predictive model of DNA-binding proteins. PDB1075 and PDB186 were collected from the Protein Data Bank (PDB) [41]. Liu et al. [26] randomly extracted non-DNA-binding and DNA-binding proteins from the PDB database. The similarity of any two sequences does not exceed 25%. A total of 525 DNA-bind proteins and 550 non-DNA-binding proteins form the PDB1075 dataset. PDB186 dataset [42] contains 93 DNA-bind and 93 non-DNA-bind proteins. Table 1 lists the information of the two benchmark data sets.

2.2. Measurements

Accuracy (ACC), Sensitivity (SN), Specificity (SP) and Matthew’s Correlation Coefficient (MCC) are used to evaluate the performance of predictive model. These coefficients are calculated as follows:
A C C = 1 N + + N + N + + N
S N = 1 N + N +
S p e c = 1 N + N
M C C = 1 ( N + N + + N + N ) ( 1 + N + N + N + ) ( 1 + N + N + N )
where N + and N are the total number of positive and negative samples, respectively. N + and N + are the number of false positive and false negative, respectively. And Area Under ROC curve (AUC) is also an effective evaluation method for binary classification.

2.3. Performance Analysis of Different Features on the PDB1075 Data Set

The single type feature can not fully describe the properties of a protein, so we build the predictive model with multi-view sequence features to represent the protein. We test (Jackknife test evaluation) these features (kernels) on the PDB1075 dataset, as shown in Table 2. The PSSM-based features (PSSM-AB and PsePSSM feature) achieve better performance than non-PSSM (MCD and NMBAC feature) single features. The performance (MCC) of MCD, NMBAC, PSSM-AB and PsePSSM feature are 0.4139, 0.4564, 0.5113 and 0.5886, respectively. In addition, mean weighted kernels (KRR) combines the above 4 kernels (features) via average weight and obtains better performance (MCC: 0.6398) than single feature. Compared with mean weightes (KRR), MKL (KRR) achieves a higher value of MCC (0.6439). FKRR weighs training sets by fuzzy membership, which can filter outliers. So, mean weights (FKRR) (MCC: 0.6554) and MKL (FKRR) (MCC: 0.6664) are both better than KRR because of using multiple kernel information and fuzzy membership. Moreover, MKL (FKRR) achieves a better MCC of 0.6664.
In addition, we test the SVM model with different features on the PDB1075 dataset. In Table 2, the performance (MCC) of SVM (with MKL, MCC: 0.6568) is better than KRR (with MKL, MCC: 0.6439). However, the MCC (0.6568) of SVM (with MKL) is slightly lower than FKRR (with MKL, MCC: 0.6664). The reason may be the fuzzy membership for building predictor. The ROC curve also reflects the excellent performance of MKL (FKRR) in Figure 1. Our method (FKRR-MVSF) employs MKL and FKRR to build a final predictor for DNA-binding proteins.
Figure 2 shows the weight of each feature. The highest weight of feature is PsePSSM, which has a similar trend of their single feature performance. To reduce bias of features, the MKL algorithm can estimate the optimal weights of features.
We test our method and other existing methods on the PDB1075 dataset. Table 3 lists the results of comparison between our method and other methods. PseDNA-Pro [26], IDNA-Prot|dis [29], IDNA-Prot [43], DNAbinder [27], DNA-Prot [44], iDNAPro-PseAAC [45], Local-DPP [30], Adilina’s work [46] and Kmer1+ACC [47] are benchmark methods. And IDNA-Prot|dis (MCC: 0.54), PseDNA-Pro (MCC: 0.53) iDNAPro-PseAAC (MCC: 0.53) and Local-DPP (MCC: 0.59) obtain better performance. Our proposed model (FKRR-MVSF) obtains best MCC (0.67) on the PDB1075 data set.

2.4. Performance on an Independent DataSet of PDB186

In order to evaluate the generalization performance of predictive models, FKRR-MVSF and other methods are also tested on the independent dateset (training set is PDB1075). The results are shown in Table 4.
Our method (FKRR-MVSF) achieves 81.7% of ACC, 0.676 of MCC and 98.9% of SN. In MCC, FKRR-MVSF is better than Local-DPP (MCC: 0.625), DBPPred (MCC: 0.538), MSFBinder [48] (MCC: 0.640), Adilina’s work (MCC: 0.670) and iDNAPro-PseAAC (MCC: 0.442).

3. Discussion

To improve the performance of predicting DNA-binding proteins, we employ an MKL algorithm and fuzzy-based model to integrated different features and further handle the outliers, respectively. There are many ways in machine learning to avoid overfitting and generating skewed models caused by outliers, e.g., adjustment of the cost value in SVM. For different training samples, the parameter of cost should be different. Different samples have different contributions to the model. In Table 2, the performance (MCC: 0.6664) of fuzzy-based models (FKRR with MKL) is better than non-fuzzy models (KRR with MKL, MCC: 0.6439).
Compared to other single kernels, the PsePSSM-based kernel achieves the highest weight and highest value of MCC (0.5886). MKL could integrate multiple information of sequence. Our method (KRR with MKL) also achieves better performance of MCC (0.6439) than a single kernel model on the PDB1075 dataset. In addition, the performance of KRR with MKL (MCC: 0.6439) is better than KRR with mean weights (MCC: 0.6398) under PDB1075 dataset.
On the independent test dataset, our method (FKRR with MKL) also achieves better MCC (0.676). MSFBinder (SVM) [48] is a two-layer model with SVM. MSFBinder (SVM) also employed several features to build a predictive model. The generalization performance of FKRR (withe MKL) is better than MSFBinder (MCC: 0.640) on an independent test set (PDB186). The above two models are similar. The main reason of different results is that the parameter C of FKRR is different for each train sample. Fuzzy membership may reduce the effect of some noise samples in the model.

4. Materials and Methods

The prediction of DNA-binding proteins can be regarded as a task of binary classification. The protein can be represented by some feature vectors. The DNA-binding proteins and non-DNA-binding proteins are labeled as +1 (positive samples) and −1 (negative samples), respectively. We construct a Fuzzy Kernel Ridge Regression model based on Multi-View Sequence Features (FKRR-MVSF) to determine whether a protein binds to DNA. We employ Normalized Moreau–Broto Auto Correlation (NMBAC) [49,50], PSSM based Average Blocks (PSSM-AB) [51], Multiple-scale Continuous and Discontinuous descriptor (MCD) [52] and PsePSSM algorithms to extract four types of PSSM-based features. Radial Basis Function (RBF) is used to build four types of kernels from the above four kinds of features. In our study, the MKL algorithm is employed to calculate the weights of kernels and to combine four kernels. Then, a membership score is estimated for each training sample. Finally, a fuzzy kernel ridge regression model for identifying DNA-binding proteins is constructed via membership scores and a combined kernel. The framework of proposed method is showed in Figure 3. In the literature [13,33], the researchers have made good use of flowcharts to describe the main framework of their methods. In our work, we employ Figure 4 to describe the flow of our model. Firstly, we extract four types of feature from a sequence. Then, Radical Basis Function (RBF) is used to build four kernels. These kernels are conbined by MKL. Finally, combined kernel and training labels are employed to construct the FKRR model and predict new samples.

4.1. Feature Extraction

Extracting features from proteins is a challenge for identifying DNA-binding proteins. A suitable feature extraction algorithm can adequately represent the properties of the protein. We use four types of feature to describe a protein.

4.1.1. MCD Feature

You et al. clustered the 20 amino acids into seven groups according to dipoles and volumes of side chains. These groups are {A, G, V}, {C}, {F, I, L, P}, {D, E}, {H, N, Q, W}, {K, R} and {M, S, T, Y}. A protein sequence “AVDCALSK” can be described as “11321476” via Multi-scale Continuous and Discontinuous descriptor (MCD) [52]. Then, above sequence was split into 10 local regions, which described multiple overlapping continuous and discontinuous interaction patterns. Composition (C), Transition (T) and Distribution (D) were calculated in each local region. The detailed descriptions of MCD algorithm can refer to You’s work [52]. The MCD feature was 882-dimentional vector.

4.1.2. NMBAC Feature

Normalized Moreau–Broto Auto Correlation (NMBAC) [49,50] was proposed for extracting the sequence feature of membrane proteins. A protein sequence (string) can be represented as discrete numerical sequence via six physicochemical properties of Amino Acids (AA): including Hydrophobicity (H), Net Charge Index of Side Chains (NCISC), Solvent-Accessible Surface Area (SASA), Volumes of Side Chains of amino acids (VSC), Polarity (P1) and Polarizability (P2), respectively. The six physicochemical properties of amino acids are list in Table 5. To extract the feature of a protein X with L-length, the NMBAC feature is calculated by following equation:
N M B A C ( l a g , j ) = 1 ( n l a g ) i = 1 n l a g ( X i , j × X i + l a g , j )
where i denote the position in the sequence, and i = 1 , 2 , , n l a g . j is the type of physicochemical properties, j = 1 , 2 , , 6 . l a g [ 1 , l g ] is the gap between amino acids. l g is a parameter of maximum distance.

4.1.3. PSSM-AB Feature

Position Specific Scoring Matrix (PSSM) contains evolutionary information of protein sequence. The PSSM of protein sequence is generated by PSI-BLAST [28]. PSSM is a L × 20 matrix (L rows and 20 columns):
P S S M = P 1 , 1 P 1 , 2 P 1 , 20 P 2 , 1 P 2 , 2 P 2 , 20 P L , 1 P L , 2 P L , 20 L × 20
PSSM-AB extracts local average values of PSSM:
P S S M A B ( k ) = 20 L z = 1 L / 20 P S S M ( z + ( i 1 ) × L / 20 , j )
where k is a linear index used to scan the cells of PSSM. i , j = 1 , 2 , , 20 ,   k = j + 20 × ( i 1 ) . The PSSM-AB algorithm can extract the information of relationship between target residue and neighboring residues.

4.1.4. PsePSSM Feature

PsePSSM [53] is an effective feature based on PSSM. PSSM L × 20 is standardized as following:
P S S M ( i , j ) = P S S M ( i , j ) m e a n ( P S S M ( i , * ) ) S T D ( P S S M ( i , * ) ) i = 1 , 2 , , L ; j = 1 , 2 , , 20
where S T D ( P S S M ( i , * ) ) denotes the standard deviation of the elements. m e a n ( P S S M ( i , * ) ) represents the mean of the elements that are located in the i-th row. * denotes the all elements of the i-th row. Then, we obtain the PsePSSM feature as the following:
P s e ( k ) = 1 L i = 1 L P S S M ( i , j ) k = 1 , , 20 1 L l a g i = 1 L l a g [ P S S M ( i , j ) P S S M ( i + l a g , j ) ] 2 j = 1 , , 20 ;   l a g = 1 , , 15 ; k = 20 + j + 20 × ( l a g 1 )
where k is index of feature vector and l a g denotes the distance between one residue and its neighbors.

4.2. Multiple Kernel Learning

RBF is employed to construct 4 types of kernels via above features (including MCD, NMBAC, PSSM-AB and PsePSSM):
K i j = K ( x i , x j ) = e x p ( γ x i x j 2 ) , i , j = 1 , 2 , , N
where γ is the Gaussian kernel bandwidth. N is the number of samples. x i and x j are the feature vector of sample i and j. The 4 types of feature can be represented as a kernel set as: K M C D , K N M B A C , K P S S M A B , K P s e P S S M .
The MKL algorithm combines multi-view features from different sources. Some kernels may have bias in the learning process. MKL can reduce bias of kernels by low weights. The optimal kernel K t r a i n * is obtained as follows:
K t r a i n * = h = 1 H ω h K h , K * , K h R N × N
where H denotes the number of basic kernels.
MKL algorithm [54] can estimate the optimal weights of kernels by minimize the distance between ideal kernel K i d e a l and optimal kernel K t r a i n * . The K i d e a l = y t r a i n y t r a i n T R N × N denote the information of label space. y t r a i n R N × 1 is the labels of training set. We hope that optimal kernel K t r a i n * is close to the K i d e a l kernel:
m i n ω , K * K t r a i n * K i d e a l F 2 + λ ω F 2
s u b j e c t t o K t r a i n * = h = 1 H ω h K h ,
ω h 0 , h = 1 , 2 , , H ,
h = 1 H ω h = 1
where X F 2 = T r a c e ( X X T ) , λ is a regularization parameters, ω = [ ω 1 , ω 2 , , ω h ] T is the weights of kernels.

4.3. Fuzzy Kernel Ridge Regression

Kernel ridge regression is a method from statistics that implements a form of Regularized Least Squares (RLS). Given a training sample x i , y i , i = 1 , 2 , , N . N, x i and y i is the number of samples, feature vector and label. The RLS aims to find the minimum of the following function:
J = C 2 K t r a i n α y t r a i n 2 + 1 2 f K 2
where K t r a i n R N × N is the training kernel, C is the non-negative regular term. The solution of KRR is:
α = ( K t r a i n + 1 C I ) 1 y t r a i n
In this paper, we present a Fuzzy Kernel Ridge Regression (FKRR) for classification. We need to minimize the sum of errors ( K t r a i n α y t r a i n 2 ). The contribution of sample x i to the decision boundary should be proportional to its fuzzy membership value. The objective function is following function:
J = C 2 D ( K t r a i n α y t r a i n ) 2 + 1 2 f K 2
where D R N × N is a diagonal matrix whose element D i i ( 0 D i i 1 ) represents a fuzzy membership value for sample x i .
We set J / α = 0 and the solution of α can be obtained as follows:
( C 2 D ( K t r a i n α y t r a i n ) 2 + 1 2 f K 2 ) / α = 0
( C 2 D ( K t r a i n α y t r a i n ) 2 + 1 2 α T K t r a i n α ) / α = 0
C K t r a i n T D T ( D K t r a i n α D y t r a i n ) + K t r a i n α = 0
C D 2 ( K t r a i n α y t r a i n ) + α = 0
α = ( K t r a i n + 1 C D 2 I ) 1 y t r a i n
where I R N × N . So, the decision function is following:
y t e s t = s i g n [ K t e s t α ]
= s i g n [ K t e s t ( K t r a i n + 1 C D 2 I ) 1 y t r a i n ]
where y t e s t R M × 1 is predictive labels. K t e s t R M × N denotes the kernel of testing samples, M is the number of testing samples.
To compute fuzzy membership values of train samples, we employ the optimal kernels K t r a i n * (training kernel) as following function:
s c o r e t = 1 N 2 ( y t = y i K t r a i n * ( x t , x i ) y t y i K t r a i n * ( x t , x i ) )
where s c o r e t denotes the score of training point t. If a sample t has a larger score. This sample may has a greater contribution to model. We normalize scores into fuzzy membership values (0–1), as follows:
D t t = 1 1 + e x p ( s c o r e t ) , t = 1 , 2 , , N

5. Conclusions

FKRR-MVSF achieves better results on independent datasets (MCC: 0.676). Eliminating noise points can improve the predictive performance of the model. In the future, we aim to use other fuzzy membership functions to build fuzzy models for filtering the noise points. As pointed out in PseAAC-based methods [13,33,39,40,55,56,57,58,59,60], we will establish a web-server for our model. The related code and datasets can be download from: https://figshare.com/s/e80f1a96b7b7bbf8062b.

Author Contributions

Y.Z., L.P. and Y.D. conceived the study. Y.Z. and Y.D. performed the experiments and analyzed the data. Y.Z., Y.D., J.T., F.G. and L.P. drafted the manuscript. All authors read and approved the manuscript.

Funding

This research was funded by State Key Research Project: Ferment Equipment Intelligent monitor and Early-warning Diagnosis System (grant number 2018YFD0400902), National Science Foundation of China (grant number 61873112) and Natural Science Research Project of Jiangsu Higher Eduction Institutions of China (grant number 19KJB520014).

Acknowledgments

The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chou, K.C.; Tomasselli, A.G.; Heinrikson, R.L. Prediction of the Tertiary Structure of a Caspase-9/Inhibitor Complex. FEBS Lett. 2000, 470, 249–256. [Google Scholar] [CrossRef]
  2. Chou, K.C.; Jones, D.; Heinrikson, R.L. Prediction of the tertiary structure and substrate binding site of caspase-8. FEBS Lett. 1997, 419, 49–54. [Google Scholar] [CrossRef]
  3. Chou, K.C. Insights from modelling the 3D structure of the extracellular domain of α7 nicotinic acetylcholine receptor. Biochem. Biophys. Res. Commun. 2004, 319, 433–438. [Google Scholar] [CrossRef] [PubMed]
  4. Xie, H.L.; Fu, L.; Nie, X. Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou’s PseAAC. Protein Eng. Des. Sel. 2013, 26, 735–742. [Google Scholar] [CrossRef] [PubMed]
  5. Xu, Y.; Ding, J.; Wu, L. iSNO-PseAAC: Predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE 2013, 8, e55844. [Google Scholar] [CrossRef] [PubMed]
  6. Chen, W.; Feng, P.; Ding, H.; Lin, H. iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem. 2015, 490, 26–33. [Google Scholar] [CrossRef]
  7. Chou, K.C. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 2015, 11, 218–234. [Google Scholar] [CrossRef]
  8. Jia, J.; Liu, Z.; Xiao, X.; Liu, B. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 2016, 394, 223–230. [Google Scholar] [CrossRef]
  9. Jia, J.; Liu, Z.; Xiao, X.; Liu, B. iCar-PseCp: Identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 2016, 7, 34558–34570. [Google Scholar] [CrossRef]
  10. Liu, Z.; Xiao, X.; Qiu, W.R. iDNA-Methyl: Identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 2015, 474, 69–77. [Google Scholar] [CrossRef]
  11. Xiao, X.; Min, J.L.; Lin, W.Z.; Liu, Z.; Cheng, X. iDrug-Target: Predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approach. J. Biomol. Struct. Dyn. 2015, 33, 2221–2233. [Google Scholar] [CrossRef] [PubMed]
  12. Jia, J.; Liu, Z.; Xiao, X. iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J. Theor. Biol. 2015, 377, 47–56. [Google Scholar] [CrossRef] [PubMed]
  13. Chen, W.; Feng, P.M.; Lin, H. iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 2013, 41, e68. [Google Scholar] [CrossRef] [PubMed]
  14. Chou, K.C. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS Struct. Funct. Genet. 2001, 43, 246–255. [Google Scholar] [CrossRef] [PubMed]
  15. Chen, W.; Lei, T.; Jin, D.; Lin, H. PseKNC: A flexible web-server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 2014, 456, 53–60. [Google Scholar] [CrossRef]
  16. Wei, L.; Luan, S.; Nagai, L.; Su, R.; Zou, Q. Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 2019, 35, 1326–1333. [Google Scholar] [CrossRef]
  17. Zou, Q.; Xing, P.; Wei, L.; Liu, B. Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6-Methyladenosine Sites from mRNA. RNA 2019, 25, 205–218. [Google Scholar] [CrossRef]
  18. Jia, C.; Zuo, Y.; Zou, Q. O-GlcNAcPRED-II: An integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 2018, 34, 2029–2036. [Google Scholar] [CrossRef]
  19. Zeng, X.; Liu, L.; Lu, L.; Zou, Q. Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics 2018, 34, 2425–2432. [Google Scholar] [CrossRef] [Green Version]
  20. Xuan, P.; Han, K.; Guo, Y.; Li, J.; Li, X.; Zhong, Y.; Zhang, Z.; Ding, J. Prediction of potential disease-associated microRNAs by using neural network. Mol. Ther. -Nucleic Acids 2019, 16, 566–575. [Google Scholar]
  21. Liu, B.; Jiang, S.; Zou, Q. HITS-PR-HHblits: Protein remote homology detection by combining pagerank and hyperlink-induced topic search. Brief. Bioinform. 2019. [Google Scholar] [CrossRef] [PubMed]
  22. Wei, L.; Ding, Y.; Su, L.; Tang, J.; Zou, Q. Prediction of human protein subcellular localization using deep learning. J. Parallel Distrib. Comput. 2018, 117, 212–217. [Google Scholar] [CrossRef]
  23. Ru, X.; Li, L.; Zou, Q. Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. J. Proteome Res. 2019, 18, 2931–2939. [Google Scholar] [CrossRef] [PubMed]
  24. Qu, K.; Guo, F.; Liu, X.; Zou, Q. Application of Machine Learning in Microbiology. Front. Microbiol. 2019, 10, 827. [Google Scholar] [CrossRef] [PubMed]
  25. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  26. Liu, B.; Xu, J.; Fan, S.; Xu, R.; Zhou, J.; Wang, X. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation. Mol. Inform. 2015, 34, 8–17. [Google Scholar] [CrossRef]
  27. Kumar, M.; Gromiha, M.M.; Raghava, G.P. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinform. 2007, 8, 463. [Google Scholar] [CrossRef]
  28. Lipman, D.J.; Zhang, J.; Madden, T.L. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar]
  29. Liu, B.; Xu, J.; Lan, X.; Xu, R.; Zhou, J.; Wang, X.; Chou, K.C. iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS ONE 2014, 9, e106691. [Google Scholar] [CrossRef]
  30. Wei, L.; Tang, J.; Zou, Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf. Sci. 2016, 384, 135–144. [Google Scholar] [CrossRef]
  31. Nimrod, G.; Schushan, M.; Szilágyi, A.; Leslie, C.; Ben-Tal, N. iDBPs: A web server for the identification of DNA binding proteins. Bioinformatics 2010, 26, 692–693. [Google Scholar] [CrossRef] [PubMed]
  32. Hussain, W.; Khan, S.D.; Rasool, N.; Khan, S.A. SPalmitoylC-PseAAC: A sequence-based model developed via Chou’s five-step rule and general PseAAC for identifying S-palmitoylation sites in proteins. Anal. Biochem. 2019, 568, 14–23. [Google Scholar] [CrossRef] [PubMed]
  33. Chou, K.C. Progresses in predicting post-translational modification. Int. J. Pept. Res. Ther. 2019. [Google Scholar] [CrossRef]
  34. Awais, M.; Hussain, W.; Khan, Y.D.; Rasool, N.; Khan, S.A. iPhosH-PseAAC: Identify phosphohistidine sites in proteins by blending statistical moments and position relative features according to the Chou’s 5-step rule and general pseudo amino acid composition. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019. [Google Scholar] [CrossRef] [PubMed]
  35. Ning, Q.; Ma, Z.; Zhao, X. dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. J. Theor. Biol. 2019, 470, 43–49. [Google Scholar] [CrossRef] [PubMed]
  36. Chou, K.C. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review, five-step rule). J. Theor. Biol. 2011, 273, 236–247. [Google Scholar] [CrossRef] [PubMed]
  37. Chou, K.C. Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. Curr. Med. Chem. 2019. [Google Scholar] [CrossRef]
  38. Chou, K.C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef]
  39. Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43, W65–W71. [Google Scholar] [CrossRef]
  40. Liu, B.; Wu, H. Pse-in-One 2.0: An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat. Sci. 2017, 9, 67–91. [Google Scholar] [CrossRef]
  41. Rose, P.W.; Prli, A.; Bi, C.; Bluhm, W.F.; Christie, C.H.; Dutta, S.; Green, R.K.; Goodsell, D.S.; Westbrook, J.D.; Woo, J.; et al. The RCSB Protein Data Bank: Views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015, 43, 345–356. [Google Scholar] [CrossRef]
  42. Lou, W.; Wang, X.; Chen, F.; Chen, Y.; Jiang, B.; Zhang, H. Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes. PLoS ONE 2014, 9, e86703. [Google Scholar] [CrossRef]
  43. Lin, W.; Fang, J.; Xiao, X. iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE 2011, 6, e24756. [Google Scholar] [CrossRef]
  44. Kumar, K.K.; Pugalenthi, G.; Suganthan, P.N. DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information using Random Forest. J. Biomol. Struct. Dyn. 2009, 26, 679–686. [Google Scholar] [CrossRef]
  45. Liu, B.; Wang, S.; Wang, X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci. Rep. 2015, 5, 15479. [Google Scholar] [CrossRef] [Green Version]
  46. Adilina, S.; Farid, D.; Shatabda, S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. J. Theor. Biol. 2019, 460, 64–78. [Google Scholar] [CrossRef]
  47. Xu, R.; Zhou, J.; Wang, H. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol. 2014, 9, e86703. [Google Scholar] [CrossRef]
  48. Liu, X.; Gong, X.; Yu, H.; Xu, J. A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers. Genes 2018, 9, 394. [Google Scholar] [CrossRef]
  49. Feng, Z.P.; Zhang, C.T. Prediction of membrane protein types based on the hydrophobic index of amino acids. J. Protein Chem. 2000, 19, 269–275. [Google Scholar] [CrossRef]
  50. Ding, Y.J.; Tang, J.J.; Guo, F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinform. 2016, 17, 398–410. [Google Scholar] [CrossRef]
  51. Jeong, J.C.; Lin, X.; Chen, X.W. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 8, 308–315. [Google Scholar] [CrossRef]
  52. You, Z.H.; Zhu, L.; Zheng, C.H.; Yu, H.J.; Deng, S.P.; Ji, Z. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. BMC Bioinform. 2014, 15, S9. [Google Scholar] [CrossRef]
  53. Chou, K.C.; Shen, H.B. MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem. Biophys. Res. Commun. 2007, 360, 339–345. [Google Scholar] [CrossRef]
  54. He, J.; Chang, S.F.; Xie, L. Fast Kernel learning for Spatial Pyramid Matching. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
  55. Chou, K.C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteom. 2009, 6, 262–274. [Google Scholar] [CrossRef]
  56. Chen, W.; Lin, H. Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences. Mol. Biosyst. 2015, 11, 2620–2634. [Google Scholar] [CrossRef]
  57. Liu, B.; Yang, F.; Huang, D.S. iPromoter-2L: A two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 2018, 34, 33–40. [Google Scholar] [CrossRef]
  58. Chen, W.; Ding, H.; Zhou, X.; Lin, H. iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Anal. Biochem. 2018, 561, 59–65. [Google Scholar] [CrossRef]
  59. Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H. iRNA-3typeA: Identifying 3-types of modification at RNA’s adenosine sites. Mol. Ther.-Nucleic Acid 2018, 11, 468–474. [Google Scholar] [CrossRef]
  60. Lin, H.; Deng, E.Z.; Ding, H.; Chen, W. iPro54-PseKNC: A sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014, 42, 12961–12972. [Google Scholar] [CrossRef]
Figure 1. The ROC curve of different kernels (features) on the PDB1075 dataset (Jackknife test).
Figure 1. The ROC curve of different kernels (features) on the PDB1075 dataset (Jackknife test).
Ijms 20 04175 g001
Figure 2. The weights of different kernels (features).
Figure 2. The weights of different kernels (features).
Ijms 20 04175 g002
Figure 3. The process of DNA-binding protein prediction.
Figure 3. The process of DNA-binding protein prediction.
Ijms 20 04175 g003
Figure 4. The process of FKRR-MVSF.
Figure 4. The process of FKRR-MVSF.
Ijms 20 04175 g004
Table 1. The detail information of two benchmark data sets.
Table 1. The detail information of two benchmark data sets.
Data SetsPDB1075PDB186
Positive52593
Negative55093
Total1075186
Table 2. The performance of different features on the PDB1075 dataset (Jackknife test).
Table 2. The performance of different features on the PDB1075 dataset (Jackknife test).
Feature TypeModelACCSNSpecMCCAUC
MCDKRR0.70700.70860.70880.41390.7751
NMBACKRR0.72840.71810.73820.45640.7857
PSSM-ABKRR0.75530.76950.74180.51130.8352
PsePSSMKRR0.79440.79050.79820.58860.8637
MW aKRR0.81950.83620.80360.63980.8998
MKLKRR0.82140.84380.80000.64390.9032
MCDSVM0.70880.73450.68190.41710.7611
NMBACSVM0.71160.69090.73330.42440.7706
PSSM-ABSVM0.76930.69810.84380.54670.8391
PsePSSMSVM0.78510.74720.82470.57310.8566
MW aSVM0.82010.82320.81700.64210.9011
MKLSVM0.82990.85410.80570.65680.9101
MW aFKRR0.82700.85330.80180.65540.9094
MKLFKRR0.83260.85710.80910.66640.9115
a MW denotes combining kernels by the mean weights.
Table 3. Comparison between our method and other existing methods on the PDB1075 dataset (Jackknife test).
Table 3. Comparison between our method and other existing methods on the PDB1075 dataset (Jackknife test).
MethodsACC (%)MCCSN (%)Spec (%)
IDNA-Prot75.400.5083.8164.73
DNAbinder73.950.4868.5779.09
DNA-Prot72.550.4482.6759.76
iDNAPro-PseAAC76.560.5375.6277.45
IDNA-Prot|dis77.300.5479.4075.27
Kmer1+ACC75.230.5076.7673.76
Local-DPP79.100.5984.8073.60
PseDNA-Pro76.550.5379.6173.63
Adilina’s work70.210.4161.0079.70
Our method (FKRR-MVSF)83.260.6785.7180.91
Table 4. Compared with existing methods on the PDB186 dataset (Independent test).
Table 4. Compared with existing methods on the PDB186 dataset (Independent test).
MethodsACC (%)MCCSN (%)Spec (%)
IDNA-Prot67.20.34467.766.7
DNA-Prot61.80.24069.953.8
IDNA-Prot|dis72.00.44579.564.5
DNAbinder60.80.21657.064.5
DBPPred76.90.53879.674.2
Kmer1+ACC71.00.43182.859.1
iDNAPro-PseAAC71.50.44282.860.2
Local-DPP79.00.62592.565.6
Adilina’s work82.30.67095.069.9
MSFBinder (SVM)81.70.64089.374.2
Our method (FKRR-MVSF)81.70.67698.964.5
Table 5. The values of the 6 properties for twenty amino acids.
Table 5. The values of the 6 properties for twenty amino acids.
Amino AcidHVSCP1P2SASANCISC
A0.6227.58.10.0461.1810.007187
C0.2944.65.50.1281.461−0.03661
D−0.940130.1051.587−0.02382
E−0.746212.30.1511.8620.006802
F1.19115.55.20.292.2280.037552
G0.480900.8810.179052
H−0.47910.40.232.025−0.01069
I1.3893.55.20.1861.810.021631
K−1.510011.30.2192.2580.017708
L1.0693.54.90.1861.9310.051672
M0.6494.15.70.2212.0340.002683
N−0.7858.711.60.1341.6550.005392
P0.1241.980.1311.4680.239531
Q−0.8580.710.50.181.9320.049211
R−2.5310510.50.2912.560.043587
S−0.1829.39.20.0621.2980.004627
T−0.0551.38.60.1081.5250.003352
V1.0871.55.90.141.6450.057004
W0.81145.55.40.4092.6630.037977
Y0.26117.36.20.2982.3680.023599

Share and Cite

MDPI and ACS Style

Zou, Y.; Ding, Y.; Tang, J.; Guo, F.; Peng, L. FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule. Int. J. Mol. Sci. 2019, 20, 4175. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms20174175

AMA Style

Zou Y, Ding Y, Tang J, Guo F, Peng L. FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule. International Journal of Molecular Sciences. 2019; 20(17):4175. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms20174175

Chicago/Turabian Style

Zou, Yi, Yijie Ding, Jijun Tang, Fei Guo, and Li Peng. 2019. "FKRR-MVSF: A Fuzzy Kernel Ridge Regression Model for Identifying DNA-Binding Proteins by Multi-View Sequence Features via Chou’s Five-Step Rule" International Journal of Molecular Sciences 20, no. 17: 4175. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms20174175

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop