Understanding how proteins adopt their 3D structure remains one of the greatest challenges in science. Elucidation of this process would greatly impact various fields of biology and medicine, as well as the rational design of new functional proteins and drug molecules. Determination of the fold category of a protein is crucial as it reveals the 3D structure of proteins. Classification of a protein of unknown structure under a fold category is called fold recognition, which is a fundamental step in the determination of the tertiary structure of a protein.
In the early years, determination of protein structure relies on traditional experimental methods, such as X-ray crystallography and nuclear magnetic resonance spectroscopy. In the post-genomic era, numerous sequences are generated by next-generation sequencing techniques. Although an increasing number of sequences are structurally characterized using experimental methods, the gap between structurally determined sequences and uncharacterized sequences is constantly increasing. Therefore, developing computational methods for fast and accurate determination of protein structures is urgently needed. Accurate computational prediction of protein folds has recently emerged as alternative approach to the labor intensive and expensive experimental methods. Computational methods for protein fold recognition can be generally categorized into three classes: (1) de novo modeling methods; (2) template-based methods; and (3) template-free methods. Many efforts have focused on the development of methods under classes (2) and (3) because the de novo approach (class 1) has two limitations. First, it requires long computational time and numerous sources, and second, it can only be successfully applied in small proteins.
Template-based methods used to determine protein structures are based on the evolutionary relationships of proteins. The procedure for template-based methods can be summarized as follows: First, proteins of known structures retrieved from public protein structure databases (e.g., Protein Data Bank (PDB)) are used as template proteins for a query protein sequence. To make template-based prediction fast and reliable, a simplified database is usually employed, in which the sequence similarity is less than 50%–70%. Second, distant evolutionary relationships between a target sequence and proteins of known structure are detected. In this step, multi-alignment algorithms are adopted to exploit evolutionary information by encoding amino acid sequences into profiles. Third, to determine the optimal alignments, scoring functions are usually used as measures to evaluate the similarity between the profiles derived from a query protein and those of template proteins with known structures. Z-score and E-value are the two commonly used scoring functions. The accuracy of the alignment is tremendously important in model building. Fourth, 3D structure models based on template atom coordinates and optimal query-template alignments are built. Last, the optimal structure models are determined from the model candidates through further structure optimization. The commonly used structural optimization methods include energy minimization and loop modeling.
A series of template-based methods were developed in the last few decades. This series of approaches are regarded as the most successful methods in constructing theoretical models of protein structures. For instance, Jaroszewski et al. [1
] developed a protein recognition method called Fold and Function Assignment System (FFAS) by using a profile-profile alignment strategy without using any structural information. In FFAS, query and template profiles are obtained by PSI-BLAST searching against the NR85 database; these profiles are then aligned by a dot-product scoring function. The significance of alignment scores was calculated by comparing the protein with the distribution scores from pairs of unrelated proteins. Xu et al. [2
] improved the FFAS method and proposed a method called FFAS-3D, wherein they introduced structural information, such as secondary structure, solvent accessibility, and residue depth. FFAS-3D remarkably outperforms FFAS. Moreover, Shi et al. [3
] developed a protein fold recognition method called FUGUE, which can search sequences against protein fold libraries by using environment-specific substitution tables and structure-dependent gap penalties. Raptor is a novel method that uses the mathematical theory of linear programming to build 3D models of proteins and predict protein folds [4
]. Roy et al. [6
] developed an online prediction server called I-TASSER (Iterative Threading ASSEmbly Refinement), which is an integrated platform for automated protein structure and function prediction based on the sequence-to-structure-to-function paradigm. Ghouzam et al. [7
] proposed ORION, a new fold recognition method based on pairwise comparison of hybrid profiles that contain evolutionary information from protein sequences and their structures. Other template-based methods were successfully developed, including MODELLER [8
] and TMFR [9
]. MODELLER implements comparative protein structure modeling through satisfaction of spatial restraints, whereas TMFR applies special scoring functions to align sequences and predict whether given sequence pairs share the same fold. As mentioned above, several typical template-based methods have been proposed. However, the manner by which to examine the quality of template-based modeling methods remains unknown. Currently, CASP (Critical Assessment of protein Structure Prediction) is a mainstream platform used to establish an independent mechanism to assess the current methods employed in protein structure modeling [10
]. This platform can be accessed at http://predictioncenter.org/
Although much progress has been made in template-based methods, some problems still exist, as follows: First, we need to determine the structures of template proteins. The three-dimensional structures of many proteins remain to be determined. Second, template-based modelling largely relies on the homology between target and template proteins. When the target and template proteins display a sequence similarity of >30%, the use of sequence alignment methods (e.g., BLAST [11
] and SSEARCH [12
]) can reveal their evolutionary relationships. However, this approach is not available for non-obvious relationships between targets and templates with a sequence identity of lower than 20%–30%. Third, template-based structure modeling is time consuming. This approach always requires homology detection by searching target proteins against a template database to detect distant evolutionary relationships.
To address the aforementioned problems, recent research efforts have focused on the development of template-free methods. Template-free methods seek to build models and accurately predict protein structures solely based on amino acid sequences rather than on known structural proteins as templates. Many machine learning algorithms have been recently used for that purpose; these algorithms include Hidden Markov Model (HMM), genetic algorithm, Artificial Neural Network, Support Vector Machines (SVMs), and ensemble classifiers. A key underlying assumption in employing machine learning-based methods for protein fold recognition is that the number of protein fold classes is limited [13
]. Machine learning aims to build a prediction model by learning the differences between different protein fold categories and use the learned model to automatically assign a query protein to a specific protein fold class. This approach is thus more efficient for large-scale predictions and can examine a large number of promising candidates for further experimental validation. This review focuses mainly on the recent progress in machine learning-based methods for protein fold recognition. This review is organized as follows: First, we introduce the public databases usually used in protein fold recognition research. Second, we describe the framework and flowchart of machine learning-based recognition methods. Third, we summarize some recent representative machine learning-based methods for protein fold recognition. Finally, we evaluate and compare the recognition performance of existing methods used in the last 10 years on a benchmark dataset.
3. Framework of Machine Learning-Based Methods
This section describes the mechanism of protein fold recognition by machine learning-based methods. The overall procedure in protein fold recognition by machine learning-based methods includes two phases (Figure 2
): (1) model training; and (2) prediction.
In the first phase (model building), query protein sequences are first submitted into a pipeline of feature representation, in which sequences of different lengths are encoded with fixed-length feature vectors by feature descriptors. The commonly used feature descriptors include Amino Acid Composition (AAC), Pseudo AAC, Functional Domain (FunD), Position Specific Scoring Matrix (PSSM)-based descriptors, Secondary Structure-based descriptors, and Autocross-covariance (ACC) transformation. When the resulting feature representations display some irrelevant features or redundant features, an alternative step is usually performed to select the optimal feature subsets, which can yield the best performance, from the resulting feature representations. Subsequently, the feature vectors are fed into a pre-selected classification algorithm to train a prediction model. Typical classification algorithms often used in model building include SVM, Random Forest (RF), Naïve Bayes (NB), and Logistic Regression (LR). The first phase is completed in this step.
In the second phase (prediction), uncharacterized query proteins are first submitted into the same pipeline of feature representation as in the first phase. Note that if feature optimization of the generated feature representation is performed in the first phase, feature optimization should also be performed in the second phase; otherwise, the resulting feature vectors are fed into the trained prediction model, wherein the protein fold class to which the query proteins belong is predicted.
5. Comparisons with Different Methods on Benchmark Dataset
To examine the effectiveness of existing machine learning-based methods in the literature for protein fold recognition, an intuitive comparison is to perform the methods on a public benchmark dataset. Here, a public and stringent dataset, proposed by Ding and Dubchak [41
], is employed as a benchmark dataset for performance comparison of the existing methods. This dataset, referred as to DD, has been widely used in several studies [22
]. The DD dataset is comprised of a training dataset and a testing dataset, both of which cover 27 protein fold classes in the SCOP database. The training dataset contains 311 protein sequences with ≤40% residue identity, while the testing dataset contains 383 protein sequences with ≤35% residue identity. Importantly, the sequences in the training dataset have residue identity ≤35% with that in testing dataset, thus ensuring to provide unbiased performance evaluation. The sequence distribution of each of the 27-fold classes can be seen in Table 2
As the benchmark dataset determined, the next thing is to determine the methods for performance comparison. To provide a comprehensive comparison, we evaluated and compared the 20 representative methods published in the past 10 years (from 2006 to present) on the DD dataset. The compared 20 methods are first modeled by the training dataset of the DD dataset, and then they are tested on the testing dataset of the DD dataset. The prediction results are presented in Table 3
. As shown in Table 3
, we observe the following two experimental results. First, the recent ProFold exhibits the best performance among other existing methods. The overall accuracy of ProFold is 76.2%, which is 2.6%–15.7% higher than that of other methods. This demonstrates that the ProFold has great power to distinguish the 27-fold classes in the DD dataset. The significant performance improvement of ProFold contributes to the first use of the DSSP feature in the field of protein fold recognition. Their research results indicate that integrating the DSSP features into feature representations remarkably enhanced the overall accuracy from 71.2% to 76.2% [35
]. This provides an alternative way to further improve predictive performance by integrating some unexplored but informative features. Second, of the 20 methods, 14 methods are based on an ensemble classifier, while 6 methods are based on a single classifier. In particular, we observe that there are 9 out of 20 methods that obtain an overall accuracy of >70%, which are PFP-FunDSeqE (70.5%), TAXFOLD (71.5%), Marfold (71.7%), Kavousi et al. (73.1%), PFPA (73.6%), Feng and Hu (70.2%), Feng et al. (70.8%), and ProFold (76.2%), respectively. Of the nine methods, only TAXFOLD is based on single classifier while the other methods are based on ensemble classifier. This indicates that ensemble classifiers are more effective than single classifiers for protein fold recognition. On the other hand, this result also explains why more recent research efforts are focused on the development of ensemble-classifier-based predictors.
6. Conclusions and Perspectives
We have systematically reviewed the recent progress in machine learning-based protein fold recognition methods. Compared with the traditional experimental methods, machine learning-based methods present three advantages. First, they demonstrate accurate, robust, and reliable performance. Second, they can be applied in large-scale protein fold recognition; this application is extremely important in the post-genomic era, wherein numerous proteins remain to be structurally characterized. Third, they can effectively address the intrinsic limitations of experimental methods, that is, their being time consuming and expensive. In the past decades, remarkable progress has been made in computational protein fold recognition. However, several challenges remain to be addressed.
First, the benchmark dataset (e.g., DD dataset) used to evaluate the performance of predictors actually suffers some limitations. For instance, the DD dataset is imbalanced. Table 2
shows that the ratio of the smallest class (“EF hand-like”) against the largest class (“immunoglobulin-like β-sandwich”) is roughly 1:4. Moreover, the sample size for each fold class is small. Only 383 training sequences belong to 27-fold classes. The largest fold class contains 30 training samples, whereas the smallest fold class contains 6 training samples. Generally, the prediction model generated based on such an imbalance and small dataset is easily overfitting.
Second, most of the existing methods, especially for those with online webservers, can only provide for the populated 27-fold class prediction. Although the sequences of the 27-fold classes cover the majority of the sequences in SCOP database, approximately 1800 protein fold classes actually exist in SCOP. Thus, developing adaptive multi-class protein fold predictors is desirable given that an increasing number of protein fold classes are being discovered.
Third, constructing informative and effective prediction engines remains a great challenge. Well-established ensemble classifiers have demonstrated their classification power in protein fold recognition. The use of deep learning algorithms for classification tasks has been a recent research hotspot in the machine learning field. Deep learning networks have been successfully applied in protein fold recognition [57
]. Combining deep learning networks with well-established ensemble classifiers is probably an alternative means to improve the efficiency of protein fold recognition.
In general, machine learning-based methods can be successfully applied in protein fold recognition. In the future, machine learning methods will be extensively applied in other similar but unexplored fields, such as disease-causing amino acid change prediction [58
], protein-protein binding site or interaction prediction [61
], and DNA-protein binding site or interaction prediction [64