1. Introduction
Proteins are large macromolecules that perform a variety of functions necessary for an organism. Biological processes are driven by multiple proteins creating a protein complex or a signaling cascade through biophysical interactions. Understanding which proteins interact with each other is an important first step towards understanding molecular mechanisms of biological functions and of diseases, and in the design of therapeutics. Knowledge of binary protein–protein interactions (PPIs) is also useful in conducting further research on blocking or enhancing different interactions to control biological processes and functions, and to design therapeutics. However, given that human genome contains approximately 20,000 protein-coding genes and therefore nearly 200 million unique protein pairs even without accounting for multiple isoforms of genes, resolving all pairs of interacting protein is not a trivial task.
Many experimental biology methods have been devised to find pairs of interacting proteins. Benchwork experiments like co-immunoprecipitation (co-IP) are resource-intensive and low-throughput, often leading to false negatives depending on experimental conditions, for example due to ineffective antibodies or due to the transient nature of PPIs [
1]. In contrast, high throughput screens such as yeast two-hybrid (Y2H) and tandem affinity purification mass-spectrometry (TAP-MS), and more recently a sequencing approach called PROPER-seq [
2] attempt to capture tens-to-hundreds of thousands of PPIs in a single experiment. In the past, statistical estimates suggested that Y2H had 25%–45% false positives, and difficulties in detecting PPIs for certain subsets of proteins such as membrane proteins [
3]. In later studies, it was shown that extensive filtering techniques can be used to decrease the false positive rate, usually by running multiple screens or comparing them to other data sources to obtain a more precise subset of potential interactions [
4]. In one recent approach, a high quality protein interaction network of 53,000 PPIs was produced using such an approach [
4]. However, the overlap between PPIs identified by different large-scale techniques remains low, with each method producing large amounts of PPIs unverified by any other experiment [
5].
Overall, most PPIs out of an estimated 500,000 to 3 million binary PPIs of the human interactome remain unknown [
4,
6,
7,
8], which motivates the development of computational approaches for PPI discovery. At minimum, computational approaches should identify an accurate subset of likely interacting protein pairs that that can subsequently be validated with experimental techniques as true PPIs, thereby eliminating much of the cost in testing millions of other pairs. A sizeable amount of literature has been published on the computational prediction of PPIs, especially in model organisms like yeast, suggesting their extrapolation to human interactome mapping. These methods represent pairs of proteins by features of amino acid composition, protein functional annotations, or gene co-expression, and develop machine learning models to classify each pair (a data instance) as interacting or non-interacting.
Amino acid sequence-based predictors make up most of the literature for computationally predicting proteome-wide PPIs, except for yeast, where other features have also been widely explored. Sequence-based predictors rely on features computed from amino acid sequences, their physicochemical properties or unigram or n-gram counts. Examples of such feature representations include auto covariance (AC), pseudo amino acid composition (PSEAAC), and conjoint triads (CT), all of which have been widely used for predicting PPIs [
9,
10,
11,
12,
13,
14,
15,
16,
17,
18]. Other approaches attempt to make a more direct usage of amino acid sequences, either by generating per protein features from PSSMs using PSI-Blast, or by encoding the amino acids into a numeric vector to use as the features [
13,
19,
20,
21,
22,
23]. Once calculated, these values can build a numeric vector of equal length for each protein, creating a representation of data typically used in machine learning.
Non-sequence-based predictors utilize functional, protein subcellular location, structural and other biological annotation and transcriptomic data to compute features (henceforth referred to as annotation-based predictors in contrast to sequence-based predictors). These features rely heavily on Gene Ontology (GO), structural domain databases, and gene expression databases to assemble the features of all protein pairs, although there may be missing values for many pairs. They compute features of protein pairs, including cooccurrence in subcellular locations, co-expression correlation across different experimental conditions or across tissues, and semantic similarity of ontology annotations as input to machine learning classification models [
24,
25,
26,
27,
28]. Many previous works rely on not one but a collection of these annotations, creating a feature vector per protein pair [
29,
30]. These collection-based classification algorithms have built-in methodologies to handle missing values so that entire protein pairs are not discarded when they miss some features.
Given the large number of unknown PPIs in the human interactome and the potential role of computational methods in discovering them, we must assess how accurately they perform on ‘real-world’ application, and if proven successful, compare them against each other to identify the best methods. In their original publications, most predictors are claimed to be highly accurate, with several obtaining over 90% accuracy, and some obtaining accuracies as high as 95–98%. Logically, if multiple classifiers exist that can obtain such high accuracies, predicting the full interactome should be a trivial task. However, the evaluations that these algorithms have been put through do not seem to consider two important aspects of PPIs, namely, (i) the large number of known PPIs involving a few hub proteins, and (ii) the rarity of PPIs among all possible protein-pairs. Here, we seek to evaluate and benchmark the performance of these algorithms, taking these aspects into account.
The common evaluation methodology for most PPI prediction methods is to utilize a set of known PPIs from a public repository as positive-label instances, and to use randomly sampled protein pairs (excluding the pairs that are known to interact) in place of negative-label instances. Randomly chosen pairs are utilized due to a lack of known, non-interacting pairs from experimental methods, as a failure to identify a PPI in a biological experiment only suggests that a PPI was not observed under those experimental conditions and does not prove that the two proteins never interact. Therefore, failed experiments do not create negative-labeled data. This lack of ability to experimentally determine non-interacting protein pairs prevents the creation of a traditional gold standard negative dataset. However, taking into account that only 0.325–1.5% of protein pairs have PPIs, computational methods utilize randomly sampled protein pairs as negatively labeled data, with the majority of randomly selected pairs likely being truly non-interacting pairs.
Although it is estimated that there exist about 500,000 to 3 million PPIs out of a total of 200 million protein pairs in humans (0.325–1.5%), most algorithms are trained and tested on datasets containing 50% positive label data. This simple approach for assessing an individual algorithm’s real-world application is questionable. Additionally, the protein interactome is believed to be a small-world network [
31]. Such networks have hubs, i.e., nodes that connect directly to many other nodes, or in this case, proteins with a large number of PPIs. For example, The Biological General Repository for Interaction Datasets (BioGRID) contains ~125,000 unique, non-self-interactions among ~14,500 proteins (represented by their gene names/symbols and not distinguishing isoforms; this aspect that proteins are referred to by their genes is applicable throughout this paper except where explicitly mentioned otherwise) [
32]. This averages to around 17 interactions per protein; however, currently 360 proteins have more than a 100 PPIs each, with one protein,
APP, involved in over 2000 PPIs. On the other hand, over 9000 proteins have 10 or fewer PPIs each. While positive class PPI data are from a scale-free network, randomly paired data (negative class) are sampled uniformly, and thus have a significantly different distribution of proteins. This can lead to biasing problems in machine learning, where a single protein appears far more times in the positive dataset, allowing machine learning algorithms to simply predict pairs containing such proteins to be of positive labels, generating a high accuracy on the test datasets from corresponding distributions. Past experiments in this field have suggested similar findings, with works by Yu as well as Park and Marcotte suggesting that many prediction algorithms primarily predict bias in underlying datasets [
33,
34].
In addition to dataset creation, evaluation metrics also play a role in correctly assessing whether an algorithm can make good predictions on real data. In evaluating classification algorithms for rare category data, accuracy and AUC are not suitable methods, and precision-recall (P-R) curves are recommended [
35]. In biological and clinical domains, where the natural distribution of class labels may be highly imbalanced, a P-R curve provides more reliable information and distinguishes models that are practical for real world applications, whereas AUC may misleadingly convey more impressive accuracy than are realistically achievable on the rare category that is of interest (e.g., an interacting protein-pair or the presence of a disease) [
36]. However, most published works have used AUC/accuracy metrics. Both the aspects of data composition and evaluation metrics were taken into account by Qi et al. in the seminal paper on proteome-wide human PPI prediction (i.e., when a protein is compared against all the proteins) for membrane receptor proteins [
29]. They set their test data at 1:1000 positive to negative instances and evaluated with a P-R curve. Our own method, called High-Precision Protein–Protein Interaction Prediction (HiPPIP) for human interactome prediction also was evaluated on the lines of Qi et al.’s, both in terms of rareness of positive class and using a P-R curve [
37]. Additionally, the HiPPIP method was also evaluated for its ability to predict proteome-wide PPIs, by computing the cumulative number of true-positives versus the increasing number of false-positives (note that in this scenario of proteome-wide evaluation (i.e., where a hub protein is paired with every other protein in the human proteome), the unlabeled data may not all be false-positives, and may indeed me true positives that are currently unknown) on hub proteins with the reasonable assumption that many, if not all, of their PPIs are known; this metric was used as a reasonable approximation of the number of accurate predictions produced by HiPPIP.
In this study, we evaluate different PPI prediction algorithms to determine how well they perform on realistically proportioned datasets. We first implement various algorithms, using similar feature sets and classification models to those described in the original publications. Where possible, we download the datasets used by these algorithms to test whether our implementations produce similar results. Next, we created six new evaluation datasets containing three proportions of positive label instances (50%, 10% and 0.3%) and two sampling methods (sampled randomly from the full list of proteins as is commonly done in literature, henceforth known as Full), and using a held out set of proteins for evaluation (referred to as Held-Out, known as C3 data in Park and Marcotte’s prior work [
33]). Finally, we also created control models to compare these predictors against, by using
illogical features (e.g., frequency of the proteins in the dataset or random vectors to represent the proteins) with simple, naïve classifiers. These types of predictions do not consider the pairwise compatibility of two proteins, and simply predict PPIs based on the distinct distributions of proteins in positive and negative classes in the datasets. This allows us to determine how well the algorithms perform relative to these illogical features.
4. Discussion
We implemented different PPI prediction algorithms and evaluated them on benchmark datasets containing different class distributions. Our results show that most published methods perform much lower than originally reported when evaluation data are created with realistic proportions of positive and negative classes. We also showed that many of these methods may be capturing spurious, illogical features that represent the frequency of specific proteins in the data rather than meaningful information about PPIs themselves; such methods will not translate to a real-world application of proteome-wide discovery of PPIs where every protein will be tested against every other protein in the proteome.
In prior publications, most sequence-based predictors were evaluated on datasets with 50% positively labeled instances, with randomly selected protein-pairs serving as negative class data. In those datasets, PPIs (i.e., the positive class data) are drawn from a scale-free network where some nodes are hubs, and therefore appear in many protein pairs, whereas randomly paired negative class data are drawn from nearly complete-graph data. Thus, the usage rates of different proteins, particularly hub proteins, in the positively and negatively labeled instances are dramatically different, creating an easily exploitable bias. Some of this can be inferred from
Table 3, which shows far more unique proteins in positive class data than in negative class data in many datasets. (If say 100 pairs are drawn from scale-free distribution and there is one hub with 25 PPIs, it contains 26 unique proteins; whereas, if that hub did not exist and it was drawn from a complete graph, the number of unique proteins from those 25 PPIs would be between 26 and 50) When evaluating these datasets, algorithms may assign class labels based on protein frequency rather than true characteristics of an interacting protein pair, and falsely shows higher performance in evaluations; the class label is assigned independently of the second protein in the pair but learns a likely invalid premise for real world interactome prediction. Our experiments showed that both on datasets from original publications and on our newly created Full datasets, control models with illogical features that simply capture protein membership can perform on par with most of the models from the literature. When analyzing the results of test datasets with proteins not utilized in the training set (Held Out), we find that most algorithms’ accuracies drop significantly. Accuracies on our 1:1 test data dropped from their original results of 75–85% accuracy down to 55–67% accuracy. On more relevant metrics, namely precision at 3% recall on 1:332 positive class data and holding out test proteins gave a mere 6.1% for the best sequence-based algorithm. Thus, when both the data ratio and evaluation metric are suitably chosen, the true ability is revealed to be impractically low.
Some of the previous methods filtered out protein sequences that have a high sequence similarity, which we have not implemented; however, we have mapped our data to gene-level information, and so multiple isoforms are not included in our test data, minimizing the number of highly sequence similar proteins in our data. It is likely that if we removed proteins with similar sequences from our datasets, the results when predicting on held out proteins would be lower, as exploiting sequence similarity provides similar results to sequence-based methods. This may be analyzed in future work.
When training the methods from literature that were originally designed for 50% positive training and test datasets, we did not make adjustments to the designs or implementations to adjust for the different ratios of data. Some methods, such as class weighting, are commonly recommended when training models on imbalanced data. However, to keep the models as similar to those found in previous literature as possible, we decided not to implement adjustments per positive data ratio. We note that we did test all models on 50% positive test data and found the results to be similar to what could be obtained by prioritizing proteins by their number of known interactions. Therefore, we believe it is unlikely that using concepts such as class-weighting would drastically increase the precision of these methods.
When using annotation features, we found that feature representation has a significant impact on the results of the model. Most methods we tested were unable to outperform control models made from illogical features when generating data from all pairs of proteins. However, using Held Out protein data showed moderate precision at 3% recall even on heavily imbalanced class data, showing their predictive capabilities do not depend on exploitable, individual protein-based biases in the underlying data.
Models that use features computed from pairs of protein domains and GO annotations that appear in other interactions performed well at predicting interactions; these are well recognized to be meaningful features in predicting interactions (e.g., that two protein domains that are known to interact are highly likely to conserve that interaction/function when those protein domains appear in other proteins). Thus, using the knowledge of interacting protein domains or compatible GO annotations (specific ligand and receptor annotation in the two proteins), and along with other protein features to learn an effective classification machine learning model which helps shortlist protein-pairs for experimental validation. We also note that the two methods using only 3 features, i.e., ‘Domain Variant’ and ‘Guo 2006 Sem’, performed as well or better than other non-sequence methods based on 10–20 features. Surprisingly, our ensemble method, which contained Resnik semantic similarity, GO annotation frequency, and all 3 domain features performed worse than the methods using only 3 domain or Resnik semantic similarity features. This was most likely due to the different aggregation methods, with Domain Variant and Guo 2006 Sem using product and max aggregations respectively, while our ensemble used best matching average aggregation. This could imply that using max or product aggregation is better for predicting protein interactions, or this could suggest a bias where protein interactions are primarily known for genes with a high number of annotations. If the latter is true, product, sum, and maximum aggregations could exploit this bias, as all three functions monotonically increase as more data are provided. We leave analyzing this to future work.
As for methods where the features used by the models were highly similar to features produced for each individual protein, such as Chen 2005 Dom RF and Maetschke 2012 ULCA, we found that their performance mirrored the performances of sequence-based methods. This implies that using sequences alone is not the problematic part for sequence-based methods, but rather, any methodology that relies on producing unique feature sets per protein and using simple combinations of these features to create data for machine learning methods seem to mostly make predictions based on underlying biases in generated PPI datasets. Only when creating a small number of more complex features using pairs of proteins, instead of individual proteins, do models see a significant improvement beyond bias when positive interaction data are used as the rare class.
In conclusion, we compare P-R performance of HiPPIP with selected sequence-based and annotation-based methods (
Figure 4A) that performed the best and find that HiPPIP outperforms all of them significantly. In this evaluation, for all predictors, known PPIs from HPRD dataset are also included in assessing whether a prediction is a true positive. HiPPIP was evaluated on two sets of test data, each set containing 10 datasets: in one set, the data are created in the same fashion as Full data (‘HiPPIP-a’ in
Figure 4A) and in the other, the positive instances for test data are taken from BioGRID published January 2017 (‘HiPPIP-b’ in
Figure 4A);
Figure 4 shows aggregated results across the datasets in 4A and individual datasets in
Figure 4B,C for the two sets. The aforementioned caveat is: the specific pairs used in training HiPPIP are unavailable; therefore, there may be some overlap with training data; based on the number of known PPIs in BioGRID and considering that 20,000 PPIs were used in training HiPPIP, it is estimated that HiPPIP-a and HiPPIP-b test datasets may have 24% and 14% overlap for positive instances respectively, whereas overlap for negative data are negligible.