4.2. Data Mining, Preprocessing and Candidate GWGENs Construction
In our research, the dataset with accession number GSE81608 was downloaded from the gene expression omnibus (GEO) of the National Center for Biotechnology Information (NCBI), and its relevant experimental platform was GPL16791. The dataset contained mRNA expression levels of genes, proteins, miRNAs, TFs, receptors, and lncRNAs in pancreatic α-cell, β-cell, δ-cell, and PP-cell. The samples of the dataset were assorted into two categories, i.e., T2D and non-T2D. In this study, to identify the T2D pathogenic mechanism on the pancreatic β-cell, the samples of the subtype β-cell were specifically extracted from the original experimental data. Furthermore, according to the WHO report, the age distribution of incidence in diabetes is at the range of approximately 50 years old and older. Therefore, 86 and 123 samples were chosen respectively for T2D and non-T2D with age equal to or greater than to 50 years old. Then, we constructed the candidate PPIN based on the Database of Interacting Proteins (DIP) [
56], IntAct [
57], the Biological General Repository for Interaction Datasets database (BioGRID) [
58], the Biomolecular Interaction Network Database (BIND) [
59], and the Molecular INTeraction Database (MINT) [
60]. In addition, the candidate GRN was built based on the Integrated Transcription Factor Platform database (ITFP) [
61], the Human Transcriptional Regulation Interactions database (HTRI) [
62], and the TRANScription FACtor database (TRANSFAC) [
63]. MiRNAs and lncRNAs regulations in GRN were referenced to the TargetScanHuman database [
64], CircuitsDB [
65], and StarBase2.0 [
66].
4.3. Constructing the Systematic Model for the Candidate GWGEN of T2D and Non-T2D
For the purpose of imitating the human cellular system, we built the stochastic interactive and regulatory models to describe the candidate GWGEN. The candidate GWGEN was composed of PPIN containing the protein–protein interactions and GRN containing the regulations of genes, miRNAs, and lncRNAs. Next, we described the interactions of proteins and regulations of genes, lncRNAs, and miRNAs using the protein–protein interactive model (PPIM), gene regulatory model (GRM), lncRNA regulatory model (LRM), and miRNA regulatory model (MRM) in detail.
First, the q-th protein in PPIM can be described as the following equations:
where
indicates the expression level of the q-th protein in the n-th sample and
indicates the expression level of the r-th protein in the n-th sample;
denotes the interaction ability between the q-th protein and the r-th protein;
represents the total number of proteins that interact with the q-th protein;
denotes the total number of proteins in candidate PPIM; N means the total number of samples in our data;
shows the basal level in the model of the q-th protein due to unknown protein interactions of histone modifications such as phosphorylation and acetylation; and
expresses the data noise of the q-th protein.
Second, the transcriptional regulation of the x-th gene in GRM is given as below:
where
denotes the expression level of the x-th gene in the n-th sample;
,
, and
individually indicate the expression level of the u-th TF, the v-th lncRNA and the w-th miRNA of the n-th sample;
,
, and
separately mean the total binding number of TFs, lncRNAs and miRNAs;
shows the transcriptional regulatory ability from the u-th TF to the x-th gene;
represents the transcriptional regulatory ability from the v-th lncRNA to the x-th gene;
expresses the post-transcriptional regulatory ability of the w-th miRNA on the x-th gene;
denotes the total number of gene in GRNs;
indicates the total number of data samples;
means the basal level of the x-th gene because of the unknown gene regulations such as methylation; and
is the data noise.
Third, TFs, lncRNAs, and miRNAs also have a potential impact on the i-th lncRNA and we can depict this behavior by the LRM in candidate GWGENs. The equation is obtained as follows:
where
indicates the expression level of the i-th lncRNA;
,
, and
represent the expression level of the u-th TF, the v-th lncRNA, and the w-th miRNA of the n-th sample, respectively;
,
, and
individually show the total binding number of TFs, lncRNAs and miRNAs.
expresses the transcriptional regulatory ability from the u-th TF to the i-th lncRNA;
means the transcriptional regulatory ability from the v-th lncRNA to the i-th lncRNA;
denotes the post-transcriptional regulatory ability from the w-th miRNA to the i-th lncRNA;
is the total number of lncRNAs and
indicates the total number of samples;
denotes the basal level of the i-th lncRNA;
expresses the data noise.
Fourth, the expression of the j-th miRNA is also affected by the TFs, lncRNAs, and miRNAs. Furthermore, we can illustrate MRM in candidate GWGENs through the following equation:
where
means the expression level of j-th miRNA;
,
, and
separately represent the expression level of the u-th TF, the v-th lncRNA and the w-th miRNA, respectively;
,
, and
show the binding total number of TFs, lncRNAs and miRNAs;
denotes the transcriptional regulatory ability from the u-th TF to the j-th miRNA;
expresses the transcriptional regulatory ability from the v-th lncRNA to the j-th miRNA;
indicates the post-transcriptional regulatory ability from the w-th miRNA to the j-th miRNA;
is the total number of miRNAs and
indicates the total number of samples;
is the basal level of the j-th miRNA;
denotes the data noise.
4.4. The System Identification and System Order Detection Methods for Real GWGENs Identification
According to the above stochastic models, PPIM in (1) composed the candidate PPIN; GRM in (2), LRM in (3), and MRM in (4) constituted the candidate GRN. We made use of the system identification and system order detection methods to obtain the real GWGENs of T2D and non-T2D by the corresponding RNA-seq data, respectively. In order to identify the parameters of these stochastic models, Equations (1)–(4) could separately be rewritten as the following linear regression forms.
for
, where (5), (6), (7), and (8) are separately regression forms for PPIM, GRM, LRM, and MRM.
,
,
, and
are respectively the total number of proteins, genes, lncRNAs and miRNAs in the candidate GWGWN, and
is the total number of samples.
The linear regression forms in (5), (6), (7), and (8) could be simplified as the following formulas:
where the
,
,
, and
individually denote the regression vectors of proteins, gene, lncRNAs, and miRNAs in the n-th sample;
means the parameter vector of the protein-protein interaction abilities and protein basal levels;
,
, and
are the parameter vector of the transcriptional regulatory abilities and basal levels of the genes, lncRNAs, and miRNAs, respectively;
,
,
, and
are individually the data noise for PPIM, GRM, LRM and MRM.
For N samples, the above regression equations are given as below:
The above equations could be individually represented as the follows:
where
,
,
, and
are separately the regression matrix of proteins, genes, lncRNAs and miRNAs of N samples.
,
,
, and
are the corresponding interactive and regulatory parameter vectors.
,
,
, and
are the corresponding data noise vectors.
What is worth noticing is that the maximum degree of the parameter estimation of proteins in PPIs and genes in GRNs must be less than the samples; otherwise, it would cause the overfitting problem during the process of system identification.
Firstly, for the purpose of identifying the real GWGENs, we adopted the least square method to estimate the parameter vectors
,
,
, and
with negative regulation constraint on miRNA as follows:
Based on the above constrained optimization problems in Equations (21)–(24), we sought out the optimal solution of the interactive ability parameters among proteins
, the regulatory parameters of genes
, lncRNAs
and miRNAs
via the RNA-seq data of non-T2D and T2D, respectively. The above optimization problems for parameter estimation could be solved by the MATLAB optimization toolbox. Carefully, the negative inequality constraints in Equations (21)–(24) mean that the regulatory parameters of miRNAs should be less than or equal to zero to ensure the negative regulation of miRNAs on genes, lncRNAs and miRNAs.
After the parameter estimation of candidate GWGENs of non-T2D and T2D by the corresponding RNA-seq data, we used the system order detection method, AIC, to detect the system order (the number of interactions of each protein or the number of regulations of each gene, lncRNA and miRNA). The detailed equations of AIC for each protein, gene, lncRNA and miRNA are shown below.
where
means the estimated residual error of the q-th protein for the least square parameter estimation
in (21) and
denotes the number of protein interactions with the q-th protein.
where
denotes the estimated residual error of the x-th gene in (22) and
means the number of regulations of the genes, lncRNAs and miRNAs on the x-th gene;
is the estimated parameters in (22).
where
shows the estimated residual error of the i-th lncRNA in (23) and
indicates the number of regulations of the genes, lncRNAs and miRNAs on the i-th lncRNA;
expresses the estimated parameters in (23).
where
expresses the estimated residual error of the j-th miRNA in (24) and
represents the number of parameters regulations of the genes, lncRNAs and miRNAs on the j-th miRNA;
is the estimated parameter in (24).
According to the order detection of AIC in system identification [
67], the real order of a system (i.e., the number of interactions of the q-th protein in (1) or the number of regulations on the x-th in (2)) is to minimize the AIC. Therefore, the true number of interactions or regulations for each protein, gene, lnRNA and miRNA in candidate GWGENs can be obtained by solving the following AIC minimization problems.
where
denoted the true number of protein interactions for the q-th protein;
individually indicate the true number of regulations of genes, lncRNAs and miRNAs on the x-th gene;
denote the true number of regulations of genes, lncRNAs and miRNAs on the i-th lncRNA, respectively;
are separately the true number of regulations of genes, lncRNAs and miRNAs on the j-th miRNA. Therefore, the protein–protein interactions and gene, miRNA, and lncRNA regulations out of true order by AIC minimization problems in (29)–(32) are considered as false positives in candidate GWGEN of non-T2D and T2D and should be removed one by one to obtain the real GWGEN.
4.5. The Principal Network Projection (PNP) Method for the Core GWGENs Extraction from Real GWGENs
The real GWGENs of non-T2D and T2D were compared to investigate the genetic and epigenetic pathogenic molecular mechanism. However, it was still harder to analyze the two larger scale and complicated real GEGENs so that we applied the principal network projection (PNP) method on the basis of the singular value decomposition (SVD) to extract the core GWGENs from the real GWGENs. Before studying the core network extraction in depth, we will start by introducing the real GWGEN network matrix
. Network matrix
consists of interactions among proteins and regulations of the TF-gene, TF-lncRNA, TF-miRNA, lncRNA-gene, lncRNA-lncRNA, lncRNA-miRNA, miRNA-gene, miRNA-lncRNA, and miRNA-miRNA in the real GWGEN as the follows:
where
denotes the sub-matrix of PPI of which the bidirectional arrow at the subscript of the parameter means that the protein interaction is bidirectional;
,
,
,
,
,
,
,
and
denote the transcriptional regulatory sub-networks of TFs on genes, lncRNAs, and miRNAs; lncRNAs on genes, lncRNAs, and miRNAs; and miRNAs on genes, lncRNAs, and miRNAs, respectively. The detail components of network matrix
of real GWGENs are given below:
where
is the interaction ability of between the q-th protein and the r-th protein;
,
, and
are individually the regulation abilities of the u-th TF on the x-th gene, the v-th lncRNA on the x-th gene, and the w-th miRNA on the x-th gene;
,
, and
represent the regulation abilities of the u-th TF on the i-th lncRNA, the v-th lncRNA on the i-th lncRNA, and the w-th miRNA on the i-th lncRNA, respectively;
,
, and
separately show the regulation abilities of the u-th TF on the j-th miRNA, the v-th lncRNA on the j-th miRNA, and the w-th miRNA on the w-th miRNA. In addition, some zeros are omitted in the matrix, which means that there is neither interaction nor regulation between the source and target.
Thereafter, the core GWGENs were obtained by applying PNP on the network matrix
with an energy threshold of 85%. First, the network matrix
is decomposed by singular value decomposition (SVD) as follows [
68]:
where
and
are the unitary singular matrices;
denotes the diagonal matrix of which the components at the diagonal are the singular values of
and are arranged in descending order, i.e.,
.
In addition, we defined the normalization of singular values in (36) as below.
From the above formula, the top I significant singular vector structures were selected to represent the system with energy equal to or more than 85%. Then we respectively projected each node of the real GWGEN (i.e., each row of network matrix H) to the top I singular vectors as follows.
where
denotes the projection value of the
-th node on the
-th significant singular vector;
means the a-th row vector of network matrix
, and
denotes the the b-th column of
. Next, we define the 2-norm projection value to each node such as protein, gene, lnRNA and miRNA in real GWGEN from the top I significant singular vectors as below.
According to the equation in (40), the top 3000 pivotal proteins, genes, miRNAs, and lncRNAs with higher projection value were selected to construct the core GWGENs for T2D and non-T2D, respectively. Afterwards, the core GWGENs were uploaded to the DAVID website for KEGG pathway enrichment analysis, and the construction of core signaling pathways for non-T2D and T2D were accomplished with the help of the annotation of KEGG pathways. The enrichment analysis was used to validate that our results were associated with T2D. Eventually, the potential biomarkers were chosen through investigating the T2D pathogenesis by comparing the non-T2D and T2D core signaling pathways.
4.6. Systematic Drug Discovery Based on Drug Design Specifications for T2D
Based on drug design specifications, we aimed to discover a potential multiple-molecule targeting drug for the identified biomarkers. We proposed a DTI model based on a deep neural network to predict the drug–target interaction between the available drugs and targets (biomarkers). Since it is not enough to consider the drug–target interaction alone for drug design, some specifications, i.e., regulation ability, toxicity, sensitivity and side effect are necessary to sieve the candidate drugs predicted by the DTI model. Then, with these considerations, we suggested an appropriate multiple-molecule targeting drug for T2D treatment before clinical trials.
First, based on the flowchart of the systematic drug discovery method in
Figure 3, we accessed an integrated collection of protein–ligand affinity data through BindingDB’s unified interface [
69], which harvests the selected data and information from multiple existing databases, i.e., PubChem, ChEMBL, UniProt, DrugBank, etc. (for more details, readers can refer to
Appendix B). Recently, the feature-based method, for instance, molecular descriptor, is broadly used to describe the structural and chemical properties of molecules such as characteristics from the 2D and 3D spectrum of structure, molecular weight, hydrophilic, hydrophobicity, etc. It was validated that the chemical properties of the drug and genomic sequence of the target could be described with the molecular descriptor for the purpose of convenient analysis in drug design, since the molecular descriptor can transform complicated chemical properties into a simple numerical feature vector [
70,
71]. On this ground, we utilized the functions from python package pyBioMed to transform both the drug and target into a descriptor as their features individually under the python2.7 environment. The considered drug features of a molecule included constitutional descriptors, connectivity indices, E-state indices, charge descriptors, molecular properties and kappa shape indices. For the target features, the structural and physicochemical features of proteins and peptides from amino acid sequence such as amino acid composition, dipeptide composition…, etc. are calculated (for more detailed information about the descriptor transformation, readers could access the documents of pyBioMed [
72]). Then, the descriptor of the drug and target were combined into a feature vector
vdrug-target corresponding to the drug–target pair as below [
73]:
Among
, 363 features for a drug and 996 features for a target were collected, where the former features in
vdrug-target are for the drug and the latter are for the target.
represents the first drug feature;
indicates the first target feature; M is the total number of drug features; and N denotes the total number of target features. Before training the DNN-based DTI model, we encountered a problem that features are not in the same standing. Since the variables of the features are measured at different scales, they do not contribute equally to the model fitting and might end up creating a bias, i.e., the feature with a larger value would dominate the result. To deal with this potential problem, a feature-wise scaling is usually implemented prior to model fitting. As powerful techniques of feature scaling, Min-max Normalization and Standardization methods are commonly used for bringing every feature in the same footing without any upfront importance. Although Min-max Normalization can also normalize the data into the same scale, it is much more sensitive to outliers compared to Standardization. Therefore, Standardization was performed on the features before applying principal component analysis (PCA) to improve the model performance, and the corresponding mathematical formulation is shown as follows:
where
is the i-th drug feature and the
is the i-th drug feature after Standardization;
and
individually represent the mean and standard deviation of the i-th drug feature;
indicates the j-th feature of the target and
denotes the j-th feature of the target after Standardization;
and
separately signify the mean and standard deviation of the j-th target feature; M denotes the total number of drug features; and N is the total number of target features.
where
and
denote input and output, respectively;
is the weighting matrix and
is the bias vector;
indicates the activation function with Rectified Linear Unit (ReLU) in the hidden layer and Sigmoid in the output layer. Since the binary classification issue is concerned, the binary-cross entropy is chosen as the cost function to calculate the model loss:
where
means the truth label of positive interaction;
indicates the predictive probability of positive interaction,
shows the truth label of negative interaction, and
represents the predicted probability of negative interaction.
denotes the average of total loss
. According to the cost function, the backward propagation algorithm is applied to update the model parameter set
containing the weighting matrix and bias vector through calculating the gradient of cost function in (46) to get the result in (50) and eventually derive the optimal solution
in (48) as follows.
where l is the l-th epoch of learning procedure;
is the learning rate; and
is the gradient of
as below:
Based on the backward propagation method, the DNN-based DTI model could adjust the parameters to fit the drug–target interaction data at each iteration well. In addition, the hyperparameters were tuned to not only lower the training time but also achieve the best model performance. We used Adam [
74] as an optimizer with a default setting and set the learning rate as 0.0001 to make the model parameter
converge faster and precisely. We set 100 for epochs and 100 for batch size. For the data, we split one-fourth of the data as testing data and three-fourths of it as training data. Moreover, we further divided the training data into ten equal folds to perform ten-fold cross-validation, in which nine-tenths of them were used for model training and one-tenth was used for validation. Such application is exploited to supervise whether the model was better than that of the former epoch and to guarantee the model stability. Furthermore, to avoid overfitting, not only did we embed the dropout layer (dropout rate = 0.4) behind each hidden layer but also applied the early stopping strategy to monitor whether the test accuracy decreased with the continuous improvement of training accuracy or not. After accomplishing model training as shown in
Figure 4, we adopted the AUC (area under the curve) score and ROC (receiver operating characteristics) curve [
75] in
Figure 7 as the performance measurement. It is one of the most useful evaluation metrics to visualize the model performance when it comes to the binary classification problems. The higher AUC score is that in which the area under the line is larger; the better accuracy is for the DNN-based DTI model predicting the true positive and true negative drug–target interaction. The formulas for the AUC score and ROC curve are shown below.
where TP (True Positive) means that the real value is true and is judged correctly; TN (True Negative) shows that the real value is true and is judged by mistake; FP (False Positive) indicates the real value is false and is judged accurately; FN (False Negative) represents that the real value is false and is judged in error.
It is worth noting that the majority of previous network approaches use machine learning (ML)-based methods to perform predictions over the drug–target interaction space [
76,
77,
78]. However, such techniques have major limitations. Traditional ML is a time-consuming process and requires lots of expertise to design and run the algorithms. Without a good understanding of the domain knowledge and feature engineering, a traditional machine algorithm can hardly work well.
As a kind of ML-based model with multiple hidden layers and a more complicated parameter training procedure, the deep learning method attracts lots of attention for its relatively better performance and ability to learn representations of data with multiple levels of abstraction [
79]. When there is a lack of domain understanding for feature introspection, deep learning techniques outshine others as we do not have to worry much about feature engineering. Additionally, the comparison of deep-learning methods with other acceptable ML algorithms in the task of new DTIs identification has previously been performed as well, where framework based on deep learning could indeed achieve relatively high prediction performance [
80]. As a result, for each algorithm compared in our work, only default parameters without fine-tuning were set to learn features from the data. However, a disadvantage should be solved that there are no experimental validated noninteracting drug–target pairs so that it is difficult to select negative samples, which would largely influence the predictive accuracy of the method [
81]. Hence, apart from extracting a great number of samples from the presently largest database, BingdingDB, we further followed the criteria in
Appendix B to abstract negative examples from existing drug–target interactions, which enabled us to evaluate and manipulate the data more realistically to achieve better performance.