Next Article in Journal
Microbiome Alterations in Alcohol Use Disorder and Alcoholic Liver Disease
Next Article in Special Issue
Genetic Heterogeneity of X-Linked Ichthyosis in the Republic of North Ossetia–Alania, Case Series Report
Previous Article in Journal
Anti-Inflammatory Activity and Mechanism of Sweet Corn Extract on Il-1β-Induced Inflammation in a Human Retinal Pigment Epithelial Cell Line (ARPE-19)
Previous Article in Special Issue
Caspase-3, Caspase-8 and XIAP Gene Expression in the Placenta: Exploring the Causes of Spontaneous Preterm Labour
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SAV-Pred: A Freely Available Web Application for the Prediction of Pathogenic Amino Acid Substitutions for Monogenic Hereditary Diseases Studied in Newborn Screening

by
Anton D. Zadorozhny
1,
Anastasia V. Rudik
2,
Dmitry A. Filimonov
2 and
Alexey A. Lagunin
1,2,*
1
Department of Bioinformatics, Pirogov Russian National Research Medical University, 117997 Moscow, Russia
2
Department of Bioinformatics, Institute of Biomedical Chemistry, 119121 Moscow, Russia
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2023, 24(3), 2463; https://0-doi-org.brum.beds.ac.uk/10.3390/ijms24032463
Submission received: 29 December 2022 / Revised: 22 January 2023 / Accepted: 23 January 2023 / Published: 27 January 2023
(This article belongs to the Special Issue State-of-the-Art Molecular Genetics and Genomics in Russia)

Abstract

:
Next Generation Sequencing (NGS) technologies are rapidly entering clinical practice. A promising area for their use lies in the field of newborn screening. The mass screening of newborns using NGS technology leads to the discovery of a large number of new missense variants that need to be assessed for association with the development of hereditary diseases. Currently, the primary analysis and identification of pathogenic variations is carried out using bioinformatic tools. Although extensive efforts have been made in the computational approach to variant interpretation, there is currently no generally accepted pathogenicity predictor. In this study, we used the sequence–structure–property relationships (SSPR) approach, based on the representation of protein fragments by molecular structural formula. The approach predicts the pathogenic effect of single amino acid substitutions in proteins related with twenty-five monogenic heritable diseases from the Uniform Screening Panel for Major Conditions recommended by the Advisory Committee on Hereditary Disorders in Newborns and Children. In order to create SSPR models of classification, we modified a piece of cheminformatics software, MultiPASS, that was originally developed for the prediction of activity spectra for drug-like substances. The created SSPR models were compared with traditional bioinformatic tools (SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN and FATHMM). The average AUC of our approach was 0.804 ± 0.040. Better quality scores were achieved for 15 from 25 proteins with a significantly higher accuracy for some proteins (IVD, HADHB, HBB). The best SSPR models of classification are freely available in the online resource SAV-Pred (Single Amino acid Variants Predictor).

1. Introduction

Newborn screening (NBS) is a meaningful, priority, globally-accepted public health program. All born infants are advised to undergo blood spot screening, also known as the heel prick test, to find any inherited diseases that are severe after an asymptomatic period. The overall detection rate is up to 1 in 500 births [1]. The testing is intended to provide an early diagnosis and treatment before significant, inevitable damage ensues. The core conditions panels mainly include monogenic autosomal recessive disorders, most of which are inborn errors of metabolism. The conditions may be indicated by biochemical analysis, tandem mass spectrometry and immunoassay techniques as well as DNA-based methods [2].
Over the past few years, next-generation sequencing (NGS) technologies have been actively implemented in the clinic. As the cost of sequencing decreased, the field of application increased, leading to the first cases of using NGS in NBS [3,4]. Since NGS has a high throughput and can identify the majority of genetic defects, DNA sequencing has the capability to become a suitable NBS method. At the same time, the increasing screening rate and the availability of NGS technologies contribute to the detection of new variants without a clinical interpretation. In addition, NGS may expand the existing panels to other diseases as it does not require special protocols and reagents to obtain a result.
Variants of clinical interpretation involve multiple evidence categories: population data, functional studies, and clinical presentations. As an outcome, a genetic variant is assigned a pathogenic class if it causes a disease, or a benign class if it is proven to have no such relationships. Quite often, the criteria produce an opposite interpretation, e.g., eventually causing a variant of uncertain significance (VUS) or some conflict of interpretation [2]. Such variants cannot assist in making medical decisions.
Preliminarily, for variants with a VUS classification as well as unclassified ones, predicted pathogenicity estimates can be obtained using computational tools (e.g., PolyPhen-2 [5], SIFT [6], MutationAssessor [7]). The most common genetic alterations happening and requiring clinical classification are missense. Missense variants modify codons, resulting in an encoded amino acid (a.a.) alteration. In turn, the alteration affects protein primary structures, the basis of the secondary, tertiary, and quaternary structures, and may disrupt their implementing function. The existing bioinformatics predictors are trained on heterogenic datasets, which may lead to a decreased prediction accuracy in specific clinically important genes [8,9].
Here, we introduce SAV-Pred—a public web-application to predict the effect of single amino acid variants (SAVs) for 25 core conditions from a newborn screening panel. This work is intended to present the sequence–structure–property relationships (SSPR) analysis of a.a. substitutions and their surroundings in specific proteins to predict the clinical effect of the variants as an additional interpretation.

2. Results

2.1. SAV-Pred Contents and Comparison with Other Bioinformatic Tools

Disease-related proteins were selected from the Uniform Screening Panel for Major Conditions recommended by the Advisory Committee on Hereditary Disorders in Newborns and Children and approved by committees of the American College of Obstetricians and Gynecologists (ACOG) [2]. The panel includes the following disease groups: congenital organic acid/amino acid/fatty acid metabolic errors, hemoglobinopathies, and various multisystem disorders such as cystic fibrosis or hypothyroidism. For these monogenic diseases the benefits of screening and treatment availability have been confirmed. Thus, the SSPR approach can be applied to them.
The data selection scheme is shown in Figure 1. The final set included 25 proteins with a total of 2124 missense variants. These variants were initially found with clinical classification (see Material and Methods). It turned out that for many of the proteins the databases contained few benign variants, insufficient for training classifiers. For instance, there was only one benign variant for PAH and two benign variants for the HADHB and HMGCL genes (Table 1, column “B”). Therefore, 8397 polymorphisms unrelated to pathological conditions were added as a negative class in the curated manual analysis. The resulting number of SAVs in the training datasets are shown in Table 1.
For each of the proteins, 195 SSPR models were created (with different levels of the multi-level neighborhoods of atoms (MNA) descriptors (15 levels, from 1 to 15)) and peptide length (13 size options with an odd number of a.a. in a peptide, from 7 to 31) (see Material and Methods). The most accurate SSPR models in terms of the area under the receiver operating characteristic curve (AUC) obtained in leave-one-out (LOO-CV) and 20-fold cross-validation (20F-CV) procedures were selected, and their parameters are presented in Table 1. Twenty-four SSPR models exceeded the accuracy threshold of 0.7. For such conditions as isovaleric acidemia, hemoglobinopathies, and trifunctional protein deficiency, the AUC values of the created models were greater than 0.9. Only the SSPR model for galactose-1-phosphate uridylyltransferase displayed an AUCF20-CV value of less than 0.7 (0.686). This may be linked to the presence of contradictions in the clinical classification data due to the existence of Duarte galactosemia, which differs from classical galactosemia in that patients with Duarte galactosemia have a partial GALT deficiency.
The best created SSPR models were compared with known bioinformatic tools: SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN, and FATHMM [5,6,7,10,11] (Table 2.). The same approach had been used in our previous study [12]. For the aforementioned methods, we obtained scores of SAV effects from dbNSFP4.1a [13] for almost all proteins and calculated AUC. In quantitative terms, our approach (SAV-Pred) was the most accurate for 15 proteins. For several genes, HADHB, HBB, and IVD, the prediction accuracy was over 0.9, while for the alternative methods it was kept at 0.796. The performances of the rest of the models are inferior to the other methods but are not much lower and are roughly in the average accuracy range. At the same time, the highest average AUC (0.804 ± 0.040; CI95%) was achieved and corresponds to the previous results [12].

2.2. SAV-Pred Web Application

The best SSPR models became the basis for the creation of the freely available web application, SAV-Pred (Single Amino acid Variants Predictor), hosted at the way2drug.com portal (http://www.way2drug.com/SAV-Pred/) (accessed on 29 December 2022).
Figure 2 illustrates an example of the output window with predictions for three single amino acid substitutions. The substitutions were published in the ClinVar [14] database after May 2022 and did not belong to the training sets. The predicted effect shown in the “Annotation” column is consistent with the current clinical classification. The data in the values in the Confidence column are calculated as Pa—Pi (see Materials and Methods) for the prediction of the pathogenic effect. Positive values of Confidence mean that the queried a.a. substitutions may belong to the class of pathogenic substitutions. The higher the Confidence value, the higher the probability that the variant is pathogenic. Negative values of Confidence mean that the queried a.a. substitutions may belong to the class of benign substitutions. The more negative the Confidence value, the more likely the variant is benign. During the analysis of the prediction results, one should also take into account the value of the prediction accuracy in the last column (AUC) for the appropriate SSPR model. The columns in the table with prediction results may by sorted. Moreover, the appropriate fields for filtration of the data are under each column. Here, one can also see the references to the description of diseases in OMIM as well as protein identifiers in UniProt [15]. The left side of the screen shows the protein sequence with the highlighted location and replacement of the letter. The user can select the protein and substitution of interest manually with the “Input” button, or they can load a query list of substitutions in the following format:
<gene name> <position> <a.a. substitution>
The prediction results can be saved as a file in the CSV or XLS formats, or simply copied. The data on composition, the datasets, and AUC values are also provided.

3. Discussion

In this paper, we present a new freely available web-based application, SAV-Pred—twenty-five SSPR models were created to identify amino acid substitutions related to monogenic heritable diseases recommended for universal newborn screening by calculating and interpreting pathogenicity scores. The models are Naïve Bayesian classifiers trained on describing the structural properties of peptide fragments, thus linking the effect to the primary structures of the proteins. Since the secondary/tertiary/quaternary structures, physicochemical, and functional properties of proteins also depend on the primary sequence, SSPR models take them into account indirectly.
In summary, the SSPR models obtained comparable accuracy, often exceeding the accuracy of the individual methods. For example, the developed predictors outperformed the widely used tools: SIFT 4G in 16/24 cases and PolyPhen-2 HDIV 16/22 cases, respectively. Depending on the method and the protein, SSPR models and individual bioinformatics tools outperform each other to diverse degrees, in keeping with the previous studies [16,17]. However, protein-specific datasets are often unbalanced due to a lack of annotated variants and this may cause a negative impact on protein-specific predictors. The absence of differences in AUC in the leave-one-out and twenty-fold cross-validations, as well as the similar average accuracy with the previous study, suggest the robustness of the obtained classifiers (Table 1).
Based on the best SSPR models, we have created a web application SAV-Pred, which is freely available at http://www.way2drug.com/SAV-Pred/ (accessed on 28 December 2022). In the prospective application, SAVs features such as secondary structure parameters and evolutionary data are going to be used as descriptors to increase the predictor’s accuracy. Additionally, we going to apply the approach to the secondary conditions table and other similar diagnostic panels.

4. Materials and Methods

4.1. Datasets Collection

Of 32 core conditions from the ACOG screening panel, 24 monogenic diseases were chosen and 25 associated genes were found based on the OMIM database (accessed on 10 January 2022) (Table 1). The annotated data on missense variants related to the known genes, including clinical significance, variant supporting evidence, and protein allele were obtained from ClinVar [14] (accessed on 14 January 2022), humsavar [15] (accessed on 14 January 2022), LOVD [18] (accessed on 12 January 2022), and dbSNP [19] (accessed on 14 January 2022) databases using the BioMart data mining tool [20] (accessed on 14 January 2022) (Figure 1). SAVs currently classified as pathogenic or likely pathogenic constituted the positive class, and substitutions that were interpreted as benign/likely benign, as well as all those that were in no way related to the phenotype/disease, constituted the negative class. Based on the known annotated SAVs and an appropriate protein sequence, we created the datasets containing fix length peptides (from 7 to 31 a.a. in the peptide) from the substitution and its a.a. surroundings in the form of structural formulas in the MOL V3000 format, plus their effect indicators (0-benign, 1-pathogenic). A similar algorithm was used earlier for the prediction of phosphorylation sites in proteins [21]. Amino acid surroundings were taken from canonical reference protein sequences from the UniProt [15] (accessed on 3 February 2022) database by related positions.

4.2. Building the SSPR Models

Classification models were created and validated in the modified command line version of the Prediction of Activity Spectra for Substances (PASS) software [12,21,22,23]—MultiPASS (version 2022, Institute of Biomedical Chemistry, Moscow, Russia)—which allows one to use different levels (up to 15) of Multilevel Neighborhoods of Atoms (MNA) descriptors to describe the structural formula of peptides [19]. Each of the fifteen MNA levels was used to build the individual SSPR model on each of thirteen different peptide fragment length datasets. Originally, PASS prediction results are a list of predicted characteristics of molecules with Pa (probability of “to be active”) and Pi (probability of “to be inactive”) values. In this study, the Pa value is the probability that the peptide with the a.a. substitution belongs to the class of pathogenic variants, and the Pi value is the probability that the peptide with the a.a. substitution does not belong to the class of pathogenic variants.
Multilevel Neighborhoods of Atoms (MNA) descriptors were used for the descriptions of molecular structures. The MNA descriptor is a representation of an atom-centered fragment of a molecule in the form of a string of characters. The level of the MNA descriptor reflects the order of proximity. Figure 3 shows an example of the representation of the first three levels for a carbon atom marked with a gray circle. Thus, the structural and physicochemical properties of molecules are embedded in the MNA descriptors. Similar to our previous work [12], descriptors from levels 1 to 15 were used for the creation of SSPR models.

4.3. Validation and Performance Assessment

SSPR models based on datasets with an appropriate length of peptides and a level of MNA descriptors were created and selected based on the leave-one-out and 20-fold cross-validation procedures implemented in MultiPASS. For every disease (protein), the best SSPR model was chosen with the highest the area under the ROC curve (AUC) value. We used individual methods (SIFT 4G, Polyphen-2 HDIV, MutationAssessor, PROVEAN and FATHMM) to compare against the SSPR models, and we used the scores from the dbNSFP (accessed on 9 October 2022) and sklearn.metrics package [24] in Python 3.9 to calculate AUC as a statistical indicator of accuracy. In doing so, we used the thresholds recommended by authors to obtain protein-related AUC values.

Author Contributions

Methodology, data extraction and curation, investigation, writing—original draft preparation, A.D.Z.; conceptualization, methodology, supervision, writing—review and editing, A.A.L.; software, methodology, writing—review and editing, D.A.F.; web application, writing—review and editing, A.V.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grant 075-15-2019-1789 from the Ministry of Science and Higher Education of the Russian Federation.

Data Availability Statement

Training datasets are available at http://www.way2drug.com/sav-pred/description.html (accessed on 28 December 2022) as SD and CSV files.

Acknowledgments

We thank the Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia for using computer infrastructure during the study. The Ministry of Science and Higher Education of the Russian Federation for supporting of this work by Grant 075-15-2019-1789. Way2drug.com portal for sup-porting SAV-Pred web application.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviation

NGSNext Generation Sequencing
SNPSingle Nucleotide Polymorphism
SAVSingle Amino acid Variant
VUSVariant of Uncertain Significance
SARStructure-Activity Relationships
SSPRSequence-Structure-Property Relationships
MNAMulti-level Neighborhoods of Atoms
MultiPASSModified command line version of Prediction of Activity Spectra for Substances
SAV-PredSingle Amino acid Variants Predictor
SIFTSorting Intolerant From Tolerant
Polyphen-2Polymorphism Phenotyping v2
PROVEANProtein Variation Effect Analyzer
Mutation AssessorFunctional impact of protein mutations
FATHMMFunctional Analysis through Hidden Markov Models
NBSNewborn Screening
ACOGAmerican College of Obstetricians and Gynecologists
dbData Base
OMIMOnline Mendelian Inheritance in Man
ClinVarPublic archive of reports of the relationships among human variations and phenotypes
LOVDLeiden Open Variation Database
UniProtLeading high-quality resource of protein sequence and functional information
humsavarAll missense variants annotated in UniProtKB/Swiss-Prot human entries
gnomADThe Genome Aggregation Database
TOPMedThe Trans-Omics for Precision Medicine program
dbNSFPFunctional prediction and annotation of all potential missense variants in humans
SDFStructured Data File
AUCArea under the receiver operating characteristic curve
LOO-CVLeave-One-Out Cross-Validation
20F-CV20-Fold Cross-Validation
ABCD1ATP binding cassette subfamily D member 1
ACADMAcyl-CoA dehydrogenase medium chain
ACADVLAcyl-CoA dehydrogenase very long chain
ASLArgininosuccinate lyase
ASS1Argininosuccinate synthase 1
BTDBiotinidase
CFTRCystic Fibrosis transmembrane conductance regulator
FAHFumarylacetoacetate hydrolase
GAAAlpha glucosidase
GALTGalactose-1-phosphate uridylyltransferase
GCDHGlutaryl-CoA dehydrogenase
HADHAHydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit alpha
HADHBHydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit beta
HBBHemoglobin subunit beta
HLCSHolocarboxylase synthetase
HMGCL3-hydroxy-3-methylglutaryl-CoA lyase
IDUAAlpha-L-iduronidase
IVDIsovaleryl-CoA dehydrogenase
MCCC1Methylcrotonyl-CoA carboxylase subunit 1
MCCC2Methylcrotonyl-CoA carboxylase subunit 2
MMUTMethylmalonyl-CoA mutase
PAHPhenylalanine hydroxylase
PCCBPropionyl-CoA carboxylase subunit beta
SLC22A5Solute carrier family 22 member 5
TSHRThyroid stimulating hormone receptor

References

  1. Feuchtbaum, L.; Carter, J.; Dowray, S.; Currier, R.J.; Lorey, F. Birth prevalence of disorders detectable through newborn screening by race/ethnicity. Genet Med. 2012, 14, 937–945. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Newborn screening and the role of the obstetrician–gynecologist. ACOG Committee Opinion No. 778. American College of Obstetricians and Gynecologists. Obstet. Gynecol. 2019, 133, e357–e361. [CrossRef] [PubMed] [Green Version]
  3. Olszowiec-Chlebna, M.; Mospinek, E.; Jerzynska, J. Impact of newborn screening for cystic fibrosis on clinical outcomes of pediatric patients: 10 years’ experience in Lodz Voivodship. Ital. J. Pediatr. 2021, 47, 87. [Google Scholar] [CrossRef] [PubMed]
  4. McInnes, G.; Sharo, A.G.; Koleske, M.L.; Brown, J.E.H.; Norstad, M.; Adhikari, A.N.; Wang, S.; Brenner, S.E.; Halpern, J.; Koenig, B.A.; et al. Opportunities and challenges for the computational interpretation of rare variation in clinically important genes. Am. J. Hum. Genet. 2021, 108, 535–548. [Google Scholar] [CrossRef] [PubMed]
  5. Adzhubei, I.; Jordan, D.M.; Sunyaev, S.R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr. Protoc. Hum. Genet. 2013, Chapter 7, Unit 7.20. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Vaser, R.; Adusumalli, S.; Leng, S. SIFT missense predictions for genomes. Nat. Protoc. 2016, 11, 1–9. [Google Scholar] [CrossRef]
  7. Reva, B.; Antipin, Y.; Sander, C. Predicting the functional impact of protein mutations: Application to cancer genomics. Nucleic Acids Res. 2011, 39, e118. [Google Scholar] [CrossRef] [Green Version]
  8. López-Ferrando, V.; Gazzo, A.; de la Cruz, X.; Orozco, M.; Gelpí, J.L. PMut: A web-based tool for the annotation of pathological variants on proteins, 2017 update. Nucleic Acids Res. 2017, 45, W222–W228. [Google Scholar] [CrossRef]
  9. Grimm, D.G.; Azencott, C.A.; Aicheler, F.; Gieraths, U.; MacArthur, D.G.; Samocha, K.E.; Cooper, D.N.; Stenson, P.D.; Daly, M.J.; Smoller, J.W.; et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum. Mutat. 2015, 36, 513–523. [Google Scholar] [CrossRef] [Green Version]
  10. Choi, Y.; Sims, G.E.; Murphy, S.; Miller, J.R.; Chan, A.P. Predicting the functional effect of amino acid substitutions and indels. PLoS ONE 2012, 7, e46688. [Google Scholar] [CrossRef]
  11. Shihab, H.A.; Gough, J.; Cooper, D.N.; Day, I.N.; Gaunt, T.R. Predicting the functional consequences of cancer-associated amino acid substitutions. Bioinformatics 2013, 29, 1504–1510. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Zadorozhnyy, A.; Smirnov, A.; Filimonov, D.; Lagunin, A. Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors. Bioinformatics 2022, unpublished data. [Google Scholar]
  13. Liu, X.; Li, C.; Mou, C.; Dong, Y.; Tu, Y. dbNSFP v4: A comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020, 12, 103. [Google Scholar] [CrossRef]
  14. Landrum, M.J.; Lee, J.M.; Benson, M.; Brown, G.R.; Chao, C.; Chitipiralla, S.; Gu, B.; Hart, J.; Hoffman, D.; Jang, W.; et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018, 46, 1062–1067. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. The UniProt Consortium UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, 480–489. [CrossRef]
  16. Riera, C.; Padilla, N.; de la Cruz, X. The Complementarity Between Protein-Specific and General Pathogenicity Predictors for Amino Acid Substitutions. Hum. Mutat. 2016, 37, 1013–1024. [Google Scholar] [CrossRef]
  17. Crockett, D.K.; Lyon, E.; Williams, M.S.; Narus, S.P.; Facelli, J.C.; Mitchell, J.A. Utility of gene-specific algorithms for predicting pathogenicity of uncertain gene variants. J. Am. Med. Inform. Assoc. 2012, 19, 207–211. [Google Scholar] [CrossRef] [Green Version]
  18. Fokkema, I.F.; Taschner, P.E.; Schaafsma, G.C.; Celli, J.; Laros, J.F.; den Dunnen, J.T. LOVD v.2.0: The next generation in gene variant databases. Hum. Mutat. 2011, 32, 557–563. [Google Scholar] [CrossRef]
  19. Sherry, S.T.; Ward, M.H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E.M.; Sirotkin, K. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 2001, 29, 308–311. [Google Scholar] [CrossRef] [Green Version]
  20. Kinsella, R.J.; Kähäri, A.; Haider, S.; Zamora, J.; Proctor, G.; Spudich, G.; Almeida-King, J.; Staines, D.; Derwent, P.; Kerhornou, A.; et al. Ensembl BioMarts: A hub for data retrieval across taxonomic space. Database (Oxford) 2011. Published online July 23. [Google Scholar] [CrossRef]
  21. Karasev, D.A.; Savosina, P.I.; Sobolev, B.N.; Filimonov, D.A.; Lagunin, A.A. Application of molecular descriptors for recognition of phosphorylation sites in amino acid sequences. Biomed. Khim. 2017, 63, 423–427. [Google Scholar] [CrossRef] [PubMed]
  22. Filimonov, D.A.; Lagunin, A.A.; Gloriozova, T.A.; Rudik, A.V.; Druzhilovskii, D.S.; Pogodin, P.V.; Poroikov, V.V. Prediction of the Biological Activity Spectra of Organic Compounds Using the Pass Online Web Resource. Chem. Heterocycl. Comp. 2014, 50, 444–457. [Google Scholar] [CrossRef]
  23. Lagunin, A.; Stepanchikova, A.; Filimonov, D.; Poroikov, V. PASS: Prediction of activity spectra for biologically active substances. Bioinformatics 2000, 16, 747–748. [Google Scholar] [CrossRef] [PubMed]
  24. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. JMLR 2011, 12, 2825–2830. [Google Scholar]
Figure 1. Illustration of the project workflow. SAVs—single amino acid variants, P—pathogenic, LP—likely pathogenic, B—benign, LB—likely benign.
Figure 1. Illustration of the project workflow. SAVs—single amino acid variants, P—pathogenic, LP—likely pathogenic, B—benign, LB—likely benign.
Ijms 24 02463 g001
Figure 2. SAV-Pred web page with prediction results for the input example. On the left part of the screen, the input list form contains gene name, sample counts in the training set, associated disease and protein sequence with marked red substitution. The result table with confidence score as well as its interpretation and ROC-AUC metrics are located on the right side. The examples were published in the ClinVar database after May 2022 and were not included in the training sets of the appropriate SSPR models. All three predictions are consistent with the current clinical classification in the ClinVar database.
Figure 2. SAV-Pred web page with prediction results for the input example. On the left part of the screen, the input list form contains gene name, sample counts in the training set, associated disease and protein sequence with marked red substitution. The result table with confidence score as well as its interpretation and ROC-AUC metrics are located on the right side. The examples were published in the ClinVar database after May 2022 and were not included in the training sets of the appropriate SSPR models. All three predictions are consistent with the current clinical classification in the ClinVar database.
Ijms 24 02463 g002
Figure 3. The example of 0–3 levels of MNA (Multilevel Neighborhoods of Atoms) descriptors is shown for the carbon atom of alanine in the polypeptide chain fragment. The numbers in the structural formula show the most distant atoms included in the descriptor of the related level of MNA descriptors. The appropriate descriptors of the chosen level are generated for all atoms in the structural formula. Such description helps to depict the linear structure of peptides completely and explicitly.
Figure 3. The example of 0–3 levels of MNA (Multilevel Neighborhoods of Atoms) descriptors is shown for the carbon atom of alanine in the polypeptide chain fragment. The numbers in the structural formula show the most distant atoms included in the descriptor of the related level of MNA descriptors. The appropriate descriptors of the chosen level are generated for all atoms in the structural formula. Such description helps to depict the linear structure of peptides completely and explicitly.
Ijms 24 02463 g003
Table 1. The list of investigated proteins with associated diseases, data on training sets, and parameters of SSPR models.
Table 1. The list of investigated proteins with associated diseases, data on training sets, and parameters of SSPR models.
GeneDiseaseOMIMUniProtBPB+TotalPLMNAAUCLOO-CVAUC20F-CV
ABCD1X-linked adrenoleukodystrophy300371P3389731583063951990.8490.839
ACADMMedium-chain acyl-CoA dehydrogenase deficiency607008P11310-1363253319970.7920.793
ACADVLVery long-chain acyl-CoA dehydrogenase deficiency609575P49748-199138248219100.8000.801
ASLArgininosuccinic aciduria608310P04424-1929288326790.8500.853
ASS1Homocystinuria Citrullinemia, type I603470P0096610251621971360.7870.792
BTDBiotinidase deficiency609019P43251-1513331745517150.8490.830
CFTRCystic fibrosis219700P13569-156350697110317110.7810.787
FAHTyrosinemia, type I613871P10253-14152482672930.8430.837
GAAGlycogen Storage Disease Type II (Pompe)606800P10253-1537235347813110.7420.733
GALTClassic galactosemia606999P07902-151191202442340.6950.686
GCDHGlutaric acidemia type I608801Q92947-155820827121150.7030.707
HADHALong-chain L-3 hydroxyacyl-CoA dehydrogenase deficiency600890Q96RQ31294764979110.8130.808
HADHBTrifunctional protein deficiency143450P50747-12143093251750.9610.961
HBBHemoglobinopathies141900P688712714979255770.9120.903
HLCSHolocarboxylase synthase deficiency609018P40939-11712463492780.7760.776
HMGCL3-Hydroxy-3-methylglutaric aciduria613898P35914-126188196980.7400.714
IDUAMucopolysaccharidosis type 1252800P35475-1194655662129150.8900.853
IVDIsovaleric acidemia607036P2644063032636213110.9080.906
MCCC13-Methylcrotonyl-CoA carboxylase deficiency609010P16930-112164494777120.7640.754
MCCC23-Methylcrotonyl-CoA carboxylase deficiency609014Q9HCC0-152541144123150.8140.797
MMUTMethylmalonic acidemia609058P22033-18703554332990.7120.712
PAHClassic phenylketonuria612349P0043912881314201170.7980.798
PCCBPropionic acidemia β-ketothiolase deficiency232050P05166-142649052017120.7940.796
SLC22A5Carnitine uptake defect/transport defect603377O76082-1968319396960.8700.875
TSHRPrimary congenital hypothyroidism603372P16473-18305115491930.8030.764
B—Benign variants in the sets; P—Pathogenic variants in the sets; B+—benign variants that initially did not have clinical classification; AUCLOO-CV—AUC obtained by leave-one-out validation procedure; AUC20F-CV—AUC obtained by twenty-fold cross-validation procedure; PL (peptide length) and MNA (the level of MNA descriptors)—parameters of sequence–structure–property relationships (SSPR) models.
Table 2. Accuracy comparison of the tools in predicting single amino acid substitution effects in proteins related to neonatal diagnosis.
Table 2. Accuracy comparison of the tools in predicting single amino acid substitution effects in proteins related to neonatal diagnosis.
ProteinSAV-PredSIFT 4GPolyPhen-2 HDIVMutation AssessorPROVEANFATHMM
AUCF20-CV%AUC%AUC%AUC%AUC%AUC%
ABCD10.8391000.886990.868990.878990.872990.73499
ACADM0.7931000.585950.657950.619950.664950.54995
ACADVL0.8011000.734970.761970.693970.652970.60997
ASL0.8531000.783980.841980.738980.795980.65998
ASS10.7921000.7111000.7211000.8151000.7541000.635100
BTD0.830100-00.792960.79796-0-0
CFTR0.7871000.6781000.7271000.7021000.7061000.516100
FAH0.8371000.848990.863990.850990.838990.65199
GAA0.7331000.762990.8211000.8241000.802990.690100
GALT0.6861000.7111000.7361000.724970.7211000.534100
GCDH0.7071000.7511000.7581000.7511000.6871000.519100
HADHA0.8081000.856990.790990.873970.774990.57299
HADHB0.9611000.596980.635980.569980.739980.60398
HBB0.9031000.707990.796990.725990.686990.63599
HLCS0.7761000.766980.751980.699980.716980.64598
HMGCL0.7141000.877990.877990.872990.829990.79699
IDUA0.8531000.7451000.7221000.7331000.7441000.609100
IVD0.9061000.69596-0-00.751960.55596
MCCC10.7541000.697980.695980.734900.632980.50098
MCCC20.7971000.637950.601950.611950.574950.58195
MMUT0.7121000.768100-0-00.7621000.680100
PAH0.7981000.769980.766980.796980.762980.72898
PCCB0.7961000.790960.773960.831960.725960.54096
SLC22A50.8751000.725970.776970.780970.786970.62497
TSHR0.7641000.65999-0-00.697990.49199
Mean0.8031000.739940.760860.755850.736940.61194
AUC—Area under the receiver operating characteristic curve; AUCF20-CV—AUC obtained by twenty-fold cross-validation procedure; %—Percentage of predicted SAVs (for the other methods, it was calculated based on the data from dbNSFP4.1a.).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zadorozhny, A.D.; Rudik, A.V.; Filimonov, D.A.; Lagunin, A.A. SAV-Pred: A Freely Available Web Application for the Prediction of Pathogenic Amino Acid Substitutions for Monogenic Hereditary Diseases Studied in Newborn Screening. Int. J. Mol. Sci. 2023, 24, 2463. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms24032463

AMA Style

Zadorozhny AD, Rudik AV, Filimonov DA, Lagunin AA. SAV-Pred: A Freely Available Web Application for the Prediction of Pathogenic Amino Acid Substitutions for Monogenic Hereditary Diseases Studied in Newborn Screening. International Journal of Molecular Sciences. 2023; 24(3):2463. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms24032463

Chicago/Turabian Style

Zadorozhny, Anton D., Anastasia V. Rudik, Dmitry A. Filimonov, and Alexey A. Lagunin. 2023. "SAV-Pred: A Freely Available Web Application for the Prediction of Pathogenic Amino Acid Substitutions for Monogenic Hereditary Diseases Studied in Newborn Screening" International Journal of Molecular Sciences 24, no. 3: 2463. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms24032463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop