Statistical Methods for the Analysis of Genomic Data

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Technologies and Resources for Genetics".

Deadline for manuscript submissions: closed (30 November 2019) | Viewed by 23903

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
Interests: statistical genomics; bioinformatics; statistical computing
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
Interests: bioinformatics; optimization; statistical genomics; survival analysis

Special Issue Information

Dear Colleagues,

In recent years, technology breakthroughs have greatly enhanced our ability to understand the complex world of molecular biology. Rapid developments in genomic profiling techniques, such as high-throughput sequencing, have brought new opportunities and challenges to the fields of computational biology and bioinformatics. Furthermore, by combining genomic profiling techniques with other experimental techniques, many powerful approaches (e.g., RNA-Seq, ChIP-Seq, single-cell assays, Hi-C) have been developed in order to help explore the complex biological systems. As more genomic datasets become available, both in volume and variety, the analysis of such data has become a critical challenge as well as a topic of interest. Consequently, statistical methods dealing with the problems associated with these newly developed techniques are in high demand. This Special Issue will highlight the state-of-the-art statistical methods for the analysis of genomic data, and explore potential future directions for improvement.

Dr. Hui Jiang
Dr. Zhi He
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • High-throughput sequencing
  • Single-cell genomics
  • Metagenomics
  • Genomic data integration
  • Computational biology
  • Bioinformatics
  • Statistical genomics
  • Gene expression
  • Gene regulation
  • Biomarker discovery
  • Gene network
  • Functional genomics
  • Precision medicine

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

4 pages, 158 KiB  
Editorial
Statistics in the Genomic Era
by Hui Jiang and Kevin He
Genes 2020, 11(4), 443; https://0-doi-org.brum.beds.ac.uk/10.3390/genes11040443 - 18 Apr 2020
Viewed by 2195
Abstract
In recent years, technology breakthroughs have greatly enhanced our ability to understand the complex world of molecular biology [...] Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)

Research

Jump to: Editorial

23 pages, 996 KiB  
Article
Model-Based Clustering with Measurement or Estimation Errors
by Wanli Zhang and Yanming Di
Genes 2020, 11(2), 185; https://0-doi-org.brum.beds.ac.uk/10.3390/genes11020185 - 10 Feb 2020
Cited by 13 | Viewed by 3388
Abstract
Model-based clustering with finite mixture models has become a widely used clustering method. One of the recent implementations is MCLUST. When objects to be clustered are summary statistics, such as regression coefficient estimates, they are naturally associated with estimation errors, whose covariance matrices [...] Read more.
Model-based clustering with finite mixture models has become a widely used clustering method. One of the recent implementations is MCLUST. When objects to be clustered are summary statistics, such as regression coefficient estimates, they are naturally associated with estimation errors, whose covariance matrices can often be calculated exactly or approximated using asymptotic theory. This article proposes an extension to Gaussian finite mixture modeling—called MCLUST-ME—that properly accounts for the estimation errors. More specifically, we assume that the distribution of each observation consists of an underlying true component distribution and an independent measurement error distribution. Under this assumption, each unique value of estimation error covariance corresponds to its own classification boundary, which consequently results in a different grouping from MCLUST. Through simulation and application to an RNA-Seq data set, we discovered that under certain circumstances, explicitly, modeling estimation errors, improves clustering performance or provides new insights into the data, compared with when errors are simply ignored, whereas the degree of improvement depends on factors such as the distribution of error covariance matrices. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

18 pages, 5430 KiB  
Article
Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control
by Qingyang Zhang
Genes 2020, 11(2), 167; https://0-doi-org.brum.beds.ac.uk/10.3390/genes11020167 - 05 Feb 2020
Cited by 4 | Viewed by 1850
Abstract
The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of [...] Read more.
The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of detecting differential substructure between two nonparanormal graphical models with false discovery rate control. We construct a new statistic based on a truncated estimator of the unknown transformation functions, together with a bias-corrected sample covariance. Furthermore, we show that the new test statistic converges to the same distribution as its oracle counterpart does. Both synthetic data and real cancer genomic data are used to illustrate the promise of the new method. Our proposed testing framework is simple and scalable, facilitating its applications to large-scale data. The computational pipeline has been implemented in the R package DNetFinder, which is freely available through the Comprehensive R Archive Network. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

15 pages, 4274 KiB  
Article
Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
by Mengli Xiao, Zhong Zhuang and Wei Pan
Genes 2020, 11(1), 41; https://0-doi-org.brum.beds.ac.uk/10.3390/genes11010041 - 29 Dec 2019
Cited by 5 | Viewed by 2943
Abstract
Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in [...] Read more.
Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

20 pages, 937 KiB  
Article
Penalized Variable Selection for Lipid–Environment Interactions in a Longitudinal Lipidomics Study
by Fei Zhou, Jie Ren, Gengxin Li, Yu Jiang, Xiaoxi Li, Weiqun Wang and Cen Wu
Genes 2019, 10(12), 1002; https://0-doi-org.brum.beds.ac.uk/10.3390/genes10121002 - 03 Dec 2019
Cited by 9 | Viewed by 3032
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid–environment interactions, in addition to the lipid and environment main effects, have important implications in understanding [...] Read more.
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid–environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid–environment interactions in a longitudinal lipidomics study. An efficient Newton–Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

13 pages, 579 KiB  
Article
Detection of Differentially Methylated Regions Using Bayes Factor for Ordinal Group Responses
by Fengjiao Dunbar, Hongyan Xu, Duchwan Ryu, Santu Ghosh, Huidong Shi and Varghese George
Genes 2019, 10(9), 721; https://0-doi-org.brum.beds.ac.uk/10.3390/genes10090721 - 17 Sep 2019
Cited by 2 | Viewed by 2543
Abstract
Researchers in genomics are increasingly interested in epigenetic factors such as DNA methylation, because they play an important role in regulating gene expression without changes in the DNA sequence. There have been significant advances in developing statistical methods to detect differentially methylated regions [...] Read more.
Researchers in genomics are increasingly interested in epigenetic factors such as DNA methylation, because they play an important role in regulating gene expression without changes in the DNA sequence. There have been significant advances in developing statistical methods to detect differentially methylated regions (DMRs) associated with binary disease status. Most of these methods are being developed for detecting differential methylation rates between cases and controls. We consider multiple severity levels of disease, and develop a Bayesian statistical method to detect the region with increasing (or decreasing) methylation rates as the disease severity increases. Patients are classified into more than two groups, based on the disease severity (e.g., stages of cancer), and DMRs are detected by using moving windows along the genome. Within each window, the Bayes factor is calculated to test the hypothesis of monotonic increase in methylation rates corresponding to severity of the disease versus no difference. A mixed-effect model is used to incorporate the correlation of methylation rates of nearby CpG sites in the region. Results from extensive simulation indicate that our proposed method is statistically valid and reasonably powerful. We demonstrate our approach on a bisulfite sequencing dataset from a chronic lymphocytic leukemia (CLL) study. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

13 pages, 285 KiB  
Article
A Pathway-Based Kernel Boosting Method for Sample Classification Using Genomic Data
by Li Zeng, Zhaolong Yu and Hongyu Zhao
Genes 2019, 10(9), 670; https://0-doi-org.brum.beds.ac.uk/10.3390/genes10090670 - 31 Aug 2019
Cited by 3 | Viewed by 2612
Abstract
The analysis of cancer genomic data has long suffered “the curse of dimensionality.” Sample sizes for most cancer genomic studies are a few hundreds at most while there are tens of thousands of genomic features studied. Various methods have been proposed to leverage [...] Read more.
The analysis of cancer genomic data has long suffered “the curse of dimensionality.” Sample sizes for most cancer genomic studies are a few hundreds at most while there are tens of thousands of genomic features studied. Various methods have been proposed to leverage prior biological knowledge, such as pathways, to more effectively analyze cancer genomic data. Most of the methods focus on testing marginal significance of the associations between pathways and clinical phenotypes. They can identify informative pathways but do not involve predictive modeling. In this article, we propose a Pathway-based Kernel Boosting (PKB) method for integrating gene pathway information for sample classification, where we use kernel functions calculated from each pathway as base learners and learn the weights through iterative optimization of the classification loss function. We apply PKB and several competing methods to three cancer studies with pathological and clinical information, including tumor grade, stage, tumor sites and metastasis status. Our results show that PKB outperforms other methods and identifies pathways relevant to the outcome variables. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

19 pages, 1492 KiB  
Article
Integrative Analysis of Cancer Omics Data for Prognosis Modeling
by Shuaichao Wang, Mengyun Wu and Shuangge Ma
Genes 2019, 10(8), 604; https://0-doi-org.brum.beds.ac.uk/10.3390/genes10080604 - 09 Aug 2019
Cited by 4 | Viewed by 3237
Abstract
Prognosis modeling plays an important role in cancer studies. With the development of omics profiling, extensive research has been conducted to search for prognostic markers for various cancer types. However, many of the existing studies share a common limitation by only focusing on [...] Read more.
Prognosis modeling plays an important role in cancer studies. With the development of omics profiling, extensive research has been conducted to search for prognostic markers for various cancer types. However, many of the existing studies share a common limitation by only focusing on a single cancer type and suffering from a lack of sufficient information. With potential molecular similarity across cancer types, one cancer type may contain information useful for the analysis of other types. The integration of multiple cancer types may facilitate information borrowing so as to more comprehensively and more accurately describe prognosis. In this study, we conduct marginal and joint integrative analysis of multiple cancer types, effectively introducing integration in the discovery process. For accommodating high dimensionality and identifying relevant markers, we adopt the advanced penalization technique which has a solid statistical ground. Gene expression data on nine cancer types from The Cancer Genome Atlas (TCGA) are analyzed, leading to biologically sensible findings that are different from the alternatives. Overall, this study provides a novel venue for cancer prognosis modeling by integrating multiple cancer types. Full article
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)
Show Figures

Figure 1

Back to TopTop