Next Article in Journal
Qualitative Analysis of Multi-Terms Fractional Order Delay Differential Equations via the Topological Degree Theory
Next Article in Special Issue
Comparing Groups of Decision-Making Units in Efficiency Based on Semiparametric Regression
Previous Article in Journal
Detecting Extreme Values with Order Statistics in Samples from Continuous Distributions
Previous Article in Special Issue
Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates

1
Department of Psychiatry, New York University School of Medicine, New York, NY 10016, USA
2
Department of Neurology, Chonnam National University Medical School, Gwangju 61469, Korea
3
Department of Applied Statistics, Chung-Ang University, Seoul 06974, Korea
*
Author to whom correspondence should be addressed.
Submission received: 30 December 2019 / Revised: 4 February 2020 / Accepted: 5 February 2020 / Published: 8 February 2020
(This article belongs to the Special Issue Uncertainty Quantification Techniques in Statistics)

Abstract

:
Low-coverage next-generation sequencing experiments assisted by statistical methods are popular in a genetic association study. Next-generation sequencing experiments produce genotype data that include allele read counts and read depths. For low sequencing depths, the genotypes tend to be highly uncertain; therefore, the uncertain genotypes are usually removed or imputed before performing a statistical analysis. It may result in the inflated type I error rate and in a loss of statistical power. In this paper, we propose a mixture-based penalized score association test adjusting for non-genetic covariates. The proposed score test statistic is based on a sandwich variance estimator so that it is robust under the model misspecification between the covariates and the latent genotypes. The proposed method takes advantage of not requiring either external imputation or elimination of uncertain genotypes. The results of our simulation study show that the type I error rates are well controlled and the proposed association test have reasonable statistical power. As an illustration, we apply our statistic to pharmacogenomics data for drug responsiveness among 400 epilepsy patients.

1. Introduction

Genome-wide association study (GWAS) is a powerful tool for screening a high-dimensional genome data set and selecting candidate genetic variants such as single nucleotide polymorphisms (SNPs) in genetic association studies. Next-generation sequencing (NGS) technology is widely used to produce a large amount of genetic information in a fast way. In the past decade, there have been numerous studies using NGS data such as rare variants association study [1,2], pharmacogenomics [3,4], machine learning and deep learning applications [5,6], and big data analysis [7,8]. Many NGS experiments are based on low-coverage sequencing with a large sized sample since there is a trade-off between sample size and sequencing depth in the NGS experiments [9,10]. For the low-coverage NGS data, a high uncertainty of the inferred genotypes is common; however, it causes biased and unreliable results on genetic association analyses. In genetic research based on NGS data, therefore, it is important to obtain accurate genotypes to perform an association analysis.
A number of researchers have worked on the effects of genotype misclassification in genetic association studies. There are two types of genotype misclassifications: differential and non-differential misclassifications, determined by whether the misclassification mechanism differs in the case and control groups or not. In summary, non-differential misclassifications result in a loss of statistical power and differential misclassifications distort type I error rates in a genetic case-control association study [11,12,13,14].
While there have been many research on improving the accuracy of genotypes such as the joint genotype calling algorithms across all samples were suggested to increase the accuracy of genotype calls [15,16,17], several researchers have tried to develop new association statistics accounting for the genotype errors. Their approaches are based on the raw measurements rather than inferred genotypes. In statistical genetics literature, Kim et al. [18] extended a chi-squared test of independence and developed a mixture likelihood based association test using the continuous measurements for copy number polymorphisms. Barnes et al. [19] proposed a mixture model linear trend test for the continuous copy number measurements. In NGS experiments, a likelihood ratio test based on allele read counts of pooled samples was proposed to test independence of genetic variants with a binary phenotype [20]. Gordon et al. [21] proposed a likelihood ratio test of the binomial mixture model of allele read counts with known error parameters. Kim et al. [13,22] proposed an extended version of Cochran–Armitage (CA) trend test and a multi-variant linear trend test for next-generation sequences data by using binomial mixture models. For a case-parent trio design, the binomial mixture model was applied to develop extended transmission disequilibrium tests (TDTs) based on read counts and read depths and to provide power analysis and sample size formulas [23]. All these approaches do not require genotype calls that can be highly uncertain when the read depth or coverage is low. However, none of these previous research has addressed how to include covariates in their mixture-based association studies.
When the covariates are independent of the latent genotypes, the extension of the mixture model based association tests is straightforward. However, if the latent genotype variable is associated with other covariates, then a likelihood based approach requires a model specification between the genotype variable and the other covariates as opposed to the previous research [16,17,18,19,20,21,22,23]. To our knowledge, this is the first study that investigates a genetic case-control association test controlling for covariates in low-coverage NGS experiments. Since we do not know the true model, we apply a sandwich variance estimator to develop a robust genetic association test statistic.

2. Materials and Methods

2.1. Mixture Model Accounting for Covariates

Let w be a covariate vector. Let y be a random variable indicating the case-control status of an individual such that y = 1 if a subject is in the case group and y = 0 , otherwise. Let z = ( z 0 , z 1 , z 2 ) denote an unobservable latent genotype vector, where g = 0 2 z g = 1 and z g = 1 if and only if the genotype is equal to g. Let x and v denote the minor allele read count and the read depth, respectively. The probability function is given by
p ( y , x , v , w ) = z p ( y , x , v , w , z ) = z p ( x | v , w , z , y ) p ( y | v , w , z ) p ( z | w , v ) p ( w , v ) = p ( w , v ) z p ( x | v , z , y ) p ( y | z , w ) p ( z | w ) .
If the probability function of the read count x does not depend on the phenotype y, that is, p ( x | v , z , y ) = p ( x | v , z ) , then it is called a non-differential error model. We apply a binary logit model to the case-control phenotype response variable y that is the same model for Cochran–Armitage trend test when perfect genotypes are available:
p ( y | z , w ) = e y ( β s T z + β w T w ) 1 + e β 0 + β s T z + β w T w .
We assume a binomial error model to the allele read counts as in previous research [13,16,20,21,23]:
p ( x | v , z , y ) = v x u ϵ T z x 1 u ϵ T z v x ,
where u ϵ = ( ϵ , 1 / 2 , 1 ϵ ) T . For a differential error model, we can use u ϵ = y ( ϵ 1 , 1 / 2 , 1 ϵ 1 ) T + ( 1 y ) ( ϵ 0 , 1 / 2 , 1 ϵ 0 ) T . When perfect genotypes are available, we do not need the conditional probability of the genotype z given covariates w to perform genetic association tests since the logistic regression model is a conditional model given the genotypes and covariates. In this work, we assume a multinomial logit model for the latent genotype given the covariates as follows:
p ( z | w ) = g = 0 2 z g e γ g T w m = 0 2 e γ m T w ,
where γ 0 = ( 0 , 0 , 0 ) T to remove over-parametrization. Other statistical models without the assumptions of a multinomial logit model may also be used for the relationship between covariates and latent genotypes, where we do not know the true model.
The likelihood function L and the log-likelihood function are written as
L = k = 1 N z k p ( y k | z k , w k ) p ( x k | v k , z k , y k ) p ( z k | w k ) p ( w k , v k ) = k = 1 N i = 0 2 e y k ( β s i + β w T w k ) 1 + e β s i + β w T w k v k x k u ϵ i x k 1 u ϵ i v k x k e γ i T w k m = 0 2 e γ m T w k p ( w k , v k ) ,
= k = 1 N log i = 0 2 e y k ( β s i + β w T w k ) 1 + e β s i + β w T w k u ϵ i x k 1 u ϵ i v k x k e γ i T w k m = 0 2 e γ m T w k + k = 1 N log v k x k p ( w k , v k ) .
The error parameter ϵ is commonly small and hence the estimate of ϵ is often equal to zero. The zero estimate of the error parameter results in a divergent information matrix. It prevents us from calculating Rao’s score test statistic. In order to overcome this issue, we include a beta density penalty term to prevent from zero estimate of the error parameter. The penalized log-likelihood function is given by
p = + C log ϵ a ϵ ( 1 ϵ ) b ϵ .
During this work, we choose C = 1 as in [24,25]. The penalized complete-data likelihood function is given by
L C = k = 1 N i = 0 2 e y k ( β s i + β w T w k ) 1 + e β s i + β w T w k × v k x k ( u ϵ i ) x k ( 1 u ϵ i ) v k x k ϵ a ϵ N ( 1 ϵ ) b ϵ N × e γ i T w k m = 0 2 e γ m T w k z i k
The complete data log-likelihood function is written as
C = k = 1 N i = 0 2 z i k y k ( β s i + β w T w k ) log 1 + e β s i + β w T w k + k = 1 N i = 0 2 z i k x k log ( u ϵ i ) + ( v k x k ) log ( 1 u ϵ i ) + a ϵ log ϵ + b ϵ log ( 1 ϵ ) + k = 1 N i = 0 2 z i k γ i T w k log m = 0 2 e γ m T w k .

2.2. Derivation of EM Algorithm under H 0

We apply the Expectation–Maximization (EM) algorithm [26] to estimate the parameters in our mixture model. Given data and the ( r ) -th step estimated parameters, the ( r + 1 ) -th E-step of the EM algorithm is written as
Q ( r + 1 ) = k = 1 N i = 0 2 τ i k ( r ) y k ( β s i + β w T w k ) log 1 + e β s i + β w T w k + k = 1 N i = 0 2 τ i k ( r ) x k log ( u ϵ i ) + ( v k x k ) log ( 1 u ϵ i ) + a ϵ log ϵ + b ϵ log ( 1 ϵ ) + k = 1 N i = 0 2 τ i k ( r ) γ i T w k log m = 0 2 e γ m T w k ,
where
τ i k ( r ) = e y k ( β ( r ) s i + β w ( r ) T w k ) 1 + e β ( r ) s i + β w ( r ) T w k ( u ϵ i ( r ) ) x k ( 1 u ϵ i ( r ) ) v k x k e γ i ( r ) T w k m = 0 2 e γ m ( r ) T w k g = 0 2 e y k ( β ( r ) s g + β w ( r ) T w k ) 1 + e β ( r ) s g + β w ( r ) T w k ( u ϵ g ( r ) ) x k ( 1 u ϵ g ( r ) ) v k x k e γ g ( r ) T w k m = 0 2 e γ m ( r ) T w k .
We note that the posterior probability of subject k belonging to genotype class i depends on the all parameters. In M-step, the ( r + 1 ) -th estimates of the parameters are obtained by maximizing Q ( r + 1 ) :
Q ( r + 1 ) β = k = 1 N i = 0 2 τ i k ( r ) s i y k π i k = 0
Q ( r + 1 ) β w = k = 1 N i = 0 2 τ i k ( r ) w k y k π i k = 0
Q ( r + 1 ) ϵ = k = 1 N τ 0 k ( r ) x k ϵ v k x k 1 ϵ + τ 2 k ( r ) v k x k ϵ x k 1 ϵ + a ϵ ϵ b ϵ 1 ϵ = 0
Q ( r + 1 ) γ i = k = 1 N w k τ i k ( r ) p i k = 0 ,
where we use notations π i k = π i k ( β , β w ) = e β s i + β w T w k 1 + e β s i + β w T w k and p i k = p i k ( γ 1 , γ 2 ) = e γ i T w k m = 0 2 e γ m T w k for simplicity. From Equation (14), we derive a closed form iteration formula to update the allele read error parameter ϵ :
ϵ ( r + 1 ) = k = 1 N τ 0 k ( r ) x k + τ 2 k ( r ) ( v k x k ) + a ϵ k = 1 N ( τ 0 k ( r ) + τ 2 k ( r ) ) v k + a ϵ + b ϵ .
There is no closed form iteration formulas to update other parameters β , β w , γ i . The M-step for β , β w , and γ can be obtained by the Newton–Raphson method. The Hessian matrix of Q ( r + 1 ) is given by
2 Q ( r + 1 ) β 2 = k = 1 N i = 0 2 τ i k ( r ) s i 2 π i k ( 1 π i k )
2 Q ( r + 1 ) β β w = k = 1 N i = 0 2 τ i k ( r ) s i w k π i k ( 1 π i k )
2 Q ( r + 1 ) β w β w T = k = 1 N i = 0 2 τ i k ( r ) w k w k T π i k ( 1 π i k )
2 Q ( r + 1 ) γ i γ i T = k = 1 N w k w k T p i k ( 1 p i k )
2 Q ( r + 1 ) γ i γ j T = k = 1 N w k w k T p i k p j k
2 Q ( r + 1 ) γ i β w T = 2 Q ( r + 1 ) γ i β = 0
Let M = diag i = 0 2 τ i k π i k ( 1 π i k ) be an N × N diagonal matrix. Let W = ( w i k ) be the N × p matrix of covariates. Let μ be an N × 1 vector of μ k = i k τ i k π i k and Y be an N × 1 vector of y k . Initially, we set β [ 0 ] = β ( r ) and update the parameter estimate by
β [ t + 1 ] = β [ t ] + ( W T M W ) 1 W T ( Y μ ) .
Let D 11 = diag p 1 k ( 1 p 1 k ) , D 12 = D 21 = diag p 1 k p 2 k , and D 22 = diag p 2 k ( 1 p 2 k ) . Let τ i = ( τ i k ) be the N × 1 vector and p i = ( p i k ) be the N × 1 vector. Initially, set γ i [ 0 ] = γ i ( r ) and update the parameters γ i by
γ 1 [ t + 1 ] γ 2 [ t + 1 ] = γ 1 [ t ] γ 2 [ t ] + W T D 11 W W T D 12 W W T D 21 W W T D 22 W 1 W T ( τ 1 p 1 ) W T ( τ 2 p 2 ) .
In order to obtain β w ( r + 1 ) and γ i ( r + 1 ) , we stop the iterations in the M-step for β and γ i when | | β [ t + 1 ] β [ t ] | | 2 + | | γ 1 [ t + 1 ] γ 1 [ t ] | | 2 + | | γ 2 [ t + 1 ] γ 2 [ t ] | | 2 t o l 2 or the number of iterations reaches the prespecified maximum number of iterations. In our work, we set t o l = 10 6 and fix the maximum iteration as 1000.

2.3. Hypothesis Tests of Genetic Association Controlling for Covariates

To test genetic association between the latent genetic variables and the binary response variable while controlling covariates, we employ Rao’s score test. There are several advantages for the use of the score test. Cochran-Armitage trend test with perfect genotypes is a score test, and we extend this test to when the genotypes are highly uncertain. The score test requires less computational cost compared to the likelihood ratio test since it requires the parameter estimates only under the null hypothesis of no association. The score function calculated in previous section is given by
S = k = 1 N i = 0 2 τ i k ( 0 ) s i y k e β w ( 0 ) T w k 1 + e β w ( 0 ) T w k
where the subscript ( 0 ) denotes the estimated parameter under the null hypothesis. Another important issue to be considered when we include the covariates in a low-coverage next-generation sequencing genetic association study is a model misspecification of the latent genotypes on the covariates. To overcome this model misspecification problem, we employ the sandwich variance estimator [27]. In this work, we derive a robust generalized score test using the sandwich variance–covariance estimator. In general, one of the difficulties in applying the sandwich estimator in practice is that it requires analytic derivation for the covariance matrix of the proposed model. For simplicity in our derivation of the sandwich variance estimator, θ denotes the vector of all parameters θ = ( β , β w , γ , ϵ ) , and ϕ = ( β w , γ , ϵ ) denotes the parameter vector except β , and hence θ = ( β , ϕ ) . The sandwich variance estimator for the score function S under H 0 is given by
v s = V β β V β ϕ J ϕ ϕ 1 J ϕ β J β ϕ J ϕ ϕ 1 V ϕ β + J β ϕ J ϕ ϕ 1 V ϕ ϕ J ϕ ϕ 1 J ϕ β ,
where V = E f 0 θ θ T and J = E f 0 2 θ θ T under the unknown true distribution f 0 . For simplicity, we may use h i k during derivation of the sandwich variance estimator:
h g k = e y k ( β s g + β w T w k ) 1 + e β s g + β w T w k u ϵ g x k 1 u ϵ g v k x k e γ g T w k m = 0 2 e γ m T w k ,
so that the likelihood function is written as
= k = 1 N log g = 0 2 h g k + C .
The relationship between J and V can be written as
J = 1 N k = 1 N g = 0 2 τ g k θ log h g k g = 0 2 τ g k θ T log h g k 1 N k = 1 N g = 0 2 τ g k θ log h g k θ T log h g k + 2 θ θ T log h g k = V 1 N k = 1 N g = 0 2 τ g k θ log h g k θ T log h g k + 2 θ θ T log h g k
If there is no model misspecification, we have J = V and the robust score test statistic is reduced to Rao’s score test statistic. We denote the difference R = V J so that R = 1 N k = 1 N g = 0 2 τ g k θ log h g k θ T log h g k + 2 θ θ T log h g k . The components of θ log h g k are calculated by
β log h g k = s g [ y k π k ]
β w log h g k = w k [ y k π k ]
ϵ log h g k = δ g ( 0 ) X k ϵ V k X k 1 ϵ + δ g ( 2 ) V k X k ϵ X k 1 ϵ + a ϵ N ϵ b ϵ N ( 1 ϵ )
γ i log h g k = w k I ( g = i ) p i k ,
where δ g ( i ) = 1 if g = i and δ g ( i ) = 0 if g i . It is straightforward to calculate V from the above first derivatives. The second term 2 θ θ T log h g k of R has components as
2 β 2 log h g k = s g 2 π k ( 1 π k )
2 β w β w T log h g k = w k w k T π k ( 1 π k )
2 β w β log h g k = w k s g π k ( 1 π k )
2 γ i γ i T log h g k = w k w k T p i k ( 1 p i k )
2 γ i γ 3 i T log h g k = w k w k T p i k p 3 i , k
2 ϵ 2 log h g k = δ g ( 0 ) X k ϵ 2 + V k X k ( 1 ϵ ) 2 + δ g ( 2 ) V k X k ϵ 2 + X k ( 1 ϵ ) 2 + a ϵ N ϵ 2 + b ϵ N ( 1 ϵ ) 2 ,
where i = 1 or 2. All other second derivatives that are not presented are equal to zero. Using these first and second derivatives of log h g k , we can obtain the components of the difference matrix R as follows:
R β β = 1 N k = 1 N g = 0 2 τ g k s g 2 ( y k π k ) 2 π k ( 1 π k )
R β w β = 1 N k = 1 N g = 0 2 τ g k s g w k ( y k π k ) 2 π k ( 1 π k )
R β w β w = 1 N k = 1 N w k w k T ( y k π k ) 2 π k ( 1 π k )
R ϵ ϵ = 1 N k = 1 N τ 0 k X k + a ϵ / N ϵ V k X k + b ϵ / N 1 ϵ 2 + τ 1 k a ϵ N ϵ b ϵ N ( 1 ϵ ) 2 + τ 2 k V k X k + a ϵ / N ϵ X k + b ϵ / N 1 ϵ 2 τ 0 k X k + a ϵ / N ϵ 2 + V k X k + b ϵ / N ( 1 ϵ ) 2 τ 1 k a ϵ N ϵ 2 + b ϵ N ( 1 ϵ ) 2 τ 2 k V k X k + a ϵ / N ϵ 2 + X k + b ϵ / N ( 1 ϵ ) 2
R β ϵ = 1 N k = 1 N [ y k π k ] τ 0 k s 0 X k + a ϵ / N ϵ V k X k + b ϵ / N 1 ϵ + τ 1 k s 1 a ϵ N ϵ b ϵ N ( 1 ϵ ) + τ 2 k s 2 V k X k + a ϵ / N ϵ X k + b ϵ / N 1 ϵ
R β w ϵ = 1 N k = 1 N w k [ y k π k ] τ 0 k X k + ϵ / N ϵ V k X k + b ϵ / N 1 ϵ + τ 1 k a ϵ N ϵ b ϵ N ( 1 ϵ ) + τ 2 k V k X k + a ϵ / N ϵ X k + b ϵ / N 1 ϵ
R γ i γ i = 1 N k = 1 N w k w k T ( τ i k p i k ) ( 1 2 p i k )
R γ 1 γ 2 = 1 N k = 1 N w k w k T p 1 k ( p 2 k τ 2 k ) + p 2 k ( p 1 k τ 1 k )
R γ i β = 1 N k = 1 N w k ( y k π k ) τ i k s 1 p i k g = 0 2 τ g k s g
R γ i β w = 1 N k = 1 N w k w k T ( y k π k ) τ i k p i k
R γ 1 ϵ = 1 N k = 1 N w k p 1 k τ 0 k X k + a ϵ / N ϵ V k X k + b ϵ / N 1 ϵ + ( 1 p 1 k ) τ 1 k a ϵ N ϵ b ϵ N ( 1 ϵ ) p 1 k τ 2 k V k X k + a ϵ / N ϵ X k + b ϵ / N 1 ϵ
R γ 2 ϵ = 1 N k = 1 N w k p 2 k τ 0 k X k + a ϵ / N ϵ V k X k + b ϵ / N 1 ϵ p 2 k τ 1 k a ϵ N ϵ b ϵ N ( 1 ϵ ) + ( 1 p 2 k ) τ 2 k V k X k + ϵ / N ϵ X k + b ϵ / N 1 ϵ
Therefore, our proposed robust score test statistic Z R can be written as
Z R = S N v s ,
which asymptotically has a standard normal distribution under H 0 .
Another common approach to obtain p-values is to use Monte Carlo permutation method based on the score vector or function. However, the Monte Carlo permutation p-value calculation given a very small Bonferroni’s corrected level of significance needs high computational expenses since it requires at least 10 7 or 10 8 permuted resamples. In this work, we employ the asymptotic permutation p-value calculation. The score function is given by
S = k = 1 N i = 0 2 τ i k ( 0 ) s i y k e β w ( 0 ) T w k 1 + e β w ( 0 ) T w k = k = 1 N r k e k
where the subscript ( 0 ) denotes the estimated parameter under the null hypothesis. We define a score r k = i = 0 2 τ i k ( 0 ) s i associated with subject k and the kth residual e k = y k e β w ( 0 ) T w k 1 + e β w ( 0 ) T w k . We can permute the residuals e k ’s to calculate the permutation p-value for adjusting covariate effects. The asymptotic permutation test statistic Z A P for a large sample size is given by
Z A P = S N · r ¯ · e ¯ 1 N 1 i = 1 N e i 2 N ( e ¯ ) 2 i = 1 N r i 2 N ( r ¯ ) 2
where r ¯ = 1 N i = 1 N r i and e ¯ = 1 N i = 1 N e i . The simple linear rank test statistic Z A P asymptotically has a standard normal distribution under the null hypothesis [28].

3. Results

3.1. Simulation Study

In this section, we simulate data from the following process:
P ( Y = 1 | w ) f ( w ) = i = 0 2 P ( Y = 1 | G = i , w ) P ( G = i ) f ( w )
For simplicity, we assume genetic relative risk R i = P ( Y = 1 | G = i , w ) P ( Y = 1 | G = 0 , w ) , for i = 1 , 2 , does not depend on the covariate W. We assume that the genotype frequency π i = P ( G = i ) satisfies Hardy–Weinberg equilibrium (HWE), so that P ( G = 0 ) = p 2 , P ( G = 1 ) = 2 p q , and P ( G = 2 ) = q 2 , where q is the minor allele frequency. Then, the prevalence is given by
ϕ = P ( Y = 1 | w ) f ( w ) d w = p 2 f ( w | G = 0 ) + 2 p q R 1 f ( w | G = 1 ) + q 2 R 2 f ( w | G = 2 ) P ( Y = 1 | G = 0 , w ) d w
We consider two scenarios when generating covariates w: (1) f ( w | G = i ) is equal to a standard normal N ( 0 , 1 ) for all i = 0 , 1 , 2 , called by a single normal, and (2) f ( w | G = i ) has a normal distribution with mean μ i and standard deviation σ = 1 , we call this a normal mixture. For the single normal model,
ϕ = p 2 + 2 p q R 1 + q 2 R 2 P ( Y = 1 | G = 0 , w ) f ( w ) d w
We finally assume P ( Y = 1 | G = 0 , w ) = e α + β w w 1 + e α + β w w . During the simulation study, we compute α by numerical integration given prevalence ϕ and other parameters.

3.1.1. Simulation Study for Null Distribution

To evaluate the type I error rate of the proposed test statistic, we perform simulations with 5000 replicates per each parameter setting. We fixed the proportion of cases as 0.5. The parameter settings that we consider are:
(i)
Prevalence ( ϕ ): 0.1, 0.3
(ii)
Coverage (v): 4, 30
(iii)
Minor allele frequency (q): 0.05, 0.3
(iv)
Total sample size (n): 500, 1000, 1500
(v)
Covariate ( w 1 ): single normal or normal mixture with mean μ = ( 0 , 1 2 , 1 2 ) given genotype ( 0 , 1 , 2 )
(vi)
Regression coefficient β w : 0, 1
We consider prevalence ϕ = 0.3 that may be large in a genetic association study. It is chosen to reflect pharmacogenomics data that we use in the real data analysis.
Figure 1 shows boxplots of the null simulations. The permutation method appears to have more variability of the empirical rejection rates over different configurations and to have the smaller empirical rejection rates compared to the proposed robust score test based on the sandwich variance estimator. When the sample size was small as 500 and the coverage was 4×, the permutation-based test had less than 2.5% rejection rate though the desired value is 5%. The smallest empirical rejection rate for the proposed robust test was greater than 3.5%, and it appears the empirical rejection rates become closer to 5% as the sample size increases. If the coverage is 30× or higher, then the estimated posterior probabilities in our approach are close to zero-or-one and most inferred genotypes are quite clear. When the coverage was 30×, our proposed test seems to well control the type I error rates regardless of other parameter settings as expected. Table 1 shows the empirical rejection rates under the null settings by combining our simulation results for the lower level of significance.

3.1.2. Simulation Study for Statistical Power

We used the same parameter settings as in the null simulation study. Additionally, we set multiplicative genetic relative risks vector ( 1 , 1.5 , 1.5 2 ) in the alternative parameter configurations. In the alternative simulations, we calculated empirical rejection rates under Bonferroni corrected level of significance, that is, 5 × 10 8 . Figure 2 shows the boxplots of empirical power under various alternative settings. We removed the results when the sample size was 500 or the minor allele frequency was 0.05 since all the rejection rates were small in Figure 2. It appears interesting that the power of the proposed test when the coverage was 4× and the sample size was 1500 is higher than the power of the test when the coverage was 30× and the sample size was 1000. If the two design costs are similar, then the low-coverage with more samples seems more effective than the high-coverage with less samples.
Table 2 summarizes statistical power of our proposed method and a naive approach. The naive approach uses uncertain genotypes by the maximum posterior probability classification rule [29]. The standard logistic regression was applied to the uncertain genotypes. As expected, the proposed robust method shows higher power than the naive approach when the sequencing coverage is as low as 4×. When the sequencing coverage is high as 30×, two approaches show similar performance in terms of statistical power.

3.2. Real Data Analysis

The proposed robust generalized score test was applied to the pharmacogenomics data consisting of 400 epilepsy patients [22]. The data were collected from several epilepsy clinics in Korea and were genotyped for whole-exomes by NGS experiments [30]. All study participants followed the criteria in [31] if the participants had drug-resistant (case group) or drug-responsive (control group) epilepsy. We defined the drug resistance as the occurrence of at least four unprovoked seizures during the past one year at the time of recruitment, with trials of two or more appropriate antiepileptic drugs (AEDs) at maximal tolerated doses. Patients who underwent surgical treatment for drug-resistant epilepsy were classified as having drug-resistant epilepsy, regardless of the surgical outcome. We excluded some patients from the study if they were frequently in poor compliance with AED therapy and had reported seizures with a questionable semiology. In addition, we defined the drug responsiveness as complete freedom from seizures for at least one year up to the date of the last follow-up visit.
We included two non-genetic covariates in our association analysis. The two covariates were age of patient and duration of epileptic seizures. The age variable was definitely independent of genetic information, whereas duration variable may be associated with genetic variables. Due to the relatively small sample size 400, we did not expect to find a significantly associated SNP controlling for the two covariates. Therefore, instead of reporting a genome-wide association study, we illustrated the results of a SNP with low read depths and a SNP with high read depths. For the low read depths example, we selected a SNP from chromosome 1, which is r s 3811406 . The distribution of read depths for the SNP was summarized in Table 3. More than 10% of the sample had five or less read depths and more than 30% of the sample had 10 or less read depths at the SNP. When applying our proposed mixture-based association test, the test statistic value was z R = 2.864 and the p-value was p = 4.183 × 10 3 , while the standard logistic regression analysis using pooled genotype calls had z = 2.601 and the p-value p = 9.30 × 10 3 that was more than twice the p-value of the proposed robust test.
In addition, we applied our proposed test to SNP r s 4915154 at which all patients had 13× or higher read depths and 85% patients had 25× or higher read depths. For this SNP, the proposed robust test statistic was z R = 2.940 with p-value = 3.28 × 10 3 and the multiple logistic regression with the pooled genotype calls reported z = 2.963 with p-value = 3.05 × 10 3 . The two results were quite close, as expected, due to high read depths at the SNP.

4. Discussion and Conclusions

In the present study, we developed the mixture-based genetic association tests adjusting the effects of non-genetic covariates in low-coverage NGS data. In order to construct a robust test statistic under model misspecification, we derived the sandwich variance estimator of the mixture model. The proposed test statistic is calculated from allele read counts and read depths instead of inferred genotypes so that we can apply this association test to low-coverage NGS data controlling for non-genetic covariates without external imputation or elimination of uncertain genotypes. Another important issue that we addressed in the present study is that the proposed test takes account of potential dependence between latent genotypes and the non-genetic covariates. Regarding computational cost, our proposed method is efficient because it is a generalized score test that uses the estimates of the parameters only under the null hypothesis of no association. When the sequencing depth is 4×, it takes around 1.2 s for sample size 500, 4 s for sample size 1000, and 9 s for sample size 1500 to simulate a dataset and to calculate both test statistics Z A P and Z R . When the sequencing depth is 30×, it takes approximately 0.13 s for sample size 500, 0.3 s for sample size 1000, and 0.53 s for sample size 1500. Time for these computations is measured based on a single core work of a 3.5 GHz Intel Xeon processor. As illustrated in the real data analysis section, the read depth is not a fixed constant. Therefore, the computational time for real data is usually less than that for the coverage 4× simulation setting. We used statistical software R, which is known to be slow. It would be computationally beneficial to run our proposed methods in other faster program languages for a high-dimensional genome-wide association study.
We applied the penalized likelihood method to avoid singularity of information matrix when calculating the proposed score test statistic. Therefore, the penalty term is not necessary for a non-zero estimate of the error parameter. During our work, we fixed the degree of penalization C = 1 , a ϵ = 0.01 , and b ϵ = 0.99 that implies 1% of allele read error as prior information. This parameter choice does not affect the proposed test statistic much since the likelihood function is merely changed when the sample size is greater than 500. It may be of interest to find optimal values for the parameters of the penalty term.
The simulation study confirms that the type I error rates of the proposed test are well controlled under the various parameter settings. The proposed robust test appears to perform better than the permutation based approach. Simulation results indicate that coverage 4× with sample size 1500 shows higher power as compared to coverage 30× with sample size 1000. Our method can be applied to an NGS experimental design by simulations to select coverage and sample size given a fixed amount of budget.
We presented a real data example in which the proposed test and multiple logistic regression results are similar to one another if the sequencing depth is high, whereas the test results may differ when the sequencing depth is low. This might have been caused because the proposed test is an extension of the multiple logistic regression with the unobserved latent genotype predictor. If the sequencing depth is high enough to call accurate genotypes, then our probability model becomes identical to the probability model of the multiple logistic regression. It would be more beneficial to compare with the previous methods by evaluating our proposed methods using a larger sized public dataset.
In this work, we focused on a single variant association test while controlling covariates. By adopting a multivariate mixture model, the proposed method can be extended to the multi-variant genetic association test including covariates. We can also extend the present method to differential genotype misclassifications.

Author Contributions

Conceptualization, W.K.; methodology, J.Y.L. and W.K.; software, J.Y.L. and W.K.; formal analysis, J.Y.L. and W.K.; data curation, M.-K.K. and W.K.; writing—original draft preparation, J.Y.L. and W.K.; writing—review and editing, J.Y.L., M.-K.K., and W.K.; project administration, W.K.; funding acquisition, M.-K.K. and W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A1B07050012) and was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health & Welfare, Republic of Korea (HI15C1559).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EMExpectation–Maximization
GWASGenome-wide association study
HWEHardy–Weinberg equilibrium
mafMinor allele frequency
NGSNext-generation sequence
SNPSingle nucleotide polymorphism
TDTTransmission disequilibrium test

References

  1. Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef] [Green Version]
  2. Cirulli, E.T.; White, S.; Read, R.W.; Elhanan, G.; Metcalf, W.J.; Tanudjaja, F.; Fath, D.M.; Sandoval, E.; Isaksson, M.; Schlauch, K.A.; et al. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat. Commun. 2020, 11, 542. [Google Scholar] [CrossRef]
  3. Lakiotaki, K.; Kanterakis, A.; Kartsaki, E.; Katsila, T.; Patrinos, G.P.; Potamias, G. Exploring public genomics data for population pharmacogenomics. PLoS ONE 2017, 12, e0182138. [Google Scholar] [CrossRef] [PubMed]
  4. Patrinos, G.P.; Giannopoulou, E.; Katsila, T.; Tsermpini, E.E.; Mitropoulou, C. Integrating next-generation sequencing in the clinical pharmacogenomics workflow. Front. Pharmacol. 2019, 10, 384. [Google Scholar]
  5. Celesti, F.; Celesti, A.; Wan, J.; Villari, M. Why Deep Learning Is Changing the Way to Approach NGS Data Processing: A Review. IEEE Rev. Biomed. Eng. 2018, 11, 68–76. [Google Scholar] [CrossRef]
  6. Le, N.Q.K.; Yapp, E.K.Y.; Nagasundaram, N.; Chua, M.C.H.; Yeh, H.Y. Computational identification of vesicular transport proteins from sequences using deep gated recurrent units architecture. Comput. Struct. Biotechnol. J. 2019, 17, 1245–1254. [Google Scholar] [CrossRef]
  7. Tripathi, R.; Sharma, P.; Chakraborty, P.; Varadwaj, P.K. Next-generation sequencing revolution through big data analytics. Front. Life Sci. 2016, 9, 119–149. [Google Scholar] [CrossRef]
  8. Cirillo, D.; Valencia, A. Big data analytics for personalized medicine. Curr. Opin. Biotechnol. 2019, 58, 161–167. [Google Scholar] [CrossRef]
  9. Sims, D.; Sudbery, I.; Ilott, N.E.; Heger, A.; Ponting, C.P. Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 2014, 15, 121–132. [Google Scholar] [CrossRef]
  10. Song, K.; Li, L.; Zhang, G. Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology. Sci. Rep. 2016, 6, 35736. [Google Scholar] [CrossRef] [Green Version]
  11. Gordon, D.; Finch, S.J.; Nothnagel, M.; Ott, J. Power and sample size calculations for case-control genetic association tests when errors are present: Application to single nucleotide polymorphisms. Hum. Hered. 2002, 54, 22–33. [Google Scholar] [CrossRef]
  12. Ahn, K.; Haynes, C.; Kim, W.; Fleur, R.S.; Gordon, D.; Finch, S.J. The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies. Ann. Hum. Genet. 2007, 71, 249–261. [Google Scholar] [CrossRef]
  13. Kim, W.; Londono, D.; Zhou, L.; Xing, J.; Nato, A.Q.; Musolf, A.; Matise, T.C.; Finch, S.J.; Gordon, D. Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Hum. Hered. 2012, 74, 172–183. [Google Scholar] [CrossRef] [Green Version]
  14. Hou, L.; Sun, N.; Mane, S.; Sayward, F.; Rajeevan, N.; Cheung, K.; Cho, K.; Pyarajan, S.; Aslan, M.; Miller, P. Impact of genotyping errors on statistical power of association tests in genomic analyses: A case study. Genet. Epidemiol. 2017, 41, 152–162. [Google Scholar] [CrossRef] [Green Version]
  15. Consortium, G.P. An integrated map of genetic variation from 1092 human genomes. Nature 2012, 491, 56. [Google Scholar]
  16. Le, S.Q.; Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 2011, 21, 952–960. [Google Scholar] [CrossRef] [Green Version]
  17. Li, Y.; Sidore, C.; Kang, H.M.; Boehnke, M.; Abecasis, G.R. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011, 21, 940–951. [Google Scholar] [CrossRef] [Green Version]
  18. Kim, W.; Gordon, D.; Sebat, J.; Kenny, Q.Y.; Finch, S.J. Computing power and sample size for case-control association studies with copy number polymorphism: Application of mixture-based likelihood ratio test. PLoS ONE 2008, 3, e3475. [Google Scholar] [CrossRef]
  19. Barnes, C.; Plagnol, V.; Fitzgerald, T.; Redon, R.; Marchini, J.; Clayton, D.; Hurles, M.E. A robust statistical method for case-control association testing with copy number variation. Nat. Genet. 2008, 40, 1245. [Google Scholar] [CrossRef] [Green Version]
  20. Kim, S.Y.; Li, Y.; Guo, Y.; Li, R.; Holmkvist, J.; Hansen, T.; Pedersen, O.; Wang, J.; Nielsen, R. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 2010, 34, 479–491. [Google Scholar] [CrossRef] [Green Version]
  21. Gordon, D.; Finch, S.J.; De La Vega, F. A new expectation-maximization statistical test for case-control association studies considering rare variants obtained by high-throughput sequencing. Hum. Hered. 2011, 71, 113–125. [Google Scholar] [CrossRef]
  22. Kim, W.; Kim, Y.H. Genetic association tests when a nuisance parameter is not identifiable under no association. Commun. Stat. Appl. Methods 2017, 24, 663–671. [Google Scholar] [CrossRef]
  23. Kim, W. Transmission Disequilibrium Tests Based on Read Counts for Low-Coverage Next,-Generation Sequence Data. Hum. Hered. 2015, 80, 36–49. [Google Scholar] [CrossRef]
  24. Chen, H.; Chen, J.; Kalbfleisch, J.D. A modified likelihood ratio test for homogeneity in finite mixture models. J. R. Stat. Soc. Ser. B 2001, 63, 19–29. [Google Scholar] [CrossRef]
  25. Zhou, H.; Pan, W. Binomial mixture model-based association tests under genetic heterogeneity. Ann. Hum. Genet. 2009, 73, 614–630. [Google Scholar] [CrossRef] [Green Version]
  26. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar]
  27. White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  28. Sidak, Z.; Sen, P.K.; Hajek, J. Theory of Rank Tests; Academic Press: San Diego, CA, USA, 1999. [Google Scholar]
  29. Anderson, T.W. An Introduction to Multivariate Statistical Analysis; Wiley: New York, NY, USA, 1962. [Google Scholar]
  30. Kang, K.W.; Kim, W.; Cho, Y.W.; Lee, S.K.; Jung, K.Y.; Shin, W.; Kim, D.W.; Kim, W.J.; Lee, H.W.; Kim, W. Genetic characteristics of non-familial epilepsy. PeerJ 2019, 7, e8278. [Google Scholar] [CrossRef]
  31. Kim, M.-K.K.; Moore, J.H.; Kim, J.K.; Cho, K.H.; Cho, Y.W.; Kim, Y.S.; Lee, M.C.; Kim, Y.O.; Shin, M.H. Evidence for epistatic interactions in antiepileptic drug resistance. J. Hum. Genet. 2011, 56, 71–76. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Boxplot of the empirical rejection rates under the null hypothesis.
Figure 1. Boxplot of the empirical rejection rates under the null hypothesis.
Mathematics 08 00217 g001
Figure 2. Boxplots of statistical power of the proposed robust test under the alternative settings. The level of significance was set as 5 × 10 8 . The notation 0.1.1000.4 represents prevalence 0.1, total sample size 1000, and coverage 4×.
Figure 2. Boxplots of statistical power of the proposed robust test under the alternative settings. The level of significance was set as 5 × 10 8 . The notation 0.1.1000.4 represents prevalence 0.1, total sample size 1000, and coverage 4×.
Mathematics 08 00217 g002
Table 1. Empirical rejection rates under null settings for level 1 × 10 2 , 1 × 10 3 , 1 × 10 4 , and 1 × 10 5 .
Table 1. Empirical rejection rates under null settings for level 1 × 10 2 , 1 × 10 3 , 1 × 10 4 , and 1 × 10 5 .
Method (cvrg) 1 × 10 2 1 × 10 3 1 × 10 4 1 × 10 5
Permutation (4×) 7.13 × 10 3 6.14 × 10 4 4.44 × 10 5 0
Permutation (30×) 7.73 × 10 3 7.08 × 10 4 6.11 × 10 5 5.56 × 10 6
Sandwich (4×) 8.34 × 10 3 6.89 × 10 4 4.44 × 10 5 5.56 × 10 6
Sandwich (30×) 1.02 × 10 2 1.01 × 10 3 8.75 × 10 5 8.33 × 10 6
Table 2. Empirical rejection rates under alternative hypothesis. The level of significance was set as 5 × 10 8 .
Table 2. Empirical rejection rates under alternative hypothesis. The level of significance was set as 5 × 10 8 .
CoverageTotal Sample SizeCovariate β w NaiveProposed
41000Normal mixture00.1020.113
41000Normal mixture10.2330.261
41000Single normal00.1900.277
41000Single normal10.2690.374
41500Normal mixture00.3980.429
41500Normal mixture10.6570.701
41500Single normal00.6260.741
41500Single normal10.7360.840
301000Normal mixture00.3840.355
301000Normal mixture10.6170.603
301000Single normal00.6220.637
301000Single normal10.7340.760
301500Normal mixture00.7920.761
301500Normal mixture10.9590.954
301500Single normal00.9330.939
301500Single normal10.9780.978
Table 3. Distribution of read depths at r s 3811406 .
Table 3. Distribution of read depths at r s 3811406 .
Read Depth v v 5 5 < v 10 10 < v 30 v > 30 Total
Frequency438695176400
Proportion0.10750.2150.23750.441

Share and Cite

MDPI and ACS Style

Lee, J.Y.; Kim, M.-K.; Kim, W. Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates. Mathematics 2020, 8, 217. https://0-doi-org.brum.beds.ac.uk/10.3390/math8020217

AMA Style

Lee JY, Kim M-K, Kim W. Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates. Mathematics. 2020; 8(2):217. https://0-doi-org.brum.beds.ac.uk/10.3390/math8020217

Chicago/Turabian Style

Lee, Jung Yeon, Myeong-Kyu Kim, and Wonkuk Kim. 2020. "Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates" Mathematics 8, no. 2: 217. https://0-doi-org.brum.beds.ac.uk/10.3390/math8020217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop