Next Article in Journal
SUPG-Stabilized Virtual Element Method for Optimal Control Problem Governed by a Convection Dominated Diffusion Equation
Previous Article in Journal
Research on a Rice Counting Algorithm Based on an Improved MCNN and a Density Map
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Minimax Rates of p-Losses for High-Dimensional Linear Errors-in-Variables Models over q-Balls

1
School of Mathematics, Northwest University, Xi’an 710069, China
2
School of Information Science and Technology, Northwest University, Xi’an 710069, China
*
Author to whom correspondence should be addressed.
Submission received: 23 April 2021 / Revised: 27 May 2021 / Accepted: 4 June 2021 / Published: 5 June 2021

Abstract

:
In this paper, the high-dimensional linear regression model is considered, where the covariates are measured with additive noise. Different from most of the other methods, which are based on the assumption that the true covariates are fully obtained, results in this paper only require that the corrupted covariate matrix is observed. Then, by the application of information theory, the minimax rates of convergence for estimation are investigated in terms of the p ( 1 p < ) -losses under the general sparsity assumption on the underlying regression parameter and some regularity conditions on the observed covariate matrix. The established lower and upper bounds on minimax risks agree up to constant factors when p = 2 , which together provide the information-theoretic limits of estimating a sparse vector in the high-dimensional linear errors-in-variables model. An estimator for the underlying parameter is also proposed and shown to be minimax optimal in the 2 -loss.

1. Introduction

In various fields of applied sciences and engineering, such as machine learning [1], a fundamental problem is to estimate an underlying parameter β R d of a linear regression model as follows
y i = X i · , β + e i , for i = 1 , 2 , , n ,
where { ( X i · , y i ) } i = 1 n are i.i.d. observations ( X i · R d ) and e R n is the random noise. In matrix form, model (1) can be written as y = X β + e , where X = ( X 1 · , , X n · ) R n × d and y , e R n . The covariates X i · ( i = 1 , 2 , , n ) are always assumed to be fully observed in standard formulations. However, this assumption is far away from reality since, in general, the measurement error cannot be avoided. In many real-world applications, due to the lack of observation or the instrumental constraint, the collected data, such as remote sensing data, may always be perturbed and tend to be noisy [2]. It has been shown in [3] that misleading inference results may be obtained if the method for clean data is applied to the noisy data naively. Therefore, it is more realistic to explore the case where only the corrupted covariates of the corresponding true covariates X i · ’s are obtained; see, e.g., [4]. This is known as the measurement error model in the literature.
Estimation in the presence of measurement errors has attracted a lot of interest for a long time. Bickel and Ritov [5] first studied the linear measurement error models and proposed an efficient estimator. Then, Stefanski and Carroll [6] investigated the generalized linear measurement error models and constructed consistent estimators. Extensive results have also been established on parameter estimation and variable selection for both parametric or nonparametric settings; see [7,8] and references therein. It should be noted that these results are only applicable to classical low-dimensional (i.e., n d ) statistical models.
In the past two decades, high-dimensional statistical models, where the number of observations is much less than the number of predictors (i.e., n d ), have been paid much attention and have achieved fruitful results in a wide range of research areas; see [9,10] for a detailed review. Most of the existing results are only suitable for models with clean data, while some researchers have began to focus on the measurement error case. For example, Loh and Wainwright studied the high-dimensional sparse linear regression model with corrupted covariates. Though the proposed estimator involves solving a nonconvex optimization problem, they proved that the global and stationary points are statistically consistent; see [11,12]. Datta and Zou [13] proposed the Convex Conditioned Lasso (CoCoLasso), which enjoys the convex benefit of the Lasso in both estimation and algorithm and can handle a class of corrupted datasets, including the cases of additive or multiplicative measurement error. Li et al. [14] investigated a general nonconvex estimation method from statistical and computational aspects and the results can be immediately applied to the corrupted errors-in-variables linear regression.
Apart from the study on statistical convergence rates and designing efficient algorithms to solve certain estimators, it is also fundamental to information-theoretic limitations of statistical inference to understand the computationally efficient procedures. Such fundamental limits are usually studied by virtue of the minimax rates, which aim to find an estimator that minimizes the worst-case loss and, thus, can reveal gaps between the performance of some computationally efficient algorithm and that of an optimal algorithm. The minimax rate is always analyzed from two aspects, namely the informational lower bounds and statistical upper bounds. In the information-theoretic aspect, the Kullback–Leibler (KL) divergence is always used to provided lower bounds [15]. Recently, in [16], Loh provides a detailed review of a variety of techniques utilized to derive information-theoretic lower bounds for minimax estimation and learning, focusing on the problem settings with community recovery, parameter and function estimation, and online learning for multi-armed bandits. In the statistical aspect, a special estimator is always constructed to derive upper bounds; see, e.g., [17,18]. For the high-dimensional linear regression with additive errors, Loh and Wainwright [19] established minimax rates of convergence for estimating the unknown parameter in the 2 -loss. The proposed estimator was also shown to be minimax optimal in the additive error case under the 2 -loss, assuming that the true parameter is exact sparse, that is, β has at most s d nonzero elements, which is also known as the exact sparsity assumption.
However, this exact assumption may be sometimes too restrictive to be satisfied in some real applications. For example, in the field of image processing, it is a standard phenomenon that wavelet coefficients for images usually exhibit an exponential decay, but do not need to be almost 0 (see, e.g., [20]). Other applications under high-dimensional scenarios include compressed sensing [21], genomic analysis [22], signal processing [23], and so on, where it is not suitable to impose an exact sparsity assumption on the underlying parameter. Hence, it is necessary to investigate minimax rates of estimation when the exact sparse assumption does not hold.
Our main purpose in the present study is to investigate the more general situation that coefficients of the true parameter are not almost zeros and then provide minimax rates of convergence for estimation in sparse linear regression with additive errors. More precisely speaking, we assume that for q [ 0 , 1 ] fixed, the q -norm of β defined as β q : = ( j = 1 p | β j | q ) 1 / q is bounded from above. Note that this assumption is reduced to the exact sparsity assumption when q = 0 . When q ( 0 , 1 ] , this type of sparsity is known as the soft sparsity. The exact sparsity assumption has been widely used for statistical inference, while the soft sparsity assumption attracts relatively little attention apart from the work [24,25,26]. Specifically, under both exact and soft sparsity assumptions, Raskutti et al. [24] and Ye and Zhang [26] provided minimax rates of convrgence for estimation in high-dimensional linear regression, respectively; Wang et al. [25] developed the optimal rates of convergence and proposed an adaptive q -aggregation strategy via model mixing which attains the established optimal rate automatically. It is worth noting that results in [24,25,26] are all obtained for clean data and cannot be applied to the errors-in-variables model. This is a fundamental difference from our present study.
The main contributions of this paper are as follows. By assuming that the regression parameter is of soft sparsity, in the information-theoretic aspect we establish lower bounds on the minimax risks for p ( 1 p < ) -losses by virtue of the mutual information which hold for any arbitrary estimator for the model regardless of the specific method. In the statistical aspect, we propose an estimator which can be solved efficiently and then provide upper bounds on the 2 -loss between the estimator and the true parameter. Moreover, the lower and upper bounds when p = 2 agree up to constant factors, implying that proposed estimator is minimax optimal in the 2 -loss.
The remainder of this paper is organized as follows. In Section 2, we provide background on the errors-in-variables linear regression model and some regularity conditions on the observed covariate matrix. In Section 3, we establish our main results on lower and upper bounds on minimax risks for p ( 1 p < ) -losses over q -balls. Conclusions and future work are discussed in Section 4.
We end this section by introducing some notations for future reference. We use Greek lowercase letter β to denote the vectors. All vectors are column vectors following classical mathematical convention. A vector β is supported on S if and only if S = { i { 1 , 2 , , d } : β i 0 } , and S is the support of β denoted by supp ( β ) , namely supp ( β ) = S . For d 1 , let I d stand for the d × d identity matrix. For a matrix X R n × d , let X i j ( i = 1 , , n , j = 1 , 2 , , d ) denote its i j -th entry, X i · ( i = 1 , , n ) denote its i-th row, X · j ( j = 1 , 2 , , d ) denote its j-th column.

2. Problem Setup

In this section, we begin with a precise formulation of the problem and then impose some regularity assumptions on the observed matrix.
Recall the standard linear regression model (1). One of the main types of measurement errors is the additive error. Specifically, for each i = 1 , 2 , , n , we observe Z i · = X i · + W i · , where W i · R d is a random vector independent of X i · with mean 0 and known covariance matrix Σ w . When the noise covariance Σ w is unknown, there are some method to estimate it from the observed data; see, e.g., [4]. For example, a simple method is to estimate Σ w from blank independent observations of the noise. Specifically, suppose that one independently observes a matrix W 0 R n × d with n i.i.d. vectors of noise. Then we use Σ w = 1 n W 0 W 0 as the estimate of Σ w . Some other sophisticated variant of this method in are also provided in [4].
Throughout this paper, we assume that for i = 1 , 2 , , n , the vectors X i · , W i · , and e are Gaussian with mean 0 and covariance matrices σ x 2 I d ( σ x > 0 ) , σ w 2 I d , and σ e 2 I n , respectively, and we write σ z 2 = σ x 2 + σ w 2 for simplicity.
According to the previous works of [11,12], we fix i { 1 , 2 , , n } and write Σ x to denote the covariance matrix of X i · (i.e., Σ x = cov ( X i · ) = σ x 2 I d ). Let ( Γ ^ , Υ ^ ) stand for the estimators for ( Σ x , Σ x β ) which only depend on the observed data { ( Z i · , y i ) } i = 1 n . As has been discussed in [11], an unbiased and suitable choice of the surrogate pair ( Γ ^ , Υ ^ ) for the additive error case is given by
Γ ^ : = Z Z n Σ w a n d Υ ^ : = Z y n .
Under the high-dimensional scenario ( n d ) , the matrix Γ ^ , which is the estimator of Σ x in the corrupted case, is always negative definite. To be specific, the matrix Z Z has rank at most n, and then the positive definite matrices Σ w are subtracted to obtain Γ ^ . Consequently, Γ ^ cannot be guaranteed to be positive definite regardless of the amount of noise. However, this does not affect the current result. Particularly, though the negative definiteness of Γ ^ leads to a nonconvex optimization problem in estimating β (cf. (14)) as well as the upper bound, a weaker condition (cf. Assumption 2) allows further analysis.
Instead of assuming the regression parameter β is exact sparse (i.e., supp ( β ) d ), we use a general notion to characterize the sparsity of β . Specifically, we assume that for q [ 0 , 1 ] , and a radius R q > 0 , β B q ( R q ) , where
B q ( R q ) : = { β R d : | | β | | q q = j = 1 d | β j | q R q } .
The use of q -ball is a common and popular way to measure the degree of sparsity (accurately, the above sets are not real “balls”, as they fail to be convex when q [ 0 , 1 ) ). Note that β B 0 ( R 0 ) corresponds to the case that β is exact sparse, while for q ( 0 , 1 ] , β B q ( R q ) corresponds to the case of weak sparsity, which endows a certain decay rate on the ordered entries of β . Throughout this paper, let q [ 0 , 1 ] be fixed, and we assume that β B q ( R q ) unless otherwise specified. Moreover, without loss of generality, we assume that β 2 = 1 and define S 2 ( 1 ) : = { β R d | β 2 = 1 } , i.e., the 2 unit sphere. Then it follows that β B q ( R q ) S 2 ( 1 ) .
In order to estimate the regression parameter, one usually considers an estimator β ^ : R n × d × R n R d , which is a measurable function of the observed data { ( Z i · , y i ) } i = 1 n . Then, for the purpose of assessing the estimation quality of β ^ , it is typical to introduce a loss function L ( β ^ , β ) , which represents the loss incurred by the estimator β ^ when the true parameter β B q ( R q ) S 2 ( 1 ) . Finally, in the minimax formalism, we aim to choose an estimator that minimizes the following worst-case loss
min β ^ max β B q ( R q ) S 2 ( 1 ) L ( β ^ , β ) .
Specifically, in this paper, we shall consider the p -losses for p [ 1 , + ) as follows
L p ( β ^ , β ) : = β ^ β p p .
We then impose some regularity conditions on the observed matrix Z, which are beneficial to analyze the minimax rates. The first assumption requires that the columns of Z are bounded from above in 2 -norm.
Assumption 1
(Column normalization). There exists a constant 0 < κ c < + such that
1 n max j = 1 , 2 , , d Z · j 2 κ c .
The second assumption imposes a lower bound on the restricted eigenvalue of the surrogate gram matrix Γ ^ , which in other words is a lower bound for the restricted curvature.
Assumption 2
(Restricted eigenvalue condition). There exists a constant κ l > 0 and a function τ l ( n , d ) such that for all β B q ( 2 R q ) ,
β Γ ^ β κ l β 2 2 τ l ( n , d ) .
Remark 1. (i) Note that though we focused on the random design case in this article, Assumptions 1 and 2 are stated in deterministic form. This choice is to make them universal to both fixed and random design matrices. Specifically, previous studies have shown that Assumptions 1 and 2 can be satisfied by a wide range of random matrices with high probability; see, e.g., [11,14,27]. Meanwhile, Assumptions 1 and 2 provide the possibility to analyze the fixed design case in which the matrices are usually chosen by researchers with suitable constants, i.e., κ c in Assumption 1 and κ l , τ l in Assumption 2. This deterministic form of the regularity condition on the design matrix is also adopted in the field of modern high-dimensional statistics and machine learning; see, e.g., [11,12,14,28].
(ii) For the Gaussian model we assumed that for i = 1 , 2 , , n , the vectors X i · and W i · are independently Gaussian with mean 0 and covariance matrices σ x 2 I d and σ w 2 I d , respectively, and the observed covariate Z i · is also Gaussian with mean 0 and covariance matrix Σ z = ( σ x 2 + σ w 2 ) I d . Recall that σ z 2 = σ x 2 + σ w 2 , then one has that Z i · N ( 0 , σ z 2 I d ) . Furthermore, since the observations are i.i.d., each column Z · j ( j = 1 , , d ) has i.i.d. elements, and thus Z · j 2 2 ( j = 1 , , d ) obeys the χ 2 distribution with freedom n. Then, for Assumption 1, it follows immediately from [27] [Appendix I] on standard tail bounds for χ 2 -variates and union bounds that there exist universal positive constants ( c 0 , c 1 ) and that Assumption 1 holds with κ c = c 0 σ z log d n with probability at least 1 c 1 exp ( l o g d ) .
(iii) As for Assumption 2, it follows from [29] [Lemma 1] that there exist universal positive constants ( c 0 , c 1 , c 2 ) , such that Assumption 2 holds with κ l = σ x 2 2 and τ l = c 0 σ x 2 max ( σ z 4 σ x 4 , 1 ) log d n with probability at least 1 c 1 exp ( c 2 n min ( σ x 4 σ z 4 , 1 ) ) .

3. Main Results

In this section, we turn to our main results on lower and upper bounds on minimax risks. We first begin with deriving intermediate results under Assumptions 1 and 2, then we will turn to our main results on probabilistic consequences by virtue of Remark 1(ii), (iii) on the conditions to guarantee Assumptions 1 and 2.
Let P β denote the distribution of y in the linear regression model with additive errors, when β is given and Z is observed. The following lemma tells us the KL divergence between the distributions induced by two different parameters β , β B q ( R q ) . The KL divergence plays a key role in establishing the information-theoretic related lower bound. Recall that for two distributions P and Q which have densities d P and d Q with respect to some base measure μ , the KL divergence is defined by D ( P | | Q ) = log d P d Q P ( d μ ) .
Lemma 1.
In the additive error setting, the KL divergence between the distributions induced by any β , β B q ( R q ) S 2 ( 1 ) is equal to
D ( P β | | P β ) = σ x 4 2 σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) Z ( β β ) 2 2 .
Proof. 
For each i = 1 , 2 , , n fixed, by the model setting, ( y i , Z i · ) is jointly Gaussian with mean 0. Then by some elementary algebra to compute the covariances, one has that
y i Z i · N 0 0 , β Σ x β + σ e 2 β Σ x Σ x β Σ x + Σ w .
Then, it follows from standard results on the conditional distribution of Gaussian variables that
y i | Z i · N ( β Σ x Σ z 1 Z i · , β ( Σ x Σ x Σ z 1 Σ x ) β + σ e 2 ) .
Now assume that σ e and σ w are not both 0; otherwise, the conclusion holds trivially. Since P β is a product distribution of y i | Z i · over all i = 1 , 2 , , n , it follows from (2) that
D ( P β | | P β ) = E P β log P β ( y ) P β ( y ) = E P β n 2 log σ β 2 σ β 2 y Z Σ z 1 Σ x β 2 2 2 σ β 2 + y Z Σ z 1 Σ x β 2 2 2 σ β 2 = n 2 log σ β 2 σ β 2 + n 2 σ β 2 σ β 2 1 + 1 2 σ β 2 Z Σ z 1 Σ x ( β β ) 2 2 ,
where σ β 2 : = β ( Σ x Σ x Σ z 1 Σ x ) β + σ e 2 , and σ β 2 is given analogously. Since Σ x = σ x 2 I n , Σ w = σ w 2 I n , and β 2 = 1 by the assumptions, we immediately arrive at that
σ β 2 = σ x 2 σ x 4 σ z 2 β 2 2 + σ e 2 = σ x 2 σ w 2 σ z 2 + σ e 2 .
Substituting this equality into (3) yields that
D ( P β | | P β ) = σ x 4 2 σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) Z ( β β ) 2 2 .
The proof is completed. □
Proposition 1.
In the additive error setting, suppose that the observed matrix Z satisfies Assumption 1 with 0 < κ c < + . Then, for any p [ 1 , + ) , there exists a constant c q , p depending only on q and p such that with probability at least 1/2, the minimax p -loss over the q -ball is lower bounded as
min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β p p c q , p σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) σ x 4 κ c 2 p q 2 R q log d n p q 2 .
Proof. 
For positive numbers δ > 0 and ϵ > 0 , let M p ( δ ) denote the cardinality of a maximal δ -packing of the ball B q ( R q ) in the l p metric with elements { β 1 , β 2 , , β M } , and N 2 ( ϵ ) denote the minimal cardinality of an ϵ -covering of B q ( R q ) in 2 -norm. We follow the standard technique in [30] to transform the estimation on lower bound into a multi-way hypothesis testing problem as follows
P min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β p p 1 2 p δ p min β ˜ P ( B β ˜ ) ,
where B R d is a random variable uniformly distributed over the packing set { β 1 , β 2 , , β M } , and β ˜ is an estimator taking values in the packing set. It then follows from Fano’s inequality [30] that
P ( B β ˜ ) 1 I ( y ; B ) + log 2 log M p ( δ ) ,
where I ( y ; B ) is the mutual information between the random variable B and the observation vector y R n . It now remains to upper bound the mutual information I ( y ; B ) . Based on the procedure of [30], the mutual information is upper bounded as
I ( y ; B ) log N 2 ( ϵ ) + D ( P β | | P β ) .
Let absconv q ( Z / n ) denote the q-convex hull of the rescaled columns of the observed matrix Z, that is,
absconv q ( Z / n ) : = 1 n j = 1 n θ j Z · j | θ B q ( R q ) ,
where the normalization factor 1 / n is used for convenience. Since Z satisfies Assumption 1 [31], [Lemma 4] is applicable to concluding that there exists a set { Z β ˜ 1 , Z β ˜ 2 , , Z β ˜ N } such that for all Z β absconv q ( Z ) , there exists some index i and some constant c > 0 such that Z ( β β ˜ i ) 2 / n c κ c ϵ . Combining this inequality with Lemma 1 and (7), one has that the mutual information is upper bounded as
I ( y ; B ) log N 2 ( ϵ ) + σ x 4 σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) n c 2 κ c 2 ϵ 2 .
Thus, we obtain by (6) that
P ( B β ˜ ) 1 log N 2 ( ϵ ) + σ x 4 σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) n c 2 κ c 2 ϵ 2 + log 2 log M p ( δ ) .
It remains to choose the packing and covering set radii (i.e., δ and ϵ , respectively) such that (8) is strictly above zero, say bounded below by 1 / 2 . For the sake of simplicity, denote σ 2 : = σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) σ x 4 . Suppose that we choose the pair ( δ , ϵ ) such that
c 2 n σ 2 κ c 2 ϵ 2 log N 2 ( ϵ ) , and
log M p ( δ ) 6 log N 2 ( ϵ ) .
As long as N 2 ( ϵ ) 2 , it is guaranteed that
P ( B β ˜ ) 1 2 log N 2 ( ϵ ) + log 2 6 log N 2 ( ϵ ) 1 2 ,
as desired. It remains to determine the values of the pair ( δ , ϵ ) satisfying (9). By [31] [Lemma 3], we know that if c 2 n σ 2 κ c 2 ϵ 2 = L q , 2 R q 2 2 q 1 ϵ 2 q 2 q log d for some constant L q , 2 depending only on q, then (9a) is satisfied. Thus, we can choose ϵ satisfying
ϵ 4 2 q = L q , 2 R q 2 2 q σ 2 c 2 κ c 2 log d n .
In addition, it follows from [31] [Lemma 3] that if δ is chosen to satisfy
U q , p R q p p q 1 δ p q p q log d 6 L q , 2 R q 2 2 q 1 ϵ 2 q 2 q log d ,
for some constant U q , p depending only on q and p, then (9b) holds. Combining (11) and (12), one has that
δ p U q , p 6 L q , 2 p q q ϵ 4 2 q p q 2 R q 2 p 2 q = L q , 2 p q 2 U q , p 6 L q , 2 p q q R q σ 2 c 2 κ c 2 log d n p q 2 .
Combining this inequality with (10) and (5), we obtain that there exists a constant c q , p depending only on q and p such that
P min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β p p c q , p R q σ z 2 ( σ x 2 σ w 2 + σ z 2 σ e 2 ) σ x 4 κ c 2 log d n p q 2 1 2 .
The proof is complete. □
Note that the probability 1 / 2 in Proposition 1 is just a standard convention, and it may be made arbitrarily close to 2 / 3 by choosing the universal constants suitably. Specifically, noting from Equation (10) that as long as N 2 ( ϵ ) 2 is sufficiently large, the probability can be made sufficiently close to 2 / 3 . The requirement on the sufficiently large value of N 2 ( ϵ ) can be satisfied by choosing the universal constants L q , 2 and c in view of Equation (11).
Proposition 2.
In the additive error setting, suppose that for a universal constant c 1 , Γ ^ satisfies Assumption 2 with κ l > 0 and τ l ( n , d ) c 1 R q log d n 1 q / 2 . Then there exist universal constants ( c 2 , c 3 ) and a constant c q depending only on q such that, with probability at least 1 c 2 exp ( c 3 log d ) , the minimax 2 -loss over the q -ball is upper bounded as
min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β 2 2 c q σ z 2 q ( σ w + σ e ) 2 q + κ l 1 q κ l 2 q R q log d n 1 q / 2 .
Proof. 
It suffices to find an estimator for β , which has a small 2 -norm estimation error with high probability. We consider the estimator formulated as follows
β ^ arg min β B q ( R q ) S 2 ( 1 ) 1 2 β Γ ^ β Υ ^ β .
It is worth noting that (14) involves solving a nonconvex optimization problem when q [ 0 , 1 ) , while a near-global solution can be obtained efficiently by the algorithm proposed in [14]. Since β B q ( R q ) S 2 ( 1 ) , it follows from the optimality of β ^ that 1 / 2 β ^ Γ ^ β ^ Υ ^ β ^ 1 / 2 β Γ ^ β Υ ^ β . Define Δ ^ : = β ^ β , and thus one has that Δ ^ B q ( 2 R q ) . Then it follows that
Δ ^ Γ ^ Δ ^ 2 Δ , Υ ^ Γ ^ β .
This inequality, together with the assumption that Γ ^ satisfies Assumption 2, implies that
κ l Δ ^ 2 2 τ l ( R q , n , d ) 2 Δ ^ , Υ ^ Γ ^ β 2 Δ ^ 1 Υ ^ Γ ^ β .
It then follows from [11] [Lemma 2] that there exist universal constants ( c 2 , c 3 , c 4 ) such that, with probability at least 1 c 2 exp ( c 3 log d ) ,
Υ ^ Γ ^ β c 4 σ z ( σ w + σ e ) β 2 log d n = c 4 σ z ( σ w + σ e ) log d n .
Combining (15) and (16), one has that
κ l Δ ^ 2 2 2 c 4 σ z ( σ w + σ e ) log d n Δ ^ 1 + τ l ( R q , n , d ) .
Introduce the shorthand σ : = σ z ( σ w + σ e ) . Recall that Δ ^ B q ( 2 R q ) . It then follows from [24] [Lemma 5] (with τ = 2 c 4 σ κ l log d n ) and the assumption τ l ( R q , n , d ) c 1 R q log d n 1 q / 2 that
Δ ^ 2 2 2 R q 2 c 4 σ κ l log d n 1 q / 2 Δ ^ 2 + 2 R q 2 c 4 σ κ l log d n 2 q + c 1 κ l R q log d n 1 q / 2 .
Therefore, by solving this inequality with the indeterminate viewed as Δ ^ 2 , we arrive at the conclusion that there exists a constant c q depending only on q such that (13) holds with probability at least 1 c 2 exp ( c 3 log d ) . The proof is complete. □
Remark 2. (i) The lower and upper bounds on minimax risks are dependent on the triple ( R q , n , d ) , the error level, and structural properties of the observed matrix Z, as shown in Propositions 1 and 2. Specifically, by setting p = 2 in Proposition 1, the lower and upper bounds agree up to constant factors independent of the triple ( R q , n , d ) , showing the optimal minimax rate in the additive error case.
(ii) Note that when p = 2 and q = 0 (i.e., the exact sparse case), the minimax rate scales as Θ R 0 log d n . In the high-dimensional regime when d / R 0 d γ for some constant γ > 0 , this rate is equivalent to R 0 log ( d / R 0 ) n (up to constant factors), which re-captures the same scaling as in [19].
(iii) The assumption that τ l ( R q , n , d ) c 1 R q log d n 1 q / 2 in Proposition 2 is not unreasonable. It has been shown in [11] [Lemma 1] that it can be satisfied with high probability for the high-dimensional linear errors-in-variables model.
The following two theorems are on probabilistic consequences in view of conditions to ensure Assumptions 1 and 2. The proofs are obtained by applying Propositions 1 and 2 together with Remark 1(ii),(iii), respectively, as well as the elementary probability theory.
Theorem 1 
(Lower bound on p -loss). In the additive error setting, for any p [ 1 , + ) , there exist universal positive constants ( c 0 , c 1 ) and a constant c q , p depending only on q and p such that, with probability at least 1 / 2 ( 1 c 1 exp ( l o g d ) ) , the minimax p -loss over the q -ball is lower bounded as
min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β p p c q , p σ x 2 σ w 2 + σ z 2 σ e 2 c 0 2 σ x 4 p q 2 R q .
Proof. 
(17) follows from substituting κ c = c 0 σ z log d n for a universal positive constant c 0 to (4). As for the probability, we define the following two events A ={Assumption 1 happens} and B ={(17) happens}. Then Proposition 1 is applicable to conclude that P ( B | A ) 1 / 2 . Note from Remark 1(ii) that P ( A ) 1 c 1 exp ( l o g d ) for a universal positive constant c 1 . Then it follows from the elementary probability that P ( B ) = P ( B | A ) P ( A ) + P ( B | A c ) P ( A c ) P ( B | A ) P ( A ) 1 / 2 ( 1 c 1 exp ( l o g d ) ) , which completes the proof. □
Theorem 2 
(Upper bound on 2 -loss). In the additive error setting, for a universal constant c 0 , suppose that σ x 2 max ( σ z 4 σ x 4 , 1 ) c 0 R q log d n q / 2 . Then there exist universal constants ( c 1 , c 2 , c 3 , c 4 ) and a constant c q depending only on q such that, with probability at least ( 1 c 1 exp ( c 2 n min ( σ x 4 σ z 4 , 1 ) ) ) ( 1 c 3 exp ( c 4 log d ) ) , the minimax 2 -loss over the q -ball is upper bounded as
min β ^ max β B q ( R q ) S 2 ( 1 ) β ^ β 2 2 c q σ z 2 q ( σ w + σ e ) 2 q + σ x 2 2 q σ x 4 2 q R q log d n 1 q / 2 .
Proof. 
(17) follows from substituting κ l = σ x 2 2 to (13). As for the probability, we define the following two events C ={Assumption 2 happens} and D ={(18) happens}. Then Proposition 2 is applicable to conclude that P ( D | C ) 1 c 3 exp ( c 4 log d ) for universal positive constants ( c 3 , c 4 ) . Note from Remark 1(iii) that P ( C ) 1 c 1 exp ( c 2 n min ( σ x 4 σ z 4 , 1 ) ) for universal positive constant ( c 1 , c 2 ) . Then it follows from the elementary probability that P ( D ) = P ( D | C ) P ( C ) + P ( D | C c ) P ( C c ) P ( D | C ) P ( C ) ( 1 c 1 exp ( c 2 n min ( σ x 4 σ z 4 , 1 ) ) )   ( 1 c 3 exp ( c 4 log d ) ) , which completes the proof. □

4. Conclusions

We focused on the information-theoretic limitations of estimation for sparse linear regression with additive measurement errors under the high-dimensional scaling. The minimax rates of convergence were analyzed by virtue of lower and upper bounds for p -losses over q -balls based on information theory. The derived lower and upper bounds together revealed the influence of corruption in the observed covariates on parameter estimation. Note that the assumed Gaussian random design matrices are of particular interest and widely applied in the field where the design matrix can be chosen by researchers, such as compressed sensing and signal processing [32]. However, the independent Gaussian assumption is still somewhat restrictive, while our earlier work [14] provides upper bounds on estimation to sub-Gaussian matrices with nondiagonal covariances. Further research may generalize the current result, especially the estimation on lower bounds, to sub-Gaussian matrices with nondiagonal covariances or other types of measurement errors, such as the multiplicative errors or errors with dependent structures. In addition, due to the modern high-dimensional challenge, it is of great significance for a method to be adaptive in learning problems. Hence, it would be a prospective direction to analyze minimax optimal rates of convergence without knowledge of the sparsity degree, such as q and the q -radius.

Author Contributions

Conceptualization, X.L. and D.W.; methodology, X.L.; validation, X.L. and D.W.; formal analysis, X.L.; investigation, X.L. and D.W.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and D.W.; supervision, D.W.; project administration, X.L. Both authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by The APC was funded by the Research Achievement Cultivation Fund of School of Mathematics, Northwest University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ahmed, S.E.; Amiri, S.; Doksum, K. Ensemble linear subspace analysis of high-dimensional data. Entropy 2021, 23, 324. [Google Scholar] [CrossRef] [PubMed]
  2. Song, X.; Wang, C.Y. GMM nonparametric correction methods for logistic regression with error-contaminated covariates and partially observed instrumental variables. Scand. J. Stat. 2019, 46, 898–919. [Google Scholar] [CrossRef] [PubMed]
  3. Sørensen, Ø.; Frigessi, A.; Thoresen, M. Measurement error in Lasso: Impact and likelihood bias correction. Stat. Sci. 2015, 25, 809–829. [Google Scholar] [CrossRef] [Green Version]
  4. Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; Crainiceanu, C.M. Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
  5. Bickel, P.J.; Ritov, Y. Efficient estimation in the errors in variables model. Ann. Stat. 1987, 15, 513–540. [Google Scholar] [CrossRef]
  6. Stefanski, L.A.; Carroll, R.J. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika 1987, 74, 703–716. [Google Scholar]
  7. Torabi, M.; Datta, G.S.; Rao, J. Empirical Bayes estimation of small area means under a nested error linear regression model with measurement errors in the covariates. Scand. J. Stat. 2009, 36, 355–369. [Google Scholar] [CrossRef]
  8. Xu, Y.H.; Li, Y.H.; Song, X. Locally efficient semiparametric estimators for proportional hazards models with measurement error. Scand. J. Stat. 2016, 43, 558–572. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Bühlmann, P.; Van De Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  10. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
  11. Loh, P.L.; Wainwright, M.J. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Stat. 2012, 40, 1637–1664. [Google Scholar] [CrossRef]
  12. Loh, P.L.; Wainwright, M.J. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 2015, 16, 559–616. [Google Scholar]
  13. Datta, A.; Zou, H. Cocolasso for high-dimensional error-in-variables regression. Ann. Stat. 2017, 45, 2400–2426. [Google Scholar] [CrossRef] [Green Version]
  14. Li, X.; Wu, D.Y.; Li, C.; Wang, J.H.; Yao, J.C. Sparse recovery via nonconvex regularized M-estimators over q-balls. Comput. Stat. Data Anal. 2020, 152, 107047. [Google Scholar] [CrossRef]
  15. Bu, Y.H.; Zou, S.F.; Liang, Y.B.; Veeravalli, V.V. Estimation of KL divergence: Optimal minimax rate. IEEE Trans. Inform. Theory 2018, 64, 2648–2674. [Google Scholar] [CrossRef]
  16. Loh, P.L. On lower bounds for statistical learning theory. Entropy 2017, 19, 617. [Google Scholar] [CrossRef] [Green Version]
  17. Cai, T.T.; Zhang, A. Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. J. Multivar. Anal. 2016, 150, 55–74. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Wu, J.B. Quasi-minimax estimation in the partial linear model. Commun. Stat. Theory Methods 2016, 46, 2982–2989. [Google Scholar] [CrossRef]
  19. Loh, P.L.; Wainwright, M.J. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression. In Proceedings of the IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012; pp. 2601–2605. [Google Scholar]
  20. Joshi, R.L.; Crump, V.J.; Fischer, T.R. Image subband coding using arithmetic coded trellis coded quantization. IEEE Trans. Circuits Syst. Video Technol. 1995, 5, 515–523. [Google Scholar] [CrossRef]
  21. Yu, J.Y.; Li, C.; Song, X.M.; Guo, S.Y.; Wang, E. Parallel mixed image encryption and extraction algorithm based on compressed sensing. Entropy 2021, 23, 278. [Google Scholar] [CrossRef]
  22. Li, X.; Wu, D.Y.; Cui, Y.; Liu, B.; Walter, H.; Schumann, G.; Li, C.; Jiang, T.Z. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinform. 2019, 20, 1–11. [Google Scholar] [CrossRef]
  23. Pourasad, Y.; Ranjbarzadeh, R.; Mardani, A. A new algorithm for digital image encryption based on chaos theory. Entropy 2021, 23, 341. [Google Scholar] [CrossRef] [PubMed]
  24. Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE Trans. Inform. Theory 2011, 57, 6976–6994. [Google Scholar] [CrossRef] [Green Version]
  25. Wang, Z.; Paterlini, S.; Gao, F.C.; Yang, Y.H. Adaptive minimax regression estimation over sparse q-hulls. J. Mach. Learn. Res. 2014, 15, 1675–1711. [Google Scholar]
  26. Ye, F.; Zhang, C.H. Rate minimaxity of the Lasso and Dantzig selector for the q loss in r balls. J. Mach. Learn. Res. 2010, 11, 3519–3540. [Google Scholar]
  27. Raskutti, G.; Wainwright, M.J.; Yu, B. Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 2010, 11, 2241–2259. [Google Scholar]
  28. Agarwal, A.; Negahban, S.N.; Wainwright, M.J. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 2012, 40, 2452–2482. [Google Scholar] [CrossRef]
  29. Loh, P.L.; Wainwright, M.J. Supplementary material: High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Stat. 2012, 40, 1637–1664. [Google Scholar] [CrossRef]
  30. Yang, Y.H.; Barron, A. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar] [CrossRef]
  31. Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax Rates of Estimation for High-Dimensional Linear Regression over ℓq-Balls; Technology Report; IEEE: Piscataway, NJ, USA, 2009. [Google Scholar]
  32. Candès, E.J.; Romberg, J.K.; Tao, T. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 2006, 59, 410–412. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, X.; Wu, D. Minimax Rates of p-Losses for High-Dimensional Linear Errors-in-Variables Models over q-Balls. Entropy 2021, 23, 722. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060722

AMA Style

Li X, Wu D. Minimax Rates of p-Losses for High-Dimensional Linear Errors-in-Variables Models over q-Balls. Entropy. 2021; 23(6):722. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060722

Chicago/Turabian Style

Li, Xin, and Dongya Wu. 2021. "Minimax Rates of p-Losses for High-Dimensional Linear Errors-in-Variables Models over q-Balls" Entropy 23, no. 6: 722. https://0-doi-org.brum.beds.ac.uk/10.3390/e23060722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop