Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples

Sowell, Fallaw; Sengupta, Nandana

doi:10.3390/stats4030043

Open AccessArticle

Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples

by

Fallaw Sowell

^1,*

and

Nandana Sengupta

²

¹

Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA 15213, USA

²

School of Public Policy, Indian Institute of Technology Delhi, New Delhi 110016, India

^*

Author to whom correspondence should be addressed.

Stats 2021, 4(3), 725-744; https://0-doi-org.brum.beds.ac.uk/10.3390/stats4030043

Submission received: 22 July 2021 / Revised: 28 August 2021 / Accepted: 30 August 2021 / Published: 3 September 2021

(This article belongs to the Special Issue Ridge Regression, Liu and Related Estimators)

Download Versions Notes

Abstract

:

The asymptotic distribution is presented for the linear instrumental variables model estimated with a ridge penalty and a prior where the tuning parameter is selected with a holdout sample. The structural parameters and the tuning parameter are estimated jointly by method of moments. A chi-squared statistic permits confidence regions for the structural parameters. The form of the asymptotic distribution provides insights on the optimal way to perform the split between the training and test sample. Results for the linear regression estimated by ridge regression are presented as a special case.

Keywords:

ridge regression; holdout sample; method of moments; asymptotic distribution; confidence region

JEL Classification:

C13; C18

1. Introduction

This paper contributes to the asymptotic distribution theory for ridge parameter estimates for the linear instrumental variables model. The tuning parameter for the ridge penalty, denoted

α

, is selected by splitting the data into a training sample and a test (or holdout) sample. In [1], the ridge penalty parameter is estimated jointly with the structural parameters and the asymptotic distribution is characterized as the projection of a stochastic process onto a cone. This gives the rate of convergence to the asymptotic distribution but does not provide guidance for inference. To allow inference, the closed form for the asymptotic distribution is presented. These new results allow for the calculation of confidence regions for the structural parameters. When the prior is equal to the population parameter value of the structural parameters, the tuning parameter is not identified. However, the structural parameter estimates are consistent and the asymptotic covariance is smaller than the asymptotic covariance of the two-stage least squares estimator.

A fundamental issue with ridge regression is the selection of the tuning parameter and its resulting impact on inference for the parameters of interest. One approach is to select a deterministic function of the sample size that shrinks to zero fast enough to not impact the asymptotic distribution. The resulting asymptotic distribution is then equivalent to the OLS asymptotic distribution [2]. An alternative approach is to select the tuning parameter with the observed data. The literature contains multiple ways to estimate the tuning parameter [3,4]. For these estimates, inference typically follows the approach stated in [5]. Conditional on a fixed

α

, the ridge estimator’s covariance is a function of

α

. The

α

is selected using the observed data and substituted into the covariance that was calculated assuming that

α

is fixed. This covariance is used to create a test statistic. In [5], the resulting tests are correctly referred to as approximate t-tests because the authors appreciate the internal inconsistency of using a covariance obtained assuming a fixed

α

with an

α

value estimated with the observed data. (This problem is well known in the literature, see [6,7,8,9,10].) Each estimate for the tuning parameter leads to a different approximate t-test which are typically compared using simulations. This has been the approach for the past 20 years [5,11]. (“When the ridge parameter k is determined from the data, the above arguments are no longer valid. Hence, to investigate the size and power of the approximate ridge-based t-type tests in such cases, a Monte Carlo simulation study is conducted” [5]. “Since a theoretical assessment among the test statistics is not possible, a simulation study has been conducted to evaluate the performance of the suggested test statistics” [11].) For other models, researchers have proposed alternative procedures to obtain hyperparameter (tuning parameter) estimates [12,13]. Like the previous ridge regression literature, these approaches have relied on simulations to demonstrate their behavior. For inference, these procedures would need to be extended to establish their asymptotic distributions. A third approach is followed in this paper. The tuning parameter is selected by splitting the sample into training and test samples. The tuning parameter defines a path from the prior to the IV estimator on the training sample. On this path, the tuning parameter is selected to minimize the prediction error on the test sample. This procedure is written as a method of moments estimation problem where the tuning parameter and the parameters of interest are simultaneously estimated. Inference is then performed using the joint asymptotic distribution.

A related literature concerns the distribution of some empirically selected ridge tuning parameters [14,15,16]. These approaches have relied on strong distribution assumptions (e.g., normal error). In addition, they are built on tuning parameters as functions of the data, where the functions are determined by assuming that the tuning parameter is fixed. This leads to an inconsistency because using the data to select the tuning parameter means that the tuning parameter is no longer fixed. In this paper, the inconsistency is avoided by estimating the structural parameters and the ridge tuning parameter simultaneously. Additionally, the method of moments framework permits weaker assumptions.

In [1], the asymptotic joint distribution for the parameters in the linear model and the ridge tuning parameter is characterized as the projection of a stochastic process onto a cone. This structure occurs because the probability limit of the ridge tuning parameter is on the boundary of the parameters space. (This leads to the same problem of consistently estimating a population parameter that is on the boundary of the parameter space [17,18,19,20].) This leads to a nonstandard asymptotic distribution that depends on the prior and the population parameter value of the structural parameters. When the prior is different from the population parameter value, the asymptotic distribution for the ridge tuning parameter is a mixture with a discrete mass of 1/2 at zero and a truncated normal over the positive reals. In addition, the asymptotic distribution for the structural parameters is normal with a nonzero mean. This mean and variance both contain the population parameter value. This prevents the calculation of t-statistics for individual parameter estimates. However, a hypothesis for the entire set of structural parameters can be tested using a chi-square test and this statistic can be inverted to give accurate confidence regions.

2. Ridge Estimator for Linear IV Model Using a Holdout Sample

Consider the linear instrumental variables model where Y is

n \times 1

, X is

n \times k

, and Z is

n \times m

with

m \geq k

\begin{matrix} Y & = & X β_{0} + ε \end{matrix}

(1)

\begin{matrix} X & = & Z Γ_{0} + u \end{matrix}

(2)

where the

m \times 1

instruments are

z_{i}^{'} \sim i i d

, with full rank

m \times m

second moments

R_{z} = E [z_{i}^{'} z_{i}]

, and conditional on Z,

[\begin{matrix} ε_{i} \\ u_{i} \end{matrix}] \sim i i d (0, [\begin{matrix} σ_{ε}^{2} & Σ_{ε u} \\ Σ_{u ε} & Σ_{U} \end{matrix}]) .

(3)

The IV, or 2SLS, estimator

\begin{matrix} {\hat{β}}_{I V} & = & \underset{β}{arg min} \frac{1}{2 n} {(Y - X β)}^{'} Z {(Z^{'} Z)}^{- 1} Z^{'} (Y - X β) = {(X^{'} P_{Z} X)}^{- 1} X^{'} P_{Z} Y \end{matrix}

(4)

where

P_{Z}

is the projection matrix for Z and has the asymptotic distribution

\sqrt{n} ({\hat{β}}_{I V} - β_{0}) \sim^{a} N (0, σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1}) .

Let

\frac{X^{'} P_{Z} X}{n}

have the spectral decomposition

C D C^{'}

, where C is orthonormal, i.e.,

C^{'} C = I_{k}

and D is a positive semidefinite diagonal

k \times k

matrix of eigenvalues,

λ_{1}, λ_{2}, \dots, λ_{k}

. When some of the eigenvectors explain very little variation, i.e., small magnitudes for the corresponding eigenvalues, the objective function is flatter along these dimensions and the resulting covariance estimates are larger because the variance of

{\hat{β}}_{I V}

is proportional to

{(\frac{X^{'} P_{Z} X}{n})}^{- 1} = {(C D C^{'})}^{- 1} = C D^{- 1} C^{'} .

This leads to a relatively large MSE. The ridge estimator addresses this by shrinking the estimated parameter towards a prior. The ridge objective function augments the usual IV objective function (4) with a quadratic penalty centered at a prior,

β^{p}

, weighted by a regularization tuning parameter

α

\begin{matrix} \frac{1}{2 n} {(Y - X β)}^{'} P_{Z} (Y - X β) + \frac{1}{2} α {(β - β^{p})}^{'} (β - β^{p}) . \end{matrix}

(5)

Conditional on

α

, the ridge solution is

{\hat{β}}_{I V} (α) = {(\frac{X^{'} P_{Z} X}{n} + α I_{k})}^{- 1} (\frac{X^{'} P_{Z} Y}{n} + α β^{p}) .

Different values of

α

result in different estimated values for

β_{0}

. An optimal value for

α

can be determined empirically by splitting the data into training and test samples. The training sample is a randomly drawn sample of

[τ n]

observations, denoted,

Y_{τ n}

,

X_{τ n}

, and

Z_{τ n}

. The estimate using the training sample, conditional on

α,

is

\begin{matrix} {\hat{β}}_{t r} (α) & \equiv & \underset{β}{arg min} \frac{1}{2 [τ n]} {(Y_{τ n} - X_{τ n} β)}^{'} P_{Z_{τ n}} (Y_{τ n} - X_{τ n} β) + \frac{α}{2} {(β - β^{p})}^{'} (β - β^{p}) \end{matrix}

(6)

\begin{matrix} = & {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I_{k})}^{- 1} (\frac{X_{τ n}^{'} P_{Z_{τ n}} Y_{τ n}}{[τ n]} + α β^{p}) \end{matrix}

(7)

where

P_{Z_{τ n}}

is the projection matrix onto

Z_{τ n}

and

[\cdot]

is the greatest integer function. The optimal

α

is selected to minimize the IV least squares objective function over the remaining

(n - [τ n])

observations, i.e., the test or holdout sample, denoted

Y_{n (1 - τ)}

,

X_{n (1 - τ)}

and

Z_{n (1 - τ)}

. The estimated tuning parameter is defined by

\hat{α} = {arg min}_{α \in [0, \infty)} Q_{n (1 - τ)} (α)

where

Q_{n (1 - τ)} (α) = \frac{1}{2 (n - [n τ])} {(Y_{n (1 - τ)} - X_{n (1 - τ)} {\hat{β}}_{t r} (α))}^{'} P_{Z_{n (1 - τ)}} (Y_{n (1 - τ)} - X_{n (1 - τ)} {\hat{β}}_{t r} (α))

(8)

and

P_{Z_{n (1 - τ)}}

is the projection matrix onto

Z_{n (1 - τ)} .

The ridge regression estimate

{\hat{β}}_{\hat{α}} \equiv {\hat{β}}_{I V} (\hat{α})

is then characterized by

\begin{matrix} - \frac{1}{n} X^{'} P_{Z} (Y - X {\hat{β}}_{\hat{α}}) + \hat{α} ({\hat{β}}_{\hat{α}} - β^{p}) & = & 0 . \end{matrix}

Ref. [1] showed how the asymptotic distribution of the ridge estimator can be determined with the method of moments framework using the parameterization

θ = {[\begin{matrix} vech {(R_{τ})}^{'} & vech {(R_{(1 - τ)})}^{'} & vec {(S_{τ})}^{'} & vec {(S_{(1 - τ)})}^{'} & β_{t r}^{'} & α & β^{'} \end{matrix}]}^{'}

where

vec (\cdot)

stacks the elements from a matrix into a column vector and

vech (\cdot)

stacks the unique elements from a symmetric matrix into a column vector. The population parameter values are

θ_{0} = {[\begin{matrix} vech {(R_{z})}^{'} & vech {(R_{z})}^{'} & vec {(R_{z} Γ_{0})}^{'} & vec {(R_{z} Γ_{0})}^{'} & β_{0}^{'} & 0 & β_{0}^{'} \end{matrix}]}^{'} .

The ridge estimator is part of the parameter estimates defined by the just identified system of equations

H_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} h_{i} (θ) = 0

where

h_{i} (θ) = [\begin{matrix} 1_{τ} (i) vech (R_{τ} - z_{i} z_{i}^{'}) \\ (1 - 1_{τ} (i)) vech (R_{(1 - τ)} - z_{i} z_{i}^{'}) \\ 1_{τ} (i) vec (S_{τ} - z_{i} x_{i}^{'}) \\ (1 - 1_{τ} (i)) vec (S_{(1 - τ)} - z_{i} x_{i}^{'}) \\ 1_{τ} (i) (- S_{τ}^{'} R_{τ}^{- 1} z_{i} (y_{i} - x_{i}^{'} β_{t r}) + α (β_{t r} - β^{p})) \\ (1 - 1_{τ} (i)) (y_{i} - x_{i}^{'} β_{t r}) z_{i}^{'} R_{(1 - τ)}^{- 1} S_{(1 - τ)} {(S_{τ}^{'} R_{τ}^{- 1} S_{τ} + α I_{k})}^{- 1} (β^{p} - β_{t r}) \\ - {(τ S_{τ} + (1 - τ) S_{1 - τ})}^{'} {(τ R_{τ} + (1 - τ) R_{1 - τ})}^{- 1}) z_{i} (y_{i} - x_{i}^{'} β) + α (β - β^{p}) \end{matrix}]

(9)

and the training and test samples are determined with the indicator function

1_{τ} (i) = \{\begin{matrix} 1, & i \leq [τ n] \\ 0, & [τ n] < i . \end{matrix}

Using the structure of Equation (9), the system

H_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} h_{i} (θ) = 0

can be seen as seven sets of equations. The first four sets are each self-contained systems of equal numbers of equations and parameters. The fifth set has k equations and introduces the k parameters,

β_{t r}

. The sixth is a single equation with parameter

α

. The seventh introduces the final k parameters,

β

. Identification occurs because the expectation of the gradient is invertible. This is presented in the Appendix A.

3. Asymptotic Behavior

The asymptotic distribution is derived with four high level assumptions.

Assumption 1.

z_{i}

is iid with finite fourth moments and

E [z_{i} z_{i}^{'}] = R_{z}

has full rank.

Assumption 2.

Conditional on Z,

{[\begin{matrix} ε_{i} & u_{i}^{'} \end{matrix}]}^{'}

are iid vectors with zero mean, full rank covariance matrix with possibly nonzero off-diagonal elements.

Assumptions 1 and 2 imply

E [h_{i} (θ_{0})] = 0

and

\sqrt{n} H_{n} (θ_{0})

satisfies the CLT.

Assumption 3.

The parameter space Θ is defined as follows:

R_{z}

is restricted to a symmetric positive definite matrix with eigenvalues

1 / B_{1} \leq e_{1} \leq e_{2} \leq \dots \leq e_{m} \leq B_{1},

|β_{j}| \leq B_{2}

for

j = 1, 2, \dots, k

,

Γ_{0} = [γ_{ℓ, j}]

is of full rank with

|γ_{ℓ, j}| \leq B_{3}

for

ℓ = 1, \dots, m

,

j = 1, 2, \dots, k

and

α \in [0, B_{4}]

where

B_{1}

,

B_{2}

,

B_{3}

, and

B_{4}

are positive and finite.

Assumption 4.

The fraction of the sample used for training satisfies

0 < τ < 1

.

The tuning parameter selected using a holdout sample is root-n when the prior is different from the population parameter value. When the prior is equal to the population parameter value, the tuning parameter is not identified.

Lemma 1.

Assumptions 1–4imply, when

β^{p} \neq β_{0}

, (1)

\hat{α} \to^{p} 0

and (2)

\sqrt{n} \hat{α} = O_{p} (1)

, and when

β^{p} = β_{0}

,

\hat{α}

converges in distribution to a draw from the distribution for

α_{min}

,

\begin{matrix} α_{min} & \equiv & \underset{a \in [0, \infty)}{arg min} {(Z_{(1 - τ)} - R_{z}^{1 / 2} Γ_{0} {(Γ_{0}^{'} R_{z} Γ_{0} + a I_{k})}^{- 1} Γ_{0}^{'} R_{z}^{1 / 2} Z_{τ})}^{'} \\ \times (Z_{(1 - τ)} - R_{z}^{1 / 2} Γ_{0} {(Γ_{0}^{'} R_{z} Γ_{0} + a I_{k})}^{- 1} Γ_{0}^{'} R_{z}^{1 / 2} Z_{τ}) \end{matrix}

where

Z_{(1 - τ)}

and

Z_{τ}

are

\sim i i d N (0, I_{m})

and

R_{z}^{1 / 2}

is the symmetric matrix square root of

R_{z}

.

Proofs are given in the Appendix A. The a-min distribution with,

m \times k

matrix parameter S, is characterized by

\begin{matrix} \underset{a \in [0, \infty)}{arg min} {(Z_{1} - S {(S^{'} S + a I_{k})}^{- 1} S^{'} Z_{2})}^{'} (Z_{1} - S {(S^{'} S + a I_{k})}^{- 1} S^{'} Z_{2}) \end{matrix}

where

Z_{1}

and

Z_{2}

are

\sim i i d N (0, I_{m})

. When

β^{p} = β_{0}

,

\hat{α}

converges in distribution to a draw from the a-min distribution with parameter

R_{z}^{1 / 2} Γ_{0}

.

When

β^{p} = β_{0}

,

α

is no longer identified. Recall that

α

parameterizes a path from the prior to the IV estimator on the training sample. However, when the prior equals

β_{0}

and the IV estimator is consistent for

β_{0}

, every value of

α

will be associated with a consistent estimator for

β_{0}

.

Lemma 1 implies that the probability limit for the tuning parameter is zero,

α_{0} = 0

, which is on the boundary of the parameter space. This results in a nonstandard asymptotic distribution, which is characterized by the projection of a random vector on a cone (denoted

Λ

) that allows for the sample estimate to be on the boundary of the parameter space. The estimation objective function can be expanded into a quadratic approximation about the centered and scaled population parameter values

\begin{matrix} n H_{n} {(θ)}^{'} H_{n} (θ) \\ = & n H_{n} {(θ_{0})}^{'} H_{n} (θ_{0}) + 2 n H_{n} (θ_{0}) \frac{\partial H_{n} (θ_{0})}{\partial θ^{'}} (θ - θ_{0}) \\ + n {(θ - θ_{0})}^{'} \{\frac{\partial H_{n} {(θ_{0})}^{'}}{\partial θ} \frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}\} (θ - θ_{0}) + o_{p} (1) \\ = & n {(H_{n} (θ_{0}) + \frac{\partial H_{n} (θ_{0})}{\partial θ^{'}} (θ - θ_{0}))}^{'} (H_{n} (θ_{0}) + \frac{\partial H_{n} (θ_{0})}{\partial θ^{'}} (θ - θ_{0})) + o_{p} (1) \\ = & {({(\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}})}^{- 1} \sqrt{n} H_{n} (θ_{0}) + \sqrt{n} (θ - θ_{0}))}^{'} \{\frac{\partial H_{n} {(θ_{0})}^{'}}{\partial θ} \frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}\} \\ \times ({(\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}})}^{- 1} \sqrt{n} H_{n} (θ_{0}) + \sqrt{n} (θ - θ_{0})) + o_{p} (1) . \end{matrix}

This suggests selecting

\hat{θ}

to minimize

H_{n} {(θ)}^{'} H_{n} (θ)

results in the asymptotic distribution of

\sqrt{n} (\hat{θ} - θ_{0})

being equivalent to the distribution of

{arg min}_{λ \in Λ} {(Z + λ)}^{'} M_{0}^{'} M_{0} (Z + λ)

, where the random variable is defined as

\begin{matrix} Z = lim_{n \to \infty} {(E [\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}])}^{- 1} \sqrt{n} H_{n} (θ_{0}), M_{0} = E [\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}], \end{matrix}

and the cone is defined by

Λ \equiv \{λ \in R^{m (m + 1) / 2 + 2 k m + 2 k + 1} : λ_{m (m + 1) / 2 + 2 k m + k + 1} \geq 0\}

. The estimator is defined as

\hat{θ} = {arg min}_{θ \in Θ} H_{n} {(θ)}^{'} H_{n} (θ)

and its asymptotic distribution is characterized in Theorem 1 of [1]. For continuity of presentation, the theorem is repeated here.

Theorem 1.

Assumptions 1–4imply that, when

β^{p} \neq β_{0}

, the asymptotic distribution of

\sqrt{n} (\hat{θ} - θ_{0})

is equivalent to the distribution of

\hat{λ} = \underset{λ \in Λ}{arg min} {(Z + λ)}^{'} M_{0}^{'} M_{0} (Z + λ) .

The objective function can be minimized at a value of the tuning parameter in

(0, \infty)

or possibly at

α = 0 .

The asymptotic distribution of the tuning parameter will be composed of two parts, a discrete mass at

α = 0

and a continuous function over

(0, \infty)

. The asymptotic distribution is characterized as the projection of a stochastic process onto a cone. The special structure of the ridge estimator using a holdout sample permits the calculation of the closed form for the asymptotic distribution for the parameter of interest, see Theorem 1 in [17], case 2 after Theorem 2 in [18], Section 3.8 in [19], and Theorem 5 in [20].

Theorem 2.

Assumptions 1–4imply, when

β^{p} \neq β_{0}

,

(i)

\sqrt{n} (\hat{β} - β_{0})

converges in distribution to a draw from a normal distribution with mean

- {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}}

and covariance

σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} + (\frac{1}{τ (1 - τ)}) (\frac{σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1}}{{(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}),

and

(ii)

\sqrt{n} \hat{α}

will converge in distribution to a mixture distribution with discrete mass of 1/2 at zero and over

[0, \infty)

, a truncated normal distribution with zero mean and covariance

\frac{σ_{ε}^{2}}{τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})} .

When the prior is different from the population parameter value, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than for the 2SLS. However, this is restricted to only one dimension. The asymptotic MSE is

\begin{matrix} σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} + (\frac{(2 π n + 1)}{2 π τ (1 - τ) n}) (\frac{σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1}}{{(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}) \\ = & σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} \{I_{k} + (\frac{(2 π n + 1)}{2 π τ (1 - τ) n}) P_{{(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} (β_{0} - β^{p})}\} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} . \end{matrix}

This is the MSE for the 2SLS estimator plus a term built on

P_{{(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} (β_{0} - β^{p})}

, the projection matrix for

{(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} (β_{0} - β^{p}) .

The ridge estimator using the holdout sample has the same bias, variance, and MSE as the 2SLS estimator, except in the dimension of

{(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1 / 2} (β_{0} - β^{p})

. Because

\frac{1}{τ (1 - τ)}

takes its minimum at

τ = 0.5

, the optimal sample split to minimize the asymptotic bias, variance, and MSE is when the sample is equally split between the training and the testing (or holdout) sample.

Because the population parameter value enters into both the asymptotic bias and the asymptotic variance, it is not possible to determine individual t-statistics for the parameters. However, under the null hypothesis that

H_{0} : β = β_{0}

, the statistic

\begin{matrix} n {((\hat{β} - β_{0}) + {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{n 2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}})}^{'} \\ \times {(σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} + (\frac{1}{τ (1 - τ)}) (\frac{σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1}}{{(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}))}^{- 1} \\ \times ((\hat{β} - β_{0}) + {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{n 2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}}) \end{matrix}

will converge in distribution to a chi-square with k degrees of freedom. This statistic can be inverted to create accurate confidence regions.

The asymptotic behavior is different when

β^{p} = β_{0}

.

Theorem 3.

Assumptions 1–4imply, when

β^{p} = β_{0},

\sqrt{n} (\hat{β} - β_{0})

, conditional on

\hat{α}

, converges in distribution to

N (0, σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0} + \hat{α} I_{k})}^{- 1} (Γ_{0}^{'} R_{z} Γ_{0}) {(Γ_{0}^{'} R_{z} Γ_{0} + \hat{α} I_{k})}^{- 1})

where

\hat{α}

converges in distribution to a draw from an a-min distribution with parameter

S = R_{z}^{1 / 2} Γ_{0}

.

In the unlikely event that the prior is selected equal to the population parameter value, the asymptotic covariance is smaller than or equal to the 2SLS asymptotic covariance. In terms of implementation, the covariance and bias associated with

β^{p} \neq β_{0}

should be used because it is asymptotically correct for all priors except for

β^{p} = β_{0}

, where it leads to conservative confidence regions.

Linear Regression

A special case is the linear regression model where Y is

n \times 1

and X is

n \times k

\begin{matrix} Y & = & X β_{0} + ε \end{matrix}

(10)

with full rank

k \times k

second moments

R_{x} = E [x_{i}^{'} x_{i}]

, and conditional on X,

ε_{i} \sim i i d (0, σ_{ε}^{2} < \infty)

. The estimation equations for the ridge regression estimate where the tuning parameter,

α \geq 0

, is selected with a holdout sample can be written in the method of moments framework using the parameterization

θ = {[\begin{matrix} vech {(R_{x τ})}^{'} & β_{t r}^{'} & α & β^{'} \end{matrix}]}^{'} w i t h θ_{0} = {[\begin{matrix} vech {(R_{x})}^{'} & β_{0}^{'} & 0 & β_{0}^{'} \end{matrix}]}^{'} .

The ridge estimator is part of the parameter estimates defined by the just identified system of equations

H_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} h_{i} (θ) = 0

, where

h_{i} (θ) = [\begin{matrix} 1_{τ} (i) vech (R_{x τ} - x_{i} x_{i}^{'}) \\ 1_{τ} (i) (- x_{i} (y_{i} - x_{i}^{'} β_{t r}) + α (β_{t r} - β^{p})) \\ (1 - 1_{τ} (i)) (y_{i} - x_{i}^{'} β_{t r}) x_{i}^{'} {(R_{x τ} + α I_{k})}^{- 1} (β^{p} - β_{t r}) \\ - x_{i} (y_{i} - x_{i}^{'} β) + α (β - β^{p}) \end{matrix}] .

(11)

Along with Assumption 4, the following three assumptions are sufficient to obtain the asymptotic results.

Assumption 5.

x_{i}

is iid with finite fourth moments and

E [x_{i} x_{i}^{'}] = R_{x}

has full rank.

Assumption 6.

Conditional on X,

ε_{i} \sim i i d (0, σ_{ε}^{2} < \infty)

.

Assumption 7.

The parameter space Θ is defined as follows:

R_{x}

is restricted to a symmetric positive definite matrix with eigenvalues

1 / B_{1} \leq e_{1} \leq e_{2} \leq \dots \leq e_{k} \leq B_{1},

|β_{j}| \leq B_{2}

for

j = 1, 2, \dots, k

, and

α \in [0, B_{3}]

where

B_{1}

,

B_{2}

, and

B_{3}

are positive and finite.

Lemma 2 gives the rate of convergence for the tuning parameter when the prior is different from the population parameter value and characterizes its asymptotic distribution when the prior is equal to the population parameter value.

Lemma 2.

Assumptions 4–7imply

(i) if

β^{p} \neq β_{0}

, (1)

\hat{α} \to^{p} 0

and (2)

\sqrt{n} \hat{α} = O_{p} (1)

; (ii) if

β^{p} = β_{0}

,

\hat{α}

converges in distribution to a draw from the a-min distribution with parameter

S = R_{x}^{1 / 2}

, the symmetric matrix square root of

R_{x}

.

The asymptotic distribution of

\sqrt{n} (\hat{θ} - θ_{0})

is equivalent to the distribution of

{arg min}_{λ \in Λ} {(Z + λ)}^{'} M_{0}^{'} M_{0} (Z + λ)

where the random variable is defined as

\begin{matrix} Z = lim_{n \to \infty} {(E [\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}])}^{- 1} \sqrt{n} H_{n} (θ_{0}), M_{0} = E [\frac{\partial H_{n} (θ_{0})}{\partial θ^{'}}], \end{matrix}

and the cone is defined by

Λ \equiv \{λ \in R^{k (k + 1) / 2 + 2 k + 1} : λ_{k (k + 1) / 2 + k + 1} \geq 0\}

. The estimator is defined as

\hat{θ} = \underset{θ \in Θ}{arg min} H_{n} {(θ)}^{'} H_{n} (θ) .

Theorem 4.

Assumptions 4–7imply, when

β^{p} \neq β_{0}

, the asymptotic distribution of

\sqrt{n} (\hat{θ} - θ_{0})

is equivalent to the distribution of

\hat{λ} = \underset{λ \in Λ}{arg min} {(Z + λ)}^{'} M_{0}^{'} M_{0} (Z + λ) .

This Theorem characterizes the asymptotic distribution of the estimator as the projection of a stochastic process onto a cone. The special structure of this problems allows for the analytic derivation of the asymptotic distribution.

Theorem 5.

Assumptions 4–7imply, when

β^{p} \neq β_{0}

,

(i)

\sqrt{n} (\hat{β} - β_{0})

is asymptotically normally distributed with mean

- R_{x}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}}

and covariance

σ_{ε}^{2} R_{x}^{- 1} + (\frac{1}{τ (1 - τ)}) (\frac{σ_{ε}^{2} R_{x}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} R_{x}^{- 1}}{{(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}),

and

(ii)

\sqrt{n} \hat{α}

asymptotically has a mixture distribution with a discrete mass of 1/2 at zero and over

[0, \infty)

, a truncated normal distribution with zero mean and covariance

\frac{σ_{ε}^{2}}{τ (1 - τ) {(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})} .

When

β^{p} \neq β_{0}

, the ridge estimator has an asymptotic bias and its asymptotic variance is larger than that of the OLS estimator. However, this is only in one dimension. The asymptotic MSE is

σ_{ε}^{2} R_{x}^{- 1} + (\frac{(2 π n + 1)}{2 π τ (1 - τ) n}) (\frac{σ_{ε}^{2} R_{x}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} R_{x}^{- 1}}{{(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}) .

This is the MSE for the OLS estimator plus a constant times the projection matrix for

R_{x}^{- 1 / 2} (β_{0} - β^{p}) .

The ridge estimator using the holdout sample has the same bias, variance, and MSE as the OLS estimator except in the dimension of the

R_{x}^{- 1 / 2} (β_{0} - β^{p})

. In order to minimize bias, variance, and MSE of the estimator,

τ = 0.5

should be selected. The population parameter value in both the asymptotic bias and the asymptotic variance does not allow for individual t-statistics. However, under the null hypothesis that

H_{0} : β = β_{0}

, the statistic

\begin{matrix} n {((\hat{β} - β_{0}) + R_{x}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{n 2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}})}^{'} \\ \times {(σ_{ε}^{2} R_{x}^{- 1} + (\frac{1}{τ (1 - τ)}) (\frac{σ_{ε}^{2} R_{x}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} R_{x}^{- 1}}{{(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}))}^{- 1} \\ \times ((\hat{β} - β_{0}) + R_{x}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{n 2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} R_{x}^{- 1} (β_{0} - β^{p})}}) \end{matrix}

will converge in distribution to a draw from a chi-square with k degrees of freedom and can be used to create confidence regions.

When

β^{p} = β_{0}

, a different asymptotic distribution occurs.

Theorem 6.

Assumptions 4–7imply, when

β^{p} = β_{0}

,

\sqrt{n} (\hat{β} - β_{0})

, conditional on

\hat{α}

, converges in distribution to a draw from

N (0, σ_{ε}^{2} {(R_{x} + \hat{α} I_{k})}^{- 1} R_{x} {(R_{x} + \hat{α} I_{k})}^{- 1})

where

\hat{α}

will converge in distribution to a draw from an a-min distribution with parameter

S = R_{x}^{1 / 2}

.

If

β^{p} = β_{0}

, the asymptotic covariance is smaller than or equal to the OLS estimator’s asymptotic covariance. Again, the covariance and bias associated with

β^{p} \neq β_{0}

should be used for inference.

4. Small Sample Properties

The behavior in finite samples is investigated next by simulating the model in Equations (1) and (2) with

k = 2

and

m = 4

(Because the ridge estimator and the 2SLS estimator will be compared using MSE, two moments need to exist. This is ensured by having four instruments to estimate the two parameters. See [21].). To standardize the model, set

z_{i} \sim i i d N (0, I_{4})

and

β_{0} = {(0, 0)}^{'}

. Create endogeneity with

[\begin{matrix} ε_{i} \\ u_{i} \end{matrix}] \sim i i d N (0, [\begin{matrix} 1 & 0.7 & 0.7 \\ . 7 & 1 & 0 \\ 0.7 & 0 & 1 \end{matrix}]) .

(12)

The strength of the instrument is controlled by the parameter

δ

with

Γ_{0} = {[\begin{matrix} 1 & 0 & 1 & 1 \\ 0 & δ & 0 & 0 \end{matrix}]}^{'} .

If

δ = 0

, the second element of

β_{0}

is not identified.

The ridge parameter estimate

α

is determined in two steps. In the first step, the objective function is evaluated on a grid of values to determine a starting value. In the second step, the objective function is evaluated over a finer grid (10,000 points) centered at the best value obtained from the first step. A value of

\hat{α} = 0

in the second step corresponds to the ridge estimator ignoring the prior in favor of the data, whereas a value

\hat{α} = 10^{7}

corresponds to “infinite regularization” implying that the ridge estimator ignores the data in favor of the prior.

4.1. Coverage Probabilities

The chi-square test is performed for a range of parameterizations including different priors, different strengths of the instrument, and different sample sizes. Each model is simulated 10,000 times and a size of ten percent is used for each test. The results are presented in Table A1. The observed coverage probabilities agree with the theoretical values. As expected, the approximations are best in cases where the sample size is large, the correlation between instruments and covariates is higher, and the prior is closer to the population parameter value.

4.2. MSE for the Ridge Estimator

The ridge estimator is compared with the 2SLS estimator to demonstrate settings where the ridge estimator can be expected to give more accurate results in small samples. The simulated models differ in three dimensions: sample size, strength of the instruments, and the prior on the structural parameters. For smaller sample sizes, the ridge estimator should have better properties, whereas, for larger sample sizes, 2SLS should perform better. Sample sizes of

n = 25

, 50, 250 and 500 are considered. As noted above, the instrument signal strength decreases with

δ

. For smaller signal strengths, the ridge estimator should perform better. Values of

δ = 0.01

, 0.05, 0.1, and 0.5 are considered. For prior values further from to the population parameter values, the ridge estimator should perform worse. Three different values of

β^{p}

are considered

{(\frac{1}{2}, \frac{1}{2})}^{'}

,

{(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})}^{'},

and

{(1, 1)}^{'}

. A total of 48 model specifications are simulated: four sample sizes n, four values of the precision parameter

δ

, and three values of the prior

β^{p}

. Each specification is simulated

10, 000

times and estimated with 2SLS and the ridge estimator.

τ = 0.5

is used to split the sample between training and test samples for the ridge estimator.

Table A2, Table A3 and Table A4 compare the performance of the 2SLS estimator with the ridge estimator for different precision levels and sample sizes when the prior is fixed at

β^{p} = {(\frac{1}{2}, \frac{1}{2})}^{'}

,

β^{p} = {(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})}^{'}

, and

β^{p} = {(1, 1)}^{'}

, respectively. The estimators are compared on the basis of bias, standard deviation of the estimates, MSE values of the estimates, and sum of MSE values of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

. All three tables demonstrate expected results. The ridge estimator dominates in models with smaller sample sizes, weaker instrument strength, and when the prior is closer to the population parameter values.

Overall, the simulations demonstrate that, for some models’ specifications, the ridge estimator using a holdout sample has better small sample performance relative to the 2SLS estimator. The simulations agree with the asymptotic distributions. As the sample size increases, the 2SLS estimator performs better than the ridge estimator.

5. Conclusions

Inference has always been a weakness of ridge regression. This paper presents a methodology and results to help address some of its weaknesses. Theoretically accurate inferences can be performed with the asymptotic distribution of the ridge estimates of the linear IV model when the tuning parameter is empirically selected by a holdout sample. It is well known that the distribution of the estimates of the structural parameters is affected by empirically selected tuning parameters. This is addressed by simultaneously estimating both the parameters of interest and the ridge regression tuning parameter in the method of moments framework. When the prior is different from the population parameter value, the estimator accounts for the probability limit of the tuning parameter being on the boundary of the parameter space. The asymptotic distribution for the tuning parameter is a nonstandard mixed distribution. The asymptotic distribution for the estimates of the structural parameters is normal but with a nonzero mean. The ridge estimator of the structural parameters has asymptotic bias and the asymptotic covariance is larger than the asymptotic covariance for the 2SLS estimator; however, the bias and larger covariance only apply to one dimension of the parameter space. The dependence of the asymptotic mean and variance on the population parameter values, prevents the calculation of t-statistics for individual parameters. Fortunately, a chi-square statistic provides accurate confidence regions for the structural parameters.

If the prior is equal to the population parameter value, the ridge estimator is consistent and the asymptotic covariance is smaller than the 2SLS asymptotic covariance. The asymptotic distribution provides insights on how to perform estimation with a holdout sample. The minimum bias, variance, and MSE for the structural parameters occur when the sample is equally split into a training sample and a test (or holdout) sample.

This paper’s approach can be useful in determining the asymptotic behavior for other empirical procedures that select tuning parameters. Two natural extensions would be to generalize cross-validation (see [22]) and K-fold cross-validation, where the entire dataset would be used to select the tuning parameter. This paper has focused on strong correlations between the instruments and the regressors. Another important extension would be the asymptotic behavior in models with weaker correlation between the instruments and the regressors, see [23].

Supplementary Materials

The following are available online at https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/stats4030043/s1.

Author Contributions

Conceptualization, F.S. and N.S.; simulations, F.S. and N.S.; writing—original draft preparation, F.S. and N.S.; writing—review and editing, F.S. and N.S. Both authors contributed equally to this project. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The authors thank three anonymous referees for their helpful comments. The authors benefited from discussions during the presentation at the 2021 North American Summer Meeting of the Econometrics Society.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Coverage probabilities for confidence regions created from the

χ^{2}

statistic with

τ = 0.5

and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. The models differ with respect to priors (

β^{p}

), strength of the instrument (

δ

), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.

Table A1. Coverage probabilities for confidence regions created from the

χ^{2}

statistic with

τ = 0.5

and the corresponding values from 2SLS estimation. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. The models differ with respect to priors (

β^{p}

), strength of the instrument (

δ

), and sample size (n). Each model is simulated 10,000 times and the test performed with a size of 10%.

$β^{p}$	$δ$	n	$χ_{τ = 0.50}^{2}$	$χ_{2 sls}^{2}$
		25	0.651	0.547
	$δ = 0.01$	500	0.717	0.629
		10,000	0.708	0.661
		25	0.655	0.561
$β^{p} = {(1 / 2, 1 / 2)}^{'}$	$δ = 0.05$	500	0.710	0.663
		10,000	0.721	0.863
		25	0.656	0.571
	$δ = 0.10$	500	0.709	0.759
		10,000	0.855	0.890
		25	0.697	0.716
	$δ = 0.50$	500	0.870	0.896
		10,000	0.900	0.897
		25	0.623	0.556
	$δ = 0.01$	500	0.686	0.636
		10,000	0.675	0.661
		25	0.625	0.563
$β^{p} = {(1 / \sqrt{2}, 1 / \sqrt{2})}^{'}$	$δ = 0.05$	500	0.671	0.673
		10,000	0.736	0.859
		25	0.624	0.576
	$δ = 0.10$	500	0.682	0.760
		10,000	0.863	0.892
		25	0.688	0.723
	$δ = 0.50$	500	0.870	0.888
		10,000	0.891	0.900
		25	0.621	0.565
	$δ = 0.01$	500	0.670	0.632
		10,000	0.650	0.669
		25	0.622	0.565
$β^{p} = {(1, 1)}^{'}$	$δ = 0.05$	500	0.664	0.664
		10,000	0.784	0.856
		25	0.616	0.564
	$δ = 0.10$	500	0.673	0.750
		10,000	0.867	0.892
		25	0.689	0.726
	$δ = 0.50$	500	0.865	0.895
		10,000	0.886	0.896

Table A2. The simulated model is given by Equations Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using 2SLS and ridge estimator for

β^{p} = {(\frac{1}{2}, \frac{1}{2})}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

Table A2. The simulated model is given by Equations Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using 2SLS and ridge estimator for

β^{p} = {(\frac{1}{2}, \frac{1}{2})}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

			${\hat{β}}_{1}$			${\hat{β}}_{2}$			$({\hat{β}}_{1}, {\hat{β}}_{2})$
$δ$	n	Estimator	Bias	SD	MSE	Bias	SD	MSE	MSE
0.01	25	2SLS	0.020	0.117	0.014	0.715	0.661	0.948	0.962
		Ridge	0.080	0.108	0.018	0.604	0.343	0.483	0.501
	50	2SLS	0.009	0.082	0.007	0.696	0.686	0.956	0.963
		Ridge	0.055	0.076	0.009	0.589	0.333	0.458	0.467
	250	2SLS	0.001	0.037	0.001	0.694	0.740	1.029	1.031
		Ridge	0.022	0.037	0.002	0.577	0.349	0.455	0.457
	500	2SLS	0.001	0.028	0.001	0.697	0.710	0.989	0.990
		Ridge	0.016	0.027	0.001	0.577	0.317	0.433	0.434
0.05	25	2SLS	0.019	0.122	0.015	0.688	0.694	0.954	0.970
		Ridge	0.083	0.110	0.019	0.595	0.352	0.477	0.496
	50	2SLS	0.011	0.084	0.007	0.683	0.697	0.952	0.959
		Ridge	0.056	0.078	0.009	0.583	0.335	0.452	0.462
	250	2SLS	0.001	0.037	0.001	0.572	0.712	0.834	0.835
		Ridge	0.022	0.038	0.002	0.531	0.332	0.392	0.394
	500	2SLS	0.001	0.025	0.001	0.485	0.687	0.707	0.707
		Ridge	0.016	0.027	0.001	0.504	0.292	0.339	0.340
0.10	25	2SLS	0.021	0.121	0.015	0.636	0.727	0.933	0.948
		Ridge	0.082	0.107	0.018	0.572	0.356	0.454	0.472
	50	2SLS	0.008	0.086	0.007	0.590	0.719	0.865	0.872
		Ridge	0.054	0.079	0.009	0.552	0.334	0.416	0.425
	250	2SLS	0.002	0.035	0.001	0.335	0.550	0.415	0.416
		Ridge	0.024	0.038	0.002	0.444	0.265	0.268	0.270
	500	2SLS	0.000	0.026	0.001	0.181	0.463	0.247	0.248
		Ridge	0.016	0.028	0.001	0.378	0.245	0.203	0.204
0.50	25	2SLS	0.012	0.127	0.016	0.158	0.462	0.238	0.255
		Ridge	0.088	0.123	0.023	0.325	0.258	0.173	0.195
	50	2SLS	0.006	0.085	0.007	0.066	0.312	0.101	0.109
		Ridge	0.055	0.089	0.011	0.243	0.235	0.114	0.125
	250	2SLS	0.002	0.036	0.001	0.014	0.126	0.016	0.017
		Ridge	0.016	0.039	0.002	0.104	0.148	0.033	0.035
	500	2SLS	0.000	0.026	0.001	0.007	0.089	0.008	0.009
		Ridge	0.009	0.027	0.001	0.070	0.111	0.017	0.018

Table A3. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using 2SLS and ridge estimator for

β^{p} = {(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

Table A3. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using 2SLS and ridge estimator for

β^{p} = {(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

			${\hat{β}}_{1}$			${\hat{β}}_{2}$			$({\hat{β}}_{1}, {\hat{β}}_{2})$
$δ$	n	Estimator	Bias	SD	MSE	Bias	SD	MSE	MSE
0.01	25	2SLS	0.021	0.115	0.014	0.688	0.659	0.908	0.921
		Ridge	0.084	0.108	0.019	0.700	0.334	0.601	0.620
	50	2SLS	0.009	0.081	0.007	0.702	0.698	0.980	0.987
		Ridge	0.055	0.077	0.009	0.708	0.331	0.611	0.620
	250	2SLS	0.002	0.037	0.001	0.693	0.730	1.014	1.016
		Ridge	0.024	0.037	0.002	0.700	0.316	0.589	0.591
	500	2SLS	0.001	0.026	0.001	0.675	0.702	0.949	0.949
		Ridge	0.016	0.027	0.001	0.700	0.304	0.582	0.583
0.05	25	2SLS	0.021	0.117	0.014	0.683	0.654	0.894	0.908
		Ridge	0.083	0.107	0.018	0.694	0.334	0.593	0.612
	50	2SLS	0.010	0.081	0.007	0.666	0.663	0.884	0.891
		Ridge	0.056	0.077	0.009	0.690	0.339	0.591	0.601
	250	2SLS	0.002	0.037	0.001	0.576	0.683	0.798	0.800
		Ridge	0.023	0.037	0.002	0.651	0.306	0.518	0.520
	500	2SLS	0.001	0.026	0.001	0.472	0.662	0.661	0.662
		Ridge	0.016	0.027	0.001	0.618	0.324	0.487	0.488
0.10	25	2SLS	0.022	0.120	0.015	0.657	0.695	0.914	0.929
		Ridge	0.085	0.111	0.019	0.679	0.361	0.591	0.611
	50	2SLS	0.009	0.083	0.007	0.595	0.690	0.830	0.837
		Ridge	0.056	0.078	0.009	0.657	0.316	0.532	0.541
	250	2SLS	0.001	0.037	0.001	0.334	0.572	0.439	0.441
		Ridge	0.023	0.038	0.002	0.548	0.301	0.391	0.393
	500	2SLS	0.000	0.026	0.001	0.187	0.457	0.244	0.245
		Ridge	0.015	0.027	0.001	0.463	0.295	0.301	0.302
0.50	25	2SLS	0.014	0.124	0.016	0.169	0.455	0.235	0.251
		Ridge	0.084	0.126	0.023	0.369	0.328	0.244	0.267
	50	2SLS	0.006	0.083	0.007	0.066	0.293	0.090	0.097
		Ridge	0.050	0.085	0.010	0.266	0.268	0.142	0.152
	250	2SLS	0.001	0.037	0.001	0.011	0.128	0.017	0.018
		Ridge	0.013	0.038	0.002	0.103	0.155	0.035	0.036
	500	2SLS	0.001	0.026	0.001	0.005	0.090	0.008	0.009
		Ridge	0.008	0.026	0.001	0.071	0.114	0.018	0.019

Table A4. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using the 2SLS and ridge estimator for

β^{p} = {(1, 1)}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

Table A4. The simulated model is given by Equations (1), (2), (4) and (12) with

k = 2

,

m = 4

,

z_{i} \sim i i d N (0, I_{4})

, and

β_{0} = {(0, 0)}^{'}

. Summary statistics are reported for estimates of

{\hat{β}}_{1}

and

{\hat{β}}_{2}

using the 2SLS and ridge estimator for

β^{p} = {(1, 1)}^{'}

, where

τ = 0.5

. The models differ with respect to the strength of the instrument (

δ

) and sample size (n). Each model is simulated 10,000 times. The ridge estimator outperforms the 2SLS estimator in terms of MSE values in a number of cases. In particular, in small samples and low precision settings, the ridge estimator leads to smaller MSE values.

			${\hat{β}}_{1}$			${\hat{β}}_{2}$			$({\hat{β}}_{1}, {\hat{β}}_{2})$
$δ$	n	Estimator	Bias	SD	MSE	Bias	SD	MSE	MSE
0.01	25	2SLS	0.022	0.124	0.016	0.708	0.724	1.025	1.041
		Ridge	0.086	0.124	0.023	0.831	0.435	0.880	0.903
	50	2SLS	0.009	0.081	0.007	0.695	0.678	0.943	0.950
		Ridge	0.057	0.083	0.010	0.836	0.380	0.842	0.853
	250	2SLS	0.003	0.036	0.001	0.691	0.693	0.958	0.960
		Ridge	0.024	0.039	0.002	0.859	0.385	0.887	0.889
	500	2SLS	0.001	0.027	0.001	0.698	0.725	1.013	1.014
		Ridge	0.016	0.028	0.001	0.868	0.351	0.877	0.878
0.05	25	2SLS	0.018	0.125	0.016	0.688	0.711	0.979	0.995
		Ridge	0.083	0.117	0.021	0.820	0.399	0.832	0.853
	50	2SLS	0.010	0.082	0.007	0.669	0.681	0.911	0.918
		Ridge	0.057	0.082	0.010	0.823	0.383	0.824	0.834
	250	2SLS	0.002	0.035	0.001	0.568	0.658	0.755	0.756
		Ridge	0.023	0.039	0.002	0.793	0.365	0.761	0.763
	500	2SLS	0.001	0.031	0.001	0.466	0.725	0.743	0.744
		Ridge	0.016	0.031	0.001	0.753	0.437	0.757	0.759
0.10	25	2SLS	0.018	0.121	0.015	0.654	0.682	0.893	0.908
		Ridge	0.085	0.119	0.021	0.799	0.404	0.802	0.823
	50	2SLS	0.010	0.099	0.010	0.593	1.026	1.405	1.415
		Ridge	0.057	0.083	0.010	0.780	0.410	0.777	0.787
	250	2SLS	0.001	0.036	0.001	0.318	0.605	0.468	0.469
		Ridge	0.021	0.040	0.002	0.643	0.393	0.567	0.569
	500	2SLS	0.001	0.026	0.001	0.178	0.464	0.247	0.247
		Ridge	0.013	0.028	0.001	0.524	0.395	0.431	0.432
0.50	25	2SLS	0.014	0.123	0.015	0.162	0.434	0.215	0.230
		Ridge	0.079	0.129	0.023	0.405	0.376	0.306	0.329
	50	2SLS	0.006	0.087	0.008	0.071	0.317	0.105	0.113
		Ridge	0.043	0.088	0.010	0.278	0.309	0.173	0.182
	250	2SLS	0.001	0.037	0.001	0.012	0.127	0.016	0.018
		Ridge	0.011	0.037	0.001	0.103	0.158	0.035	0.037
	500	2SLS	0.000	0.026	0.001	0.007	0.089	0.008	0.009
		Ridge	0.007	0.026	0.001	0.071	0.114	0.018	0.019

Proofs

Lemma 1

Proof of Lemma 1.

The objective function that determines the tuning parameter is

Q_{n (1 - τ)} (α) = \frac{1}{2 (n - [n τ])} {(Y_{n (1 - τ)} - X_{n (1 - τ)} {\hat{β}}_{t r} (α))}^{'} P_{Z_{n (1 - τ)}} (Y_{n (1 - τ)} - X_{n (1 - τ)} {\hat{β}}_{t r} (α)) .

(A1)

Substitute

\begin{matrix} {\hat{β}}_{t r} (α) & = & {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} (\frac{X_{τ n}^{'} P_{Z_{τ n}} Y_{τ n}}{[τ n]} + α β^{p}) \end{matrix}

(A2)

and write the objective function

\begin{matrix} Q_{n (1 - τ)} (α) \\ = & \frac{1}{2 (n - [n τ])} (Z_{n (1 - τ)}^{'} ϵ_{n (1 - τ)} - Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} \frac{X_{τ n}^{'} P_{Z_{τ n}} ϵ_{τ n}}{[τ n]} \\ {- Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} α (β^{p} - β_{0}))}^{'} \\ \times {(Z_{n (1 - τ)}^{'} Z_{n (1 - τ)})}^{- 1} \\ \times (Z_{n (1 - τ)}^{'} ϵ_{n (1 - τ)} - Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} \frac{X_{τ n}^{'} P_{Z_{τ n}} ϵ_{τ n}}{[τ n]} \\ - Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} α (β^{p} - β_{0})) . \end{matrix}

The CLT and LLN imply

Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} \frac{X_{τ n}^{'} P_{Z_{τ n}} ϵ_{τ n}}{[τ n]}

and

Z_{n (1 - τ)}^{'} ϵ_{n (1 - τ)}

are

O_{p} (n^{1 / 2})

. The LLN implies

Z_{n (1 - τ)}^{'} X_{n (1 - τ)} {(\frac{X_{τ n}^{'} P_{Z_{τ n}} X_{τ n}}{[τ n]} + α I)}^{- 1} α (β^{p} - β_{0})

is

O_{p} (n^{- 1})

when

β^{p} \neq β_{0}

. However, this term will be zero if

β^{p} = β_{0}

. Hence, the limiting behavior of the objective function will be determined by the

O_{p} (n^{1})

term when

β^{p} \neq β_{0}

and by the

O_{p} (n^{1 / 2})

terms when

β^{p} = β_{0}

.

For

β^{p} \neq β_{0}

, the consistency of

\hat{α}

is presented in Lemma 1 of [1]. For

β^{p} = β_{0}

,

\begin{matrix} lim_{n \to \infty} Q_{n (1 - τ)} (α) & = & \frac{1}{2} {(V_{(1 - τ)} - R_{z} Γ_{0} {(Γ_{0}^{'} R_{z} Γ_{0} + α I_{k})}^{- 1} Γ_{0}^{'} V_{τ})}^{'} \\ \times R_{z}^{- 1} (V_{(1 - τ)} - R_{z} Γ_{0} {(Γ_{0}^{'} R_{z} Γ_{0} + α I_{k})}^{- 1} Γ_{0}^{'} V_{τ}) \end{matrix}

where

V_{(1 - τ)}

and

V_{τ}

are

i i d N (0, R_{z} σ_{ϵ}^{2})

. Hence,

\hat{α}

converges in distribution to a draw from the a-min distribution with parameter

S = R_{z}^{1 / 2} Γ_{0}

where

R_{z}^{1 / 2}

is the symmetric matrix square root of

R_{z}

. □

Theorem 1

This theorem and its proof are presented in [1].

Theorem 2

Proof of Theorem 2.

Let

Σ_{h} = E [h_{i} (θ_{0}) h_{i} {(θ_{0})}^{'}]

and

M_{0} = E [\frac{\partial h_{i} (θ_{0})}{\partial θ^{'}}]

. The asymptotic distribution of

Z

is

Z \sim^{a} N (0, M_{0}^{- 1} Σ_{h} M_{0}^{- 1}^{'}) \equiv N (0, C) .

The parameters and the moment conditions are written in sets. To keep track of the needed calculations, write

θ

,

h_{i} (θ)

,

Σ_{h}

,

M_{0}

,

Z

, and C into terms associated with the sets. Use bold subscripts and superscripts to denote the different sets

\begin{matrix} θ & = & {[\begin{matrix} θ_{1}^{'}, & θ_{2}^{'}, & θ_{3}, & θ_{4}^{'} \end{matrix}]}^{'} \\ \equiv & {[\begin{matrix} [\begin{matrix} vech {(R_{τ})}^{'} & vech {(R_{(1 - τ)})}^{'} & vec {(S_{τ})}^{'} & vec {(S_{(1 - τ)})}^{'} \end{matrix}] & β_{t r}^{'} & α & β^{'} \end{matrix}]}^{'} \end{matrix}

[\begin{matrix} h_{1, i} (θ) \\ h_{2, i} (θ) \\ h_{3, i} (θ) \\ h_{4, i} (θ) \end{matrix}] \equiv [\begin{matrix} [\begin{matrix} 1_{τ} (i) vech (R_{τ} - z_{i} z_{i}^{'}) \\ (1 - 1_{τ} (i)) vech (R_{(1 - τ)} - z_{i} z_{i}^{'}) \\ 1_{τ} (i) vec (S_{τ} - z_{i} x_{i}^{'}) \\ (1 - 1_{τ} (i)) vec (S_{(1 - τ)} - z_{i} x_{i}^{'}) \end{matrix}] \\ 1_{τ} (i) (- S_{τ}^{'} R_{τ}^{- 1} z_{i} (y_{i} - x_{i}^{'} β_{t r}) + α (β_{t r} - β^{p})) \\ (1 - 1_{τ} (i)) (y_{i} - x_{i}^{'} β_{t r}) z_{i}^{'} R_{(1 - τ)}^{- 1} S_{(1 - τ)} {(S_{τ}^{'} R_{τ}^{- 1} S_{τ} + α I_{k})}^{- 1} (β^{p} - β_{t r}) \\ - {(τ S_{τ} + (1 - τ) S_{1 - τ})}^{'} {(τ R_{τ} + (1 - τ) R_{1 - τ})}^{- 1}) z_{i} (y_{i} - x_{i}^{'} β) + α (β - β^{p}) \end{matrix}] .

(A3)

Σ_{h i, j} \equiv E [h_{i, i} (θ_{0}) h_{j, i} {(θ_{0})}^{'}],

and

M_{0 i, j} \equiv E [\frac{\partial h_{i, i} (θ_{0})}{\partial θ_{j}^{'}}]

for

i, j = 1, \dots, 4

. Denote the partitioned terms of

M_{0}^{- 1}

,

M_{0}^{i, j}

for

i, j = 1, \dots, 4

. The limiting random variables will be partitioned

[\begin{matrix} Z_{1} \\ Z_{2} \\ Z_{3} \\ Z_{4} \end{matrix}] = [\begin{matrix} M_{0}^{1, 1} & M_{0}^{1, 2} & M_{0}^{1, 3} & M_{0}^{1, 4} \\ M_{0}^{2, 1} & M_{0}^{2, 2} & M_{0}^{2, 3} & M_{0}^{2, 4} \\ M_{0}^{3, 1} & M_{0}^{3, 2} & M_{0}^{3, 3} & M_{0}^{3, 4} \\ M_{0}^{4, 1} & M_{0}^{4, 2} & M_{0}^{4, 3} & M_{0}^{4, 4} \end{matrix}] \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} [\begin{matrix} h_{i, 1} (θ_{0}) \\ h_{i, 2} (θ_{0}) \\ h_{i, 3} (θ_{0}) \\ h_{i, 4} (θ_{0}) \end{matrix}] .

(A4)

The partitioned elements of C will be denoted

C_{i, j}

for

i, j = 1, \dots, 4

and can be written

C_{i, j} = \sum_{l = 1}^{4} \sum_{k = 1}^{4} M_{0}^{i, l} Σ_{h l, k} M_{0}^{j, k} .

(Note that the transpose on the second

M_{0}

term is achieved with the index being flipped.) The

C_{4, 4}

term is the covariance matrix for the estimate of

θ_{4}

, i.e., the ridge estimate of the structural parameters,

β

. The detailed calculation of the

C_{i, j}

terms are presented in the Supplemental Material for this paper.

The

θ_{3}

term is

α

that is restricted to being non-negative and the probability limit is on the boundary of the parameter space, i.e.,

α_{0} = 0 .

Following Self and Liang (1987), the probability limit being on the boundary of the parameter space results in an asymptotic distribution that is characterized by a projection onto a cone. Because the probability limit is zero, the asymptotic distribution is obtained by projecting the limiting stochastic process onto the non-negative values of the

θ_{3} \equiv α

. This projection is defined using the limiting covariance matrix to define the inner product. When a draw from the limiting distribution

Z

has a non-negative

Z_{3}

term, it directly contributed to the asymptotic distribution. When a draw has a negative

Z_{3}

term, the random vector is projected on the cone with

Z_{3} = 0

. This means

Z_{3}

will be mapped to zero. The other parameters will also be adjusted depending on their covariance and correlation with

Z_{3}

. This adjustment can contribute an asymptotic bias term.

The asymptotic distribution of the estimates can be characterized

\sqrt{n} (\hat{θ} - θ_{0}) \sim^{a} [\begin{matrix} Z_{1} \\ Z_{2} \\ Z_{3} \\ Z_{4} \end{matrix}] 1 \{Z_{3} \geq 0\} + [\begin{matrix} Z_{1} - (C_{1, 3} / C_{3, 3}) Z_{3} \\ Z_{2} - (C_{2, 3} / C_{3, 3}) Z_{3} \\ 0 \\ Z_{4} - (C_{4, 3} / C_{3, 3}) Z_{3} \end{matrix}] 1 \{Z_{3} < 0\} .

The asymptotic distribution for

\hat{α}

is a mixture with 1/2 probability mass on zero and truncated normal distribution

N (0, C_{3, 3})

over non-negative values.

The asymptotic distribution for the ridge estimator of the structural parameters

β

, conditional on

Z_{3}

, is the asymptotic distribution for

θ_{4}

\sqrt{n} (\hat{β} - β_{0}) \sim^{a} N (- \frac{1}{2} (C_{4, 3} / C_{3, 3}) Z_{3}, C_{4, 4})

where

Z_{3}

is a draw from the truncated normal distribution

N (0, C_{3, 3})

over the negative values. The asymptotic bias can be evaluated in closed form. The expectation of

Z_{3}

for the truncated normal distribution is

\begin{matrix} E [z] & = & \int_{- \infty}^{0} z \frac{2}{\sqrt{2 π} \sqrt{C_{3, 3}}} exp \{\frac{- z^{2}}{2 C_{3, 3}}\} d z \end{matrix}

(A5)

\begin{matrix} = & \frac{2}{\sqrt{2 π} \sqrt{C_{3, 3}}} \int_{- \infty}^{0} z exp \{\frac{- z^{2}}{2 C_{3, 3}}\} d z \end{matrix}

(A6)

\begin{matrix} = & \frac{2}{\sqrt{2 π} \sqrt{C_{3, 3}}} {(- C_{3, 3} exp \{\frac{- z^{2}}{2 C_{3, 3}}\}]}_{- \infty}^{0} \end{matrix}

(A7)

\begin{matrix} = & \frac{2}{\sqrt{2 π} \sqrt{C_{3, 3}}} (- C_{3, 3}) \end{matrix}

(A8)

\begin{matrix} = & - \sqrt{\frac{2 C_{3, 3}}{π}} . \end{matrix}

(A9)

Under the null hypothesis that

H_{0} : β = β_{0}

, the statistic

n {((\hat{β} - β_{0}) - \frac{C_{4, 3}}{2 C_{3, 3}} (\sqrt{\frac{2 C_{3, 3}}{π}}) / \sqrt{n})}^{'} C_{4, 4}^{- 1} ((\hat{β} - β_{0}) - \frac{C_{4, 3}}{2 C_{3, 3}} (\sqrt{\frac{2 C_{3, 3}}{π}}) / \sqrt{n})

will converge in distribution to chi-square with k degrees of freedom.

The asymptotic distribution for

\hat{α}

,

\hat{β}

, and the test statistic require the terms

C_{3, 3}, C_{4, 3}, a n d C_{4, 4} .

The details of the matrix multiplication are presented in the Supplemental Material for this paper. The terms are

C_{3, 3} = \frac{σ_{ε}^{2}}{τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})},

C_{4, 3} = \frac{- σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}{τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})},

\begin{matrix} C_{4, 4} & = & σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} + \frac{σ_{ε}^{2} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1}}{τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}, \end{matrix}

and

\frac{C_{4, 3}}{2 C_{3, 3}} \sqrt{\frac{2 C_{3, 3}}{π}} = {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p}) \sqrt{\frac{σ_{ε}^{2}}{2 π τ (1 - τ) {(β_{0} - β^{p})}^{'} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (β_{0} - β^{p})}} .

The asymptotic distribution for the optimally selected tuning parameter when

β^{p} \neq β_{0}

is a mixture with a discrete mass of 1/2 at zero and, over the positive values, a truncated normal with zero mean and covariance

C_{3, 3} .

The asymptotic distribution for the structural parameters

\sqrt{n} (\hat{β} - β_{0})

is normally distributed with mean

\frac{C_{4, 3}}{2 C_{3, 3}} \sqrt{\frac{2 C_{3, 3}}{π}}

and covariance

C_{4, 4} .

□

Theorem 3

Proof of Theorem 3.

The tuning parameter is estimated, but it is not identified. Consider only the final system of k equations used to estimate the parameters of interest, conditional on the estimated tuning parameter

\hat{α}

- \frac{X^{'} P_{Z} (Y - X \hat{β})}{n} + \hat{α} (\hat{β} - β^{p}) = 0 .

This implies

\hat{β} = {(\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k})}^{- 1} (\frac{X^{'} P_{Z} Y}{n} + \hat{α} β^{p}) .

Substitute in

β^{p} = β_{0}

,

\hat{β} = {(\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k})}^{- 1} (\frac{X^{'} P_{Z} Y}{n} + \hat{α} β_{0}) .

Substitute in

Y = X β_{0} + ε

,

\hat{β} = {(\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k})}^{- 1} (\frac{X^{'} P_{Z} (X β_{0} + ε)}{n} + \hat{α} β_{0}) .

Simplify to

\hat{β} = {(\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k})}^{- 1} ((\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k}) β_{0} + (\frac{X^{'} P_{Z} ε}{n})) .

This implies

(\hat{β} - β_{0}) = {(\frac{X^{'} P_{Z} X}{n} + \hat{α} I_{k})}^{- 1} (\frac{X^{'} P_{Z} ε}{n})

which gives the root-n consistency of the estimate and the asymptotic distribution becomes

\sqrt{n} (\hat{β} - β_{0}) \sim^{a} N (0, σ_{ε}^{2} {((Γ_{0}^{'} R_{z} Γ_{0}) + \hat{α} I_{k})}^{- 1} (Γ_{0}^{'} R_{z} Γ_{0}) {((Γ_{0}^{'} R_{z} Γ_{0}) + \hat{α} I_{k})}^{- 1})

where

\hat{α}

is a draw from the a-min distribution with parameter

R_{z}^{1 / 2} Γ_{0}

.

The inverse of the ridge estimator’s variance is

\begin{matrix} σ_{ε}^{- 2} (Γ_{0}^{'} R_{z} Γ_{0} + \hat{α} I_{k}) {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} (Γ_{0}^{'} R_{z} Γ_{0} + \hat{α} I_{k}) \\ = & σ_{ε}^{- 2} (Γ_{0}^{'} R_{z} Γ_{0}) + 2 σ_{ε}^{- 2} \hat{α} I_{k} + σ_{ε}^{- 2} \hat{α} {(Γ_{0}^{'} R_{z} Γ_{0})}^{- 1} . \end{matrix}

For

\hat{α} > 0

, this is larger than the inverse of the 2SLS estimator’s variance,

σ_{ε}^{- 2} (Γ_{0}^{'} R_{z} Γ_{0})

. Hence, the variance of the 2SLS can never be smaller than the variance of the ridge estimator, when

β^{p} = β_{0} .

□

Theorem 4

This is a special case of Theorem 1. The proof of Theorem 1 as presented in [1] applies to this set of parameters and moment conditions.

Theorem 5

This is a special case of Theorem 2. Assumption 6 implies that there is no need for instrumental variables. The basic simplification is that in the IV model

\frac{1}{n} X^{'} X \to^{p} Γ_{0}^{'} R_{z} Γ_{0}

, while, for the linear regression model, this reduces to

\frac{1}{n} X^{'} X \to^{p} R_{x}

.

Theorem 6

This is a special case of the results in Theorem 3. The explanation for Theorem 5 also applies for this theorem.

References

Sengupta, N.; Sowell, F. On the Asymptotic Distribution of Ridge Regression Estimators Using Training and Test Samples. Econometrics 2020, 8, 39. [Google Scholar] [CrossRef]
Obenchain, R. Classical F-Tests and Confidence Regions for Ridge Regression. Technometrics 1977, 19, 429. [Google Scholar] [CrossRef]
Van Wieringen, W.N. Lecture notes on ridge regression. arXiv 2021, arXiv:1509.09169. [Google Scholar]
Melo, S.; Kibria, B.M.G. On Some Test Statistics for Testing the Regression Coefficients in Presence of Multicollinearity: A Simulation Study. Stats 2020, 3, 40–55. [Google Scholar] [CrossRef] [Green Version]
Halawa, A.; Bassiouni, M.E. Tests of regression coefficients under ridge regression models. J. Stat. Comput. Simul. 2000, 65, 341–356. [Google Scholar] [CrossRef]
Theobald, C.M. Generalizations of Mean Square Error Applied to Ridge Regression. J. R. Stat. Soc. Ser. B 1974, 36, 103–106. [Google Scholar] [CrossRef]
Schmidt, P. Econometrics; Statistics, Textbooks and Monographs; Dekker: New York, NY, USA, 1976. [Google Scholar]
Smith, G.; Campbell, F. A Critique of Some Ridge Regression Methods. J. Am. Stat. Assoc. 1980, 75, 74–81. [Google Scholar] [CrossRef]
Montgomery, D.; Peck, E.; Vining, G. Introduction to Linear Regression Analysis; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Gómez, R.S.; García, C.G.; Pérez, J.G. The Raise Regression: Justification, properties and application. arXiv 2021, arXiv:2104.14423. [Google Scholar]
Kibria, B.M.G.; Banik, S. A Simulation Study on the Size and Power Properties of Some Ridge Regression Tests. Appl. Appl. Math. Int. J. (AAM) 2019, 14, 741–761. [Google Scholar]
Zorzi, M. Empirical Bayesian learning in AR graphical models. Automatica 2019, 109, 108516. [Google Scholar] [CrossRef] [Green Version]
Zorzi, M. Autoregressive identification of Kronecker graphical models. Automatica 2020, 119, 109053. [Google Scholar] [CrossRef]
Alheety, M.I.; Ramanathan, T.V. Confidence Interval for Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2009, 38, 3489–3497. [Google Scholar] [CrossRef]
Rubio, H.; Firinguetti, L. The Distribution of Stochastic Shrinkage Parameters in Ridge Regression. Commun. Stat.-Theory Methods 2002, 31, 1531–1547. [Google Scholar] [CrossRef] [Green Version]
Akdeniz, F.; Öztürk, F. The distribution of stochastic shrinkage biasing parameters of the Liu type estimator. Appl. Math. Comput. 2005, 163, 29–38. [Google Scholar] [CrossRef]
Moran, P.A.P. Maximum-likelihood estimation in non-standard conditions. Math. Proc. Camb. Philos. Soc. 1971, 70, 441–450. [Google Scholar] [CrossRef]
Self, S.; Liang, K. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 1987, 82, 605–610. [Google Scholar] [CrossRef]
Andrews, D.W.K. Generalized method of moments estimation when a parameter is on a boundary. J. Bus. Econ. Stat. 2002, 20, 530–544. [Google Scholar] [CrossRef]
Andrews, D.W.K. Estimation When a Parameter is on a Boundary. Econometrica 1999, 67, 1341–1383. [Google Scholar] [CrossRef]
Kinal, T.W. The Existence of Moments of k-Class Estimators. Econometrica 1980, 48, 241–249. [Google Scholar] [CrossRef]
Golub, G.H.; Heath, M.; Wahba, G. Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics 1979, 21, 215–223. [Google Scholar] [CrossRef]
Antoine, B.; Renault, E. Efficient GMM with nearly-weak instruments. Econom. J. 2009, 12, S135–S171. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sowell, F.; Sengupta, N. Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples. Stats 2021, 4, 725-744. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4030043

AMA Style

Sowell F, Sengupta N. Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples. Stats. 2021; 4(3):725-744. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4030043

Chicago/Turabian Style

Sowell, Fallaw, and Nandana Sengupta. 2021. "Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples" Stats 4, no. 3: 725-744. https://0-doi-org.brum.beds.ac.uk/10.3390/stats4030043

Article Menu

Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples

Abstract

1. Introduction

2. Ridge Estimator for Linear IV Model Using a Holdout Sample

3. Asymptotic Behavior

Linear Regression

4. Small Sample Properties

4.1. Coverage Probabilities

4.2. MSE for the Ridge Estimator

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI