Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design

Teng, Guangqiang; Tian, Boping; Zhang, Yuanyuan; Fu, Sheng

doi:10.3390/e25010084

Open AccessArticle

Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design

by

Guangqiang Teng

¹,

Boping Tian

^1,*,†,

Yuanyuan Zhang

^2,*,†

and

Sheng Fu

³

¹

School of Mathematics, Harbin Institute of Technology, Harbin 150001, China

²

School of Mathematical Sciences, Soochow University, Suzhou 215006, China

³

Department of Industrial Systems Engineering & Management, National University of Singapore, 21 Lowr Kent Ridge Road, Singapore 119077, Singapore

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2023, 25(1), 84; https://0-doi-org.brum.beds.ac.uk/10.3390/e25010084

Submission received: 8 November 2022 / Revised: 27 December 2022 / Accepted: 28 December 2022 / Published: 31 December 2022

(This article belongs to the Special Issue Recent Advances in Statistical Theory and Applications)

Download Review Reports Versions Notes

Abstract

:

The optimal subsampling is an statistical methodology for generalized linear models (GLMs) to make inference quickly about parameter estimation in massive data regression. Existing literature only considers bounded covariates. In this paper, the asymptotic normality of the subsampling M-estimator based on the Fisher information matrix is obtained. Then, we study the asymptotic properties of subsampling estimators of unbounded GLMs with nonnatural links, including conditional asymptotic properties and unconditional asymptotic properties.

Keywords:

generalized linear models; massive data; nonnatural links; unbounded covariates; unconditional subsampling estimator

1. Introduction

In recent years, the amount of information that people need to process is increasing dramatically. It is of great challenge to directly process massive data for statistical analysis. The divide-and-conquer strategy can mitigate the challenge of directly processing such big data [1], but it still consumes considerable computing resources. As a cheaper alternative in computing, subsampling gains its value in the case of limited computing resources.

To reduce the burden on the machine, the subsampling strategy based on big data has been given more attention in recent years. Ref. [2] proposes simple necessary and sufficient conditions for a convolved subsampling estimator to produce a normal limit that matches the target of bootstrap estimation; Ref. [3] provides an optimally distributed subsampling for maximum quasi-likelihood estimators with massive data; Ref. [4] studies some adaptive optimal subsampling algorithms; and Ref. [5] describes a subdata selection method based on leverage scores which conduct the linear model selection on a small subdata set.

GLM is a kind of statistical model with a wide range of applications such as [6,7,8]. Many subsampling studies are based on GLMs such as [3,9,10]. However, the covariates of the subsampled GLMs in the literature are bounded. When dealing with some big data problems, the size of covariate is not strictly bounded, such as the number of clicks on a web page, which can grow infinitely. This requires the extension of existing theories to the unbounded design. To fill this gap, this paper aims to study asymptotic properties of the subsampled GLMs with unbounded covariates based on empirical process and martingale technology.

Our three contributions are shown as follows: (1) we describe the asymptotic property of subsampled M-estimator using Fisher information matrix; (2) we give the conditional consistency and asymptotic normality of unbounded GLMs subsampling estimator; (3) we provide the unconditional consistency and asymptotic normality of unbounded GLMs subsampling estimator.

The rest of the paper is organized as follows. Section 2 introduces the basic concepts in GLMs and subsampling M-estimation problem. Section 3 presents the asymptotical properties for unbounded GLMs subsampling estimators. Section 4 gives the conclusion and discussion, as well as future research directions. All the technical proofs are collected in the Appendix A.

2. Preliminaries

This section introduces the subsampling M-estimation problem and GLMs.

2.1. Subsampling M-Estimation

Let

{l (β; Z) \in R | Z \in Z}

be a set of loss functions with a finite dimensional convex set

β \in Θ \subset R^{p}

, and

U = {1, 2, \dots, N}

be the index of the full large dataset with

σ

-algebra

F_{N} = σ (Z_{1}, \dots, Z_{N})

, where for each

i \in U,

the random data point

Z_{i} \in Z

(some probability space) is observed. The empirical risk

L_{N} : Θ \to R

is given by

L_{N} (β) = \frac{1}{N} \sum_{i \in U} l (β; Z_{i})

.

The goal is to get the solution

{\hat{β}}_{N}

to minimize the risk, namely

{\hat{β}}_{N} = arg min_{β \in Θ} L_{N} (β) .

(1)

To solve Equation (1), we need

{\hat{β}}_{N}

satisfy:

\nabla L_{N} (β) = \frac{1}{N} \sum_{i \in U} \nabla l (β; Z_{i}) = 0

, and let

Σ_{N} : = \nabla^{2} L_{N} ({\hat{β}}_{N})

. This is an M-estimation problem; see [11]. For fast solving large-scale estimation in Equation (1), we propose the subsampling M-estimation. Consider an index set

S = {i_{1}, i_{2}, \dots, i_{n}}

with replacement from U according to the sampling probability

{\{π_{i}\}}_{i = 1}^{N}

such that

\sum_{i = 1}^{N} π_{i} = 1 .

The subsampling M-estimation problem is to obtain the solution

{\hat{β}}_{n}

satisfying

\nabla L_{n} (β) = 0 with L_{n} (β) = \frac{1}{N n} \sum_{i \in S} \frac{1}{π_{i}^{*}} l (β; Z_{i}^{*}),

where

Z_{i}^{*}

is the i-th time subsample with replacement and

π_{i}^{*}

is the subsampling probability of

Z_{i}^{*}

. For example, if

Z_{1}^{*} = Z_{1}

, then

π_{1}^{*} = π_{1}

; if

Z_{2}^{*} = Z_{1}

, then

π_{2}^{*} = π_{1}

. Denote

a_{i}

as the number of i-th subsampled data such that

\sum_{i \in U} a_{i} = n

. And

L_{n} (β)

is constructed by inverse probability weighting skill such that

E [L_{n} (β)| F_{N}] = L_{N} (β)

; see [12]. Details about properties of conditional expectation are shown in [13].

2.2. Generalized Linear Models

Let the random variable Y be the distribution of the natural exponential families

P_{α}

indexed by parameter

α

,

P_{α} (d y) = d F_{Y} (y) = c (y) exp {y α - b (α)} ν (d y), c (y) > 0,

(2)

where

α

is often referred to as the canonical parameter belonging to its natural space

Λ = {α : \int c (y) exp {y α} ν (d y) < \infty} .

ν (\cdot)

is the Lebesgue measure for continuous distributions (Normal, Gamma) or counting measure for discrete distributions (binomial, Poisson, negative binomial). The

c (y)

is free of

α

.

Let

{(Y_{i}, X_{i})}_{i = 1}^{N}

be N independent sample data pairs. Here the

X_{i} \in R^{p}

is covariates and we assume that the response

Y_{i}

follows the distribution of the natural exponential families with the parameter

α_{i} \in Λ

. The covariates

X_{i} : = {(x_{i 1}, \dots, x_{i p})}^{T}, (i = 1, 2, \dots, N)

are supposed to be deterministic.

The conditional expectation of

Y_{i}

for a given

X_{i}

is defined as a function of

β^{T} X_{i}

after a transformation by a link function

α_{i} = ψ (β^{T} X_{i})

. The mean value denoted as

μ_{i} : = E (Y_{i})

is mostly considered for regression.

If

α_{i} = β^{T} X_{i}

then we call that

ψ (β^{T} X_{i}) = β^{T} X_{i}

is canonical (or natural) link function, and corresponding model is canonical (or natural) GLMs; see Page 32 in [14]. Sometimes the assumption

α_{i} = β^{T} X_{i}

is somewhat strong and not very suitable in practice, while nonnatural link GLMs allow more flexible choices for the link function. We can further assume that

α_{i}

and

β^{T} X_{i}

can be related by a nonnatural link function

α_{i} = ψ (β^{T} X_{i})

.

Let

f_{β} (Y_{i} | X_{i})

be the joint density function of the i.i.d. data

{(Y_{i}, X_{i})}_{i = 1}^{N}

from the exponential family with a link function

ψ (\cdot)

. Then the nonnatural GLMs [15] is defined by

Y_{i} | X_{i} \sim f_{β} (Y_{i} | X_{i}) = c (Y_{i}) exp \{Y_{i} ψ (β^{T} X_{i}) - b (ψ (β^{T} X_{i}))\}, i = 1, 2, \dots, N .

(3)

Here is a classic result for the exponential family (3),

E (Y_{i} | X_{i}) : = μ_{i} = \dot{b} (α_{i}) = \dot{b} (ψ (β^{T} X_{i})) a n d Var (Y_{i} | X_{i}) : = Var (Y_{i}) = \ddot{b} (α_{i}),

(4)

where

i = 1, 2, \dots, N

; see P280 in [16].

3. Main Results

3.1. Subsampling M-Estimation Problem

In this part we first look at the term

Σ_{N}^{- 1} \nabla L_{n} ({\hat{β}}_{N})

. Define an independent random vector sequence

{ζ_{j}}_{j = 1}^{N}

and the subsampled

{ζ_{j}^{*}}_{j = 1}^{n}

, such that each vector

ζ

takes the value among

{\frac{1}{N π_{i}} Σ_{N}^{- 1} \nabla l ({\hat{β}}_{N}; Z_{i})}_{i = 1}^{N},

and let

V_{M} ({\hat{β}}_{N}; n) = \frac{1}{N^{2} n} Σ_{N}^{- 1} [\sum_{i \in U} \frac{1}{π_{i}} \nabla l ({\hat{β}}_{N}; Z_{i}) \nabla^{T} l ({\hat{β}}_{N}; Z_{i})] Σ_{N}^{- 1} .

From the definition of

\nabla L_{N} (β)

, we have

E (ζ | F_{N}) = Σ^{- 1} \nabla L_{N} ({\hat{β}}_{N}) = 0

and

Var (ζ | F_{N})

= n V_{M} ({\hat{β}}_{N}; n) .

Then we have the asymptotic property of subsampled M-estimator.

Theorem 1.

Suppose that the risk function

L_{N} (β)

is twice differentiable and λ-strongly convex over Θ, that is, for

β \in Θ

,

\nabla^{2} L_{N} (β) \geq λ I

, where ≥ denotes the semidefinite positive ordering; and the sampling-based moment condition,

\begin{matrix} \frac{1}{N^{4}} \sum_{i = 1}^{N} \frac{1}{π_{i}^{3}} {∥\nabla l ({\hat{β}}_{N}; Z_{i})∥}^{4} = O_{P} (1) . \end{matrix}

Then we can obtain: As

n \to \infty,

conditioning on

F_{N}

,

V_{M} {({\hat{β}}_{N}; n)}^{- \frac{1}{2}} ({\hat{β}}_{n} - {\hat{β}}_{N}) \overset{d}{\to} N (0, I_{p}),

(5)

where

\overset{d}{\to}

means convergence in distribution.

Theorem 1 reveals that the subsampling M-estimation scheme is theoretically feasible under mild conditions. In addition, the existence of the estimator is given by the Fisher information matrix.

3.2. Conditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

The exponential family is very versatile for containing many common light-tail distributions such as binomial, Poisson, negative binomial, normal and Gamma. Along with their attendant convexity properties which leads to finite variance property for log-density, they can serve for a large amount of popular and effective statistical models. It is precisely because of the commonality of these distributions so that we study the subsampling problem for GLMs.

From the loss function introduced in Section 2.1, we set

l (β; Z_{i}) : = - log f_{β} (Y_{i} | X_{i})

where

f_{β} (Y_{i} | X_{i})

is defined by Equation (2), then the problem solving the minimum of the loss function is equivalent to solve the maximum of the likelihood function. For simplicity, we assume that

c (y) = 1

, then

\nabla l (β; Z_{i}) : = - \frac{\partial log f_{β} (Y_{i} | X_{i})}{\partial β} = - [Y_{i} - \dot{b} (ψ (β^{T} X_{i}))] \dot{ψ} (β^{T} X_{i}) X_{i}

with the nonnatural link function

α_{i} = ψ (β^{T} X_{i})

. We also use this idea in Section 3.3.

More generally, we consider a wider class saying quasi-GLMs, rather than GLMs, which assumes that Equation (4) holds for a certain function

μ (\cdot)

. Strong consistency and asymptotic normality of quasi maximum likelihood estimate in GLMs with bounded covariates are proved in [17]. For unbounded covariates, adopting the subsampled estimation of GLMs in [9], we calculate the inverse probability weighted estimator of

β

by solving the estimating equation based on the subsampled index set S,

- \frac{1}{N n} \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \dot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*} = 0 .

where

{(Y_{i}^{*}, X_{i}^{*})}_{i \in S}

is subsampled data. Equivalently, we have

s_{n} (β) = \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \dot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*} = 0 .

(6)

Equation (6) is called quasi-GLMs since Equation (4) is given instead of the distribution function.

Let

{\hat{β}}_{n}

be the estimator of the real parameter

β_{0}

in subsampled quasi-GLMs and

{\hat{β}}_{N}

be the estimator of

β_{0}

in quasi-GLMs with full data. For the unbounded quasi-GLMs with full data,

{\hat{β}}_{N}

is asymptotic unbiased with respect to

β_{0}

; see [18]. Next, we focus on the asymptotical properties of

{\hat{β}}_{n}

, as shown in the following theorems.

Theorem 2.

Let

{(Y_{i}^{*}, X_{i}^{*})}_{i \in S}

be subsampled from i.i.d. full data

{(Y_{i}, X_{i})}_{i \in U}

. Consider the Equation (4) and (6) where

ψ (\cdot)

is three times continuously differentiable whose every derivative is bounded, and

b (\cdot)

is twice continuously differentiable whose every derivative is also bounded. Assume that:

(A.1): The range of the unknown parameter $β$ is an open subset of $R^{p}$ .
(A.2): For any $i \in S$ , $E {sup}_{β \in Θ} [\frac{1}{π_{i}^{*}} | Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*})) | | F_{N}] = O (1)$ .
(A.3): For any $β \in Θ$ and $i \in S$ , $0 < {inf}_{i} φ (β^{T} X_{i}^{*}) \leq {sup}_{i} φ (β^{T} X_{i}^{*}) < \infty$ , where $φ (t) = {[\dot{ψ} (t)]}^{2} \ddot{b} (ψ (t))$ .
(A.4): For any $β_{1} \in Θ$ and $β_{2} \in Θ$ , there exists a function $| m (X_{i}^{*}) | < \infty$ such that

$| φ (β_{1}^{T} X_{i}^{*}) - φ (β_{2}^{T} X_{i}^{*}) | \leq | m (X_{i}^{*}) | | β_{1}^{T} X_{i}^{*} - β_{2}^{T} X_{i}^{*} | .$
(A.5): When $n \to \infty, {max}_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} = O (n^{- 1})$ and $λ_{min} [X^{*} X^{* T}] \to \infty$ , where $X^{*} = (X_{1}^{*}, . . ., X_{n}^{*})$ and $λ_{min} [A]$ is the smallest eigenvalue of the matrix $A$ .
(A.6): ${min}_{i = 1, . . ., N} (N π_{i}) = O (1)$ , ${max}_{i = 1, . . ., N} (N π_{i}) = O (1)$ .

Then

{\hat{β}}_{n}

is consistent with

{\hat{β}}_{N}

, i.e.,

∥ {\hat{β}}_{n} - {\hat{β}}_{N} ∥ = o_{P | F_{N}} (1)

where

o_{P | F_{N}} (1)

means

o (1)

conditioning on

F_{N}

in probability.

Theorem 3.

Under the conditions in Theorem 2, as

N \to \infty

and

n \to \infty

, conditional on

F_{N}

in probability,

\sqrt{n} ({\hat{β}}_{n} - {\hat{β}}_{N}) \to N (0, V_{s}),

in distribution, where

\begin{matrix} V_{s} = Σ_{N}^{- 1} V_{N} Σ_{N}^{- 1}, \\ Σ_{N} = \sum_{i \in U} a_{i} [Y_{i} - \dot{b} (ψ ({\hat{β}}_{N}^{T} X_{i}))] \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}) X_{i} X_{i}^{T} - \sum_{i \in U} a_{i} \ddot{b} (ψ ({\hat{β}}_{N}^{T} X_{i})) \dot{ψ} {({\hat{β}}_{N}^{T} X_{i})}^{2} X_{i} X_{i}^{T}, \\ V_{N} = \sum_{i \in U} \frac{a_{i}}{π_{i}} {[Y_{i} - \dot{b} (ψ ({\hat{β}}_{N}^{T} X_{i}))]}^{2} \dot{ψ} {({\hat{β}}_{N}^{T} X_{i})}^{2} X_{i} X_{i}^{T} . \end{matrix}

In this part, we complete the asymptotic properties without the moment condition of the covariates

{X_{i}}_{i = 1}^{N}

which is used in [9], and that means

X_{i}

’s are unbounded. Here we only provide the theoretical asymptotic results. Furthermore, the subsampling probability can be derived by A-optimal criterion like [10].

3.3. Unconditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

In real engineering, the measurement of some response variable data is very expensive, such as superconductor data, deep space exploration data, etc. The accuracy of estimating the target parameters under measurement constraints of responses is a very important issue. Ref. [19] completed the unconditional asymptotic properties of parameter estimation in bounded GLMs with canonical link. But the unbounded GLMs with nonnatural link situation has not been discussed yet.

In this section, we continue to use the notations of Section 3.2. Through the theory of empirical process [11], we obtain the unconditional consistency of

{\hat{β}}_{n}

in the following theorem.

Theorem 4.

(Unconditional subsampled consistency) Assume the conditions:

(B.1): $λ_{min} (E X X^{T}) > 0$ where $X$ is the unbounded covariate of GLMs.
(B.2): For $\forall u_{1}, u_{2} \in [0, 1]$ ,

$inf_{β \in Θ \ {β_{0}}} \frac{E {\ddot{b} ({\tilde{ψ}}_{u_{1}}) \dot{ψ} {[(1 - u_{2}) (β_{0}^{T} X) + u_{2} (β^{T} X)]}^{2} {(β^{T} X - β_{0}^{T} X)}^{2}}}{E {(β^{T} X - β_{0}^{T} X)}^{2}} \geq C_{1} > 0,$

where ${\tilde{ψ}}_{u_{1}} = (1 - u_{1}) ψ (β_{0}^{T} X) + u_{1} ψ (β^{T} X)$ and $\ddot{b} (\cdot)$ is the second derivative with respect to $β$ .
(B.3): $E_{β_{0}} sup_{β \in Θ} [| Y - \dot{b} (ψ (β^{T} X)) {| \cdot | | X | |}^{2}] < \infty,$

where $\dot{b} (\cdot)$ is the first derivative with respect to $β$ .
(B.4): $ψ (\cdot)$ in (3) is twice continuously differentiable and its every derivative has a positive minimum.
(B.5): $b (\cdot)$ in (3) is twice continuously differentiable and its every derivative has a positive minimum.

Then

∥ {\hat{β}}_{n} - β_{0} ∥ = o_{P} (1)

.

Theorem 4 directly obtains the unconditional consistency of the subsampling estimator with respect to the true parameters under the unbounded assumption.

To prove the asymptotic normality of

{\hat{β}}_{n}

with respect to

β_{0}

, we briefly review the subsampled score function in Section 3.2

s_{n} (β) = \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \dot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*} : = \sum_{i \in S} \frac{1}{π_{i}^{*}} ϕ_{β} (X_{i}^{*}, Y_{i}^{*}) .

Next we will apply a multivariate martingale central limit theorem (Lemma 4 in [19]), which is the extension of Theorem A.1 in [20], to show the asymptotic normality of

{\hat{β}}_{n}

. Let

{F_{N, i}}_{i = 1}^{n}

be a filtration adaptive to the sampling:

F_{N, 0} = σ (X_{1}^{N}, Y_{1}^{N}); F_{N, 1} = σ (X_{1}^{N}, Y_{1}^{N}) \lor σ (*_{1}); \dots; F_{N, i} = σ (X_{1}^{N}, Y_{1}^{N}) \lor σ (*_{1}) \lor \dots \lor σ (*_{i}); \dots

, where

σ (*_{i})

is the

σ

-algebra generated by ith sampling step. The subsample of size n is assumed to increase with N. By the filtration, we define the martingale

\bar{M} : = \sum_{i = 1}^{n} {\bar{M}}_{i} : = \sum_{i = 1}^{n} [\frac{1}{π_{i}^{*}} ϕ_{β} (X_{i}^{*}, Y_{i}^{*}) - \sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j})],

where

{{\bar{M}}_{i}}_{i = 1}^{n}

is a martingale difference sequence adapted to

{F_{N, i}}_{i = 1}^{n}

. In addition, define

Q : = n \sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j})

;

T : = s_{n} (β) = \bar{M} + Q

;

ξ_{N i} : = {Var}^{- 1 / 2} (T) {\bar{M}}_{i}

and

B_{N} : = {Var}^{- 1 / 2} (T) Var (\bar{M}) {Var}^{- 1 / 2} (T)

, where matrix

A^{1 / 2}

is the symmetric square root of

A

, i.e.,

A = {(A^{1 / 2})}^{2}

, and

A^{- 1 / 2} = {(A^{1 / 2})}^{- 1} = {(A^{- 1})}^{1 / 2}

.

B_{N}

is the variance of

{Var}^{- 1 / 2} (T) \bar{M}

.

The following theorem shows the asymptotic normality of the estimator

{\hat{β}}_{n}

.

Theorem 5.

Assume the conditions,

(C.1): $Φ = E (\nabla s_{n} (β)) = E [- \sum_{i \in S} \frac{1}{π_{i}^{*}} \dot{μ} (ψ (β^{T} X_{i}^{*})) {[\dot{ψ} (β^{T} X_{i}^{*})]}^{2} X_{i}^{*} X_{i}^{* T}]$

is finite and nonsingular.
(C.2): $E \{{[\sum_{i \in U} \frac{a_{i}}{π_{i}} \dot{μ} (ψ (β^{T} X_{i})) {[\dot{ψ} (β^{T} X_{i})]}^{2} X_{i k} X_{i j}]}^{2}\} = o_{P} (1)$ , for $1 \leq k, j \leq p$ ,where $X_{i k}$ means k-th element of vector $X_{i}$ and $X_{i j}$ means j-th element of vector $X_{i}$ .
(C.3): $ψ (x)$ is three-times continuously differentiable for every x with its domain.
(C.4): For any $i \in S$ , $| | {\ddot{ϕ}}_{β} (X_{i}^{*}, Y_{i}^{*}) | | < \infty$ .
(C.5): ${min}_{i = 1, . . ., N} (N π_{i}) = {max}_{i = 1, . . ., N} (N π_{i}) = O (1)$ and $n / N = o (1)$ .
(C.6): $lim_{N \to \infty} \sum_{i = 1}^{n} E [| | ξ_{N i} {| |}^{4}] = 0$ ,
(C.7): $lim_{N \to \infty} E [{||\sum_{i = 1}^{n} E [ξ_{N i} ξ_{N i}^{T} | F_{N, i - 1}] - B_{N}||}^{2}] = 0 .$

Then

Var {(T)}^{- 1 / 2} Φ ({\hat{β}}_{n} - β_{0}) \overset{d}{\to} N (0, I_{p}) .

Here, we establish the unconditional asymptotic properties of subsampling estimator for unbounded GLMs. The condition

n / N = o (1)

ensures that small-scale subsamples also have expected performance, which greatly release the computational cost. We also present the theoretical asymptotic results, which leads to the subsampling probability using the A-optimal criterion in [10].

4. Conclusions and Future Work

In this paper, we derive the asymptotic normality of the subsampling M-estimator by Fisher information. In the unbounded GLMs with nonnatural link function, we separately obtain the conditional and unconditional asymptotic properties of subsampling estimator.

For future study, it is meaningful to apply the sub-Weibull concentration inequalities in [21] to make nonasymptotic inference. The importance sampling is not ideal, since it tends to assign high sampling probability to the observed samples. Hence, effective subsampling methods are considered for GLMs, such as Markov subsampling in [22]. Moreover, high-dimensional methods in [23,24] for subsampling need further studies.

Author Contributions

Conceptualization, B.T.; Methodology, Y.Z.; Validation, G.T.; Writing—original draft, G.T.; Writing—review & editing, B.T., Y.Z. and S.F.; Supervision, B.T.; Funding acquisition, Y.Z. and B.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key University Science Research Project of Jiangsu Province 21KJB110023 and National Natural Science Foundation of China 91646106.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank Huiming Zhang for helpful discussions on large sample theory.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Details

Lemma A1

(Theorem 4.17 in [16]). Let

X_{1}, \dots, X_{N}

be i.i.d. from a p.d.f.

f_{β}

w.r.t. a σ-finite measure ν on

(R, B_{R})

, where

β \in Θ

and Θ is an open set in

R^{p}

. Suppose that for every

x

in the range of

X_{1}

,

f_{β} (x)

is twice continuously differentiable on

β

and satisfies

(D.1): $\frac{\partial}{\partial β} \int ψ_{β} (x) d ν = \int \frac{\partial}{\partial β} ψ_{β} (x) d ν$

for $ψ_{β} (x) = f_{β} (x)$ and $ψ_{β} (x) = \frac{\partial f_{β} (x)}{\partial β}$ .
(D.2): The Fisher information matrix

$I_{1} (β) = E \{\frac{\partial}{\partial β} log f_{β} (X_{1}) {[\frac{\partial}{\partial β} log f_{β} (X_{1})]}^{T}\}$

is positive definite.
(D.3): For any given $β \in Θ$ , there exists a positive number $C_{β}$ and a positive function $h_{β}$ such that $E [h_{β} (X_{1})] < \infty$ and

$sup_{γ : ∥γ - β∥ < C_{β}} ∥\frac{\partial^{2} log f_{γ} (x)}{\partial γ \partial γ^{T}}∥ \leq h_{β} (x)$

for all $x$ in the range of $X_{1}$ , where $| | \cdot | |$ is Euclidean norm and $∥A∥ = \sqrt{tr (A^{T} A)}$ for any matrix $A$ . Then there exist a sequence of estimators ${\hat{β}}_{N}$ (based on $\{X_{i}, i \in U\}$ ) such that

$P (s_{a} ({\hat{β}}_{N}) = 0) \to 1 and {\hat{β}}_{N} \overset{P}{\to} β_{0},$

(A1)

where $s_{a} (γ) = \frac{\partial log {\tilde{L}}_{N} (γ)}{\partial γ}$ and ${\tilde{L}}_{N} (γ)$ is the likelihood function of full data and $β_{0}$ is the real parameter. Meanwhile, there exist a sequence of estimators ${\hat{β}}_{n}$ (based on $\{X_{i}, i \in S\}$ ) such that

$P (s_{s} ({\hat{β}}_{n}) = 0) \to 1 and {\hat{β}}_{n} \overset{P}{\to} β_{0},$

(A2)

where $s_{s} (γ) = \frac{\partial log {\tilde{L}}_{n} (γ)}{\partial γ}$ and ${\tilde{L}}_{n} (γ)$ is the likelihood function of subsampled data and $β_{0}$ is the real parameter.

Let

a_{i}

be the number of i-th subsampled data such that

\sum_{i \in U} a_{i} = n

.

Lemma A2.

E [L_{n} (β)| F_{N}] = L_{N} (β)

.

Proof.

From the definition of

a_{i}

, one has

\begin{matrix} E [L_{n} (β) | F_{N}] & = E [\frac{1}{N n} \sum_{i \in S} \frac{1}{π_{i}^{*}} l (β; Z_{i}^{*}) | F_{N}] \\ = E [\frac{1}{N n} \sum_{i \in U} \frac{1}{π_{i}} l (β; Z_{i}) a_{i} | F_{N}] \\ = \frac{1}{N n} \sum_{i \in U} a_{i} E [\frac{1}{π_{i}} l (β; Z_{i}) | F_{N}] \\ = \frac{1}{N n} \sum_{i \in U} a_{i} \frac{\sum_{i \in U} \frac{1}{π_{i}} l (β; Z_{i}) π_{i}}{\sum_{i \in U} π_{i}} \\ = \frac{1}{N n} \sum_{i \in U} a_{i} \sum_{i \in U} l (β; Z_{i}) \\ = \frac{1}{N n} n \sum_{i \in U} l (β; Z_{i}) \\ = \frac{1}{N} \sum_{i \in U} l (β; Z_{i}) \\ = L_{N} (β) . \end{matrix}

□

Proposition A1.

Under the conditions of Lemma A1 and

min_{i} (N π_{i}) = max_{i} (N π_{i}) = O (1) (i = 1, \dots, N) .

Assume that

{\hat{β}}_{N}

based on

\{X_{i}, i \in U\}

is an estimator of

β

, and

{\hat{β}}_{n}

based on

\{X_{i}, i \in S\}

is also an estimator of

β

, then

{\hat{β}}_{n} - {\hat{β}}_{N} = - Σ_{N}^{- 1} \nabla L_{n} ({\hat{β}}_{N}) + o_{P | F_{N}} (1) .

(A3)

Proof.

Taking Taylor series expansion of

\nabla L_{n} ({\hat{β}}_{n})

around

{\hat{β}}_{N}

, we have

\begin{matrix} 0 = & \nabla L_{n} ({\hat{β}}_{n}) \\ = & \nabla L_{n} ({\hat{β}}_{N}) + \nabla^{2} L_{n} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) + o (∥{\hat{β}}_{n} - {\hat{β}}_{N}∥) \\ = & \nabla L_{n} ({\hat{β}}_{N}) + \nabla^{2} L_{n} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) + \nabla^{2} L_{N} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ - \nabla^{2} L_{N} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) + o (∥{\hat{β}}_{n} - {\hat{β}}_{N}∥) \\ = & \nabla L_{n} ({\hat{β}}_{N}) + \nabla^{2} L_{N} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) + (\nabla^{2} L_{n} ({\hat{β}}_{N}) - \nabla^{2} L_{N} ({\hat{β}}_{N})) ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ + o (∥{\hat{β}}_{n} - {\hat{β}}_{N}∥) . \end{matrix}

(A4)

From the definition of

a_{i}

, one has

\begin{matrix} (\nabla^{2} L_{n} ({\hat{β}}_{N}) - \nabla^{2} L_{N} ({\hat{β}}_{N})) ({\hat{β}}_{n} - {\hat{β}}_{N}) = & (\frac{1}{N n} \sum_{i \in S} \frac{1}{π_{i}^{*}} \nabla^{2} l ({\hat{β}}_{N}; Z_{i}^{*}) - \frac{1}{N} \sum_{i \in U} \nabla^{2} l ({\hat{β}}_{N}; Z_{i})) \\ \cdot ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ = & (\sum_{i \in U} \frac{1}{N n π_{i}} \nabla^{2} l ({\hat{β}}_{N}; Z_{i}) a_{i} - \frac{1}{N} \sum_{i \in U} \nabla^{2} l ({\hat{β}}_{N}; Z_{i})) \\ \cdot ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ = & (\sum_{i \in U} \frac{a_{i} - n π_{i}}{N n π_{i}} \nabla^{2} l ({\hat{β}}_{N}; Z_{i})) ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ \leq & (\sum_{i \in U} \frac{a_{i}}{N n π_{i}} \nabla^{2} l ({\hat{β}}_{N}; Z_{i})) ({\hat{β}}_{n} - {\hat{β}}_{N}) \\ = & o_{P | F_{N}} (1) . \end{matrix}

(A5)

Combine Equations (A1), (A2) and (A5) into Equation (A4), one has

0 = \nabla L_{n} ({\hat{β}}_{N}) + \nabla^{2} L_{N} ({\hat{β}}_{N}) ({\hat{β}}_{n} - {\hat{β}}_{N}) + o_{P | F_{N}} (1) .

This can be transformed to

{\hat{β}}_{n} - {\hat{β}}_{N} = - Σ_{N}^{- 1} \nabla L_{n} ({\hat{β}}_{N}) + o_{P | F_{N}} (1) .

(A6)

The proposition is proved. □

Remark A1.

The last equation in the proof ensures that

{\hat{β}}_{n} - {\hat{β}}_{N} + Σ_{N}^{- 1} \nabla L_{n} ({\hat{β}}_{N})

is a higher-order infinitesimal with respect to 1, which is true according to conditional probability with

F_{N}

.

o_{P | F_{N}} (1)

in Equation (A6) is denoted as the higher order infinitesimal of 1 according to conditional probability with

F_{N}

.

Proof of Theorem 1.

For every constant

\hat{γ} > 0

, one has

\begin{matrix} \sum_{j \in S} E \{{∥n^{- \frac{1}{2}} ζ_{j}^{*}∥}^{2} I (∥ζ_{j}^{*}∥ > n^{\frac{1}{2}} \hat{γ}) | F_{N}\} & \leq \sum_{j \in S} E \{\frac{{∥n^{- \frac{1}{2}}∥}^{2} {∥ζ_{j}^{*}∥}^{4}}{n {\hat{γ}}^{2}} I (∥ζ_{j}^{*}∥ > n^{\frac{1}{2}} \hat{γ}) | F_{N}\} \\ = \frac{1}{n^{2} {\hat{γ}}^{2}} \sum_{j \in S} E \{{∥ζ_{j}^{*}∥}^{4} I (∥ζ_{j}^{*}∥ > n^{\frac{1}{2}} \hat{γ}) | F_{N}\} \\ \leq \frac{1}{n^{2} {\hat{γ}}^{2}} \sum_{j \in S} E \{{∥ζ_{j}^{*}∥}^{4} | F_{N}\} \\ = \frac{1}{n^{2} {\hat{γ}}^{2}} \sum_{i \in U} a_{i} E \{{∥ζ_{i}∥}^{4} | F_{N}\} \\ = \frac{1}{n^{2} {\hat{γ}}^{2}} n \frac{\sum_{i \in U} {∥ζ_{i}∥}^{4} π_{i}}{\sum_{i \in U} π_{i}} \\ = \frac{1}{n {\hat{γ}}^{2}} \sum_{i \in U} {∥\frac{1}{N π_{i}} Σ_{N}^{- 1} \nabla l ({\hat{β}}_{N}; Z_{i})∥}^{4} π_{i} \\ = \frac{1}{n {\hat{γ}}^{2}} \frac{1}{N^{4}} \sum_{i \in U} \frac{1}{π_{i}^{3}} {∥Σ_{N}^{- 1} \nabla l ({\hat{β}}_{N}; Z_{i})∥}^{4} \\ \leq \frac{1}{n {\hat{γ}}^{2}} \frac{1}{N^{4}} \sum_{i \in U} \frac{1}{π_{i}^{3}} \frac{1}{λ^{4}} {∥\nabla l ({\hat{β}}_{N}; Z_{i})∥}^{4} \\ = \frac{1}{n {\hat{γ}}^{2}} \frac{1}{λ^{4}} O (1) \\ = o (1) . \end{matrix}

Furthermore,

\begin{matrix} \sum_{j \in S} Cov (n^{- \frac{1}{2}} ζ_{j}^{*} | F_{N}) & = \sum_{j \in S} E \{[n^{- \frac{1}{2}} ζ_{j}^{*} - E (n^{- \frac{1}{2}} ζ_{j}^{*} | F_{N})] {[n^{- \frac{1}{2}} ζ_{j}^{*} - E (n^{- \frac{1}{2}} ζ_{j}^{*} | F_{N})]}^{T} | F_{N}\} \\ = \sum_{j \in S} E [(n^{- \frac{1}{2}} ζ_{j}^{*}) {(n^{- \frac{1}{2}} ζ_{j}^{*})}^{T} | F_{N}] \\ = \frac{1}{n} \sum_{j \in S} E (ζ_{j}^{*} ζ_{j}^{* T} | F_{N}) \\ = \frac{1}{n} n E (ζ ζ^{T} | F_{N}) \\ = Var (ζ | F_{N}) . \end{matrix}

Then, by the Lindeberg-Feller central limit theorem (Proposition 2.27 of [11]), conditional on

F_{N}

,

\begin{matrix} \sum_{j \in S} n^{- \frac{1}{2}} ζ_{j}^{*} \overset{d}{\to} N (0, Var (ζ | F_{N})) . \end{matrix}

Therefore, combining the above and Proposition A1, Equation (5) holds. Thus, the proof is completed. □

Proof of Theorem 2.

Next, one needs to show convexity (i.e., uniqueness and maximum value) due to the existence of the estimators from [25]. Let

\begin{matrix} I_{1} (β) = & E (\nabla s_{n} (β)) \\ = & E (\frac{\partial (\sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \dot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*})}{\partial β}) \\ = & E (- \sum_{i \in S} \frac{1}{π_{i}^{*}} \dot{μ} (ψ (β^{T} X_{i}^{*})) {[\dot{ψ} (β^{T} X_{i}^{*})]}^{2} X_{i}^{*} X_{i}^{* T} \\ + \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \ddot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}) \\ = & - \sum_{i \in S} \frac{1}{π_{i}^{*}} \dot{μ} (ψ (β^{T} X_{i}^{*})) {[\dot{ψ} (β^{T} X_{i}^{*})]}^{2} X_{i}^{*} X_{i}^{* T}, \end{matrix}

where

s_{n} (β) = \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \dot{ψ} (β^{T} X_{i}^{*}) X_{i}^{*} .

From [16] in Theorem 4.17, one needs to show

max_{γ \in G (C_{0})} ∥{[I_{1} ({\hat{β}}_{N})]}^{- 1 / 2} \nabla s_{n} (γ) {[I_{1} ({\hat{β}}_{N})]}^{- 1 / 2} + I_{p}∥ \overset{P | F_{N}}{\to} 0,

where

G (C_{0}) = \{γ : ∥{[I_{1} ({\hat{β}}_{N})]}^{1 / 2} (γ - {\hat{β}}_{N})∥ \leq C_{0}\}

and

I_{p} = diag (1, 1, \dots, 1)

is a p-dimensional identity matrix.

Let

M_{n} (γ) = \sum_{i \in S} \frac{1}{π_{i}^{*}} {[\dot{ψ} (γ^{T} X_{i}^{*})]}^{2} \ddot{b} (ψ (γ^{T} X_{i}^{*})) X_{i}^{*} X_{i}^{* T}

(A7)

and

R_{n} (γ) = \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (γ^{T} X_{i}^{*}))] \ddot{ψ} (γ^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T} .

(A8)

Then

\nabla s_{n} (γ) = R_{n} (γ) - M_{n} (γ)

(A9)

and

I_{1} (γ) = - E (\nabla s_{n} (γ)) = M_{n} (γ)

(A10)

Thus, one only needs to prove

max_{γ \in G (C_{0})} ∥{[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2} [M_{n} (γ) - M_{n} ({\hat{β}}_{N})] {[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2}∥ \overset{P | F_{N}}{\to} 0,

(A11)

and

max_{γ \in G (C_{0})} ∥{[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2} R_{n} (γ) {[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2}∥ \overset{P | F_{N}}{\to} 0

(A12)

for any

C_{0} > 0

. From the definition of

M_{n} (γ)

, and the property of trace in P288 of [16], the left-hand side of Equation (A11) can be bounded by

\sqrt{p} max_{γ \in G (C_{0}), i \in S} |1 - φ (γ^{T} X_{i}^{*}) / φ ({\hat{β}}_{N}^{T} X_{i}^{*})| .

From condition (A.4), one needs to prove

|γ^{T} X_{i}^{*} - {\hat{β}}_{N}^{T} X_{i}^{*}|

converges to 0 so that Equation (A11) holds, and one has

\begin{matrix} {|γ^{T} X_{i}^{*} - {\hat{β}}_{N}^{T} X_{i}^{*}|}^{2} & = {|(γ^{T} - {\hat{β}}_{N}^{T}) {[I_{1} ({\hat{β}}_{N})]}^{1 / 2} {[I_{1} ({\hat{β}}_{N})]}^{- 1 / 2} X_{i}^{*}|}^{2} \\ \leq {∥{[I_{1} ({\hat{β}}_{N})]}^{1 / 2} (γ - {\hat{β}}_{N})∥}^{2} {∥{[I_{1} ({\hat{β}}_{N})]}^{- 1 / 2} X_{i}^{*}∥}^{2} \\ \leq C_{0}^{2} max_{i \in S} X_{i}^{* T} {[I_{1} ({\hat{β}}_{N})]}^{- 1} X_{i}^{*} \\ = C_{0}^{2} max_{i \in S} X_{i}^{* T} {[M_{n} ({\hat{β}}_{N})]}^{- 1} X_{i}^{*} \\ = C_{0}^{2} max_{i \in S} X_{i}^{* T} {[\sum_{i \in S} \frac{1}{π_{i}^{*}} {[\dot{ψ} (\hat{}} β_{N}^{T} X_{i}^{*})]}^{2} \ddot{b} (ψ (\hat{}} β_{N}^{T} X_{i}^{*})) X_{i}^{*} X_{i}^{* T}]}^{- 1} X_{i}^{*} \\ = C_{0}^{2} max_{i \in S} X_{i}^{* T} {[\sum_{i \in S} \frac{1}{π_{i}^{*}} φ (\hat{}} β_{N}^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}]}^{- 1} X_{i}^{*} \\ = C_{0}^{2} max_{i \in S} X_{i}^{* T} {[\sum_{i \in S} N \frac{1}{N π_{i}^{*}} φ (\hat{}} β_{N}^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}]}^{- 1} X_{i}^{*} \\ \leq C_{0}^{2} {[N min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} max_{i \in S} X_{i}^{* T} {(\sum_{i \in S} X_{i}^{*} X_{i}^{* T})}^{- 1} X_{i}^{*} \\ = C_{0}^{2} {[N min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ \overset{P | F_{N}}{\to} 0 . \end{matrix}

Hence Equation (A11) holds. Let

e_{i}^{*} = Y_{i}^{*} - μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*}))

, and

\begin{matrix} U_{n} (γ) = \sum_{i \in S} \frac{1}{π_{i}^{*}} [μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*})) - μ (ψ (γ^{T} X_{i}^{*}))] \ddot{ψ} (γ^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}, \\ V_{n} (γ) = \sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}} [\ddot{ψ} (γ^{T} X_{i}^{*}) - \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})] X_{i}^{*} X_{i}^{* T}, \\ W_{n} ({\hat{β}}_{N}) = \sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}} \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T} . \end{matrix}

Then

R_{n} (γ) = U_{n} (γ) + V_{n} (γ) + W_{n} ({\hat{β}}_{N})

. In the same way as proving Equation (A11), we have

max_{γ \in G (C_{0})} ∥{[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2} U_{n} (γ) {[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2}∥ \overset{P | F_{N}}{\to} 0 .

Note that

∥{[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2} V_{n} (γ) {[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2}∥

is bounded by the product of

∥{[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}} \sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}} X_{i}^{*} X_{i}^{* T} {[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}}∥

(A13)

and

max_{γ \in G (C_{0}), i \in S} |\ddot{ψ} (γ^{T} X_{i}^{*}) - \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})| .

(A14)

Equation (A13) can be bounded as

\begin{matrix} ∥{[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}} \sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}} X_{i}^{*} X_{i}^{* T} {[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}}∥ \\ = & ∥{[I_{1} ({\hat{β}}_{N})]}^{- \frac{1}{2}} \sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}} X_{i}^{*} X_{i}^{* T} {[I_{1} ({\hat{β}}_{N})]}^{- \frac{1}{2}}∥ \\ \leq & |\sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}}| {∥{[I_{1} ({\hat{β}}_{N})]}^{- \frac{1}{2}} X_{i}^{*}∥}^{2} \\ \leq & |\sum_{i \in S} \frac{e_{i}^{*}}{π_{i}^{*}}| {[N min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ \leq & |\sum_{i \in S} e_{i}^{*}| |max_{i \in S} \frac{1}{N π_{i}^{*}}| {[min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ \leq & \sum_{i \in U} |Y_{i} - μ (ψ ({\hat{β}}_{N}^{T} X_{i}))| |max_{i \in S} \frac{1}{N π_{i}^{*}}| {[min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ \leq & [\sum_{i \in U} sup_{β \in Θ} |Y_{i} - μ (ψ (β^{T} X_{i}))|] |max_{i \in S} \frac{1}{N π_{i}^{*}}| {[min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} \\ \cdot max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ = & \frac{1}{n} \sum_{i \in S} E sup_{β \in Θ} [\frac{1}{π_{i}^{*}} | Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*})) | | F_{N}] |max_{i \in S} \frac{1}{N π_{i}^{*}}| {[min_{i \in S} \frac{1}{N π_{i}^{*}} inf_{i \in S} φ ({\hat{β}}_{N}^{T} X_{i}^{*})]}^{- 1} \\ \cdot & max_{i \in S} X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*} \\ = & O_{P | F_{N}} (1 / n), \end{matrix}

where the penultimate equal sign applies the Lemma A2 with

l (β) = sup_{β \in Θ} |Y_{i} - μ (ψ (β^{T} X_{i}))| .

Equation (A14) can be bounded as

max_{γ \in G (C_{0}), i \in S} |\ddot{ψ} (γ^{T} X_{i}^{*}) - \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})| \overset{P | F_{N}}{\to} 0,

which can be proved as the same argument of Equation (A11) by Lagrange mean value theorem. Combine the bounds of Equations (A13) and (A14), one obtains

max_{γ \in G (C_{0})} ∥{[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2} V_{n} (γ) {[M_{n} ({\hat{β}}_{N})]}^{- 1 / 2}∥ \overset{P | F_{N}}{\to} 0 .

Let

δ \in (0, 1)

be a constant. Since

sup_{i \in S} E (| e_{i}^{*} |^{1 + δ} | F_{N}) < \infty

, one has

\begin{matrix} \sum_{i \in S} E ({|\frac{e_{i}^{*}}{π_{i}^{*}} \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*}) X_{i}^{* T} {[M_{n} ({\hat{β}}_{N})]}^{- 1} X_{i}^{*}|}^{1 + δ} | F_{N}) \\ \leq \sum_{i \in S} E ({|\frac{e_{i}^{*}}{π_{i}^{*}}|}^{1 + δ} | F_{N}) \cdot max_{i \in S} {|\ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})|}^{1 + δ} \cdot {|X_{i}^{* T} {[M_{n} ({\hat{β}}_{N})]}^{- 1} X_{i}^{*}|}^{1 + δ} \\ \leq \sum_{i \in S} {(\frac{1}{π_{i}^{*}})}^{1 + δ} E ({|e_{i}^{*}|}^{1 + δ} | F_{N}) max_{i \in S} {|\ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})|}^{1 + δ} \\ \cdot {|X_{i}^{* T} {[\sum_{i \in S} N \frac{1}{N π_{i}^{*}} φ (\hat{}} β_{N}^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}]}^{- 1} X_{i}^{*}|}^{1 + δ} \\ = \sum_{i \in S} {(\frac{1}{N π_{i}^{*}})}^{1 + δ} E ({|e_{i}^{*}|}^{1 + δ} | F_{N}) max_{i \in S} {|\ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*})|}^{1 + δ} \\ \cdot {|X_{i}^{* T} {[\sum_{i \in S} \frac{1}{N π_{i}^{*}} φ (\hat{}} β_{N}^{T} X_{i}^{*}) X_{i}^{*} X_{i}^{* T}]}^{- 1} X_{i}^{*}|}^{1 + δ} \\ \leq C_{δ} \sum_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{1 + δ} \\ \leq C_{δ} \sum_{i \in S} |X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}| max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} \\ = C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} \sum_{i \in S} tr [X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}] \\ = C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} \sum_{i \in S} tr [{(X^{*} X^{* T})}^{- 1} X_{i}^{*} X_{i}^{* T}] \\ = C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} tr [{(X^{*} X^{* T})}^{- 1} \sum_{i \in S} X_{i}^{*} X_{i}^{* T}] \\ = C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} tr [{(X^{*} X^{* T})}^{- 1} (X^{*} X^{* T})] \\ = C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} tr I_{p} \\ = p C_{δ} max_{i \in S} {|X_{i}^{* T} {(X^{*} X^{* T})}^{- 1} X_{i}^{*}|}^{δ} \\ \overset{P | F_{N}}{\to} 0, \end{matrix}

where

C_{δ} > 0

is a constant. Under the definition of

W_{n} ({\hat{β}}_{N})

and

E (e_{i}^{*} | F_{N}) = 0

, together with Theorem 1.14(ii) in [16], one obtains

∥{[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}} W_{n} ({\hat{β}}_{N}) {[M_{n} ({\hat{β}}_{N})]}^{- \frac{1}{2}}∥ \overset{P | F_{N}}{\to} 0 .

Hence, Equation (A12) holds and the proof is completed. □

Proof of Theorem 3.

According to the mean value theorem, one has

0 = s_{n} ({\hat{β}}_{n}) = s_{n} ({\hat{β}}_{N}) + \nabla s_{n} (\bar{\bar{β}}) ({\hat{β}}_{n} - {\hat{β}}_{N}),

where

\bar{\bar{β}}

is between

{\hat{β}}_{n}

and

{\hat{β}}_{N}

, then

\sqrt{n} ({\hat{β}}_{n} - {\hat{β}}_{N}) = - \sqrt{n} \nabla s_{n} {(\bar{\bar{β}})}^{- 1} s_{n} ({\hat{β}}_{N}) .

(A15)

Let

q_{i}^{*} ({\hat{β}}_{N}) = \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*}))] \dot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*}) X_{i}^{*}

, then

\begin{matrix} \sum_{i \in S} q_{i}^{*} ({\hat{β}}_{N}) & = \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*}))] \dot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*}) X_{i}^{*} = s_{n} ({\hat{β}}_{N}) . \end{matrix}

According to

E (Y_{i}^{*} | F_{N}) = μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*}))

in Equation (4), one obtains

E (q_{i}^{*} ({\hat{β}}_{N}) | F_{N}) = \frac{1}{π_{i}^{*}} [E (Y_{i}^{*} | F_{N}) - μ (ψ ({\hat{β}}_{N}^{T} X_{i}^{*}))] \dot{ψ} ({\hat{β}}_{N}^{T} X_{i}^{*}) X_{i}^{*} = 0 .

Applying Lindeberg-L

\overset{´}{e}

vy CLT, one has

\frac{s_{n} ({\hat{β}}_{N})}{\sqrt{n}} \overset{d}{\to} N (0, Var (q_{i}^{*} ({\hat{β}}_{N}) | F_{N})),

(A16)

where

\begin{matrix} Var (q_{i}^{*} ({\hat{β}}_{N}) | F_{N}) = & E (q_{i}^{*} ({\hat{β}}_{N}) q_{i}^{*} {({\hat{β}}_{N})}^{T} | F_{N}) \\ = & \sum_{i \in U} \frac{a_{i}}{π_{i}} {[Y_{i} - \dot{b} (ψ ({\hat{β}}_{N}^{T} X_{i}))]}^{2} \dot{ψ} {({\hat{β}}_{N}^{T} X_{i})}^{2} X_{i} X_{i}^{T} . \end{matrix}

Applying [26] in Theorem 2, one has

\frac{\nabla s_{n} (\bar{\bar{β}})}{n} = \frac{1}{n} \sum_{i \in S} \frac{\partial q_{i}^{*} (\bar{\bar{β}})}{\partial \bar{\bar{β}}} \overset{a . s .}{\to} E (\frac{\partial q_{i}^{*} (\bar{\bar{β}})}{\partial \bar{\bar{β}}} | F_{N}),

where

\begin{matrix} E (\frac{\partial q_{i}^{*} (\bar{\bar{β}})}{\partial \bar{\bar{β}}} | F_{N}) = & \sum_{i \in U} a_{i} [Y_{i} - \dot{b} (ψ ({\bar{\bar{β}}}^{T} X_{i}))] \ddot{ψ} ({\bar{\bar{β}}}^{T} X_{i}) X_{i} X_{i}^{T} \\ - \sum_{i \in U} a_{i} \ddot{b} (ψ ({\bar{\bar{β}}}^{T} X_{i})) \dot{ψ} {({\bar{\bar{β}}}^{T} X_{i})}^{2} X_{i} X_{i}^{T} . \end{matrix}

Since

\bar{\bar{β}}

is between

{\tilde{β}}_{n}

and

{\hat{β}}_{N}

, and

{\tilde{β}}_{n}

is consistent with

{\hat{β}}_{N}

with respect to

F_{N}

in probability, then

\frac{\nabla s_{n} (\bar{\bar{β}})}{n} \overset{P | F_{N}}{\to} E (\frac{\partial q_{i}^{*} ({\hat{β}}_{N})}{\partial {\hat{β}}_{N}} | F_{N}),

(A17)

where

\begin{matrix} E (\frac{\partial q_{i}^{*} ({\hat{β}}_{N})}{\partial {\hat{β}}_{N}} | F_{N}) = & \sum_{i \in U} a_{i} [Y_{i} - \dot{b} (ψ ({\hat{β}}_{N}^{T} X_{i}))] \ddot{ψ} ({\hat{β}}_{N}^{T} X_{i}) X_{i} X_{i}^{T} \\ - \sum_{i \in U} a_{i} \ddot{b} (ψ ({\hat{β}}_{N}^{T} X_{i})) \dot{ψ} {({\hat{β}}_{N}^{T} X_{i})}^{2} X_{i} X_{i}^{T} . \end{matrix}

At last, combining Equations (A15)–(A17) by Slutsky’s theorem, one obtains

\sqrt{n} ({\hat{β}}_{n} - {\hat{β}}_{N}) \overset{d}{\to} N (0, V_{s}),

where

V_{s} = {[E (\frac{\partial q_{i}^{*} ({\hat{β}}_{N})}{\partial {\hat{β}}_{N}} | F_{N})]}^{- 1} Var (q_{i}^{*} ({\hat{β}}_{N}) | F_{N}) {[E (\frac{\partial q_{i}^{*} ({\hat{β}}_{N})}{\partial {\hat{β}}_{N}} | F_{N})]}^{- 1} = Σ_{N}^{- 1} V_{N} Σ_{N}^{- 1}

. The proof is completed. □

Proof of Theorem 4.

Here, one needs to prove the consistency of

{\hat{β}}_{n}

with respect to

β_{0}

due to the existence of

{\hat{β}}_{n}

; see [27].

Denote

p_{β} (X, y) : = exp {y ψ (β^{T} X) - b (ψ (β^{T} X))}

,

m_{β} (X, y) = log p_{β} (X, y) : =

y ψ (β^{T} X) - b (ψ (β^{T} X))

and

\tilde{φ} (β^{T} X) = \dot{b} [ψ (β^{T} X)] \dot{ψ} (β^{T} X)

. Then the negative K-L divergence in [28] is bounded,

\begin{matrix} - D_{K L} (P_{β_{0}} | | P_{β}) : = & E_{β_{0}} (m_{β} - m_{β_{0}}) \\ = & E {(E_{β_{0}} y | X) [ψ (β^{T} X) - ψ (β_{0}^{T} X)] - b (ψ (β^{T} X)) + b (ψ (β_{0}^{T} X))} \\ = & E {\dot{b} [ψ (β_{0}^{T} X)] [ψ (β^{T} X) - ψ (β_{0}^{T} X)] - b (ψ (β^{T} X)) + b (ψ (β_{0}^{T} X))} \\ (\exists t_{1} \in [0, 1]) = & E {\dot{b} [ψ (β_{0}^{T} X)] [ψ (β^{T} X) - ψ (β_{0}^{T} X)] \\ - \dot{b} [(1 - t_{1}) ψ (β^{T} X) + t_{1} ψ (β_{0}^{T} X)] [ψ (β^{T} X) - ψ (β_{0}^{T} X)]} \\ (\exists t_{2} \in [0, 1]) = & E {\ddot{b} [(1 - t_{2}) ψ (β_{0}^{T} X) + (1 - t_{1}) t_{2} ψ (β^{T} X) + t_{1} t_{2} ψ (β_{0}^{T} X)] \\ \cdot [ψ (β_{0}^{T} X) - (1 - t_{1}) ψ (β^{T} X) - t_{1} ψ (β_{0}^{T} X)] [ψ (β^{T} X) - ψ (β_{0}^{T} X)]} \\ = & - (1 - t_{1}) E {\ddot{b} [(1 - t_{3}) ψ (β_{0}^{T} X) + t_{3} ψ (β^{T} X)] {[ψ (β^{T} X) - ψ (β_{0}^{T} X)]}^{2}} \\ (\exists t_{4} \in [0, 1]) = & - (1 - t_{1}) E {\ddot{b} [(1 - t_{3}) ψ (β_{0}^{T} X) + t_{3} ψ (β^{T} X)] \\ \cdot \dot{ψ} {[(1 - t_{4}) (β_{0}^{T} X) + t_{4} (β^{T} X)]}^{2} {(β^{T} X - β_{0}^{T} X)}^{2}} \\ By (B . 4) and (B . 5)] \leq & - (1 - t_{1}) C_{1} E {(β^{T} X - β_{0}^{T} X)}^{2} \\ = & - (1 - t_{1}) C_{1} {(β - β_{0})}^{T} (E X X^{T}) (β - β_{0}) \\ \leq & - (1 - t_{1}) C_{1} λ_{min} (E X X^{T}) | | β - β_{0} {| |}^{2} \\ By (B . 1)] \leq & - (1 - t_{1}) C_{1} C_{2} | | β - β_{0} {| |}^{2}, \end{matrix}

where

t_{3} = t_{2} - t_{1} t_{2} \in [0, 1]

and

C_{2} > 0

. Then for any

ε > 0

, one has the well-separation condition

sup_{| | β - β_{0} {| |}^{2} \geq ε} E_{β_{0}} m_{β} (X, y) < E_{β_{0}} m_{β_{0}} (X, y) .

Let

{\tilde{M}}_{n} (β) : = \frac{1}{n} \sum_{i = 1}^{n} m_{β} (X_{i}^{*}, Y_{i}^{*})

, which is essentially a logarithmic likelihood function of subsampled GLMs, and

{\hat{β}}_{n}

is the function’s maximum point. Thus, one has the nearly maximization

{\tilde{M}}_{n} ({\hat{β}}_{n}) \geq {\tilde{M}}_{n} (β_{0}) \geq {\tilde{M}}_{n} (β_{0}) - o_{P} (1)

.

Let

F : = {m_{β} (X, y) = - y ψ (β^{T} X) + b (ψ (β^{T} X)), β \in Θ}

. Now one obtains

\begin{matrix} | m_{β_{1}} (X, y) - m_{β_{2}} (X, y) | = & | - y ψ (β_{1}^{T} X) + b (ψ (β_{1}^{T} X)) + y ψ (β_{2}^{T} X) - b (ψ (β_{2}^{T} X)) | \\ = & | y ψ (β_{1}^{T} X) - b (ψ (β_{1}^{T} X)) - y ψ (β_{2}^{T} X) + b (ψ (β_{2}^{T} X)) | \\ = & | | y \dot{ψ} (ξ_{(5)}^{T} X) (β_{1}^{T} X - β_{2}^{T} X) X \\ - \dot{b} (ψ (ξ_{(6)}^{T} X)) \dot{ψ} (ξ_{(6)}^{T} X) (β_{1}^{T} X - β_{2}^{T} X) X | | \\ \leq & C_{4} | y - \dot{b} (ψ (ξ_{(6)}^{T} X)) | \cdot | β_{1}^{T} X - β_{2}^{T} X | \cdot | | X | | \\ \leq & C_{4} | y - \dot{b} (ψ (ξ_{(6)}^{T} X)) {| \cdot | | X | |}^{2} \cdot | | β_{1} - β_{2} | |, \forall β_{1}, β_{2} \in Θ, \end{matrix}

where

ξ_{(5)}

and

ξ_{(6)}

are both between

β_{1}

and

β_{2}

and

C_{4} > 0

.

Let

\bar{m} (X, y) = | y - \dot{b} (ψ (ξ_{(6)}^{T} X)) {| \cdot | | X | |}^{2}

and by (B.3), one has

| | \bar{m} {(X, y) | |}_{P, 1} : = E_{β_{0}} | \bar{m} (X, Y) | \leq E_{β_{0}} sup_{\forall β \in Θ} [| y - \dot{b} (ψ (β^{T} X)) {| \cdot | | X | |}^{2}] < \infty,

where

| | \cdot {| |}_{\tilde{P}, 1} = \tilde{P} | \cdot |

is the

L_{1} (\tilde{P})

-norm in P269-P270 of [11] and

\tilde{P} : = E_{β_{0}}

. And then from the Example 19.7 in [11], one obtains

N_{[]} (ε, F, L_{1} (E_{β_{0}})) \leq K {(\frac{diam Θ}{ε / | | \bar{m} {| |}_{E_{β_{0}}, 1}})}^{p} < \infty, every 0 < ε < diam Θ < \infty

where

N_{[]} (ε, F, L_{1} (E_{β_{0}}))

is called

bracketing number

which is the minimum number of

ε

-brackets needed to cover

F

; see P270 in [11]. And K is a constant, and

diam Θ = {sup}_{\forall β_{1}, β_{2} \in Θ} | | β_{1} - β_{2} | |

.

Therefore, the class

F

is P-Glivenko-Cantelli by Theorem 19.4 in [11]. And from the definition of P-Glivenko-Cantelli in P269 of [11], we have

sup_{\forall β \in Θ} | {\tilde{M}}_{n} (β) - E_{β_{0}} m_{β} (X, y) | \overset{a . s .}{\to} 0 .

Finally, according to Theorem 5.7 in [11], we get

∥ {\hat{β}}_{n} - β_{0} ∥ = o_{P} (1)

. The proof is then completed. □

Recall (A7) and (A8) respectively, then

\nabla s_{n} (γ) = R_{n} (γ) - M_{n} (γ)

. Let

Φ = E (\nabla s_{n} (β))

, then we have the following lemma.

Lemma A3.

For

β \in R^{p}

, assume that

(E.1): $R_{n} (β)$ is finite and nonsingular.
(E.2): For $1 \leq k, j \leq p$ ,

$E {[\sum_{i \in S} \frac{a_{i}}{π_{i}} [Y_{i} - μ (ψ (β^{T} X_{i}))] \ddot{ψ} (β^{T} X_{i}) x_{i k} x_{i j}]}^{2} = o (1) .$
(E.3): For $1 \leq k, j \leq p$ ,

$Var [\sum_{i \in U} \frac{a_{i}}{π_{i}} {[\dot{ψ} (β^{T} X_{i})]}^{2} \dot{μ} (ψ (β^{T} X_{i})) x_{i k} x_{i j}] = o (1) .$

Then,

$\nabla s_{n} (β) \to Φ .$

Proof.

One derives each entry in the matrix by

\begin{matrix} {(\nabla s_{n} (β))}_{k j} = & {(R_{n} (β))}_{k j} - {(M_{n} (β))}_{k j} \\ = & \sum_{i \in S} \frac{1}{π_{i}^{*}} [Y_{i}^{*} - μ (ψ (β^{T} X_{i}^{*}))] \ddot{ψ} (β^{T} X_{i}^{*}) x_{i k}^{*} x_{i j}^{*} \\ - \sum_{i \in S} \frac{1}{π_{i}^{*}} {[\dot{ψ} (β^{T} X_{i}^{*})]}^{2} \dot{μ} (ψ (β^{T} X_{i}^{*})) x_{i k}^{*} x_{i j}^{*} . \end{matrix}

By the definition of

Φ

, one has

Φ_{k j} = E {(\nabla s_{n} (β))}_{k j} = E [- \sum_{i \in S} \frac{1}{π_{i}^{*}} \dot{μ} (ψ (β^{T} X_{i}^{*})) {[\dot{ψ} (β^{T} X_{i}^{*})]}^{2} x_{i k}^{*} x_{i j}^{*}] .

Next, one obtains

\begin{matrix} E {[{(\nabla s_{n} (β))}_{k j} - Φ_{k j}]}^{2} \\ = & E \{{[{(\nabla s_{n} (β))}_{k j} - Φ_{k j}]}^{2} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ = & E \{{[{(\nabla s_{n} (β))}_{k j} - E {(\nabla s_{n} (β))}_{k j}]}^{2} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ = & E \{{[{(R_{n} (β))}_{k j} - {(M_{n} (β))}_{k j} + E {(M_{n} (β))}_{k j}]}^{2} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ = & E \{{(R_{n} (β))}_{k j}^{2} + {[E {(M_{n} (β))}_{k j} - {(M_{n} (β))}_{k j}]}^{2} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ + E \{2 {(R_{n} (β))}_{k j} [E {(M_{n} (β))}_{k j} - {(M_{n} (β))}_{k j}] | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ = & E \{{(R_{n} (β))}_{k j}^{2} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} + Var \{{(M_{n} (β))}_{k j} | {(X_{i}, Y_{i})}_{i = 1}^{N}\} \\ = & o (1), \end{matrix}

where the first equality is based on the fact that after conditioning on the N data points, the n repeating sampling steps should be independent and identically distributed in each step. The last equality holds by the conditions (E.2) and (E.3). □

Lemma A4.

Under the conditions (C.1)–(C.5) in Theorem 5, if

s_{n} ({\hat{β}}_{n}) = 0

for all large n and

| | {\hat{β}}_{n} - β_{0} | | = O_{P} (1 / N)

, then

s_{n} (β_{0}) = - Φ ({\hat{β}}_{n} - β_{0}) + o_{P} (1) .

Proof.

By Taylor’s expansion:

0 = s_{n} ({\hat{β}}_{n}) = s_{n} (β_{0}) + \nabla s_{n} (β_{0}) ({\hat{β}}_{n} - β_{0}) + \frac{1}{2} {({\hat{β}}_{n} - β_{0})}^{T} Σ ({\tilde{β}}_{n}) ({\hat{β}}_{n} - β_{0}),

where

Σ ({\tilde{β}}_{n}) = \nabla^{2} s_{n} ({\tilde{β}}_{n})

and

{\tilde{β}}_{n}

is between

β_{0}

and

{\hat{β}}_{n}

. From assumption (C.3), (C.4) and (C.5) in Theorem 5, we have

\begin{matrix} ∥Σ ({\tilde{β}}_{n})∥ & = ∥\sum_{i \in S} \frac{1}{π_{i}^{*}} {\ddot{ϕ}}_{β} (X_{i}^{*}, Y_{i}^{*})∥ \\ \leq \sum_{i \in S} \frac{1}{π_{i}^{*}} \cdot ∥{\ddot{ϕ}}_{β} (X_{i}^{*}, Y_{i}^{*})∥ = O (n N) . \end{matrix}

Then

\frac{1}{2} {({\hat{β}}_{n} - β_{0})}^{T} Σ ({\tilde{β}}_{n}) ({\hat{β}}_{n} - β_{0}) = o_{P} (1)

. Therefore, by Lemma A3, one has

0 = s_{n} (β_{0}) + (Φ + o (1)) ({\hat{β}}_{n} - β_{0}) + o_{P} (1),

which implies

s_{n} (β_{0}) = - Φ ({\hat{β}}_{n} - β_{0}) + o_{P} (1) .

Hence, the proof is completed. □

Lemma A5.

{{\bar{M}}_{i}}_{i = 1}^{n}

is a martingale difference sequence adapt to the filtration

{F_{N, i}}_{i = 1}^{n}

.

Proof.

The

{\bar{M}}_{i}

’s are

F_{N, i}

-measurable by the definition of

{\bar{M}}_{i}

and the definition of the filtration

{F_{N, i}}_{i = 1}^{n}

. Then we obtain

\begin{matrix} E [{\bar{M}}_{i} | F_{N, i - 1}] & = E [\frac{1}{π_{i}^{*}} ϕ_{β} (X_{i}^{*}, Y_{i}^{*}) - \sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j}) | F_{N, i - 1}] \\ = E [\frac{1}{π_{i}^{*}} ϕ_{β} (X_{i}^{*}, Y_{i}^{*}) | F_{N, i - 1}] - E [\sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j}) | F_{N, i - 1}] \\ = \frac{\sum_{i = 1}^{N} π_{i} \frac{1}{π_{i}} ϕ_{β} (X_{i}, Y_{i})}{\sum_{i = 1}^{N} π_{i}} - \frac{\sum_{i = 1}^{N} π_{i} \sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j})}{\sum_{i = 1}^{N} π_{i}} \\ = \sum_{i = 1}^{N} ϕ_{β} (X_{i}, Y_{i}) - \sum_{j = 1}^{N} ϕ_{β} (X_{j}, Y_{j}) \\ = 0 . \end{matrix}

By the definition of martingale difference sequence in P230 of [29], the proof is completed. □

Under the definition of

T, \bar{M}, Q

, it is obvious that

Var (T) = Var (\bar{M}) + Var (Q)

.

Lemma A6.

sup_{N} λ_{max} (B_{N}) \leq 1

.

Proof.

By symmetry of

B_{N}

, we only to show for any N,

I - B_{N}

is positive definite.

\begin{matrix} I - B_{N} & = Var {(T)}^{- \frac{1}{2}} (Var (T) - Var (\bar{M})) Var {(T)}^{- \frac{1}{2}} \\ = Var {(T)}^{- \frac{1}{2}} Var (Q) Var {(T)}^{- \frac{1}{2}} . \end{matrix}

Therefore,

I - B_{N}

is equivalent to the positive definite matrix

Var (Q)

. The proof is completed. □

Lemma A7

(Multivariate version of martingale CLT, Lemma 4 in [19]). For

k = 1, 2, 3, \dots

, let

{ξ_{k i}; i = 1, 2, \dots, N_{k}}

be a martingale difference sequence in

R^{p}

relative to the filtration

{F_{k i}; i = 0, 1, \dots, N_{k}}

and let

Y_{k} \in R^{p}

be an

F_{k 0}

-measurable random vector. Set

S_{k} = \sum_{i = 1}^{N_{k}} ξ_{k i}

. Assume that

(F.1): $lim_{k \to \infty} \sum_{i = 1}^{N_{k}} E [∥ ξ_{k i} ∥^{4}] = 0$ ;
(F.2): $lim_{k \to \infty} E [{∥\sum_{i = 1}^{N_{k}} [ξ_{k i} ξ_{k i}^{T} | F_{k, i - 1}] - B_{k}∥}^{2}] = 0$ for some sequence of positive definite matrices ${B_{k}}_{k = 1}^{\infty}$ with $sup_{k} λ_{m a x} (B_{k}) < \infty$ i.e., the largest eigenvalue is uniformly bounded;
(F.3): For some probability distribution $L_{0}$ , ∗ denotes convolution and $L (\cdot)$ denotes the law of random variates:

$L (Y_{k}) * N (0, B_{k}) \overset{d}{\to} L_{0} .$

Then

$L (Y_{k} + S_{k}) \overset{d}{\to} L_{0} .$

Lemma A8

(Asymptotic normality of

s_{n} (β_{0})

). Assume that

(G.1.): $lim_{N \to \infty} \sum_{i = 1}^{n} E [| | ξ_{N i} {| |}^{4}] = 0$ ;
(G.2.): $lim_{N \to \infty} E [{∥\sum_{i = 1}^{n} E [ξ_{N i} ξ_{N i}^{T} | F_{N, i - 1}] - B_{N}∥}^{2}] = 0$ .
Then

$Var {(T)}^{- \frac{1}{2}} \cdot T \overset{d}{\to} N (0, I_{p}) .$

Proof.

The conditions in lemma A7 can be substituted with

\begin{matrix} ξ_{k i} = ξ_{N i}, Y_{k} = Var {(T)}^{- \frac{1}{2}} \cdot Q, \\ B_{k} = B_{N}, L_{0} \sim N (0, I_{p}) . \end{matrix}

By Lemma A5, conditions (F.1) and (F.2) of Lemma A7 are satisfied. Next we only need to show the third condition in Lemma A7 holds. According to central limit theorem we have

Var {(Q)}^{- \frac{1}{2}} \cdot Q \overset{d}{\to} N (0, I_{p}) .

For any

t \in R^{p}

, let

\tilde{t} = Var {(T)}^{- \frac{1}{2}} t

,

\tilde{X} = i Q

, due to the properties of the complex multivariate normal distributions are equivalent to the properties of real multivariate normal distributions in P222 of [30], and

E Q = 0

, one has

\begin{matrix} Var (\tilde{X}) & = Var (i Q) = E [(i Q) (i Q)] - {[E (i Q)]}^{2} \\ = - E Q^{2} = - E Q^{2} + {(E Q)}^{2} = - Var (Q) . \end{matrix}

Thus, according to Equations (45.4)–(45.6) in P108 of [30], one has

E e^{{\tilde{t}}^{T} \tilde{X}} = e^{{\tilde{t}}^{T} E (\tilde{X}) + \frac{1}{2} {\tilde{t}}^{T} Var (\tilde{X}) \tilde{t}} = e^{- \frac{1}{2} t^{T} Var {(T)}^{- \frac{1}{2}} Var (Q) Var {(T)}^{- \frac{1}{2}} t} .

Further, we obtain

E [e^{i t^{T} Var {(T)}^{- \frac{1}{2}} Q}] \cdot e^{- \frac{1}{2} t^{T} Var {(T)}^{- \frac{1}{2}} Var (\bar{M}) Var {(T)}^{- \frac{1}{2}} t} = e^{- \frac{1}{2} t^{T} t} .

Therefore, condition (F.3) in Lemma A7 is verified. Then one obtains

Var {(T)}^{- \frac{1}{2}} T = Var {(T)}^{- \frac{1}{2}} \cdot Q + Var {(T)}^{- \frac{1}{2}} \cdot \bar{M} \overset{d}{\to} N (0, I_{p}) .

The proof is completed. □

Proof of Theorem 5.

According to Lemma A4,

Φ ({\hat{β}}_{n} - β_{0}) + o_{P} (1) = - s_{n} (β_{0}) = - T .

(A18)

Multiplying with

Var {(T)}^{- \frac{1}{2}}

in (A18), one obtains

Var {(T)}^{- \frac{1}{2}} Φ ({\hat{β}}_{n} - β_{0}) + o_{P} (| | Var {(T)}^{- \frac{1}{2}} | |) = - Var {(T)}^{- \frac{1}{2}} T .

Applying Lemma A8, one obtains

Var {(T)}^{- \frac{1}{2}} Φ ({\hat{β}}_{n} - β_{0}) \overset{d}{\to} N (0, I_{p}) .

The proof is completed. □

References

Xi, R.; Lin, N. Direct regression modelling of high-order moments in big data. Stat. Its Interface 2016, 9, 445–452. [Google Scholar] [CrossRef] [Green Version]
Tewes, J.; Politis, D.N.; Nordman, D.J. Convolved subsampling estimation with applications to block bootstrap. Ann. Stat. 2019, 47, 468–496. [Google Scholar] [CrossRef] [Green Version]
Yu, J.; Wang, H.; Ai, M.; Zhang, H. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 2022, 117, 265–276. [Google Scholar] [CrossRef]
Yao, Y.; Wang, H. A review on optimal subsampling methods for massive datasets. J. Data Sci. 2021, 19, 151–172. [Google Scholar] [CrossRef]
Yu, J.; Wang, H. Subdata selection algorithm for linear model discrimination. Stat. Pap. 2021, 63, 1883–1906. [Google Scholar] [CrossRef]
Fu, S.; Chen, P.; Liu, Y.; Ye, Z. Simplex-based Multinomial Logistic Regression with Diverging Numbers of Categories and Covariates. Stat. Sin. 2022, in press. [Google Scholar] [CrossRef]
Ma, J.; Xu, J.; Maleki, A. Analysis of sensing spectral for signal recovery under a generalized linear model. Adv. Neural Inf. Process. Syst. 2021, 34, 22601–22613. [Google Scholar]
Mahmood, T. Generalized linear model based monitoring methods for high-yield processes. Qual. Reliab. Eng. Int. 2020, 36, 1570–1591. [Google Scholar] [CrossRef]
Ai, M.; Yu, J.; Zhang, H.; Wang, H. Optimal Subsampling Algorithms for Big Data Regressions. Stat. Sin. 2021, 31, 749–772. [Google Scholar] [CrossRef]
Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef]
van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: London, UK, 1998. [Google Scholar]
Wooldridge, J.M. Inverse probability weighted M-estimators for sample selection, attrition, and stratification. Port. Econ. J. 2002, 1, 117–139. [Google Scholar] [CrossRef]
Durret, R. Probability: Theory and Examples, 5th ed.; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
McCullagh, P.; Nelder, J. Generalized Linear Models, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1989. [Google Scholar]
Fahrmeir, L.; Kaufmann, H. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann. Stat. 1985, 13, 342–368. [Google Scholar] [CrossRef]
Shao, J. Mathematical Statistics, 2nd ed.; Springer: New York, NY, USA, 2003. [Google Scholar]
Yin, C.; Zhao, L.; Wei, C. Asymptotic normality and strong consistency of maximum quasi-likelihood estimates in generalized linear models. Sci. China Ser. A 2006, 49, 145–157. [Google Scholar] [CrossRef]
Rigollet, P. Kullback-Leibler aggregation and misspecified generalized linear models. Ann. Stat. 2012, 40, 639–665. [Google Scholar] [CrossRef] [Green Version]
Zhang, T.; Ning, Y.; Ruppert, D. Optimal sampling for generalized linear models under measurement constraints. J. Comput. Graph. Stat. 2021, 30, 106–114. [Google Scholar] [CrossRef]
Ohlsson, E. Asymptotic normality for two-stage sampling from a finite population. Probab. Theory Relat. Fields 1989, 81, 341–352. [Google Scholar] [CrossRef]
Zhang, H.; Wei, H. Sharper Sub-Weibull Concentrations. Mathematics 2022, 10, 2252. [Google Scholar] [CrossRef]
Gong, T.; Dong, Y.; Chen, H.; Dong, B.; Li, C. Markov Subsampling Based on Huber Criterion. IEEE Trans. Neural Netw. Learn. Syst. 2022, in press. [Google Scholar] [CrossRef]
Xiao, Y.; Yan, T.; Zhang, H.; Zhang, Y. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J. Inequalities Appl. 2020, 2020, 252. [Google Scholar] [CrossRef]
Zhang, H.; Jia, J. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signals detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
Ding, J.L.; Chen, X.R. Large-sample theory for generalized linear models with non-natural link and random variates. Acta Math. Appl. Sin. 2006, 22, 115–126. [Google Scholar] [CrossRef]
Jennrich, R.I. Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 1969, 40, 633–643. [Google Scholar] [CrossRef]
White, H. Maximum likelihood estimation of misspecified models. Econom. J. Econom. Soc. 1982, 50, 1–25. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Davidson, J. Stochastic Limit Theory: An Introduction for Econometricians; OUP Oxford: Oxford, UK, 1994. [Google Scholar]
Kotz, S.; Balakrishnan, N.; Johnson, N.L. Continuous Multivariate Distributions, Volume 1: Models and Applications, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2000. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Teng, G.; Tian, B.; Zhang, Y.; Fu, S. Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design. Entropy 2023, 25, 84. https://0-doi-org.brum.beds.ac.uk/10.3390/e25010084

AMA Style

Teng G, Tian B, Zhang Y, Fu S. Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design. Entropy. 2023; 25(1):84. https://0-doi-org.brum.beds.ac.uk/10.3390/e25010084

Chicago/Turabian Style

Teng, Guangqiang, Boping Tian, Yuanyuan Zhang, and Sheng Fu. 2023. "Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design" Entropy 25, no. 1: 84. https://0-doi-org.brum.beds.ac.uk/10.3390/e25010084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asymptotics of Subsampling for Generalized Linear Regression Models under Unbounded Design

Abstract

1. Introduction

2. Preliminaries

2.1. Subsampling M-Estimation

2.2. Generalized Linear Models

3. Main Results

3.1. Subsampling M-Estimation Problem

3.2. Conditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

3.3. Unconditional Asymptotic Properties of Subsampled GLMs with Unbounded Covariates

4. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Technical Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI