Huber Regression Analysis with a Semi-Supervised Method

Wang, Yue; Wang, Baobin; Peng, Chaoquan; Li, Xuefeng; Yin, Hong

doi:10.3390/math10203734

Open AccessArticle

Huber Regression Analysis with a Semi-Supervised Method

by

Yue Wang

¹,

Baobin Wang

¹,

Chaoquan Peng

¹,

Xuefeng Li

¹ and

Hong Yin

^2,*

¹

School of Mathematics and Statistics, South-Central Minzu University, Wuhan 430074, China

²

School of Mathematics, Renmin University of China, Beijing 100872, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(20), 3734; https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

Submission received: 27 August 2022 / Revised: 5 October 2022 / Accepted: 8 October 2022 / Published: 11 October 2022

(This article belongs to the Special Issue Distribution Theory and Application)

Download

Browse Figure

Versions Notes

Abstract

:

In this paper, we study the regularized Huber regression algorithm in a reproducing kernel Hilbert space (RKHS), which is applicable to both fully supervised and semi-supervised learning schemes. Our focus in the work is two-fold: first, we provide the convergence properties of the algorithm with fully supervised data. We establish optimal convergence rates in the minimax sense when the regression function lies in RKHSs. Second, we improve the learning performance of the Huber regression algorithm by a semi-supervised method. We show that, with sufficient unlabeled data, the minimax optimal rates can be retained if the regression function is out of RKHSs.

Keywords:

robust regression; Huber loss function; reproducing kernel Hilbert space; semi-supervised data

MSC:

62J02

1. Introduction

The ordinary least squares (OLS) is an important statistical tool applied in regression analysis. However, OLS does not perform well when the data are contaminated by the occurrence of outliers or heavy-tailed noise. Thus, OLS is suboptimal in the robust regression analysis and a variety of robust loss functions have been developed that are not so easily affected by noises. Among them, Huber loss function is usually a popular choice in the fields of statistics, machine learning and optimization since it is less sensitive to outliers and can address the issue of heavy-tailed errors effectively. Huber regression was initiated by Peter Huber in his seminal work [1,2]. Statistical bounds and convergence properties for Huber estimation and inference have been further investigated in the subsequent works. See, e.g., [3,4,5,6,7,8,9].

Semi-supervised learning has been gaining increased attention as an active research area in the fields of science and engineering. The original idea of semi-supervised method can date back to self-learning in the context of classification [10] and then is well developed in decision-directed learning, co-training in text classification, and manifold learning [11,12,13]. Most existing research on Huber regression work is in the supervised framework. Unlabeled data had been deemed useless and thus thrown away in the design of algorithms. Recently, it has been shown in vast literature that utilizing the additional information in unlabeled data can effectively improve the learning performance of algorithms. See, e.g., [14,15,16,17,18]. In this paper, we focus on the Huber regression algorithm performance with unlabeled data. By the semi-supervised method, we find that optimal learning rates are available if sufficient unlabeled data are added in the Huber regression analysis.

In the standard framework of statistical learning, we let the explanatory variable X take values in a compact domain

X

in a Euclidean space, and the response variable Y takes values in the output space

Y \subset R .

This work investigates the application of the Huber loss that is linked to the following regression model:

\begin{matrix} Y = f^{*} (X) + ϵ, \end{matrix}

where

f^{*}

is the regression function and

ϵ

is the noise in the regression model. Let

ρ

be a Borel probability measure on the product space

Z = X \times Y

. Let

ρ_{X}

and

ρ (y | x)

and denote the marginal distribution of

ρ

on

X

, and the conditional distribution on

Y

given

x \in X

, respectively. In the supervised learning setting,

ρ

is assumed to be unknown and the purpose of regression is to estimate

f^{*} (X)

according to a sample

D = {(x_{i}, y_{i})}_{i = 1}^{N}

drawn independently from

ρ

, where N is the sample size, the cardinality of D. The Huber loss function

ℓ_{σ} (\cdot)

is defined as

\begin{matrix} ℓ_{σ} (u) = \{\begin{matrix} u^{2}, & if | u | \leq σ, \\ 2 σ | u | - σ^{2}, & if | u | > σ, \end{matrix} \end{matrix}

where

σ > 0

is a robustification parameter. Given the prediction function

f : X \to Y

, Huber regression searches for a good approximation of

f^{*} (X)

by minimizing the empirical prediction error with the Huber loss

\begin{matrix} E_{D} (f) : = \frac{1}{N} \sum_{i = 1}^{N} ℓ_{σ} (y_{i} - f (x_{i})) \end{matrix}

(1)

over a suitable hypothesis space.

In this work, we study the kernel based Huber regression algorithm and the minimization of (1) performs in a reproducing kernel Hilbert space (RKHS) [19]. Recall that

K : X \times X \to R

is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The RKHS

H_{K}

is the completion of the linear span of the function set

{K_{x} = K (x, \cdot), x \in X}

with the inner product induced by

{〈 K_{x}, K_{y} 〉}_{K} = K (x, y)

. The reproducing property is given by

f (x) = {〈 f, K_{x} 〉}_{K}

. Note that, by Cauchy–Schwarz inequality and [19],

{∥ f ∥}_{\infty} = sup_{x \in X} |{〈 f, K_{x} 〉}_{K}| \leq sup_{x \in X} {∥f∥}_{K} ∥ K_{x} ∥_{K} = sup_{x \in X} \sqrt{K (x, x)} {∥ f ∥}_{K} .

To avoid overfitting, the regularized Huber regression algorithm in the RKHS

H_{K}

is given as

\begin{matrix} f_{D, λ} = arg min_{f \in H_{K}} \{E_{D} (f) + λ {∥ f ∥}_{K}^{2}\}, \end{matrix}

(2)

where

λ > 0

is a regularization parameter.

In this paper, we derive the explicit learning rate of Algorithm (2) in the supervised learning, which is comparable to the minimax optimal rate of OLS. By a semi-supervised method, we show that utilizing unlabeled data can conquer the bottleneck that optimal learning rates for algorithm (2) are only achievable when

f^{*}

lies in

H_{K} .

2. Assumptions and Main Results

To present our main results, we introduce some necessary assumptions. In this section, we study the convergence of

f_{D, λ}

to

f^{*}

in the square integrable space (

L_{ρ_{X}}^{2}

,

{∥ \cdot ∥}_{ρ}

).

Below, we elaborate on three important assumptions to carry out the analysis. The first assumption (3) is about the regularity of the regression function

f^{*} .

Define the integral operator

L_{K} : L_{ρ_{X}}^{2} \to L_{ρ_{X}}^{2}

associated with the kernel K by

\begin{matrix} L_{K} f : = \int_{X} f (x) K_{x} d ρ_{X} (x), \forall f \in L_{ρ_{X}}^{2} . \end{matrix}

Since K is a Mercer kernel on the compact domain

X

,

L_{K}

is compact and positive. Thus,

L_{K}^{r}

as the r-th power of

L_{K}

for

r > 0

is well defined [20]. Our error bounds are stated in terms of the regularity of

f^{*}

, given by

\begin{matrix} f^{*} = L_{K}^{r} (h), for some r > 0 and h \in L_{ρ_{X}}^{2} . \end{matrix}

(3)

The condition (3) characterizes the regularity of

f^{*}

and is directly related to the smoothness of

f^{*}

when

H_{K}

is a Sobolev space. If (3) holds with

r \geq \frac{1}{2}

,

f^{*}

lies in the space

H_{K}

[21].

The second assumption (4) is about the capacity of

H_{K}

, measured by the effective dimension [22,23,24]

\begin{matrix} N (λ) = Trace ({(L_{K} + λ I)}^{- 1} L_{K}), for λ > 0, \end{matrix}

where I is the identity operator on

H_{K}

. In this paper, we assume that

\begin{matrix} N (λ) \leq C λ^{- s} f o r s o m e C > 0, 0 < s \leq 1 . \end{matrix}

(4)

This condition measures the complexity of

H_{K}

with respect to the marginal distribution

ρ_{X} .

It is typical in the analysis of the performances of kernel methods’ estimators. It is always satisfied with

s = 1

by taking the constant

C = Trace (L_{K}) .

When

H_{K}

is a Sobolev space

W^{α} (X), X \subset R^{n}

with all derivatives of an order up to

α > \frac{n}{2},

then (4) is satisfied with

s = \frac{n}{2 α}

[25]. When

0 < s < 1

, (4) is weaker than the eigenvalue decaying assumption in the literature [17,23].

The third assumption is about the conditional probability distribution

ρ (y | x)

on the output space

Y

. We assume that the output variable Y satisfies the moment condition when there exist two positive numbers

t, M > 0

such that, for any integer

q \geq 2

,

\begin{matrix} E ({| Y |}^{q} | X) \leq \frac{1}{2} q! t^{2} M^{q - 2} . \end{matrix}

(5)

The assumption (5) covers many common distributions, for example, Gaussian, sub-Gaussian, and the distributions with compact support [26].

Now, we are ready to present the main results of this paper. Without loss of generality, we assume

sup_{x \in X} K (x, x) = 1 .

2.1. Convergence in the Supervised Learning

The following error estimate for Algorithm (2) is the first result of this section, which presents the convergence of Huber regression with fully supervised data and will be proved in Section 3.

Theorem 1.

Define

f_{D, λ}

by Algorithm (2) with the fully supervised data set

D = {(x_{i}, y_{i})}_{i = 1}^{N}

. Suppose that (3) holds for some

r > 0

, (4) and (5). If

\begin{matrix} λ = \{\begin{matrix} N^{- \frac{1}{1 + s}}, & f o r 0 < r < \frac{1}{2}, \\ N^{- \frac{1}{s + 2 min {1, r}}}, & f o r r \geq \frac{1}{2}, \end{matrix} \end{matrix}

(6)

then, for any

0 < δ < 1

, with probability

1 - δ

,

\begin{matrix} ∥ f_{D, λ} - f^{*} ∥_{ρ} \leq C_{1} max \{λ^{min {r, 1}}, σ^{- 1} λ^{- \frac{3}{2}} {(log N)}^{4}\} {(log \frac{8}{δ})}^{4}, \end{matrix}

(7)

where

C_{1}

is a constant independent of

N, δ

, or σ.

The above theorem shows that the parameter

σ

in the Huber loss

ℓ_{σ}

balances the robustness of Algorithm (2) and its convergence rates. We can see that, when the Huber loss function is employed in nonparametric regression problems, the enhancement of robustness occurs with the sacrifice of the convergence rate of Algorithm (2). Thus, what one needs to do is to find a trade-off. It is then direct to obtain the following corollary that provides the explicit learning rates for (2) with a suitable choice of

σ

.

Corollary 1.

Under the same conditions of Theorem 1, if

σ \geq [λ^{- r - \frac{3}{2}} {(log N)}^{4}]

, then with probability at least

1 - δ

,

\begin{matrix} ∥ f_{D, λ} - f^{*} ∥_{ρ} = \{\begin{matrix} O (N^{- \frac{r}{s + 1}} {(log \frac{8}{δ})}^{4}), & f o r 0 < r < \frac{1}{2}, \\ O (N^{- min \{\frac{r}{s + 2 r}, \frac{1}{s + 2}\}} {(log \frac{8}{δ})}^{4}), & f o r r \geq \frac{1}{2} . \end{matrix} \end{matrix}

(8)

Remark 1.

The above corollary tells us that, when

\frac{1}{2} \leq r \leq 1,

Algorithm (2) achieves the error rate

O (N^{- \frac{r}{2 r + s}})

, which coincides with the minimax lower bound proved in [23,25], and is optimal. We also notice that the convergence rate can not improve when

r > 1

. It is referred to as the saturation phenomenon, which has been found in a vast amount of literature [20,22,25].

2.2. Convergence in the Semi-Supervised Learning

Although optimal convergence rates of the Algorithm (2) were deduced when

f^{*}

lies in

H_{K}

(

r \geq \frac{1}{2}

) in the previous subsection, the error rate for the case

0 < r < \frac{1}{2}

needs improvements. In this subsection, we study the influence of unlabeled data on the convergence of (2) by using semi-supervised data.

Let an unlabeled data set

\tilde{D} (x) = {\{{\tilde{x}}_{i}\}}_{i = 1}^{\tilde{N}}

be drawn independently according to the marginal distribution

ρ_{X}

, where

\tilde{N}

is the cardinality of

\tilde{D} (x)

. With the fully supervised data set

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, we then introduce the supervised data set associated with Huber regression problems as

D^{*} = {(x_{i}^{*}, y_{i}^{*})}_{i = 1}^{N + \tilde{N}}

, given by

\begin{matrix} (x_{i}^{*}, y_{i}^{*}) = \{\begin{matrix} (x_{i}, \frac{N + \tilde{N}}{N} y_{i}), & for 1 \leq i \leq N, \\ ({\tilde{x}}_{i - N}, 0), & for N + 1 \leq i \leq N + \tilde{N} . \end{matrix} \end{matrix}

(9)

By replacing D with

D^{*}

in Algorithm (2), we then obtain the output function

f_{D^{*}, λ}

with semi-supervised data

D^{*}

. The enhanced convergence results are as follows.

Theorem 2.

Suppose (3), (4) and (5) hold for

0 < r \leq 1

,

r + s \geq \frac{1}{2}

, and

\tilde{N} \geq max {N^{\frac{s + 1}{2 r + s}} - N + 1, 1}

. If

λ = N^{- \frac{1}{2 r + s}}

, then, with a probability at least

1 - δ

,

\begin{matrix} ∥ f_{D^{*}, λ} - f_{ρ} ∥_{ρ} \leq & C_{2} max \{N^{- \frac{r}{2 r + s}}, \frac{Δ_{N, \tilde{N}, λ} {(log N)}^{4}}{\sqrt{λ} σ}\} log {(\frac{8}{δ})}^{4}, \end{matrix}

(10)

where

\begin{matrix} Δ_{N, \tilde{N}, λ} = \frac{N + \tilde{N}}{λ N} + {(\frac{N + \tilde{N}}{N})}^{2} \end{matrix}

and

C_{2}

is a constant independent of

N, \tilde{N},

σ, or δ.

Based on the theorem above, we can obtain the improved convergence rate as follows.

Corollary 2.

Under the same conditions of Theorem 2, if

\begin{matrix} σ \geq N^{\frac{2 r + 1}{2 (2 r + s)}} Δ_{N, \tilde{N}, λ} {(log N)}^{4}, \end{matrix}

(11)

then, with probability

1 - δ

,

\begin{matrix} ∥ f_{D^{*}, λ} - f_{ρ} ∥_{ρ} = O (N^{- \frac{r}{2 r + s}} {(log \frac{8}{δ})}^{4}) . \end{matrix}

(12)

Remark 2.

Corollary 1 shows that, provided no unlabeled data are involved, the minimax optimal convergence rate for (2) is obtained only in the situation

r > \frac{1}{2}

. When

0 < r \leq \frac{1}{2}

, the rate reduces to

O (N^{- \frac{r}{s + 1}})

. It implies that the regression function

f^{*}

is assumed to belong to

H_{K}

for achieving the optimal rate, which is difficult to verify in practice. In contrast, Corollary 2 tells us that, with sufficient unlabeled data

\tilde{D} (x)

engaged in Algorithm (2), the minimax optimal rate

O (N^{- \frac{r}{2 r + s}})

is retained for

0 < r \leq 1 .

This removes the strict regularity condition on

f^{*} .

3. Proofs

Now, we are in a position of proving results stated in Section 2.

3.1. Useful Estimates

First, we will estimate the bound of

f_{D, λ}

defined by (2). In the sequel, for notational simplicity, let

z = (x, y)

and define the empirical operator

L_{K, D} : H_{K} \to H_{K}

by

\begin{matrix} L_{K, D} : = \frac{1}{N} \sum_{i = 1}^{N} {〈 \cdot, K_{x_{i}} 〉}_{K} K_{x_{i}}, z_{i} = (x_{i}, y_{i}) \in D, \end{matrix}

so, for any

f \in H_{K}

,

L_{K, D} f = \frac{1}{N} \sum_{i = 1}^{N} f (x_{i}) K_{x_{i}}

. Then, we have the following representation for

f_{D, λ}

.

Lemma 1.

Define

f_{D, λ}

by (2). Then, it satisfies

\begin{matrix} f_{D, λ} = {(L_{K, D} + λ I)}^{- 1} {\hat{f}}_{ρ, D} + {(L_{K, D} + λ I)}^{- 1} W_{D, λ} \end{matrix}

(13)

where

\begin{matrix} {\hat{f}}_{ρ, D} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} K_{x_{i}}, z_{i} = (x_{i}, y_{i}) \in D \end{matrix}

and

\begin{matrix} W_{D, λ} = \frac{1}{N} \sum_{i = 1}^{N} [G_{+}^{'} (\frac{{(f_{D, λ} (x_{i}) - y_{i})}^{2}}{σ^{2}}) - G_{+}^{'} (0)] (f_{D, λ} (x_{i}) - y_{i}) K_{x_{i}} \end{matrix}

with

\begin{matrix} G (s) = \{\begin{matrix} s, & i f 0 \leq s \leq 1, \\ 2 s^{\frac{1}{2}} - 1, & i f s \geq 1 . \end{matrix} \end{matrix}

Proof.

Note that

ℓ_{σ} (u) = σ^{2} G (\frac{u^{2}}{σ^{2}})

. Since

f_{D, λ}

is the minimizer of Algorithm (2), we take the gradient of the regularized functional on

H_{K}

to give

\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} G_{+}^{'} (\frac{{(f_{D, λ} (x_{i}) - y_{i})}^{2}}{σ^{2}}) (f_{D, λ} (x_{i}) - y_{i}) K_{x_{i}} + λ f_{D, λ} = 0 . \end{matrix}

With the fact

G_{+}^{'} (0) = 1

, it yields

\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} (f_{D, λ} (x_{i}) - y_{i}) K_{x_{i}} + λ f_{D, λ} - W_{D, λ} = 0, \end{matrix}

which is

(L_{K, D} + λ I) f_{D, λ} - {\hat{f}}_{ρ, D} - W_{D, λ} = 0

.

The proof is complete. □

Based on the above lemma, we can obtain the bound of

f_{D, λ} .

Lemma 2.

Under the moment condition (5), with a probability at least

1 - δ

, there holds

\begin{matrix} ∥ f_{D, λ} ∥_{K} \leq (4 M + 5 t) λ^{- \frac{1}{2}} log \frac{N}{δ} . \end{matrix}

(14)

Proof.

Under the moment condition (5), it has been proven in [27] that, with a probability of at least

1 - δ

, there holds

\begin{matrix} max {| y | : there exists an x \in X, such that (x, y) \in D} \leq (4 M + 5 t) log \frac{N}{δ} . \end{matrix}

(15)

By the definition of

f_{D, λ}

, we have that

E_{D} (f_{D, λ}) + λ {∥ f_{D, λ} ∥}_{K}^{2} \leq E_{D} (0)

. Thus,

\begin{matrix} λ ∥ f_{D, λ} ∥_{K}^{2} \leq & E_{D} (0) \leq \frac{1}{N} \sum_{i = 1}^{N} ℓ_{σ} (y_{i}) \leq \frac{1}{N} \sum_{i = 1}^{N} y_{i}^{2} \leq max_{(x, y) \in D} {| y |}^{2} . \end{matrix}

It follows that

\begin{matrix} ∥ f_{D, λ} ∥_{K} \leq λ^{- \frac{1}{2}} max_{(x, y) \in D} | y | . \end{matrix}

(16)

This together with (15) yields the desired conclusion. □

Furthermore, we see that

\begin{matrix} ∥ W_{D, λ} ∥_{K} & \leq σ^{- 1} \frac{1}{N} \sum_{i = 1}^{N} {(∥ f_{D, λ} ∥_{K} + | y_{i} |)}^{2} \leq 2 σ^{- 1} \frac{1}{N} \sum_{i = 1}^{N} (∥ f_{D, λ} ∥_{K}^{2} + {| y_{i} |}^{2}) \\ \leq 2 σ^{- 1} (∥ f_{D, λ} ∥_{K}^{2} + max_{(x, y) \in D} {| y |}^{2}) . \end{matrix}

(17)

This in combination with the bounds (15) and (16) provides that, with probability at least

1 - δ

,

\begin{matrix} ∥ W_{D, λ} ∥_{K} \leq 2 {(4 M + 5 t)}^{2} (λ^{- 1} + 1) σ^{- 1} {(log \frac{N}{δ})}^{2} . \end{matrix}

(18)

3.2. Error Decomposition

To derive the explicit convergence rate of Algorithm (2), we introduce the regularization function

f_{λ}

in

H_{K}

, defined by

\begin{matrix} f_{λ} : = arg min_{f \in H_{K}} E_{ls} (f) + λ {∥ f ∥}_{K}^{2} \end{matrix}

where

E_{ls} (f) = \int_{Z} {(f (x) - y)}^{2} d ρ

is the expected risk associated with the least squares loss. It is direct to verify that

\begin{matrix} f_{λ} = {(L_{K} + λ I)}^{- 1} L_{K} f^{*}, \end{matrix}

(19)

so

f_{λ} - f^{*} = - λ {(L_{K} + λ I)}^{- 1} f^{*}

. By the work in [20], we know that under the regularity assumption (3) with

r > 0

,

\begin{matrix} ∥ f_{λ} - f^{*} ∥_{ρ} \leq \{\begin{matrix} {∥ h ∥}_{ρ} λ^{r}, & when 0 < r \leq 1, \\ {∥ h ∥}_{ρ} λ, & when r > 1, \end{matrix} \end{matrix}

(20)

and

∥ f_{λ} ∥_{K} \leq \{\begin{matrix} {∥ h ∥}_{ρ} λ^{r - \frac{1}{2}}, & when 0 < r < 1 / 2, \\ {∥ h ∥}_{ρ}, & when r \geq 1 / 2 . \end{matrix}

(21)

Now, we state two error decompositions for

f_{D, λ} - f_{λ}

. By (19), we have

- L_{K, D} f_{λ} - λ f_{λ} = - L_{K, D} f_{λ} + L_{K} f_{λ} - L_{K} f^{*} .

It implies

\begin{matrix} - f_{λ} = {(L_{K, D} + λ I)}^{- 1} [(L_{K} - L_{K, D}) f_{λ} - L_{K} f^{*}], \end{matrix}

(22)

which leads the decomposition by (13),

\begin{matrix} f_{D, λ} - f_{λ} = & {(L_{K, D} + λ I)}^{- 1} (L_{K} - L_{K, D}) f_{λ} + {(L_{K, D} + λ I)}^{- 1} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) \\ + {(L_{K, D} + λ I)}^{- 1} W_{D, λ} . \end{matrix}

(23)

In the sequel, we denote

\begin{matrix} B_{D, λ} = ∥ {(L_{K, D} + λ I)}^{- 1} (L_{K} + λ I) ∥, \\ C_{D, λ} = ∥ {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, D}) ∥, \\ G_{D, λ} = {∥ {(L_{K} + λ I)}^{- \frac{1}{2}} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) ∥}_{K} . \end{matrix}

Noting that, for any

f \in H_{K}

,

\begin{matrix} {max {∥ f ∥}_{ρ}, \sqrt{λ} {∥ f ∥}_{K}} \leq ∥ {(L_{K} + λ I)}^{\frac{1}{2}} {f ∥}_{K} \end{matrix}

(24)

by the fact

{∥ f ∥}_{ρ} = {∥ L_{K}^{\frac{1}{2}} f ∥}_{K}

[21], one obtains a bound for the sample error

∥ f_{D, λ} - f_{λ} ∥_{ρ}

by the decomposition (23) above.

Proposition 1.

Define

f_{D, λ}

by (2). Then, there holds

\begin{matrix} ∥ f_{D, λ} - f_{λ} ∥_{ρ} \leq B_{D, λ} C_{D, λ} ∥ f_{λ} ∥_{K} + B_{D, λ} G_{D, λ} + λ^{- \frac{1}{2}} B_{D, λ} {∥ W_{D, λ} ∥}_{K} . \end{matrix}

(25)

Proof.

Let

I_{1}

,

I_{2}

, and

I_{3}

denote the three terms on the right-hand side of (23), respectively. Consider the

H_{K}

norm of

\begin{matrix} {(L_{K} + λ I)}^{1 / 2} (f_{D, λ} - f_{λ}) = {(L_{K} + λ I)}^{1 / 2} (I_{1} + I_{2} + I_{3}) . \end{matrix}

Then,

\begin{matrix} ∥ {(L_{K} + λ I)}^{1 / 2} I_{1} ∥_{K} \\ \leq & ∥ {(L_{K} + λ I)}^{1 / 2} {(L_{K, D} + λ I)}^{- 1 / 2} ∥ ∥ {(L_{K, D} + λ I)}^{- 1 / 2} {(L_{K} + λ I)}^{1 / 2} ∥ \\ \times ∥ {(L_{K} + λ I)}^{- 1 / 2} (L_{K} - L_{K, D}) ∥ ∥ f_{λ} ∥_{K} \\ \leq & B_{D, λ} C_{D, λ} {∥ f_{λ} ∥}_{K} . \end{matrix}

Similarly,

\begin{matrix} ∥ {(L_{K} + λ I)}^{1 / 2} I_{2} ∥_{K} \\ \leq & ∥ {(L_{K} + λ I)}^{1 / 2} {(L_{K, D} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} ∥ ∥ {(L_{K} + λ I)}^{- 1 / 2} ({\hat{f}}_{ρ, D} - L_{K} f^{*}) ∥_{K} \\ \leq & B_{D, λ} G_{D, λ}, \end{matrix}

and

\begin{matrix} ∥ {(L_{K} + λ I)}^{1 / 2} I_{3} ∥_{K} \\ \leq & ∥ {(L_{K} + λ I)}^{1 / 2} {(L_{K, D} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} ∥ \frac{1}{\sqrt{λ}} {∥ W_{D, λ} ∥}_{K} \\ \leq & λ^{- 1 / 2} B_{D, λ} {∥ W_{D, λ} ∥}_{K} . \end{matrix}

With the above bounds, we use (24) to obtain the statement.

The proof is finished. □

3.3. Deriving Main Results

To prove our main results, we need to bound the quantities

B_{D, λ}, C_{D, λ}, G_{D, λ}

by the following probability estimates.

Lemma 3.

With a confidence of at least

1 - δ,

there holds

\begin{matrix} B_{D, λ} \leq 2 {(\frac{2 A_{D, λ} log \frac{2}{δ}}{\sqrt{λ}})}^{2} + 2, C_{D, λ} \leq 2 A_{D, λ} log \frac{2}{δ}, a n d \\ G_{D, λ} \leq 4 (M + t) A_{D, λ} log \frac{2}{δ} \end{matrix}

where

A_{D, λ} = \frac{1}{N \sqrt{λ}} + \frac{\sqrt{N (λ)}}{\sqrt{N}} .

These inequalities are well studied in the literature and can be found in [17,18].

Proof of Theorem 1.

We can decompose

∥ f_{D, λ} - f^{*} ∥_{ρ}

as the sample error

∥ f_{D, λ} - f_{λ} ∥_{ρ}

and the approximation error

∥ f_{λ} - f^{*} ∥_{ρ}

. As stated in (20),

∥ f_{λ} - f^{*} ∥_{ρ} \leq λ^{r} {∥ h ∥}_{ρ}

for

0 < r \leq 1

. Thus, we just estimate

∥ f_{D, λ} - f_{λ} ∥_{ρ}

by Proposition 1.

By Lemma 3 and the bound (18), with probability at least

1 - 4 δ

, the following bounds hold simultaneously:

\begin{matrix} B_{D, λ} C_{D, λ} ∥ f_{λ} ∥_{K} \leq 4 [{(\frac{2 A_{D, λ}}{\sqrt{λ}})}^{2} + 2] A_{D, λ} {(log \frac{2}{δ})}^{3} {∥ f_{λ} ∥}_{K}, \\ B_{D, λ} G_{D, λ} \leq 8 (M + t) [{(\frac{2 A_{D, λ}}{\sqrt{λ}})}^{2} + 2] A_{D, λ} {(log \frac{2}{δ})}^{3}, \end{matrix}

and

\begin{matrix} λ^{- \frac{1}{2}} B_{D, λ} {∥ W_{D, λ} ∥}_{K} \leq 8 {(4 M + 5 t)}^{2} [{(\frac{2 A_{D, λ}}{\sqrt{λ}})}^{2} + 2] λ^{- \frac{3}{2}} σ^{- 1} {(log \frac{N}{δ})}^{4} . \end{matrix}

Scaling

4 δ

to

δ,

by (20) and the estimates above, we have with confidence at least

1 - δ

\begin{matrix} {∥f_{D, λ} - f^{*}∥}_{ρ} \leq {∥f_{D, λ} - f_{λ}∥}_{ρ} + {∥f_{λ} - f^{*}∥}_{ρ} \\ \leq 24 {(4 M + 5 t)}^{2} [{(\frac{2 A_{D, λ}}{\sqrt{λ}})}^{2} + 2] \\ [(A_{D, λ} + A_{D, λ} {∥ f_{λ} ∥}_{K}) {(log \frac{8}{δ})}^{3} + λ^{- \frac{3}{2}} σ^{- 1} {(log \frac{4 N}{δ})}^{4}] + {∥ h ∥}_{ρ} λ^{r} . \end{matrix}

(26)

By (4),

\begin{matrix} A_{D, λ} = \frac{1}{N \sqrt{λ}} + \frac{\sqrt{N (λ)}}{\sqrt{N}} & \leq \frac{1}{N \sqrt{λ}} + \sqrt{\frac{C λ^{- s}}{N}} \leq (\sqrt{C} + 1) \frac{λ^{- \frac{s}{2}}}{\sqrt{N}} (\frac{λ^{\frac{s - 1}{2}}}{\sqrt{N}} + 1) . \end{matrix}

The choice (6) of

λ

results in the fact that

\begin{matrix} \frac{λ^{- \frac{s}{2}}}{\sqrt{N}} \leq \{\begin{matrix} λ^{- \frac{s}{2} + \frac{1 + s}{2}} = λ^{\frac{1}{2}}, & when 0 < r < \frac{1}{2}, \\ λ^{- \frac{s}{2} + \frac{s}{2} + min {1, r}} = λ^{min {1, r}}, & when r \geq \frac{1}{2}, \end{matrix} \end{matrix}

(27)

\begin{matrix} \frac{λ^{\frac{s - 1}{2}}}{\sqrt{N}} \leq \{\begin{matrix} λ^{\frac{s - 1}{2} + \frac{1 + s}{2}}, & when 0 < r < \frac{1}{2} \\ λ^{\frac{s - 1}{2} + \frac{s}{2} + min {1, r}}, & when r \geq \frac{1}{2} \end{matrix}\} \leq λ^{s} \leq 1, \end{matrix}

(28)

and

\begin{matrix} \frac{λ^{- s - 1}}{N} \leq \{\begin{matrix} λ^{- s - 1} λ^{s + 1} = 1, & when 0 < r < \frac{1}{2} \\ λ^{- 1 - s} λ^{s + 2 min {r, 1}} = λ^{min {2 r - 1, 1}}, & when r \geq \frac{1}{2} \end{matrix}\} \leq 1 . \end{matrix}

(29)

Collecting the above estimates,

\begin{matrix} {(\frac{A_{D, λ}}{\sqrt{λ}})}^{2} \leq {(\sqrt{C} + 1)}^{2} \frac{λ^{- s - 1}}{N} {(\frac{λ^{\frac{s - 1}{2}}}{\sqrt{N}} + 1)}^{2} \leq 4 {(\sqrt{C} + 1)}^{2} \end{matrix}

(30)

and

\begin{matrix} A_{D, λ} \leq \{\begin{matrix} λ^{\frac{1}{2}}, & when 0 < r < \frac{1}{2}, \\ λ^{min {1, r}}, & when r \geq \frac{1}{2} . \end{matrix} \end{matrix}

(31)

Putting (21), (30) and (31) into (26), we can get (7) with

C_{1} = 96 {(4 M + 5 t)}^{2} [{(\sqrt{C} + 1)}^{2} + 1] {(2 + ∥ h ∥}_{ρ}) .

The proof is complete. □

Proof of Theorem 2.

Similar as the proof of (25), there holds

\begin{matrix} ∥ f_{D^{*}, λ} - f_{λ} ∥_{ρ} \leq B_{D^{*}, λ} C_{D^{*}, λ} ∥ f_{λ} ∥_{K} + B_{D^{*}, λ} G_{D^{*}, λ} + λ^{- \frac{1}{2}} B_{D^{*}, λ} {∥ W_{D^{*}, λ} ∥}_{K} . \end{matrix}

(32)

Note that, by (9),

\begin{matrix} {\hat{f}}_{ρ, D^{*}} = \frac{1}{N + \tilde{N}} \sum_{i = 1}^{N + \tilde{N}} y_{i}^{*} K_{x_{i}^{*}} = \frac{1}{N + \tilde{N}} \sum_{i = 1}^{N} \frac{N + \tilde{N}}{N} y_{i} K_{x_{i}} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} K_{x_{i}} = {\hat{f}}_{ρ, D} . \end{matrix}

It means

G_{D^{*}, λ} = G_{D, λ} .

Furthermore, similar to (16), we have

\begin{matrix} ∥ f_{D^{*}, λ} ∥_{K}^{2} \leq \frac{N + \tilde{N}}{N λ} max_{(x, y) \in D} {| y |}^{2} . \end{matrix}

In addition, by (17),

\begin{matrix} ∥ W_{D^{*}, λ} ∥_{K} & \leq σ^{- 1} \frac{1}{N + \tilde{N}} \sum_{i = 1}^{N + \tilde{N}} {(∥ f_{D^{*}, λ} ∥_{K} + | y_{i} |)}^{2} \leq 2 σ^{- 1} \frac{1}{N + \tilde{N}} \sum_{i = 1}^{N + \tilde{N}} (∥ f_{D^{*}, λ} ∥_{K}^{2} + {| y_{i}^{*} |}^{2}) \\ \leq 2 σ^{- 1} (∥ f_{D^{*}, λ} ∥_{K}^{2} + {(\frac{N + \tilde{N}}{N})}^{2} max_{(x, y) \in D} {| y |}^{2}) . \end{matrix}

Then, by (15), with confidence at least

1 - δ,

\begin{matrix} ∥ W_{D^{*}, λ} ∥_{K} \leq 2 {(4 M + 5 t)}^{2} (\frac{N + \tilde{N}}{N λ} + {(\frac{N + \tilde{N}}{N})}^{2}) σ^{- 1} {(log \frac{N}{δ})}^{2} . \end{matrix}

This together with Lemma 3 yields that, with a confidence of at least

1 - 4 δ,

\begin{matrix} B_{D^{*}, λ} C_{D^{*}, λ} ∥ f_{λ} ∥_{K} \leq 4 [{(\frac{2 A_{D^{*}, λ}}{\sqrt{λ}})}^{2} + 2] A_{D^{*}, λ} {(log \frac{2}{δ})}^{3} {∥ f_{λ} ∥}_{K}, \\ B_{D^{*}, λ} G_{D^{*}, λ} = B_{D^{*}, λ} G_{D, λ} \leq 8 (M + t) [{(\frac{2 A_{D^{*}, λ}}{\sqrt{λ}})}^{2} + 2] A_{D, λ} {(log \frac{2}{δ})}^{3}, \end{matrix}

and

\begin{matrix} λ^{- \frac{1}{2}} B_{D^{*}, λ} {∥ W_{D^{*}, λ} ∥}_{K} \leq 8 {(4 M + 5 t)}^{2} [{(\frac{2 A_{D^{*}, λ}}{\sqrt{λ}})}^{2} + 2] Δ_{N, \tilde{N}, λ} λ^{- \frac{1}{2}} σ^{- 1} {(log \frac{N}{δ})}^{4} . \end{matrix}

Scaling

4 δ

to

δ,

by (32) and (20), then, with a confidence at least

1 - δ,

\begin{matrix} {∥f_{D^{*}, λ} - f^{*}∥}_{ρ} \leq {∥f_{D^{*}, λ} - f_{λ}∥}_{ρ} + {∥f_{λ} - f^{*}∥}_{ρ} \\ \leq 24 {(4 M + 5 t)}^{2} [{(\frac{2 A_{D^{*}, λ}}{\sqrt{λ}})}^{2} + 2] \\ [(A_{D, λ} + A_{D^{*}, λ} {∥ f_{λ} ∥}_{K}) {(log \frac{8}{δ})}^{3} + Δ_{N, \tilde{N}, λ} λ^{- \frac{1}{2}} σ^{- 1} {(log \frac{4 N}{δ})}^{4}] + {∥ h ∥}_{ρ} λ^{r} . \end{matrix}

(33)

Thus, to prove Theorem 2, we need the estimates as follows:

Since

r + s > \frac{1}{2}

and

λ = N^{- \frac{1}{2 r + s}}

,

\frac{λ^{\frac{s - 1}{2}}}{\sqrt{N}} = N^{\frac{1 - 2 (r + s)}{2 (2 r + s)}} \leq 1, \frac{λ^{- \frac{s}{2}}}{\sqrt{N}} = λ^{r} .

Then, by (4),

\begin{matrix} A_{D, λ} = \frac{1}{N \sqrt{λ}} + \frac{\sqrt{N (λ)}}{\sqrt{N}} & \leq \frac{1}{N \sqrt{λ}} + \sqrt{\frac{C λ^{- s}}{N}} \leq (\sqrt{C} + 1) \frac{λ^{- \frac{s}{2}}}{\sqrt{N}} (\frac{λ^{\frac{s - 1}{2}}}{\sqrt{N}} + 1) \leq (\sqrt{C} + 1) λ^{r} \end{matrix}

and

\begin{matrix} A_{D^{*}, λ} & = \frac{1}{(N + \tilde{N}) \sqrt{λ}} + \frac{\sqrt{N (λ)}}{\sqrt{N + \tilde{N}}} \leq \frac{1}{(N + \tilde{N}) \sqrt{λ}} + \sqrt{\frac{C λ^{- s}}{(N + \tilde{N})}} \\ \leq (\sqrt{C} + 1) \frac{λ^{- \frac{s}{2}}}{\sqrt{N + \tilde{N}}} (\frac{λ^{\frac{s - 1}{2}}}{\sqrt{N + \tilde{N}}} + 1) \leq 2 (\sqrt{C} + 1) \frac{λ^{- \frac{s}{2}}}{\sqrt{N + \tilde{N}}} . \end{matrix}

Thus,

\begin{matrix} {(\frac{A_{D^{*}, λ}}{\sqrt{λ}})}^{2} + 1 \leq 4 {(\sqrt{C} + 1)}^{2} \frac{λ^{- s - 1}}{N + \tilde{N}} + 1 \leq 4 {(\sqrt{C} + 1)}^{2} + 1 . \end{matrix}

Furthermore, by (21),

A_{D^{*}, λ} {∥ f_{λ} ∥}_{K} \leq \{\begin{matrix} 2 (\sqrt{C} + 1) {∥ h ∥}_{ρ} \frac{λ^{- \frac{s}{2} + r - \frac{1}{2}}}{\sqrt{N + \tilde{N}}}, & when 0 < r < 1 / 2, \\ 2 (\sqrt{C} + 1) {∥ h ∥}_{ρ} \frac{λ^{- \frac{s}{2}}}{\sqrt{N + \tilde{N}}}, & when r \geq 1 / 2 . \end{matrix}

By the restriction

\tilde{N} \geq max {N^{\frac{s + 1}{2 r + s}} - N + 1, 1}

, we conclude that

\begin{matrix} A_{D^{*}, λ} ∥ f_{λ} ∥_{K} \leq 2 (\sqrt{C} + 1) {∥ h ∥}_{ρ} λ^{r}, for r > 0 . \end{matrix}

Putting the estimates above into (33) yields that the desired conclusion (10) with

C_{2} = 96 {(4 M + 5 t)}^{2} [{(\sqrt{C} + 1)}^{2} + (\sqrt{C} + 1) + 1] {(2 + ∥ h ∥}_{ρ}) .

The proof is finished. □

4. Numerical Simulation

In this part, we carry out simulations to verify our theoretical statements. We employ the mean squared error of a testing set for the comparison. We generate

N = 500

labeled data

{x_{i}, y_{i}}_{i = 1}^{500}

by the regression model

y_{i} = f^{*} (x_{i}) + ϵ

, where

f^{*} (x) = x (1 - x)

, and the random inputs

x_{i}

’s are independently drawn according to the Normal distribution

N (0, 1)

, and

ϵ

is the independent Gaussian noise

N (0, 0.005)

. We also generate

\tilde{N} = 200

unlabeled data

{{\tilde{x}}_{i}}_{i = 1}^{200}

with

{\tilde{x}}_{i}

’s drawn independently according to the uniform distribution on

[0, 1]

. We choose the Gaussian kernel

K (x, u) = exp {- | x - u |^{2} / 2}

,

h = 5

and regularization parameter

λ = 0.7

. Algorithm 1 shows the mean squared error of Algorithm (1) with the training data set

D = {x_{i}, y_{i}}_{i = 1}^{500} .

Algorithm 2 shows the mean squared error of Algorithm (2) with the semi-supervised data set

D^{*}

by (9). Algorithm (2)’ s error is obviously smaller than Algorithm (1) if 20 unlabeled data are added into the training data. When we add more unlabeled data from 20 to 200, Algorithm (2)’ s curve decreases continuously. These experimental results coincide with our theoretical analysis through the following Figure 1.

5. Discussion

Unlabeled data are ubiquitous in a variety of fields including signal processing, privacy concerns, feature selection, and data clustering. For the applications of Huber regression that have robustness, we adopted a semi-supervised learning method to our regularized Huber regression algorithm. We derived the explicit learning rate of algorithm (2) in the supervised learning, which was comparable to the minimax optimal rate of OLS. By a semi-supervised method, we showed that an inflation of unlabeled data could improve learning performance for Huber regression analysis. It suggested that using the additional information of unlabeled data could extend the application of Huber regression.

Author Contributions

Conceptualization, Y.W.; Funding acquisition, C.P.; Methodology, B.W. and X.L.; Project administration, B.W.; Resources, X.L.; Supervision, C.P. and H.Y.; Writing—original draft, Y.W.; Writing—review & editing, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper is partially supported by the National Natural Science Foundation of China (Project 12071356).

Conflicts of Interest

The authors declare no conflict of interest.

References

Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Huber, P.J. Robust Regression: Asymptotics, Conjectures and Monte Carlo. Ann. Stat. 1973, 1, 799–821. [Google Scholar]
Christmann, A.; Steinwart, I. Consistency and robustness of kernel based regression. Bernoulli 2007, 13, 799–819. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Li, Q.; Wang, Y. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J. R. Stat. Soc. 2017, 79, 247–265. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.; Wu, Q. A statistical learning assessment of Huber regression. J. Approx. Theory 2022, 273, 105660. [Google Scholar] [CrossRef]
Loh, P.L. Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Statistics 2015, 45, 866–896. [Google Scholar] [CrossRef] [Green Version]
Rao, B. Asymptotic behavior of M-estimators for the linear model with dependent errors. Bull. Inst. Math. Acad. Sin. 1981, 9, 367–375. [Google Scholar]
Sun, Q.; Zhou, W.; Fan, J. Adaptive Huber Regression. J. Am. Stat. Assoc. 2017, 115, 254–265. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Liu, H.; Zhang, T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann. Stat. 2013, 42, 2164–2201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chapelle, O.; Schölkopf, B.; Zien, A. Semi-Supervised Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Belkin, M.; Niyogi, P. Semi-Supervised Learning on Riemannian Manifolds. Mach. Learn. 2004, 56, 209–239. [Google Scholar] [CrossRef]
Blum, A.; Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998. [Google Scholar]
Wang, J.; Jebara, T.; Chang, S.F. Semi-Supervised Learning Using Greedy Max-Cut. J. Mach. Learn. Res. 2013, 14, 771–800. [Google Scholar]
Andrea Caponnetto, Y.Y. Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 2010, 8, 161–183. [Google Scholar] [CrossRef]
Guo, X.; Hu, T.; Wu, Q. Distributed Minimum Error Entropy Algorithms. J. Mach. Learn. Res. 2020, 21, 1–31. [Google Scholar]
Hu, T.; Fan, J.; Xiang, D.H. Convergence Analysis of Distributed Multi-Penalty Regularized Pairwise Learning. Anal. Appl. 2019, 18, 109–127. [Google Scholar] [CrossRef]
Lin, S.B.; Guo, X.; Zhou, D.X. Distributed Learning with Regularized Least Squares. J. Mach. Learn. Res. 2016, 18, 3202–3232. [Google Scholar]
Lin, S.B.; Zhou, D.X. Distributed Kernel-Based Gradient Descent Algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 686, 337–404. [Google Scholar] [CrossRef]
Smale, S.; Zhou, D.X. Learning Theory Estimates via Integral Operators and Their Approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef] [Green Version]
Cucker, F.; Ding, X.Z. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: Cambridge, MA, USA, 2007. [Google Scholar]
Bauer, F.; Pereverzev, S.; Rosasco, L. On regularization algorithms in learning theory. J. Complex. 2007, 23, 52–72. [Google Scholar] [CrossRef] [Green Version]
Caponnetto, A.; Vito, E.D. Optimal Rates for the Regularized Least-Squares Algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
Tong, Z. Effective Dimension and Generalization of Kernel Learning. In Proceedings of the Advances in Neural Information Processing Systems 15, NIPS 2002, Vancouver, BC, Canada, 9–14 December 2002. [Google Scholar]
Neeman, M.J. Regularization in kernel learning. Ann. Stat. 2010, 38, 526–565. [Google Scholar]
Raskutti, G.; Wainwright, M.J.; Yu, B. Early stopping and non-parametric regression. J. Mach. Learn. Res. 2014, 15, 335–366. [Google Scholar]
Wang, C.; Hu, T. Online minimum error entropy algorithm with unbounded sampling. Anal. Appl. 2019, 17, 293–322. [Google Scholar] [CrossRef]

Figure 1. The number of unlabeled data.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, B.; Peng, C.; Li, X.; Yin, H. Huber Regression Analysis with a Semi-Supervised Method. Mathematics 2022, 10, 3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

AMA Style

Wang Y, Wang B, Peng C, Li X, Yin H. Huber Regression Analysis with a Semi-Supervised Method. Mathematics. 2022; 10(20):3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

Chicago/Turabian Style

Wang, Yue, Baobin Wang, Chaoquan Peng, Xuefeng Li, and Hong Yin. 2022. "Huber Regression Analysis with a Semi-Supervised Method" Mathematics 10, no. 20: 3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Huber Regression Analysis with a Semi-Supervised Method

Abstract

1. Introduction

2. Assumptions and Main Results

2.1. Convergence in the Supervised Learning

2.2. Convergence in the Semi-Supervised Learning

3. Proofs

3.1. Useful Estimates

3.2. Error Decomposition

3.3. Deriving Main Results

4. Numerical Simulation

5. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI