Next Article in Journal
Noise Models in Classification: Unified Nomenclature, Extended Taxonomy and Pragmatic Categorization
Next Article in Special Issue
Quantitative Analysis of the Balance Property in Factorial Experimental Designs 24 to 28
Previous Article in Journal
Resource- and Time-Efficient Computation Offloading in Vehicular Edge Computing: A Max-Min Fairness Oriented Approach
Previous Article in Special Issue
Convergence of Uniformity Criteria and the Application in Numerical Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Huber Regression Analysis with a Semi-Supervised Method

1
School of Mathematics and Statistics, South-Central Minzu University, Wuhan 430074, China
2
School of Mathematics, Renmin University of China, Beijing 100872, China
*
Author to whom correspondence should be addressed.
Submission received: 27 August 2022 / Revised: 5 October 2022 / Accepted: 8 October 2022 / Published: 11 October 2022
(This article belongs to the Special Issue Distribution Theory and Application)

Abstract

:
In this paper, we study the regularized Huber regression algorithm in a reproducing kernel Hilbert space (RKHS), which is applicable to both fully supervised and semi-supervised learning schemes. Our focus in the work is two-fold: first, we provide the convergence properties of the algorithm with fully supervised data. We establish optimal convergence rates in the minimax sense when the regression function lies in RKHSs. Second, we improve the learning performance of the Huber regression algorithm by a semi-supervised method. We show that, with sufficient unlabeled data, the minimax optimal rates can be retained if the regression function is out of RKHSs.

1. Introduction

The ordinary least squares (OLS) is an important statistical tool applied in regression analysis. However, OLS does not perform well when the data are contaminated by the occurrence of outliers or heavy-tailed noise. Thus, OLS is suboptimal in the robust regression analysis and a variety of robust loss functions have been developed that are not so easily affected by noises. Among them, Huber loss function is usually a popular choice in the fields of statistics, machine learning and optimization since it is less sensitive to outliers and can address the issue of heavy-tailed errors effectively. Huber regression was initiated by Peter Huber in his seminal work [1,2]. Statistical bounds and convergence properties for Huber estimation and inference have been further investigated in the subsequent works. See, e.g., [3,4,5,6,7,8,9].
Semi-supervised learning has been gaining increased attention as an active research area in the fields of science and engineering. The original idea of semi-supervised method can date back to self-learning in the context of classification [10] and then is well developed in decision-directed learning, co-training in text classification, and manifold learning [11,12,13]. Most existing research on Huber regression work is in the supervised framework. Unlabeled data had been deemed useless and thus thrown away in the design of algorithms. Recently, it has been shown in vast literature that utilizing the additional information in unlabeled data can effectively improve the learning performance of algorithms. See, e.g., [14,15,16,17,18]. In this paper, we focus on the Huber regression algorithm performance with unlabeled data. By the semi-supervised method, we find that optimal learning rates are available if sufficient unlabeled data are added in the Huber regression analysis.
In the standard framework of statistical learning, we let the explanatory variable X take values in a compact domain X in a Euclidean space, and the response variable Y takes values in the output space Y R . This work investigates the application of the Huber loss that is linked to the following regression model:
Y = f * ( X ) + ϵ ,
where f * is the regression function and ϵ is the noise in the regression model. Let ρ be a Borel probability measure on the product space Z = X × Y . Let ρ X and ρ ( y | x ) and denote the marginal distribution of ρ on X , and the conditional distribution on Y given x X , respectively. In the supervised learning setting, ρ is assumed to be unknown and the purpose of regression is to estimate f * ( X ) according to a sample D = { ( x i , y i ) } i = 1 N drawn independently from ρ , where N is the sample size, the cardinality of D. The Huber loss function σ ( · ) is defined as
σ ( u ) = u 2 , if   | u | σ , 2 σ | u | σ 2 , if   | u | > σ ,
where σ > 0 is a robustification parameter. Given the prediction function f : X Y , Huber regression searches for a good approximation of f * ( X ) by minimizing the empirical prediction error with the Huber loss
E D ( f ) : = 1 N i = 1 N σ ( y i f ( x i ) )
over a suitable hypothesis space.
In this work, we study the kernel based Huber regression algorithm and the minimization of (1) performs in a reproducing kernel Hilbert space (RKHS) [19]. Recall that K : X × X R is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The RKHS H K is the completion of the linear span of the function set { K x = K ( x , · ) , x X } with the inner product induced by K x , K y K = K ( x , y ) . The reproducing property is given by f ( x ) = f , K x K . Note that, by Cauchy–Schwarz inequality and [19],
f = sup x X f , K x K sup x X f K K x K = sup x X K ( x , x ) f K .
To avoid overfitting, the regularized Huber regression algorithm in the RKHS H K is given as
f D , λ = arg min f H K E D ( f ) + λ f K 2 ,
where λ > 0 is a regularization parameter.
In this paper, we derive the explicit learning rate of Algorithm (2) in the supervised learning, which is comparable to the minimax optimal rate of OLS. By a semi-supervised method, we show that utilizing unlabeled data can conquer the bottleneck that optimal learning rates for algorithm (2) are only achievable when f * lies in H K .

2. Assumptions and Main Results

To present our main results, we introduce some necessary assumptions. In this section, we study the convergence of f D , λ to f * in the square integrable space ( L ρ X 2 , · ρ ).
Below, we elaborate on three important assumptions to carry out the analysis. The first assumption (3) is about the regularity of the regression function f * . Define the integral operator L K : L ρ X 2 L ρ X 2 associated with the kernel K by
L K f : = X f ( x ) K x d ρ X ( x ) , f L ρ X 2 .
Since K is a Mercer kernel on the compact domain X , L K is compact and positive. Thus, L K r as the r-th power of L K for r > 0 is well defined [20]. Our error bounds are stated in terms of the regularity of f * , given by
f * = L K r ( h ) , for   some   r > 0   and   h L ρ X 2 .
The condition (3) characterizes the regularity of f * and is directly related to the smoothness of f * when H K is a Sobolev space. If (3) holds with r 1 2 , f * lies in the space H K [21].
The second assumption (4) is about the capacity of H K , measured by the effective dimension [22,23,24]
N ( λ ) = Trace ( ( L K + λ I ) 1 L K ) , for   λ > 0 ,
where I is the identity operator on H K . In this paper, we assume that
N ( λ ) C λ s f o r   s o m e   C > 0 , 0 < s 1 .
This condition measures the complexity of H K with respect to the marginal distribution ρ X . It is typical in the analysis of the performances of kernel methods’ estimators. It is always satisfied with s = 1 by taking the constant C = Trace ( L K ) . When H K is a Sobolev space W α ( X ) , X R n with all derivatives of an order up to α > n 2 , then (4) is satisfied with s = n 2 α [25]. When 0 < s < 1 , (4) is weaker than the eigenvalue decaying assumption in the literature [17,23].
The third assumption is about the conditional probability distribution ρ ( y | x ) on the output space Y . We assume that the output variable Y satisfies the moment condition when there exist two positive numbers t , M > 0 such that, for any integer q 2 ,
E | Y | q | X 1 2 q ! t 2 M q 2 .
The assumption (5) covers many common distributions, for example, Gaussian, sub-Gaussian, and the distributions with compact support [26].
Now, we are ready to present the main results of this paper. Without loss of generality, we assume sup x X K ( x , x ) = 1 .

2.1. Convergence in the Supervised Learning

The following error estimate for Algorithm (2) is the first result of this section, which presents the convergence of Huber regression with fully supervised data and will be proved in Section 3.
Theorem 1.
Define f D , λ by Algorithm (2) with the fully supervised data set D = { ( x i , y i ) } i = 1 N . Suppose that (3) holds for some r > 0 , (4) and (5). If
λ = N 1 1 + s , f o r   0 < r < 1 2 , N 1 s + 2 min { 1 , r } , f o r   r 1 2 ,
then, for any 0 < δ < 1 , with probability 1 δ ,
f D , λ f * ρ C 1 max λ min { r , 1 } , σ 1 λ 3 2 ( log N ) 4 log 8 δ 4 ,
where C 1 is a constant independent of N , δ , or σ.
The above theorem shows that the parameter σ in the Huber loss σ balances the robustness of Algorithm (2) and its convergence rates. We can see that, when the Huber loss function is employed in nonparametric regression problems, the enhancement of robustness occurs with the sacrifice of the convergence rate of Algorithm (2). Thus, what one needs to do is to find a trade-off. It is then direct to obtain the following corollary that provides the explicit learning rates for (2) with a suitable choice of σ .
Corollary 1.
Under the same conditions of Theorem 1, if σ λ r 3 2 ( log N ) 4 , then with probability at least 1 δ ,
f D , λ f * ρ = O N r s + 1 log 8 δ 4 , f o r   0 < r < 1 2 , O N min r s + 2 r , 1 s + 2 log 8 δ 4 , f o r   r 1 2 .
Remark 1.
The above corollary tells us that, when 1 2 r 1 , Algorithm (2) achieves the error rate O N r 2 r + s , which coincides with the minimax lower bound proved in [23,25], and is optimal. We also notice that the convergence rate can not improve when r > 1 . It is referred to as the saturation phenomenon, which has been found in a vast amount of literature [20,22,25].

2.2. Convergence in the Semi-Supervised Learning

Although optimal convergence rates of the Algorithm (2) were deduced when f * lies in H K ( r 1 2 ) in the previous subsection, the error rate for the case 0 < r < 1 2 needs improvements. In this subsection, we study the influence of unlabeled data on the convergence of (2) by using semi-supervised data.
Let an unlabeled data set D ˜ ( x ) = x ˜ i i = 1 N ˜ be drawn independently according to the marginal distribution ρ X , where N ˜ is the cardinality of D ˜ ( x ) . With the fully supervised data set D = { ( x i , y i ) } i = 1 N , we then introduce the supervised data set associated with Huber regression problems as D * = { ( x i * , y i * ) } i = 1 N + N ˜ , given by
( x i * , y i * ) = ( x i , N + N ˜ N y i ) , for   1 i N , ( x ˜ i N , 0 ) , for   N + 1 i N + N ˜ .
By replacing D with D * in Algorithm (2), we then obtain the output function f D * , λ with semi-supervised data D * . The enhanced convergence results are as follows.
Theorem 2.
Suppose (3), (4) and (5) hold for 0 < r 1 , r + s 1 2 , and N ˜ max { N s + 1 2 r + s N + 1 , 1 } . If λ = N 1 2 r + s , then, with a probability at least 1 δ ,
f D * , λ f ρ ρ C 2 max N r 2 r + s , Δ N , N ˜ , λ log N 4 λ σ log 8 δ 4 ,
where
Δ N , N ˜ , λ = N + N ˜ λ N + N + N ˜ N 2
and C 2 is a constant independent of N , N ˜ , σ, or δ.
Based on the theorem above, we can obtain the improved convergence rate as follows.
Corollary 2.
Under the same conditions of Theorem 2, if
σ N 2 r + 1 2 ( 2 r + s ) Δ N , N ˜ , λ log N 4 ,
then, with probability 1 δ ,
f D * , λ f ρ ρ = O N r 2 r + s log 8 δ 4 .
Remark 2.
Corollary 1 shows that, provided no unlabeled data are involved, the minimax optimal convergence rate for (2) is obtained only in the situation r > 1 2 . When 0 < r 1 2 , the rate reduces to O N r s + 1 . It implies that the regression function f * is assumed to belong to H K for achieving the optimal rate, which is difficult to verify in practice. In contrast, Corollary 2 tells us that, with sufficient unlabeled data D ˜ ( x ) engaged in Algorithm (2), the minimax optimal rate O N r 2 r + s is retained for 0 < r 1 . This removes the strict regularity condition on f * .

3. Proofs

Now, we are in a position of proving results stated in Section 2.

3.1. Useful Estimates

First, we will estimate the bound of f D , λ defined by (2). In the sequel, for notational simplicity, let z = ( x , y ) and define the empirical operator L K , D : H K H K by
L K , D : = 1 N i = 1 N · , K x i K K x i , z i = ( x i , y i ) D ,
so, for any f H K , L K , D f = 1 N i = 1 N f ( x i ) K x i . Then, we have the following representation for f D , λ .
Lemma 1.
Define f D , λ by (2). Then, it satisfies
f D , λ = ( L K , D + λ I ) 1 f ^ ρ , D + ( L K , D + λ I ) 1 W D , λ
where
f ^ ρ , D = 1 N i = 1 N y i K x i , z i = ( x i , y i ) D
and
W D , λ = 1 N i = 1 N G + ( f D , λ ( x i ) y i ) 2 σ 2 G + ( 0 ) ( f D , λ ( x i ) y i ) K x i
with
G ( s ) = s , i f   0 s 1 , 2 s 1 2 1 , i f   s 1 .
Proof. 
Note that σ ( u ) = σ 2 G u 2 σ 2 . Since f D , λ is the minimizer of Algorithm (2), we take the gradient of the regularized functional on H K to give
1 N i = 1 N G + ( f D , λ ( x i ) y i ) 2 σ 2 ( f D , λ ( x i ) y i ) K x i + λ f D , λ = 0 .
With the fact G + ( 0 ) = 1 , it yields
1 N i = 1 N ( f D , λ ( x i ) y i ) K x i + λ f D , λ W D , λ = 0 ,
which is ( L K , D + λ I ) f D , λ f ^ ρ , D W D , λ = 0 .
The proof is complete. □
Based on the above lemma, we can obtain the bound of f D , λ .
Lemma 2.
Under the moment condition (5), with a probability at least 1 δ , there holds
f D , λ K ( 4 M + 5 t ) λ 1 2 log N δ .
Proof. 
Under the moment condition (5), it has been proven in [27] that, with a probability of at least 1 δ , there holds
max { | y | :   there   exists   an   x X ,   such   that   ( x , y ) D } ( 4 M + 5 t ) log N δ .
By the definition of f D , λ , we have that E D ( f D , λ ) + λ f D , λ K 2 E D ( 0 ) . Thus,
λ f D , λ K 2 E D ( 0 ) 1 N i = 1 N σ ( y i ) 1 N i = 1 N y i 2 max ( x , y ) D | y | 2 .
It follows that
f D , λ K λ 1 2 max ( x , y ) D | y | .
This together with (15) yields the desired conclusion. □
Furthermore, we see that
W D , λ K σ 1 1 N i = 1 N f D , λ K + | y i | 2 2 σ 1 1 N i = 1 N f D , λ K 2 + | y i | 2 2 σ 1 f D , λ K 2 + max ( x , y ) D | y | 2 .
This in combination with the bounds (15) and (16) provides that, with probability at least 1 δ ,
W D , λ K 2 ( 4 M + 5 t ) 2 λ 1 + 1 σ 1 log N δ 2 .

3.2. Error Decomposition

To derive the explicit convergence rate of Algorithm (2), we introduce the regularization function  f λ in H K , defined by
f λ : = arg min f H K E ls ( f ) + λ f K 2
where E ls ( f ) = Z ( f ( x ) y ) 2 d ρ is the expected risk associated with the least squares loss. It is direct to verify that
f λ = ( L K + λ I ) 1 L K f * ,
so f λ f * = λ ( L K + λ I ) 1 f * . By the work in [20], we know that under the regularity assumption (3) with r > 0 ,
f λ f * ρ h ρ λ r , when   0 < r 1 , h ρ λ , when   r > 1 ,
and
f λ K h ρ λ r 1 2 , when   0 < r < 1 / 2 , h ρ , when   r 1 / 2 .
Now, we state two error decompositions for f D , λ f λ . By (19), we have
L K , D f λ λ f λ = L K , D f λ + L K f λ L K f * .
It implies
f λ = ( L K , D + λ I ) 1 [ ( L K L K , D ) f λ L K f * ] ,
which leads the decomposition by (13),
f D , λ f λ = ( L K , D + λ I ) 1 ( L K L K , D ) f λ + ( L K , D + λ I ) 1 f ^ ρ , D L K f * + ( L K , D + λ I ) 1 W D , λ .
In the sequel, we denote
B D , λ = ( L K , D + λ I ) 1 ( L K + λ I ) , C D , λ = ( L K + λ I ) 1 2 ( L K L K , D ) , G D , λ = ( L K + λ I ) 1 2 ( f ^ ρ , D L K f * ) K .
Noting that, for any f H K ,
max { f ρ , λ f K } ( L K + λ I ) 1 2 f K
by the fact f ρ = L K 1 2 f K [21], one obtains a bound for the sample error  f D , λ f λ ρ by the decomposition (23) above.
Proposition 1.
Define f D , λ by (2). Then, there holds
f D , λ f λ ρ B D , λ C D , λ f λ K + B D , λ G D , λ + λ 1 2 B D , λ W D , λ K .
Proof. 
Let I 1 , I 2 , and I 3 denote the three terms on the right-hand side of (23), respectively. Consider the H K norm of
( L K + λ I ) 1 / 2 ( f D , λ f λ ) = ( L K + λ I ) 1 / 2 ( I 1 + I 2 + I 3 ) .
Then,
( L K + λ I ) 1 / 2 I 1 K ( L K + λ I ) 1 / 2 ( L K , D + λ I ) 1 / 2 ( L K , D + λ I ) 1 / 2 ( L K + λ I ) 1 / 2 × ( L K + λ I ) 1 / 2 ( L K L K , D ) f λ K B D , λ C D , λ f λ K .
Similarly,
( L K + λ I ) 1 / 2 I 2 K ( L K + λ I ) 1 / 2 ( L K , D + λ I ) 1 ( L K + λ I ) 1 / 2 ( L K + λ I ) 1 / 2 ( f ^ ρ , D L K f * ) K B D , λ G D , λ ,
and
( L K + λ I ) 1 / 2 I 3 K ( L K + λ I ) 1 / 2 ( L K , D + λ I ) 1 ( L K + λ I ) 1 / 2 1 λ W D , λ K λ 1 / 2 B D , λ W D , λ K .
With the above bounds, we use (24) to obtain the statement.
The proof is finished. □

3.3. Deriving Main Results

To prove our main results, we need to bound the quantities B D , λ , C D , λ , G D , λ by the following probability estimates.
Lemma 3.
With a confidence of at least 1 δ , there holds
B D , λ 2 2 A D , λ log 2 δ λ 2 + 2 , C D , λ 2 A D , λ log 2 δ ,   a n d G D , λ 4 ( M + t ) A D , λ log 2 δ
where A D , λ = 1 N λ + N ( λ ) N .
These inequalities are well studied in the literature and can be found in [17,18].
Proof of Theorem 1.
We can decompose f D , λ f * ρ as the sample error  f D , λ f λ ρ and the approximation error  f λ f * ρ . As stated in (20), f λ f * ρ λ r h ρ for 0 < r 1 . Thus, we just estimate f D , λ f λ ρ by Proposition 1.
By Lemma 3 and the bound (18), with probability at least 1 4 δ , the following bounds hold simultaneously:
B D , λ C D , λ f λ K 4 2 A D , λ λ 2 + 2 A D , λ log 2 δ 3 f λ K , B D , λ G D , λ 8 ( M + t ) 2 A D , λ λ 2 + 2 A D , λ log 2 δ 3 ,
and
λ 1 2 B D , λ W D , λ K 8 ( 4 M + 5 t ) 2 2 A D , λ λ 2 + 2 λ 3 2 σ 1 log N δ 4 .
Scaling 4 δ to δ , by (20) and the estimates above, we have with confidence at least 1 δ
f D , λ f * ρ f D , λ f λ ρ + f λ f * ρ 24 ( 4 M + 5 t ) 2 2 A D , λ λ 2 + 2 A D , λ + A D , λ f λ K log 8 δ 3 + λ 3 2 σ 1 log 4 N δ 4 + h ρ λ r .
By (4),
A D , λ = 1 N λ + N ( λ ) N 1 N λ + C λ s N ( C + 1 ) λ s 2 N λ s 1 2 N + 1 .
The choice (6) of λ results in the fact that
λ s 2 N λ s 2 + 1 + s 2 = λ 1 2 , when   0 < r < 1 2 , λ s 2 + s 2 + min { 1 , r } = λ min { 1 , r } , when   r 1 2 ,
λ s 1 2 N λ s 1 2 + 1 + s 2 , when   0 < r < 1 2 λ s 1 2 + s 2 + min { 1 , r } , when   r 1 2 λ s 1 ,
and
λ s 1 N λ s 1 λ s + 1 = 1 , when   0 < r < 1 2 λ 1 s λ s + 2 min { r , 1 } = λ min { 2 r 1 , 1 } , when   r 1 2 1 .
Collecting the above estimates,
A D , λ λ 2 ( C + 1 ) 2 λ s 1 N λ s 1 2 N + 1 2 4 ( C + 1 ) 2
and
A D , λ λ 1 2 , when   0 < r < 1 2 , λ min { 1 , r } , when   r 1 2 .
Putting (21), (30) and (31) into (26), we can get (7) with
C 1 = 96 ( 4 M + 5 t ) 2 [ ( C + 1 ) 2 + 1 ] ( 2 + h ρ ) .
The proof is complete. □
Proof of Theorem 2.
Similar as the proof of (25), there holds
f D * , λ f λ ρ B D * , λ C D * , λ f λ K + B D * , λ G D * , λ + λ 1 2 B D * , λ W D * , λ K .
Note that, by (9),
f ^ ρ , D * = 1 N + N ˜ i = 1 N + N ˜ y i * K x i * = 1 N + N ˜ i = 1 N N + N ˜ N y i K x i = 1 N i = 1 N y i K x i = f ^ ρ , D .
It means G D * , λ = G D , λ . Furthermore, similar to (16), we have
f D * , λ K 2 N + N ˜ N λ max ( x , y ) D | y | 2 .
In addition, by (17),
W D * , λ K σ 1 1 N + N ˜ i = 1 N + N ˜ f D * , λ K + | y i | 2 2 σ 1 1 N + N ˜ i = 1 N + N ˜ f D * , λ K 2 + | y i * | 2 2 σ 1 f D * , λ K 2 + N + N ˜ N 2 max ( x , y ) D | y | 2 .
Then, by (15), with confidence at least 1 δ ,
W D * , λ K 2 ( 4 M + 5 t ) 2 N + N ˜ N λ + N + N ˜ N 2 σ 1 log N δ 2 .
This together with Lemma 3 yields that, with a confidence of at least 1 4 δ ,
B D * , λ C D * , λ f λ K 4 2 A D * , λ λ 2 + 2 A D * , λ log 2 δ 3 f λ K , B D * , λ G D * , λ = B D * , λ G D , λ 8 ( M + t ) 2 A D * , λ λ 2 + 2 A D , λ log 2 δ 3 ,
and
λ 1 2 B D * , λ W D * , λ K 8 ( 4 M + 5 t ) 2 2 A D * , λ λ 2 + 2 Δ N , N ˜ , λ λ 1 2 σ 1 log N δ 4 .
Scaling 4 δ to δ , by (32) and (20), then, with a confidence at least 1 δ ,
f D * , λ f * ρ f D * , λ f λ ρ + f λ f * ρ 24 ( 4 M + 5 t ) 2 2 A D * , λ λ 2 + 2 A D , λ + A D * , λ f λ K log 8 δ 3 + Δ N , N ˜ , λ λ 1 2 σ 1 log 4 N δ 4 + h ρ λ r .
Thus, to prove Theorem 2, we need the estimates as follows:
Since r + s > 1 2 and λ = N 1 2 r + s ,
λ s 1 2 N = N 1 2 ( r + s ) 2 ( 2 r + s ) 1 , λ s 2 N = λ r .
Then, by (4),
A D , λ = 1 N λ + N ( λ ) N 1 N λ + C λ s N ( C + 1 ) λ s 2 N λ s 1 2 N + 1 ( C + 1 ) λ r
and
A D * , λ = 1 ( N + N ˜ ) λ + N ( λ ) N + N ˜ 1 ( N + N ˜ ) λ + C λ s ( N + N ˜ ) ( C + 1 ) λ s 2 N + N ˜ λ s 1 2 N + N ˜ + 1 2 ( C + 1 ) λ s 2 N + N ˜ .
Thus,
A D * , λ λ 2 + 1 4 ( C + 1 ) 2 λ s 1 N + N ˜ + 1 4 ( C + 1 ) 2 + 1 .
Furthermore, by (21),
A D * , λ f λ K 2 ( C + 1 ) h ρ λ s 2 + r 1 2 N + N ˜ , when   0 < r < 1 / 2 , 2 ( C + 1 ) h ρ λ s 2 N + N ˜ , when   r 1 / 2 .
By the restriction N ˜ max { N s + 1 2 r + s N + 1 , 1 } , we conclude that
A D * , λ f λ K 2 ( C + 1 ) h ρ λ r , for r > 0 .
Putting the estimates above into (33) yields that the desired conclusion (10) with
C 2 = 96 ( 4 M + 5 t ) 2 [ ( C + 1 ) 2 + ( C + 1 ) + 1 ] ( 2 + h ρ ) .
The proof is finished. □

4. Numerical Simulation

In this part, we carry out simulations to verify our theoretical statements. We employ the mean squared error of a testing set for the comparison. We generate N = 500 labeled data { x i , y i } i = 1 500 by the regression model y i = f * ( x i ) + ϵ , where f * ( x ) = x ( 1 x ) , and the random inputs x i ’s are independently drawn according to the Normal distribution N ( 0 , 1 ) , and ϵ is the independent Gaussian noise N ( 0 , 0.005 ) . We also generate N ˜ = 200 unlabeled data { x ˜ i } i = 1 200 with x ˜ i ’s drawn independently according to the uniform distribution on [ 0 , 1 ] . We choose the Gaussian kernel K ( x , u ) = exp { | x u | 2 / 2 } , h = 5 and regularization parameter λ = 0.7 . Algorithm 1 shows the mean squared error of Algorithm (1) with the training data set D = { x i , y i } i = 1 500 . Algorithm 2 shows the mean squared error of Algorithm (2) with the semi-supervised data set D * by (9). Algorithm (2)’ s error is obviously smaller than Algorithm (1) if 20 unlabeled data are added into the training data. When we add more unlabeled data from 20 to 200, Algorithm (2)’ s curve decreases continuously. These experimental results coincide with our theoretical analysis through the following Figure 1.

5. Discussion

Unlabeled data are ubiquitous in a variety of fields including signal processing, privacy concerns, feature selection, and data clustering. For the applications of Huber regression that have robustness, we adopted a semi-supervised learning method to our regularized Huber regression algorithm. We derived the explicit learning rate of algorithm (2) in the supervised learning, which was comparable to the minimax optimal rate of OLS. By a semi-supervised method, we showed that an inflation of unlabeled data could improve learning performance for Huber regression analysis. It suggested that using the additional information of unlabeled data could extend the application of Huber regression.

Author Contributions

Conceptualization, Y.W.; Funding acquisition, C.P.; Methodology, B.W. and X.L.; Project administration, B.W.; Resources, X.L.; Supervision, C.P. and H.Y.; Writing—original draft, Y.W.; Writing—review & editing, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper is partially supported by the National Natural Science Foundation of China (Project 12071356).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  2. Huber, P.J. Robust Regression: Asymptotics, Conjectures and Monte Carlo. Ann. Stat. 1973, 1, 799–821. [Google Scholar]
  3. Christmann, A.; Steinwart, I. Consistency and robustness of kernel based regression. Bernoulli 2007, 13, 799–819. [Google Scholar] [CrossRef] [Green Version]
  4. Fan, J.; Li, Q.; Wang, Y. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J. R. Stat. Soc. 2017, 79, 247–265. [Google Scholar] [CrossRef] [Green Version]
  5. Feng, Y.; Wu, Q. A statistical learning assessment of Huber regression. J. Approx. Theory 2022, 273, 105660. [Google Scholar] [CrossRef]
  6. Loh, P.L. Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. Statistics 2015, 45, 866–896. [Google Scholar] [CrossRef] [Green Version]
  7. Rao, B. Asymptotic behavior of M-estimators for the linear model with dependent errors. Bull. Inst. Math. Acad. Sin. 1981, 9, 367–375. [Google Scholar]
  8. Sun, Q.; Zhou, W.; Fan, J. Adaptive Huber Regression. J. Am. Stat. Assoc. 2017, 115, 254–265. [Google Scholar] [CrossRef] [Green Version]
  9. Wang, Z.; Liu, H.; Zhang, T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann. Stat. 2013, 42, 2164–2201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Chapelle, O.; Schölkopf, B.; Zien, A. Semi-Supervised Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  11. Belkin, M.; Niyogi, P. Semi-Supervised Learning on Riemannian Manifolds. Mach. Learn. 2004, 56, 209–239. [Google Scholar] [CrossRef]
  12. Blum, A.; Mitchell, T. Combining Labeled and Unlabeled Data with Co-Training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998. [Google Scholar]
  13. Wang, J.; Jebara, T.; Chang, S.F. Semi-Supervised Learning Using Greedy Max-Cut. J. Mach. Learn. Res. 2013, 14, 771–800. [Google Scholar]
  14. Andrea Caponnetto, Y.Y. Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. 2010, 8, 161–183. [Google Scholar] [CrossRef]
  15. Guo, X.; Hu, T.; Wu, Q. Distributed Minimum Error Entropy Algorithms. J. Mach. Learn. Res. 2020, 21, 1–31. [Google Scholar]
  16. Hu, T.; Fan, J.; Xiang, D.H. Convergence Analysis of Distributed Multi-Penalty Regularized Pairwise Learning. Anal. Appl. 2019, 18, 109–127. [Google Scholar] [CrossRef]
  17. Lin, S.B.; Guo, X.; Zhou, D.X. Distributed Learning with Regularized Least Squares. J. Mach. Learn. Res. 2016, 18, 3202–3232. [Google Scholar]
  18. Lin, S.B.; Zhou, D.X. Distributed Kernel-Based Gradient Descent Algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
  19. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 686, 337–404. [Google Scholar] [CrossRef]
  20. Smale, S.; Zhou, D.X. Learning Theory Estimates via Integral Operators and Their Approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef] [Green Version]
  21. Cucker, F.; Ding, X.Z. Learning Theory: An Approximation Theory Viewpoint; Cambridge University Press: Cambridge, MA, USA, 2007. [Google Scholar]
  22. Bauer, F.; Pereverzev, S.; Rosasco, L. On regularization algorithms in learning theory. J. Complex. 2007, 23, 52–72. [Google Scholar] [CrossRef] [Green Version]
  23. Caponnetto, A.; Vito, E.D. Optimal Rates for the Regularized Least-Squares Algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
  24. Tong, Z. Effective Dimension and Generalization of Kernel Learning. In Proceedings of the Advances in Neural Information Processing Systems 15, NIPS 2002, Vancouver, BC, Canada, 9–14 December 2002. [Google Scholar]
  25. Neeman, M.J. Regularization in kernel learning. Ann. Stat. 2010, 38, 526–565. [Google Scholar]
  26. Raskutti, G.; Wainwright, M.J.; Yu, B. Early stopping and non-parametric regression. J. Mach. Learn. Res. 2014, 15, 335–366. [Google Scholar]
  27. Wang, C.; Hu, T. Online minimum error entropy algorithm with unbounded sampling. Anal. Appl. 2019, 17, 293–322. [Google Scholar] [CrossRef]
Figure 1. The number of unlabeled data.
Figure 1. The number of unlabeled data.
Mathematics 10 03734 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, B.; Peng, C.; Li, X.; Yin, H. Huber Regression Analysis with a Semi-Supervised Method. Mathematics 2022, 10, 3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

AMA Style

Wang Y, Wang B, Peng C, Li X, Yin H. Huber Regression Analysis with a Semi-Supervised Method. Mathematics. 2022; 10(20):3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

Chicago/Turabian Style

Wang, Yue, Baobin Wang, Chaoquan Peng, Xuefeng Li, and Hong Yin. 2022. "Huber Regression Analysis with a Semi-Supervised Method" Mathematics 10, no. 20: 3734. https://0-doi-org.brum.beds.ac.uk/10.3390/math10203734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop