Next Article in Journal
Application of Bayesian Networks and Information Theory to Estimate the Occurrence of Mid-Air Collisions Based on Accident Precursors
Previous Article in Journal
Effect of Atomic Size Difference on the Microstructure and Mechanical Properties of High-Entropy Alloys
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semi-Supervised Minimum Error Entropy Principle with Distributed Method

1
School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China
2
School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Submission received: 26 October 2018 / Revised: 8 December 2018 / Accepted: 10 December 2018 / Published: 14 December 2018
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The minimum error entropy principle (MEE) is an alternative of the classical least squares for its robustness to non-Gaussian noise. This paper studies the gradient descent algorithm for MEE with a semi-supervised approach and distributed method, and shows that using the additional information of unlabeled data can enhance the learning ability of the distributed MEE algorithm. Our result proves that the mean squared error of the distributed gradient descent MEE algorithm can be minimax optimal for regression if the number of local machines increases polynomially as the total datasize.

1. Introduction

The minimum error entropy (MEE) principle is an important criterion proposed in information theoretical learning (ITL) [1] and was firstly addressed for adaptive system training by Erdogmus and Principe [2]. It has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, minimum cross-entropy for model selection, and some other topics [3,4,5,6,7,8]. Taking entropy as a measure of the error, the MEE principle can extract the information contained in data fully and produce robustness to outliers in the implementation of algorithms.
Let X R n be an explanatory variable with values taken in a compact metric space ( X , d ) , Y be a real response variable with Y Y R , and g : X Y be a prediction function. For a given set of labeled examples D = { ( x i , y i ) } i = 1 N X × Y (N denotes the sample size) and a windowing function G : R R + , the MEE principle is to find a minimizer of the empirical quadratic entropy:
H ^ ( g ) = - log h 2 N 2 ( x i , y i ) D ( x j , y j ) D G [ ( y i - g ( x i ) ) - ( y j - g ( x j ) ) ] 2 h 2 ,
where h > 0 is the scaling parameter. Its goal is to solve the problem y = g ρ ( x ) + ε , where ε is the noise and g ρ ( x ) is the target function. Taking a function f ( x i , x j ) : = g ( x i ) - g ( x j ) , MEE belongs to pairwise learning problems, which involves with the intersections of example pairs. Since logarithmic function is monotonic, we only consider the empirical information error of MEE:
R ( f ) = - h 2 N 2 ( x i , y i ) D ( x j , y j ) D G [ y i - y j - f ( x i , x j ) ] 2 h 2 ,
in the optimization process. Borrowing the idea from Reference [9], we introduced the Mercer kernel K ( · , · ) : X 2 × X 2 R , ( X 2 : = X × X ) and employed the reproducing kernel Hilbert space (RKHS) H K as our hypothesis space. With K , H K is defined as the linear span of the functions set { K ( x , u ) : = K ( x , u ) , ( · , · ) , ( x , u ) X 2 } , which is equipped with the inner product · , · K and the reproducing property K ( x , u ) , K ( x , u ) K = K ( ( x , u ) , ( x , u ) ) , ( x , u ) , ( x , u ) X 2 . For the G nonconvex, we usually solve Equation (1) using the kernel-based gradient descent method as follows. It starts with f 1 , D = 0 and is updated by:
f t + 1 , D = f t , D - η × R ( f t , D ) ,
in the t-th step, where η > 0 is a step size, ∇ is the gradient operator and:
R ( f t , D ) = - 1 N 2 ( x i , y i ) D ( x j , y j ) D G [ y i - y j - f t , D ( x i , x j ) ] 2 h 2 [ f t , D ( x i , x j ) - y i + y j ] K ( x i , x j ) ,
as we know that the example pairs will grow quadratically with the increasing example size N, which will bring the computational burden in the MEE implementation. Thus, it is necessary to reduce the algorithmic complexity by the distributed method based on a divide-and-conquer strategy [10]. Semi-supervised learning (SSL) [11] has attracted extensive attention as an emerging field in machine learning research and data mining. Actually, in many practical problems, few data are given, but a large number of unlabeled data are available, since labeling data requires a lot of time, effort or money. In this paper, we study a distributed MEE algorithm in the framework of SSL and show that the learning ability of the MEE algorithm can be enhanced by the distributed method and the combination of labeled data with unlabeled data.
There are mainly three contributions in this paper. The first one is that we derive the explicit learning rate of the gradient descent method for distributed MEE in the context of SSL, which is comparable to the minimax optimal rate of the least squares in regression. This implies that the MEE algorithm can be an alternative of the least squares in SSL in the sense that both of them have the same prediction power. The second one is that we provide the theoretical upper bound for the number of local machines guaranteeing the optimal rate in the distributed computation. The last one is that we extend the range of the target function allowed in the distributed MEE algorithm.
In Table 1, we summarize some notations used in this paper.

2. Algorithms and Main Results

We considered MEE for the regression problem. To allow noise in sampling processes, we assumed that a Borel measure ρ ( · , · ) is defined on the product space X × Y . Let ρ ( y | x ) be the conditional distribution of y Y for any given x X , and ρ X ( · ) the marginal distribution on X . For the semi-supervised MEE algorithm, our goal was to estimate the regression function g ρ ( x ) = Y y d ρ ( y | x ) , x X , from labeled examples D = { ( x i , y i ) } i = 1 N and unlabeled examples D * = { x j } j = 1 S drawn from the distribution ρ and ρ X , respectively.
Based on the divide-and-conquer strategy, both D and D * are partitioned equally into m subsets, D = l = 1 m D l and D * = l = 1 m D l * . Here, we denote the size of subsets | D l | = n and | D l * | = s , 1 l m , i.e., N = m n , S = m s . We construct a new dataset D ˜ = l = 1 m D ˜ l by:
D ˜ l = D l D l * = { ( x k , y k ) } k = 1 n + s ,
where:
x k = x k , if   ( x k , y k ) D l , x k , if   x k D l * , and y k = n + s n y k , if   ( x k , y k ) D l , 0 , if   x k D l * .
Based on the gradient descent algorithm (Equation (2)), we can get a set of local estimators { f t , D ˜ l } for each subset D ˜ l , 1 l m . Then, the global estimator averaging over these local estimators is given by:
f ¯ t , D ˜ = 1 m l = 1 m f t , D ˜ l .
In the pairwise setting, our target function f ρ ( x , x ) = g ρ ( x ) - g ρ ( x ) , x , x X , which is the difference of the regression function g ρ . Denote by L ρ X 2 2 the space of square integrable functions on the product space X 2 :
L ρ X 2 2 : = f : X 2 R : f L 2 = X 2 | f ( x , x ) | 2 d ρ X ( x ) d ρ X ( x ) 1 2 < .
The goodness of f ¯ t , D ˜ is usually measured by the mean squared error f ¯ t , D ˜ - f ρ L 2 2 .
Throughout the paper, we assumed that sup ( x , x ) X 2 K ( ( x , x ) , ( x , x ) ) 1 and for some constant M > 0 , | y | M almost surely. Without generality, windowing function G is assumed to be differentiable and satisfies G ( 0 ) = - 1 , G ( u ) < 0 for u > 0 , C G : = sup u ( 0 , ) | G ( u ) | < and there exists some p such that c p > 0 and:
| G ( u ) - G ( 0 ) | c p | u | p , u > 0 .
It is easy to check that the Gaussian kernel G ( u ) = exp { - u } satisfies the assumptions above with p = 1 .
Before we present our main results, define an integral operator L K : L ρ X 2 2 L ρ X 2 2 associated with the kernel K by:
L K ( f ) : = X X f ( x , x ) K ( x , x ) d ρ X ( x ) d ρ X ( x ) , f L ρ X 2 2 .
Our error analysis for the distributed MEE algorithm (Equation (3)) is stated in terms of the following regularity condition:
f ρ = L K r ( ϕ ) f o r s o m e r > 0 , ϕ L ρ X 2 2 ,
where L K r denotes the r-th power of L K on L ρ X 2 2 and is well defined, since the operator L K is positive and compact with the Mercer kernel K . We use the effective dimension [12,13] N ( λ ) to measure the complexity of H K with respect to ρ X , which is defined to be the trace of the operator ( λ I + L K ) - 1 L K as:
N ( λ ) = T r ( ( λ I + L K ) - 1 L K ) , λ > 0 .
To obtain optimal learning rates, we need to quantify N ( λ ) of H K . A suitable assumption is: that
N ( λ ) C 0 λ - β , f o r s o m e C 0 > 0 a n d 0 < β 1 .
Remark 1.
When β = 1 , Equation (6) always holds with C 0 = T r ( L K ) . For 0 < β < 1 , when H K is a Sobolev space W α ( X ) on X R d with all derivative of order up to α > d 2 , then Equation (6) is satisfied with β = d 2 α [14]. Moreover, if the eigenvalues { γ i } i = 1 of the operator L K decays as γ i = O ( i - b ) for some b > 1 , then N ( λ ) = O ( λ - 1 b ) . The eigenvalues assumption is typical in the analysis of the performances of kernel methods estimators and recently used in References [13,15,16] to establish the optimal learning rate in the least square problems.
The following theorem shows that the distributed gradient descent algorithm (Equation (3)) can achieve the optimal rate by providing the iteration time T and the maximal number of local machines, whose proof can be found in Section 3.
Theorem 1. (Main Result)
Assume Equations (5) and (6) hold for r + β 1 2 . Let the iteration time T = N / 4 1 2 r + β and S + N N β + 1 2 r + β :
m < min { ( N + S ) 1 2 N - β + 1 4 r + 2 β , ( N + S ) 1 3 N - 2 - 2 r - β 6 r + 3 β } log 6 N ,
then for any 0 < δ < 1 , with confidence at least 1 - δ :
f ¯ T + 1 , D ˜ - f ρ L 2 C max N - r 2 r + β , h - 2 p ( N + S ) 2 p + 1 N p + 3 2 2 r + β - ( 2 p + 1 ) log 4 24 δ ,
where C is a constant independent of N , S , δ , h and N / 4 denotes the largest number not exceeding N / 4 .
Corollary 1.
Under the same conditions of Theorem 1, if the scaling parameter:
h > ( N + S ) 2 p + 1 2 p N r + p + 3 2 2 p ( 2 r + β ) N - 2 p + 1 2 p ,
then for any 0 < δ < 1 , with confidence at least 1 - δ :
f ¯ T + 1 , D ˜ - f ρ L 2 C N - r 2 r + β log 4 24 δ .
Remark 2.
The rate O N - r 2 r + β in Equation (9) is optimal in the minimax sense for kernel regression problems [13]. When m = 1 , the result of Equation (9) shows that the kernel gradient descent MEE algorithm (Equation (2)) on a single big data set can achieve the minimax optimal rate for regression. Thus, MEE is a nice alternative of the classical least squares. Meanwhile, the upper bound (Equation (7)) for the number of local machines implies that the performance of the distributed MEE algorithm (Equation (3)) can be as good as the standard MEE algorithm (2) (acting on the whole data set D ˜ ), provided that the subset D ˜ l ’s size n + s is not too small.
Remark 3.
If no unlabeled data is engaged in the algorithm (Equation (3)), then S = 0 and the upper bound (Equation (7)) for the number of local machines m that ensures the optimal rate is about O N r - 1 2 2 r + β . So, when the regularity parameter r in Equation (5) is close to 1 2 , the upper bound O N r - 1 2 2 r + β reduces to a constant and then the distributed algorithm (Equation (3)) will not be feasible in real applications. A similar phenomenon is observed in various distributed algorithms [15,16,17,18]. When the size of unlabeled data S > 0 , we see from Equation (7) that the upper bound of m keeps growing with the increase of S when the size of labeled data N is fixed. For example, let β > 1 2 and S = N 1 2 r + β , then the upper bound in Equation (7) is O N r 2 r + β and will not be a constant when r 1 2 . Hence, with sufficient unlabeled data D * , the distributed algorithm (Equation (3)) will allow more local machines in the distributed method.
Remark 4.
A series of distributed works [15,16,17,18,19] were carried out when the target function f ρ lies in the space H K , i.e., the regularization parameter r > 1 2 . As a byproduct, our work in Theorem 1 does not impose the restriction r > 1 2 on the distributed algorithm (Equation (3)).

3. Proof of Main Result

In this section we prove our main results in Theorem 1. To this end, we introduce the data-free gradient descent method in H K for the least squares, defined as f 1 = 0 and:
f t + 1 = f t - η t X X ( f t ( x , x ) - f ρ ( x , x ) ) K ( x , x ) d ρ X ( x ) d ρ X ( x ) , t 1 .
Recalling the definition of L K , it can be written as:
f t + 1 = f t - η t L K ( f t - f ρ ) = ( I - η t L K ) f t + η t L K ( f ρ ) , t 1 .
Following the standard decomposition technique in leaning theory, we split the error f ¯ t + 1 , D ˜ - f ρ into the sample error f ¯ t + 1 , D ˜ - f t + 1 and the approximation error f t + 1 - f ρ .

3.1. Approximation Error

Firstly, we estimate the approximation error f t + 1 - f ρ L 2 . It has been proven in Reference [20] and shown in the lemmas as follows.
Lemma 1.
Define { f t } by Equation (10) with 0 < η 1 . If Equation (5) holds with r > 0 , there are:
f t - f ρ L 2 c ϕ , r t - r ,
and when r 1 2 :
f t - f ρ K c ϕ , r t - ( r - 1 2 ) ,
where c ϕ , r = max ϕ L 2 ( 2 r / e ) r , ϕ L 2 [ ( 2 r - 1 ) / e ] r - 1 2 .
Moreover, we derive the uniform bound of the sequence { f t } by Equation (10) when 0 < r < 1 2 , which is useful in our analysis. Here and in the sequel, denote π i + 1 t ( L ) as the polynomial operator associated with an operator L defined by π i + 1 t ( L ) : = j = i + 1 t ( I - η L ) and π t + 1 t : = I . We use the conventional notation j = T + 1 T : = 1 .
Lemma 2.
Define { f t } by Equation (10) with 0 < η 1 . If Equation (5) holds with 0 < r < 1 2 , there are:
f t K d ϕ , η , r t 1 2 - r ,
where d ϕ , η , r is defined in the proof.
Proof. 
Using Equation (10) iteratively from t to 1 , then we have that:
f t + 1 = i = 1 t η π i + 1 t ( L K ) L K ( f ρ ) , f o r a l l t 1 .
With Equation (5):
f t + 1 K = i = 1 t η π i + 1 t ( L K ) L K ( f ρ ) K = i = 1 t η π i + 1 t ( L K ) L K L K r ( ϕ ) K = i = 1 t η π i + 1 t ( L K ) L K r + 1 2 L K 1 2 ϕ K = i = 1 t η π i + 1 t ( L K ) L K r + 1 2 ϕ L 2 .
Let { σ k } k = 1 be the eigenvalues of the operator L K and 0 σ k 1 , k 1 , since L K is positive and L K H K H K 1 , then the norm:
i = 1 t η π i + 1 t ( L K ) L K r + 1 2 = sup k 1 i = 1 t η π i + 1 t ( σ k ) σ k r + 1 2 η sup a > 0 i = 1 t - 1 π i + 1 t ( a ) a r + 1 2 + η L K r + 1 2 η sup a > 0 i = 1 t - 1 exp - η a ( t - i ) a r + 1 2 + η i = 1 t - 1 sup a > 0 η exp - η a ( t - i ) a r + 1 2 + η .
For each i t - 1 , by a simple calculation, we have:
sup a > 0 exp - η a ( t - i ) a r + 1 2 = exp - η a ( t - i ) a r + 1 2 | a = ( r + 1 2 ) η ( t - i ) - 1 = η 1 2 - r ( r + 1 2 ) r + 1 2 exp - ( r + 1 2 ) t - i - ( r + 1 2 ) t - i - ( r + 1 2 ) .
Thus, we have:
i = 1 t η π i + 1 t ( L K ) L K r + 1 2 η i = 1 t - 1 t - i - ( r + 1 2 ) + η = η i = 1 t - 1 i - ( r + 1 2 ) + η .
By the elementary inequality i = 1 t t - θ t 1 - θ 1 - θ with 0 < θ < 1 , it follows that:
i = 1 t η π i + 1 t ( L K ) L K r + 1 2 η 1 1 / 2 - r + 1 t 1 2 - r = η 3 / 2 - r 1 / 2 - r t 1 2 - r .
Together with Equation (12), then the proof is completed by taking d ϕ , η , r : = η 3 / 2 - r 1 / 2 - r ϕ L 2 .  □

3.2. Sample Error

Define the empirical operator L K , D : H K H K by:
L K , D : = 1 N 2 ( x i , y i ) D ( x j , y j ) D · , K ( x i , x j ) K K ( x i , x j ) ,
and for any f H K :
L K , D ( f ) = 1 N 2 ( x i , y i ) D ( x j , y j ) D f , K ( x i , x j ) K K ( x i , x j ) = 1 N 2 ( x i , y i ) D ( x j , y j ) D f ( x i , x j ) K ( x i , x j ) .
Then, the MEE gradient descent algorithm (Equation (2)) on D ˜ can be written as:
f t + 1 , D ˜ = [ I - η L K , D ˜ ] ( f t , D ˜ ) + η f ρ , D ˜ + η E t , D ˜ ,
where:
E t , D ˜ = 1 ( N + S ) 2 ( x i , y i ) D ˜ ( x j , y j ) D ˜ G [ y i - y j - f t , D ˜ ( x i , x j ) ] 2 h 2 - G ( 0 ) f t , D ˜ ( x i , x j ) - y i + y j K ( x i , x j ) ,
and:
f ρ , D ˜ = 1 ( N + S ) 2 ( x i , y i ) D ˜ ( x j , y j ) D ˜ ( y i - y j ) K ( x i , x j ) .
In the sequel, denote:
B D ˜ , λ = ( L K , D ˜ + λ I ) - 1 ( L K + λ ) , C D ˜ , λ = ( L K + λ I ) - 1 2 ( L K - L K , D ˜ ) , D D ˜ , λ = 1 m l = 1 m ( L K + λ I ) - 1 2 ( L K - L K , D ˜ l ) , F D ˜ , λ = 1 m l = 1 m ( L K + λ I ) - 1 2 [ f ρ , D ˜ l - L K ( f ρ ) ] K , G D ˜ , λ = ( L K + λ I ) - 1 2 ( L K f ρ - f ρ , D ˜ ) K .
With these preliminaries in place, we now turn to the estimates of the sample error f ¯ t + 1 , D ˜ - f t + 1 presented in the following Lemma, whose proof can be found in the Appendix. Here and in the sequel, we use the conventional notation i = 1 t t - i - 1 : = i = 1 t - 1 t - i - 1 + 1 .
Lemma 3.
Let λ > 0 and 0 < η < min { C G - 1 , 1 } , for any f * H K , there holds:
f ¯ T + 1 , D ˜ - f T + 1 L 2 t e r m 1 + t e r m 2 + c p , M | N + S | 2 p + 1 N - ( 2 p + 1 ) T p + 3 / 2 h - 2 p ,
where the constant c p , M = 2 4 p + 2 c p C G 2 p + 1 M 2 p + 1 :
t e r m 1 = sup 1 l m i = 1 T ( T - i ) - 1 + η λ C D ˜ l , λ × { s = 1 i - 1 ( i - s - 1 ) - 1 + λ η f s - f * K B D ˜ l , λ C D ˜ l , λ λ - 1 2 + ( 1 + λ η i ) B D ˜ l , λ ( C D ˜ l , λ f * K + G D ˜ l , λ ) λ - 1 2 + c p , M | N + S | 2 p + 1 N - ( 2 p + 1 ) i p + 1 / 2 h - 2 p } ,
and:
t e r m 2 = i = 1 T ( T - i ) - 1 + η λ D D ˜ , λ f i - f * K + ( 1 + λ η T ) ( D D ˜ , λ f * K + F D ˜ , λ ) .
With the help of Lemma above, to bound the sample error f ¯ T + 1 , D ˜ - f T + 1 L 2 , we first need to estimate the quantities the quantities B D ˜ , λ , C D ˜ , λ , D D ˜ , λ F D ˜ λ and G D ˜ , λ . Denote A D , λ : = 1 | D | / 4 λ + N ( λ ) | D | / 4 ( | D | is the cardinality of D). In previous work [19,21,22,23], we have foundnd that each of the following inequality holds with confidence at least 1 - δ :
B D ˜ , λ 2 2 A D ˜ , λ log 2 δ λ 2 + 2 , C D ˜ , λ 2 A D ˜ , λ log 2 δ , D D ˜ , λ 2 A D ˜ , λ log 2 δ F D ˜ , λ 16 M A D , λ log 4 δ , a n d G D ˜ , λ 16 M A D , λ log 4 δ .
By Lemma 3, we also see that the function f * is crucial to determine f ¯ T + 1 , D ˜ - f T + 1 . To get a tight bound for the learning error, we should choose an appropriate f * H K ˜ according to the regularity of the target function. When r 1 2 , f ρ H K and we take f * = f ρ . When 0 < r < 1 2 , f ρ is out of the space H K and we let f * = 0 .
Now, we give the first main result when the target function f ρ is out of H K with 0 < r < 1 2 .
Theorem 2.
Assume Equation (5) for 0 < r < 1 2 . Let 0 < η < min { 1 , C G - 1 } , T N and λ = T - 1 . Then, for any 0 < δ < 1 , with probability at least 1 - δ , there holds:
f ¯ T + 1 , D ˜ - f ρ L 2 C * { T - r + log 2 ( T ) J D , D ˜ , λ log 4 24 m δ + log ( T ) A D ˜ , λ λ r - 1 2 + A D , λ log 16 δ + | N + S | 2 p + 1 N - ( 2 p + 1 ) h - 2 p T p + 3 / 2 + T p + 1 2 log ( T ) sup 1 l k A D ˜ l , λ log 2 δ } ,
where C * is a constant given in the proof, J D , D ˜ , λ = sup 1 l m A D ˜ l , λ λ 2 + 1 ( A D ˜ l , λ 2 λ r - 1 + A D ˜ l , λ A D l , λ λ - 1 2 ) .
Proof. 
Decompose f ¯ T + 1 , D ˜ - f ρ L 2 into:
f ¯ T + 1 , D ˜ - f ρ L 2 f ¯ T + 1 , D ˜ - f T + 1 L 2 + f T + 1 - f ρ L 2 .
The estimate of f T + 1 - f ρ L 2 is presented in Lemma 1. We only need to handle f ¯ T + 1 , D ˜ - f T + 1 L 2 by Lemma 3.
For any 0 < s T - 1 and λ = T - 1 , by Equation (11), we have f s K d ϕ , η , r s 1 2 - r d ϕ , η , r λ r - 1 2 . Take f * = 0 in Lemma 3, then:
t e r m 1 ( 1 + d ϕ , η , r + c p , M ) sup 1 l m i = 1 T ( T - i ) - 1 + η λ C D ˜ l , λ × { s = 1 i - 1 ( i - s - 1 ) - 1 + λ η B D ˜ l , λ C D ˜ l , λ λ r - 1 + ( 1 + λ η i ) B D ˜ l , λ ( C D ˜ l , λ λ r - 1 2 + G D ˜ l , λ ) λ - 1 2 + | N + S | 2 p + 1 N - ( 2 p + 1 ) i p + 1 / 2 h - 2 p } ,
and:
t e r m 2 ( 1 + d ϕ , η , r ) i = 1 T ( T - i ) - 1 + η λ D D ˜ , λ λ r - 1 2 + ( 1 + λ η T ) ( D D ˜ , λ λ r - 1 2 + F D ˜ , λ ) .
Noticing the elementary inequality s = 1 i i - 1 2 log ( i ) , then:
s = 1 i - 1 ( i - s - 1 ) - 1 + λ η s = 1 i - 1 ( i - s - 1 ) - 1 + T - 1 4 log ( i ) ,
i = 1 T ( T - i ) - 1 + η λ s = 1 i - 1 ( i - s - 1 ) - 1 + λ η 4 i = 1 T ( T - i ) - 1 + T - 1 log ( i ) 16 i = 1 T log ( i ) T - i 16 log ( T ) i = 1 T 1 T - i = 16 log ( T ) i = 1 T - 1 i - 1 32 log 2 ( T ) ,
i = 1 T ( T - i ) - 1 + η λ ( 1 + λ η i ) i = 1 T ( T - i ) - 1 + η T - 1 ( 1 + T - 1 η i ) 2 i = 1 T ( T - i ) - 1 + 1 8 log ( T ) ,
and:
i = 1 T ( T - i ) - 1 + η λ 4 log ( T ) ,
i = 1 T ( T - i ) - 1 + η λ i p + 1 / 2 i = 1 T ( T - i ) - 1 + η λ T p + 1 / 2 4 T p + 1 / 2 log ( T ) .
Plugging the above inequalities into term 1 and term 2, then:
t e r m 1 sup 1 l m C 1 ( log 2 ( T ) B D ˜ l , λ C D ˜ l , λ 2 λ r - 1 + log ( T ) B D ˜ l , λ C D ˜ l , λ 2 λ r - 1 + log ( T ) B D ˜ l , λ C D ˜ l , λ G D ˜ l , λ λ - 1 2 + | N + S | 2 p + 1 N - ( 2 p + 1 ) T p + 1 / 2 log ( T ) C D ˜ l , λ h - 2 p ) ,
and:
t e r m 2 C 2 ( log ( T ) D D ˜ , λ λ r - 1 2 + F D ˜ , λ ) ,
where C 1 = 32 ( 1 + d ϕ , η , r + c p , M ) and C 2 = 6 ( 1 + d ϕ , η , r ) .
By Equation (16), for any fixed l , there exist three subsets with measure at least 1 - δ such that:
B D ˜ l , λ 2 2 A D ˜ l , λ log 2 δ λ 2 + 2 , C D ˜ l , λ 2 A D ˜ l , λ log 2 δ ,
and:
G D ˜ l , λ 16 M A D l , λ log 4 δ .
Thus, for any fixed l , with confidence at least 1 - 3 δ , there holds:
B D ˜ l , λ C D ˜ l , λ 2 λ r - 1 32 A D ˜ l , λ λ 2 + 1 A D ˜ l , λ 2 λ r - 1 log 4 2 δ ,
and:
B D ˜ l , λ C D ˜ l , λ G D ˜ l , λ λ - 1 2 256 M A D ˜ l , λ λ 2 + 1 A D ˜ l , λ A D l , λ λ - 1 2 log 3 2 δ log 4 δ .
Therefore, with confidence at least 1 - 3 m δ , there holds:
sup 1 l m B D ˜ l , λ C D ˜ l , λ 2 λ r - 1 32 A D ˜ l , λ λ 2 + 1 A D ˜ l , λ 2 λ r - 1 log 4 2 δ ,
and:
sup 1 l m B D ˜ l , λ C D ˜ l , λ G D ˜ l , λ λ - 1 2 256 M A D ˜ l , λ λ 2 + 1 A D ˜ l , λ A D l , λ λ - 1 2 log 3 2 δ log 4 δ .
Thus, by Equation (18), it follows that with confidence at least 1 - δ / 2 by scaling 3 m δ to δ / 2 , there holds:
t e r m 1 C 3 sup 1 l m ( log 2 ( T ) A D ˜ l , λ λ 2 + 1 ( A D ˜ l , λ 2 λ r - 1 + A D ˜ l , λ A D l , λ λ - 1 2 ) log 4 24 m δ + | N + S | 2 p + 1 N - ( 2 p + 1 ) T p + 1 / 2 log ( T ) A D ˜ l , λ h - 2 p ) ,
where C 3 = C 1 ( 256 M + 64 ) .
Similarly, with confidence at least 1 - 2 δ such that:
D D ˜ , λ 2 A D ˜ , λ log 2 δ , a n d F D ˜ , λ 16 M A D , λ log 4 δ .
By Equation (19), it follows that with confidence at least 1 - δ / 2 by scaling 2 δ to δ / 2 :
t e r m 2 C 4 log ( T ) A D ˜ , λ λ r - 1 2 + A D , λ log 16 δ ,
where C 4 = C 2 ( 16 M + 2 ) . Together with Lemma 1, we obtain the desired bound (Equation (17)) with C * = c ϕ , r + C 3 + C 4 + c p , M .  □
Next, we give the result when the target function f ρ is in H K with r 1 2 .
Theorem 3.
Assume Equation (1) for r 1 2 . Let 0 < η < min { 1 , C G - 1 } , T N and λ = T - 1 . Then, for any 0 < δ < 1 , with probability at least 1 - δ , there holds:
f ¯ T + 1 , D ˜ - f ρ L 2 C * { T - r + log 2 ( T ) K D , D ˜ , λ log 4 24 m δ + log ( T ) A D ˜ , λ + A D , λ log 16 δ + | N + S | 2 p + 1 N - ( 2 p + 1 ) h - 2 p T p + 3 / 2 + T p + 1 2 log ( T ) sup 1 l k A D ˜ l , λ log 2 δ } ,
where K D , D ˜ , λ = sup 1 l m A D ˜ l , λ λ 2 + 1 ( A D ˜ l , λ 2 + A D ˜ l , λ A D l , λ ) λ - 1 2 and C * is a constant given in the proof.
The proof is similar to that of Theorem 2. Here we omit it.
With these preliminaries in place, we can prove our main result in Theorem 1.
Proof of Theorem 1.
We first prove Equation (8) by Theorem 2 when 0 < r < 1 2 . Let T = | D | / 4 1 2 r + β and λ = T - 1 . Notice that | D | = N , | D ˜ | = N + S and m | D l | = | D | , m | D ˜ l | = | D ˜ | for 1 l m , with r + β > 1 2 and Equation (7), we obtain that:
A D , λ = | D | / 4 - 1 + 1 4 r + 2 β + C 0 | D | / 4 - 1 2 + β 4 r + 2 β ( C 0 + 1 ) | D | / 4 - r 2 r + β 5 ( C 0 + 1 ) | D | - r 2 r + β = 5 ( C 0 + 1 ) N - r 2 r + β ,
A D ˜ , λ = | D ˜ | / 4 - 1 [ | D | / 4 ] 1 4 r + 2 β + C 0 | D ˜ | / 4 - 1 2 | D | / 4 β 4 r + 2 β 5 ( C 0 + 1 ) ( | D ˜ | - 1 | D | 1 4 r + 2 β + | D ˜ | - 1 2 | D | β 4 r + 2 β ) = 5 ( C 0 + 1 ) ( | N + S | - 1 N 1 4 r + 2 β + | N + S | - 1 2 N β 4 r + 2 β ) ,
A D l , λ = | D l | / 4 - 1 | D | / 4 1 4 r + 2 β + C 0 | D l | / 4 - 1 2 | D | / 4 β 4 r + 2 β 5 ( C 0 + 1 ) ( m | D | - 1 + 1 4 r + 2 β + m 1 2 | D | - 1 2 + β 4 r + 2 β ) = 5 ( C 0 + 1 ) ( m N - 1 + 1 4 r + 2 β + m 1 2 N - 1 2 + β 4 r + 2 β ) ,
and:
A D ˜ l , λ 5 ( C 0 + 1 ) ( | D ˜ l | - 1 | D | 1 4 r + 2 β + | D ˜ l | - 1 2 | D | β 4 r + 2 β ) = 5 ( C 0 + 1 ) ( m | D ˜ | - 1 | D | 1 4 r + 2 β + m 1 2 | D ˜ | - 1 2 | D | β 4 r + 2 β ) 2 5 ( C 0 + 1 ) m 1 2 | D ˜ | - 1 2 | D | β 4 r + 2 β .
Thus:
A D ˜ l , λ λ 5 ( C 0 + 1 ) ( m | D ˜ | - 1 | D | 1 4 r + 2 β + m 1 2 | D ˜ | - 1 2 | D | β 4 r + 2 β ) | D | / 4 1 2 ( 2 r + β ) 2 5 ( C 0 + 1 ) m 1 2 | D ˜ | - 1 2 | D | β + 1 4 r + 2 β = 2 5 ( C 0 + 1 ) m 1 2 | N + S | - 1 2 N β + 1 4 r + 2 β 2 5 ( C 0 + 1 ) .
It follows for l = 1 , , m :
A D ˜ l , λ λ 2 + 1 20 ( C 0 + 1 ) 2 + 1 ,
A D ˜ l , λ 2 λ r - 1 10 ( C 0 + 1 ) 2 ( m 2 | D ˜ | - 2 | D | 1 2 r + β + m | D ˜ | - 1 | D | s 2 r + β ) | D | 1 - r 2 r + β = 10 ( C 0 + 1 ) 2 ( m 2 | D ˜ | - 2 | D | 2 2 r + β + m | D ˜ | - 1 | D | 1 + β 2 r + β ) | D | - r 2 r + β 20 ( C 0 + 1 ) 2 | D | - r 2 r + s / log 6 | D | = 20 ( C 0 + 1 ) 2 N - r 2 r + s / log 6 N ,
A D ˜ l , λ A D l , λ λ - 1 2 10 ( C 0 + 1 ) 2 m 1 2 | D ˜ | - 1 2 | D | β 4 r + 2 β ( m | D | - 1 + 1 4 r + 2 β + m 1 2 | D | - 1 2 + β 4 r + 2 β ) | D | 1 4 r + 2 β 10 ( C 0 + 1 ) 2 ( m 3 2 | D ˜ | - 1 2 | D | - β + 4 r - 2 4 r + 2 β + m | D ˜ | - 1 2 | D | β + 1 - 2 r 4 r + 2 β ) 20 ( C 0 + 1 ) 2 | D | - r 2 r + β / log 6 | D | = 20 ( C 0 + 1 ) 2 N - r 2 r + β / log 6 N .
Thus, by the above estimates:
J D , D ˜ , λ 40 ( 20 ( C 0 + 1 ) 2 + 1 ) ( C 0 + 1 ) 2 N - r 2 r + β / log 6 N
Thus:
log 2 ( T ) J D , D ˜ , λ log 4 24 m δ 2 4 log 2 ( T ) J D , D ˜ , λ ( log 4 m ) log 4 24 δ 2 4 ( 2 r + β ) - 2 ( log 2 N ) J D , D ˜ , λ ( log 4 m ) log 4 24 δ 2 4 ( 2 r + β ) - 2 ( log 6 N ) J D , D ˜ , λ log 4 24 δ 2 10 ( 2 r + β ) - 2 ( 20 ( C 0 + 1 ) 2 + 1 ) ( C 0 + 1 ) 2 | D | - r 2 r + β log 4 24 δ = 2 10 ( 2 r + β ) - 2 ( 20 ( C 0 + 1 ) 2 + 1 ) ( C 0 + 1 ) 2 N - r 2 r + β log 4 24 δ ,
and:
log ( T ) A D ˜ , λ λ r - 1 2 ( 2 r + β ) - 1 5 ( C 0 + 1 ) log | D | ( | D ˜ | - 1 | D | 1 2 r + β + | D ˜ | - 1 2 | D | β + 1 4 r + 2 β ) | D | - r 2 r + β 2 5 ( 2 r + β ) - 1 ( C 0 + 1 ) | D | - r 2 r + β = 2 5 ( 2 r + β ) - 1 ( C 0 + 1 ) N - r 2 r + β .
Putting the above estimates into Theorem 2, we have the desired conclusion (Equation (8)) with:
C = C * 2 10 ( 2 r + β ) - 2 ( 20 ( C 0 + 1 ) 2 + 1 ) ( C 0 + 1 ) 2 + 2 5 ( 2 r + β ) - 1 ( C 0 + 1 ) + 5 ( C 0 + 2 ) .
When r 1 2 , we apply Theorem 3 and take the same proof procedure a above. Then, the conclusion (Equation (8)) can be obtained. The proof is completed. □

4. Simulation and Conclusions

In this section, we provide the simulation to verify our theoretical statements. We assume that the inputs { x i } are independently drawn according to the uniform distribution on [ 0 , 1 ] . Consider the regression model y i = g ρ ( x i ) + ε i , i = 1 , , N , where ε i is the independent Gaussian noise N ( 0 , 0 . 1 2 ) and:
g ρ ( x ) = x , if   0 < x 0 . 5 , 1 - x , if   0 . 5 < x 1 .
Define the pairwise kernel K : X 2 × X 2 R by K ( ( x , u ) , ( x , u ) ) : = K 1 ( x , x ) + K 1 ( u , u ) - K 1 ( x , u ) - K 1 ( x , u ) where:
K 1 ( x , x ) = 1 + min { x , x } .
We apply the kernel K to the distributed algorithm (Equation (3)). In Figure 1, we plot the mean squared error of Equation (3) for N = 600 and S = 0 , 300 , 600 when the number of local machines m varies. Note that S = 0 , and it is a standard distributed MEE algorithm without unlabeled data. When m becomes large, the red curve increases dramatically. However, when we add 300 or 600 unlabeled data, the error curves begin to increase very slowly. This coincides with our theory that using unlabeled data can enlarge the range of m in the distributed method.
This paper studied the convergence rate of the distribute gradient descent MEE algorithm in a semi-supervised setting. Our results demonstrated that using additional unlabeled data can improve the learning performance of the distributed MEE algorithm, especially in enlarging the range of m to guarantee the learning rate. As we know, there are many gaps between theory and empirical studies. We regard this paper as mainly a theoretical paper and expect that the theoretical analysis give some guidance to real applications.

Author Contributions

B.W. conceived the presented idea. T.H. developed the theory and performed the computations. All authors discussed the results and contributed to the final manuscript.

Funding

The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], the Natural Science Foundation of Hubei Province in China [No. 2017CFB523], and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY18033].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 3

We state two useful lemmas as follows, whose proof can be found in Reference [22].
Lemma A1.
For λ > 0 , 0 < η < 1 and j = 1 , , t - 1 , we have:
max { η ( L K + λ ) π j + 1 t ( L K ) , η ( L K , D ˜ + λ ) π j + 1 t ( L K , D ˜ ) } 1 t - j + η λ , max { j = 1 t η ( L K + λ ) π j + 1 t ( L K ) , j = 1 t η ( L K , D ˜ + λ ) π j + 1 t ( L K , D ˜ ) } 1 + η λ t .
Lemma A2.
For any λ > 0 and f * H K , there holds:
f t + 1 , D ˜ - f t + 1 L 2 B D ˜ , λ C D ˜ , λ i = 1 t t - i - 1 + η λ f i - f * K + B D ˜ , λ 1 + η λ t C D ˜ , λ f * K + G D ˜ , λ + c p , M ( N + S ) 2 p + 1 N - ( 2 p + 1 ) t p + 1 / 2 h - 2 p ,
and:
f t + 1 , D ˜ - f t + 1 K B D ˜ , λ C D ˜ , λ i = 1 t t - i - 1 + η λ f i - f * K / λ + B D ˜ , λ 1 + η λ t C D ˜ , λ f * K + G D ˜ , λ / λ + c p , M ( N + S ) 2 p + 1 N - ( 2 p + 1 ) t p + 1 / 2 h - 2 p ,
where the constant c p , M = 2 4 p + 2 c p C G 2 p + 1 M 2 p + 1 .
Proof of Lemma A2.
By Equations (10) and (13), we get a two error decomposition for f t + 1 , D - f t + 1 .
The first one is:
f t + 1 , D ˜ - f t + 1 = η i = 1 t π i + 1 t ( L K , D ˜ ) [ L K - L K , D ˜ ] ( f i ) + η i = 1 t π i + 1 t ( L K , D ˜ ) [ f ρ , D ˜ - L K ( f ρ ) ] + η i = 1 t π i + 1 t ( L K , D ˜ ) E i , D ˜ ,
and the second one is:
f t + 1 , D ˜ - f t + 1 = η i = 1 t π i + 1 t ( L K ) [ L K - L K , D ˜ ] ( f i , D ˜ ) + η i = 1 t π i + 1 t ( L K ) [ f ρ , D ˜ - L K ( f ρ ) ] + η i = 1 t π i + 1 t ( L K ) E i , D ˜ .
It has been proven in Reference [22] that { f t , D ˜ } as f t , D ˜ K 2 C G | D ˜ | | D | M t 1 2 = 2 C G | N + S | N M t 1 2 . It follows from Equation (4) that:
E t , D ˜ K 1 | N + S | 2 ( x i , y i ) D ˜ , ( x j , y j ) D ˜ G ( f t , D * ( x i , x j ) - y i + y j ) 2 h 2 - G ( 0 ) ( f t , D ˜ ( x i , x j ) - y i + y j ) K ( x i , x j ) K c p 2 | N + S | N C G M + 2 2 p + 1 h 2 p f t , D ˜ K 2 p + 1 2 4 p + 2 c p C G 2 p + 1 M 2 p + 1 | N + S | 2 p + 1 N 2 p + 1 t p + 1 / 2 h - 2 p .
Then, we can follow the proof procedure in Proposition 1 of Reference [24] to prove Equations (A2) and (A1). □
With the help of the lemmas above, we can prove Lemma 3.
Proof of Lemma 3.
Applying Equation (24) with D ˜ = D ˜ l for l = 1 , , m , we have that:
f ¯ T + 1 , D ˜ l - f T + 1 L 2 = 1 m l = 1 m f ¯ T + 1 , D ˜ l - f T + 1 L 2 η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m [ L K - L K , D ˜ l ] ( f i , D ˜ l ) L 2 + η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m [ f ^ ρ , D ˜ l - L K ( f ρ ) ] L 2 + η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m E i , D ˜ l L 2 : = I 1 + I 2 + I 3 .
Firstly, we will bound I 1 , which is most difficult to handle. It can be decomposed as:
I 1 η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K - L K , D ˜ l ] ( f i , D ˜ l - f i ) K + η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K - L K , D ˜ l ] ( f i - f * ) K + η i = 1 T π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K - L K , D ˜ l ] ( f * ) K : = I 11 + I 12 + I 13 .
Then, it is easy to get that by Lemma A1 and f 1 , D ˜ l = f 1 = 0 :
I 11 = i = 1 T η ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] ( f i , D ˜ l - f i ) K i = 1 T η ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] ( f i , D ˜ l - f i ) K i = 1 T ( η λ + ( T - i ) - 1 ) sup 1 l m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] ( f i , D ˜ l - f i ) K sup 1 l m i = 1 T ( η λ + ( T - i ) - 1 ) f i , D ˜ l - f i K C D ˜ l , λ .
Applying Proposition A2 with D ˜ = D ˜ l and t + 1 = i , we have that:
f i , D ˜ l - f i K s = 1 i - 1 ( i - s - 1 ) - 1 + λ η f s - f * K B D ˜ l , λ C D ˜ l , λ λ - 1 2 + ( 1 + λ η i ) B D ˜ l , λ ( C D ˜ l , λ f * K + G D ˜ l , λ ) λ - 1 2 + c p , M | N + S | 2 p + 1 N 2 p + 1 i p + 1 / 2 h - 2 p .
Thus:
I 11 sup 1 l m i = 1 T ( η λ + ( T - i ) - 1 ) C D ˜ l , λ × { s = 1 i - 1 ( i - s - 1 ) - 1 + λ η f s - f * K B D ˜ l , λ C D ˜ l , λ λ - 1 2 + ( 1 + λ η i ) B D ˜ l , λ ( C D ˜ l , λ f * K + G D ˜ l , λ ) λ - 1 2 + c p , M | N + S | 2 p + 1 N 2 p + 1 i p + 1 / 2 h - 2 p } .
By Lemma A1 again, we have:
I 12 i = 1 T η ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] ( f i - f * ) K i = 1 T η ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] f i - f * K i = 1 T ( η λ + ( T - i ) - 1 ) D D ˜ , λ f i - f * K ,
and:
I 13 i = 1 T η ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) - 1 2 [ L K - L K , D ˜ l ] f * K ( 1 + λ η T ) D D ˜ , λ f * K .
This completes the estimate of I 1 with Equations (A6), (A7), and (A8).
Now we turn to bound I 2 . Then, by the definition of F D ˜ , λ , the bound (Equation (A5)) of E t , D ˜ and Lemma A1, we obtain that:
I 2 ( 1 + η λ T ) F D ˜ , λ ,
and:
I 3 c p , M | D ˜ | 2 p + 1 | D | 2 p + 1 T p + 3 / 2 h - 2 p = c p , M | N + S | 2 p + 1 N 2 p + 1 T p + 3 / 2 h - 2 p .
Together with the bound of Equations (A6), (A7), and (A8), we can get the desired conclusion (Equation (15)). □

References

  1. Principe, J.C. Renyi’s entropy and Kernel perspectives. In Information Theoretic Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
  2. Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In Proceedings of the International Conference on ICA and Signal Separation; Springer: Berlin, Germany, 2000; pp. 75–90. [Google Scholar]
  3. Erdogmus, D.; Hild, K.; Principe, J.C. Blind source separation using Renyi’s α-marginal entropies. Neurocomputing 2002, 49, 25–38. [Google Scholar] [CrossRef]
  4. Erdogmus, D.; Principe, J.C. Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. IEEE Trans. Signal Process. 2003, 51, 1966–1978. [Google Scholar] [CrossRef]
  5. Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Learn. 2002, 24, 158–171. [Google Scholar] [CrossRef]
  6. Silva, L.M.; Marques, J.; Alexandre, L.A. Neural network classification using Shannon’s entropy. In Proceedings of the European Symposium on Artificial Neural Networks; D-Side: Bruges, Belgium, 2005; pp. 217–222. [Google Scholar]
  7. Silva, L.M.; Marques, J.; Alexandre, L.A. The MEE principle in data classification: A perceptron-based analysis. Neural Comput. 2010, 22, 2698–2728. [Google Scholar] [CrossRef]
  8. Choe, Y. Information criterion for minimum cross-entropy model selection. arXiv, 2017; arXiv:1704.04315. [Google Scholar]
  9. Ying, Y.; Zhou, D.X. Online pairwise learning algorithms. Neural Comput. 2016, 28, 743–777. [Google Scholar] [CrossRef] [PubMed]
  10. Zhang, Y.; Duchi, J.C.; Wainwright, M.J. Divide and conquer kernel ridge regression: A dis tributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2013, 30, 592–617. [Google Scholar]
  11. Chapelle, O.; Zien, A. Semi-Supervised Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  12. Zhang, T. Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 2005, 17, 2077–2098. [Google Scholar] [CrossRef] [PubMed]
  13. Caponnetto, A.; Vito, E.D. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
  14. Steinwart, I.; Hush, D.R.; Scovel, C. Optimal rates for regularized least squares regression. In Proceedings of the COLT 2009—the Conference on Learning Theory, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
  15. Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. [Google Scholar]
  16. Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Prob. 2017, 33, 074009. [Google Scholar] [CrossRef]
  17. Guo, Z.C.; Shi, L.; Wu, Q. Learning theory of distributed regression with bias corrected regu-larization kernel network. J. Mach. Learn. Res. 2017, 18, 4237–4261. [Google Scholar]
  18. M<i>u</i>¨cke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. [Google Scholar]
  19. Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
  20. Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
  21. Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge re-gression. J. Mach. Learn. Res. 2017, 18, 1493–1514. [Google Scholar]
  22. Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Unpublished work. 2018. [Google Scholar]
  23. Guo, X.; Hu, T.; Wu, Q. Distributed minimum error entropy algorithms. Unpublished work. 2018. [Google Scholar]
  24. Wang, B.; Hu, T. Distributed pairwise algorithms with gradient descent methods. Unpublished work. 2018. [Google Scholar]
Figure 1. The mean square errors for the size of unlabeled data S { 0 , 300 , 600 } as the number of local machines m varies.
Figure 1. The mean square errors for the size of unlabeled data S { 0 , 300 , 600 } as the number of local machines m varies.
Entropy 20 00968 g001
Table 1. List of notations used throughout the paper.
Table 1. List of notations used throughout the paper.
NotationMeaning of the Notation
Xthe explanatory variable
Ythe response variable
X X X , a compact subset of an Euclidian space R n
Y Y Y , a subset of R
ρ ( · , · ) a Boreal measure on X × Y
ρ X the marginal probability measure of ρ on X
ρ ( y | x ) the conditional probability measure of y Y given X = x
g ρ ( x ) the mean regression function g ρ ( x ) = Y y d ρ ( y | x )
f ρ ( x , u ) the target function of MEE induced by f ρ ( x , u ) = g ρ ( x ) - g ρ ( u )
Ka reproducing kernel on X × X
Dthe labeled data set D = { ( x 1 , y 1 ) , , ( x N , y N ) }
Nthe size of labeled data set D
N / 4 the largest integer not exceeding N / 4
| D | the cardinality of D , | D | = N
D * the unlabeled data set D * = { x 1 , , x S }
Sthe size of unlabeled data set D *
| D * | the cardinality of D * , | D * | = S
D ˜ training data set used in the distributed MEE algorithm, consisting of D and D *
| D ˜ | the cardinality of D ˜ , | D ˜ | = N + S
mthe number of local machines
D ˜ l the lth subset of D ˜ , 1 l m
Gthe loss function of MEE algorithm
L K the integral operator associated with K
L K , D ˜ the empirical operator of L K on D ˜
f t + 1 , D the function output by the kernel gradient descent MEE algorithm
with data D and kernel K after t iterations
f t + 1 , D l the function output by the kernel gradient MEE algorithm
with data D l and kernel K after t iterations
f ¯ t + 1 , D ˜ the global output averaging over local outputs f t + 1 , D ˜ l , l = 1 , , m

Share and Cite

MDPI and ACS Style

Wang, B.; Hu, T. Semi-Supervised Minimum Error Entropy Principle with Distributed Method. Entropy 2018, 20, 968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

AMA Style

Wang B, Hu T. Semi-Supervised Minimum Error Entropy Principle with Distributed Method. Entropy. 2018; 20(12):968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

Chicago/Turabian Style

Wang, Baobin, and Ting Hu. 2018. "Semi-Supervised Minimum Error Entropy Principle with Distributed Method" Entropy 20, no. 12: 968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop