Semi-Supervised Minimum Error Entropy Principle with Distributed Method

Wang, Baobin; Hu, Ting

doi:10.3390/e20120968

Open AccessArticle

Semi-Supervised Minimum Error Entropy Principle with Distributed Method

by

Baobin Wang

¹ and

Ting Hu

^2,*

¹

School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China

²

School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(12), 968; https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

Submission received: 26 October 2018 / Revised: 8 December 2018 / Accepted: 10 December 2018 / Published: 14 December 2018

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Versions Notes

Abstract

:

The minimum error entropy principle (MEE) is an alternative of the classical least squares for its robustness to non-Gaussian noise. This paper studies the gradient descent algorithm for MEE with a semi-supervised approach and distributed method, and shows that using the additional information of unlabeled data can enhance the learning ability of the distributed MEE algorithm. Our result proves that the mean squared error of the distributed gradient descent MEE algorithm can be minimax optimal for regression if the number of local machines increases polynomially as the total datasize.

Keywords:

information theoretical learning; distributed method; MEE algorithm; semi-supervised approach; gradient descent; reproducing kernel Hilbert spaces

1. Introduction

The minimum error entropy (MEE) principle is an important criterion proposed in information theoretical learning (ITL) [1] and was firstly addressed for adaptive system training by Erdogmus and Principe [2]. It has been applied to blind source separation, maximally informative subspace projections, clustering, feature selection, blind deconvolution, minimum cross-entropy for model selection, and some other topics [3,4,5,6,7,8]. Taking entropy as a measure of the error, the MEE principle can extract the information contained in data fully and produce robustness to outliers in the implementation of algorithms.

Let

X \in R^{n}

be an explanatory variable with values taken in a compact metric space

(X, d),

Y be a real response variable with

Y \in Y \subset R

, and

g : X \to Y

be a prediction function. For a given set of labeled examples

D = {(x_{i}, y_{i})}_{i = 1}^{N} \subset X \times Y

(N denotes the sample size) and a windowing function

G : R \to R_{+},

the MEE principle is to find a minimizer of the empirical quadratic entropy:

\begin{matrix} \hat{H} (g) = - log \{\frac{h^{2}}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} G (\frac{{[(y_{i} - g (x_{i})) - (y_{j} - g (x_{j}))]}^{2}}{h^{2}})\}, \end{matrix}

where

h > 0

is the scaling parameter. Its goal is to solve the problem

y = g_{ρ} (x) + ε

, where

ε

is the noise and

g_{ρ} (x)

is the target function. Taking a function

f (x_{i}, x_{j}) : = g (x_{i}) - g (x_{j}),

MEE belongs to pairwise learning problems, which involves with the intersections of example pairs. Since logarithmic function is monotonic, we only consider the empirical information error of MEE:

\begin{matrix} R (f) = - \frac{h^{2}}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} G (\frac{{[y_{i} - y_{j} - f (x_{i}, x_{j})]}^{2}}{h^{2}}), \end{matrix}

(1)

in the optimization process. Borrowing the idea from Reference [9], we introduced the Mercer kernel

K (\cdot, \cdot) : X^{2} \times X^{2} \to R, (X^{2} : = X \times X)

and employed the reproducing kernel Hilbert space (RKHS)

H_{K}

as our hypothesis space. With

K,

H_{K}

is defined as the linear span of the functions set

{K_{(x, u)} : = K ((x, u), (\cdot, \cdot)), \forall (x, u) \in X^{2}},

which is equipped with the inner product

{〈 \cdot, \cdot 〉}_{K}

and the reproducing property

{〈 K_{(x, u)}, K_{(x^{'}, u^{'})} 〉}_{K} = K ((x, u), (x^{'}, u^{'})), \forall (x, u), (x^{'}, u^{'}) \in X^{2} .

For the G nonconvex, we usually solve Equation (1) using the kernel-based gradient descent method as follows. It starts with

f_{1, D} = 0

and is updated by:

\begin{matrix} f_{t + 1, D} = f_{t, D} - η \times \nabla R (f_{t, D}), \end{matrix}

(2)

in the t-th step, where

η > 0

is a step size, ∇ is the gradient operator and:

\nabla R (f_{t, D}) = - \frac{1}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} G^{'} (\frac{{[y_{i} - y_{j} - f_{t, D} (x_{i}, x_{j})]}^{2}}{h^{2}}) [f_{t, D} (x_{i}, x_{j}) - y_{i} + y_{j}] K_{(x_{i}, x_{j})},

as we know that the example pairs will grow quadratically with the increasing example size N, which will bring the computational burden in the MEE implementation. Thus, it is necessary to reduce the algorithmic complexity by the distributed method based on a divide-and-conquer strategy [10]. Semi-supervised learning (SSL) [11] has attracted extensive attention as an emerging field in machine learning research and data mining. Actually, in many practical problems, few data are given, but a large number of unlabeled data are available, since labeling data requires a lot of time, effort or money. In this paper, we study a distributed MEE algorithm in the framework of SSL and show that the learning ability of the MEE algorithm can be enhanced by the distributed method and the combination of labeled data with unlabeled data.

There are mainly three contributions in this paper. The first one is that we derive the explicit learning rate of the gradient descent method for distributed MEE in the context of SSL, which is comparable to the minimax optimal rate of the least squares in regression. This implies that the MEE algorithm can be an alternative of the least squares in SSL in the sense that both of them have the same prediction power. The second one is that we provide the theoretical upper bound for the number of local machines guaranteeing the optimal rate in the distributed computation. The last one is that we extend the range of the target function allowed in the distributed MEE algorithm.

In Table 1, we summarize some notations used in this paper.

2. Algorithms and Main Results

We considered MEE for the regression problem. To allow noise in sampling processes, we assumed that a Borel measure

ρ (\cdot, \cdot)

is defined on the product space

X \times Y .

Let

ρ (y | x)

be the conditional distribution of

y \in Y

for any given

x \in X

, and

ρ_{X} (\cdot)

the marginal distribution on

X .

For the semi-supervised MEE algorithm, our goal was to estimate the regression function

g_{ρ} (x) = \int_{Y} y d ρ (y | x), x \in X

, from labeled examples

D = {(x_{i}, y_{i})}_{i = 1}^{N}

and unlabeled examples

D^{*} = {x_{j}}_{j = 1}^{S}

drawn from the distribution

ρ

and

ρ_{X}

, respectively.

Based on the divide-and-conquer strategy, both D and

D^{*}

are partitioned equally into m subsets,

D = \sum_{l = 1}^{m} ⋃ D_{l}

and

D^{*} = \sum_{l = 1}^{m} ⋃ D_{l}^{*} .

Here, we denote the size of subsets

| D_{l} | = n

and

| D_{l}^{*} | = s

,

1 \leq l \leq m,

i.e.,

N = m n, S = m s .

We construct a new dataset

\tilde{D} = \sum_{l = 1}^{m} ⋃ {\tilde{D}}_{l}

by:

\begin{matrix} {\tilde{D}}_{l} = D_{l} \cup D_{l}^{*} = {(x_{k}, y_{k})}_{k = 1}^{n + s}, \end{matrix}

where:

x_{k} = \{\begin{matrix} x_{k}, & if (x_{k}, y_{k}) \in D_{l}, \\ x_{k}, & if x_{k} \in D_{l}^{*}, \end{matrix} and y_{k} = \{\begin{matrix} \frac{n + s}{n} y_{k}, & if (x_{k}, y_{k}) \in D_{l}, \\ 0, & if x_{k} \in D_{l}^{*} . \end{matrix}

Based on the gradient descent algorithm (Equation (2)), we can get a set of local estimators

{f_{t, {\tilde{D}}_{l}}}

for each subset

{\tilde{D}}_{l}, 1 \leq l \leq m .

Then, the global estimator averaging over these local estimators is given by:

\begin{matrix} {\bar{f}}_{t, \tilde{D}} = \frac{1}{m} \sum_{l = 1}^{m} f_{t, {\tilde{D}}_{l}} . \end{matrix}

(3)

In the pairwise setting, our target function

f_{ρ} (x, x^{'}) = g_{ρ} (x) - g_{ρ} (x^{'}), x, x^{'} \in X,

which is the difference of the regression function

g_{ρ} .

Denote by

L_{ρ_{X^{2}}}^{2}

the space of square integrable functions on the product space

X^{2}

:

\begin{matrix} L_{ρ_{X^{2}}}^{2} : = \{f : X^{2} \to R : {∥ f ∥}_{L^{2}} = {(\int \int_{X^{2}} {| f (x, x^{'}) |}^{2} d ρ_{X} (x) d ρ_{X} (x^{'}))}^{\frac{1}{2}} < \infty\} . \end{matrix}

The goodness of

{\bar{f}}_{t, \tilde{D}}

is usually measured by the mean squared error

∥ {\bar{f}}_{t, \tilde{D}} - f_{ρ} ∥_{L^{2}}^{2} .

Throughout the paper, we assumed that

sup_{(x, x^{'}) \in X^{2}} \sqrt{K ((x, x^{'}), (x, x^{'}))} \leq 1

and for some constant

M > 0,

| y | \leq M

almost surely. Without generality, windowing function G is assumed to be differentiable and satisfies

G^{'} (0) = - 1,

G^{'} (u) < 0

for

u > 0,

C_{G} : = {sup}_{u \in (0, \infty)} | G^{'} (u) | < \infty

and there exists some p such that

c_{p} > 0

and:

\begin{matrix} | G^{'} (u) - G^{'} (0) | \leq c_{p} {| u |}^{p}, \forall u > 0 . \end{matrix}

(4)

It is easy to check that the Gaussian kernel

G (u) = exp {- u}

satisfies the assumptions above with

p = 1 .

Before we present our main results, define an integral operator

L_{K} : L_{ρ_{X^{2}}}^{2} ⟶ L_{ρ_{X^{2}}}^{2}

associated with the kernel K by:

\begin{matrix} L_{K} (f) : = \int_{X} \int_{X} f (x, x^{'}) K_{(x, x^{'})} d ρ_{X} (x) d ρ_{X} (x^{'}), \forall f \in L_{ρ_{X^{2}}}^{2} . \end{matrix}

Our error analysis for the distributed MEE algorithm (Equation (3)) is stated in terms of the following regularity condition:

\begin{matrix} f_{ρ} = L_{K}^{r} (ϕ) f o r s o m e r > 0, ϕ \in L_{ρ_{X^{2}}}^{2}, \end{matrix}

(5)

where

L_{K}^{r}

denotes the r-th power of

L_{K}

on

L_{ρ_{X^{2}}}^{2}

and is well defined, since the operator

L_{K}

is positive and compact with the Mercer kernel

K .

We use the effective dimension [12,13]

N (λ)

to measure the complexity of

H_{K}

with respect to

ρ_{X},

which is defined to be the trace of the operator

{(λ I + L_{K})}^{- 1} L_{K}

as:

\begin{matrix} N (λ) = T r ({(λ I + L_{K})}^{- 1} L_{K}), λ > 0 . \end{matrix}

To obtain optimal learning rates, we need to quantify

N (λ)

of

H_{K}

. A suitable assumption is: that

\begin{matrix} N (λ) \leq C_{0} λ^{- β}, f o r s o m e C_{0} > 0 a n d 0 < β \leq 1 . \end{matrix}

(6)

Remark 1.

When

β = 1,

Equation (6) always holds with

C_{0} = T r (L_{K}) .

For

0 < β < 1,

when

H_{K}

is a Sobolev space

W^{α} (X)

on

X \subset R^{d}

with all derivative of order up to

α > \frac{d}{2},

then Equation (6) is satisfied with

β = \frac{d}{2 α}

[14]. Moreover, if the eigenvalues

{γ_{i}}_{i = 1}^{\infty}

of the operator

L_{K}

decays as

γ_{i} = O (i^{- b})

for some

b > 1,

then

N (λ) = O (λ^{- \frac{1}{b}}) .

The eigenvalues assumption is typical in the analysis of the performances of kernel methods estimators and recently used in References [13,15,16] to establish the optimal learning rate in the least square problems.

The following theorem shows that the distributed gradient descent algorithm (Equation (3)) can achieve the optimal rate by providing the iteration time T and the maximal number of local machines, whose proof can be found in Section 3.

Theorem 1. (Main Result)

Assume Equations (5) and (6) hold for

r + β \geq \frac{1}{2} .

Let the iteration time

T = {⌈ N / 4 ⌉}^{\frac{1}{2 r + β}}

and

S + N \geq N^{\frac{β + 1}{2 r + β}}

:

\begin{matrix} m < \frac{min {{(N + S)}^{\frac{1}{2}} N^{- \frac{β + 1}{4 r + 2 β}}, {(N + S)}^{\frac{1}{3}} N^{- \frac{2 - 2 r - β}{6 r + 3 β}}}}{{log}^{6} N}, \end{matrix}

(7)

then for any

0 < δ < 1,

with confidence at least

1 - δ

:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}} \leq C^{'} max \{N^{- \frac{r}{2 r + β}}, h^{- 2 p} {(N + S)}^{2 p + 1} N^{\frac{p + \frac{3}{2}}{2 r + β} - (2 p + 1)}\} {log}^{4} \frac{24}{δ}, \end{matrix}

(8)

where

C^{'}

is a constant independent of

N, S, δ, h

and

⌈ N / 4 ⌉

denotes the largest number not exceeding

N / 4 .

Corollary 1.

Under the same conditions of Theorem 1, if the scaling parameter:

\begin{matrix} h > {(N + S)}^{\frac{2 p + 1}{2 p}} N^{\frac{r + p + \frac{3}{2}}{2 p (2 r + β)}} N^{- \frac{2 p + 1}{2 p}}, \end{matrix}

then for any

0 < δ < 1,

with confidence at least

1 - δ

:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}} \leq C^{'} N^{- \frac{r}{2 r + β}} {log}^{4} \frac{24}{δ} . \end{matrix}

(9)

Remark 2.

The rate

O (N^{- \frac{r}{2 r + β}})

in Equation (9) is optimal in the minimax sense for kernel regression problems [13]. When

m = 1,

the result of Equation (9) shows that the kernel gradient descent MEE algorithm (Equation (2)) on a single big data set can achieve the minimax optimal rate for regression. Thus, MEE is a nice alternative of the classical least squares. Meanwhile, the upper bound (Equation (7)) for the number of local machines implies that the performance of the distributed MEE algorithm (Equation (3)) can be as good as the standard MEE algorithm (2) (acting on the whole data set

\tilde{D}

), provided that the subset

{\tilde{D}}_{l}

’s size

n + s

is not too small.

Remark 3.

If no unlabeled data is engaged in the algorithm (Equation (3)), then

S = 0

and the upper bound (Equation (7)) for the number of local machines m that ensures the optimal rate is about

O (N^{\frac{r - \frac{1}{2}}{2 r + β}}) .

So, when the regularity parameter r in Equation (5) is close to

\frac{1}{2},

the upper bound

O (N^{\frac{r - \frac{1}{2}}{2 r + β}})

reduces to a constant and then the distributed algorithm (Equation (3)) will not be feasible in real applications. A similar phenomenon is observed in various distributed algorithms [15,16,17,18]. When the size of unlabeled data

S > 0,

we see from Equation (7) that the upper bound of m keeps growing with the increase of S when the size of labeled data N is fixed. For example, let

β > \frac{1}{2}

and

S = N^{\frac{1}{2 r + β}}

, then the upper bound in Equation (7) is

O (N^{\frac{r}{2 r + β}})

and will not be a constant when

r \to \frac{1}{2} .

Hence, with sufficient unlabeled data

D^{*}

, the distributed algorithm (Equation (3)) will allow more local machines in the distributed method.

Remark 4.

A series of distributed works [15,16,17,18,19] were carried out when the target function

f_{ρ}

lies in the space

H_{K},

i.e., the regularization parameter

r > \frac{1}{2} .

As a byproduct, our work in Theorem 1 does not impose the restriction

r > \frac{1}{2}

on the distributed algorithm (Equation (3)).

3. Proof of Main Result

In this section we prove our main results in Theorem 1. To this end, we introduce the data-free gradient descent method in

H_{K}

for the least squares, defined as

f_{1} = 0

and:

\begin{matrix} f_{t + 1} = f_{t} - η_{t} \int_{X} \int_{X} (f_{t} (x, x^{'}) - f_{ρ} (x, x^{'})) K_{(x, x^{'})} d ρ_{X} (x) d ρ_{X} (x^{'}), t \geq 1 . \end{matrix}

Recalling the definition of

L_{K}

, it can be written as:

\begin{matrix} f_{t + 1} = f_{t} - η_{t} L_{K} (f_{t} - f_{ρ}) = (I - η_{t} L_{K}) f_{t} + η_{t} L_{K} (f_{ρ}), t \geq 1 . \end{matrix}

(10)

Following the standard decomposition technique in leaning theory, we split the error

{\bar{f}}_{t + 1, \tilde{D}} - f_{ρ}

into the sample error

{\bar{f}}_{t + 1, \tilde{D}} - f_{t + 1}

and the approximation error

f_{t + 1} - f_{ρ} .

3.1. Approximation Error

Firstly, we estimate the approximation error

∥ f_{t + 1} - f_{ρ} ∥_{L^{2}} .

It has been proven in Reference [20] and shown in the lemmas as follows.

Lemma 1.

Define

{f_{t}}

by Equation (10) with

0 < η \leq 1 .

If Equation (5) holds with

r > 0

, there are:

\begin{matrix} ∥ f_{t} - f_{ρ} ∥_{L^{2}} \leq c_{ϕ, r} t^{- r}, \end{matrix}

and when

r \geq \frac{1}{2}

:

\begin{matrix} ∥ f_{t} - f_{ρ} ∥_{K} \leq c_{ϕ, r} t^{- (r - \frac{1}{2})}, \end{matrix}

where

c_{ϕ, r} = max \{{∥ ϕ ∥}_{L^{2}} {(2 r / e)}^{r}, {∥ ϕ ∥}_{L^{2}} {[(2 r - 1) / e]}^{r - \frac{1}{2}}\}

.

Moreover, we derive the uniform bound of the sequence

{f_{t}}

by Equation (10) when

0 < r < \frac{1}{2}

, which is useful in our analysis. Here and in the sequel, denote

π_{i + 1}^{t} (L)

as the polynomial operator associated with an operator L defined by

π_{i + 1}^{t} (L) : = \prod_{j = i + 1}^{t} (I - η L)

and

π_{t + 1}^{t} : = I .

We use the conventional notation

\sum_{j = T + 1}^{T} : = 1 .

Lemma 2.

Define

{f_{t}}

by Equation (10) with

0 < η \leq 1 .

If Equation (5) holds with

0 < r < \frac{1}{2}

, there are:

\begin{matrix} ∥ f_{t} ∥_{K} \leq d_{ϕ, η, r} t^{\frac{1}{2} - r}, \end{matrix}

(11)

where

d_{ϕ, η, r}

is defined in the proof.

Proof.

Using Equation (10) iteratively from t to

1,

then we have that:

\begin{matrix} f_{t + 1} = \sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K} (f_{ρ}), f o r a l l t \geq 1 . \end{matrix}

With Equation (5):

\begin{matrix} ∥ f_{t + 1} ∥_{K} & = {∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K} (f_{ρ})∥}_{K} = {∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K} L_{K}^{r} (ϕ)∥}_{K} \\ = ∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K}^{r + \frac{1}{2}}∥ ∥ L_{K}^{\frac{1}{2}} {ϕ ∥}_{K} = ∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K}^{r + \frac{1}{2}}∥ {∥ ϕ ∥}_{L^{2}} . \end{matrix}

(12)

Let

{σ_{k}}_{k = 1}^{\infty}

be the eigenvalues of the operator

L_{K}

and

0 \leq σ_{k} \leq 1, k \geq 1

, since

L_{K}

is positive and

∥ L_{K} ∥_{H_{K} \to H_{K}} \leq 1

, then the norm:

\begin{matrix} ∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K}^{r + \frac{1}{2}}∥ = sup_{k \geq 1} |\sum_{i = 1}^{t} η π_{i + 1}^{t} (σ_{k}) σ_{k}^{r + \frac{1}{2}}| \leq η sup_{a > 0} |\sum_{i = 1}^{t - 1} π_{i + 1}^{t} (a) a^{r + \frac{1}{2}}| + ∥η L_{K}^{r + \frac{1}{2}}∥ \\ \leq η sup_{a > 0} \{\sum_{i = 1}^{t - 1} [exp \{- η a (t - i)\}] a^{r + \frac{1}{2}}\} + η \\ \leq \sum_{i = 1}^{t - 1} sup_{a > 0} \{η [exp \{- η a (t - i)\}] a^{r + \frac{1}{2}}\} + η . \end{matrix}

For each

i \leq t - 1,

by a simple calculation, we have:

\begin{matrix} sup_{a > 0} \{[exp \{- η a (t - i)\}] a^{r + \frac{1}{2}}\} = \{[exp \{- η a (t - i)\}] a^{r + \frac{1}{2}}\} |_{a = (r + \frac{1}{2}) {(η (t - i))}^{- 1}} \\ = η^{\frac{1}{2} - r} {(r + \frac{1}{2})}^{r + \frac{1}{2}} exp \{- (r + \frac{1}{2})\} {(t - i)}^{- (r + \frac{1}{2})} \leq {(t - i)}^{- (r + \frac{1}{2})} . \end{matrix}

Thus, we have:

\begin{matrix} ∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K}^{r + \frac{1}{2}}∥ \leq η \sum_{i = 1}^{t - 1} {(t - i)}^{- (r + \frac{1}{2})} + η = η \sum_{i = 1}^{t - 1} i^{- (r + \frac{1}{2})} + η . \end{matrix}

By the elementary inequality

\sum_{i = 1}^{t} t^{- θ} \leq \frac{t^{1 - θ}}{1 - θ}

with

0 < θ < 1,

it follows that:

\begin{matrix} ∥\sum_{i = 1}^{t} η π_{i + 1}^{t} (L_{K}) L_{K}^{r + \frac{1}{2}}∥ \leq η (\frac{1}{1 / 2 - r} + 1) t^{\frac{1}{2} - r} = η (\frac{3 / 2 - r}{1 / 2 - r}) t^{\frac{1}{2} - r} . \end{matrix}

Together with Equation (12), then the proof is completed by taking

d_{ϕ, η, r} : = η (\frac{3 / 2 - r}{1 / 2 - r}) {∥ ϕ ∥}_{L^{2}} .

□

3.2. Sample Error

Define the empirical operator

L_{K, D} : H_{K} \to H_{K}

by:

\begin{matrix} L_{K, D} : = \frac{1}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} {〈 \cdot, K_{(x_{i}, x_{j})} 〉}_{K} K_{(x_{i}, x_{j})}, \end{matrix}

and for any

f \in H_{K}

:

\begin{matrix} L_{K, D} (f) = \frac{1}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} {〈 f, K_{(x_{i}, x_{j})} 〉}_{K} K_{(x_{i}, x_{j})} = \frac{1}{N^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in D \\ (x_{j}, y_{j}) \in D \end{matrix}} f (x_{i}, x_{j}) K_{(x_{i}, x_{j})} . \end{matrix}

Then, the MEE gradient descent algorithm (Equation (2)) on

\tilde{D}

can be written as:

\begin{matrix} f_{t + 1, \tilde{D}} = [I - η L_{K, \tilde{D}}] (f_{t, \tilde{D}}) + η f_{ρ, \tilde{D}} + η E_{t, \tilde{D}}, \end{matrix}

(13)

where:

E_{t, \tilde{D}} = \frac{1}{{(N + S)}^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in \tilde{D} \\ (x_{j}, y_{j}) \in \tilde{D} \end{matrix}} (G^{'} (\frac{{[y_{i} - y_{j} - f_{t, \tilde{D}} (x_{i}, x_{j})]}^{2}}{h^{2}}) - G^{'} (0)) (f_{t, \tilde{D}} (x_{i}, x_{j}) - y_{i} + y_{j}) K_{(x_{i}, x_{j})},

(14)

and:

\begin{matrix} f_{ρ, \tilde{D}} = \frac{1}{{(N + S)}^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in \tilde{D} \\ (x_{j}, y_{j}) \in \tilde{D} \end{matrix}} (y_{i} - y_{j}) K_{(x_{i}, x_{j})} . \end{matrix}

In the sequel, denote:

\begin{matrix} B_{\tilde{D}, λ} = ∥{(L_{K, \tilde{D}} + λ I)}^{- 1} (L_{K} + λ)∥, \\ C_{\tilde{D}, λ} = ∥{(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, \tilde{D}})∥, \\ D_{\tilde{D}, λ} = ∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} - L_{K, {\tilde{D}}_{l}})∥, \\ F_{\tilde{D}, λ} = {∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ I)}^{- \frac{1}{2}} [f_{ρ, {\tilde{D}}_{l}} - L_{K} (f_{ρ})]∥}_{K}, \\ G_{\tilde{D}, λ} = {∥{(L_{K} + λ I)}^{- \frac{1}{2}} (L_{K} f_{ρ} - f_{ρ, \tilde{D}})∥}_{K} . \end{matrix}

With these preliminaries in place, we now turn to the estimates of the sample error

{\bar{f}}_{t + 1, \tilde{D}} - f_{t + 1}

presented in the following Lemma, whose proof can be found in the Appendix. Here and in the sequel, we use the conventional notation

\sum_{i = 1}^{t} {(t - i)}^{- 1} : = \sum_{i = 1}^{t - 1} {(t - i)}^{- 1} + 1 .

Lemma 3.

Let

λ > 0

and

0 < η < min {C_{G}^{- 1}, 1},

for any

f^{*} \in H_{K},

there holds:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{T + 1} ∥_{L^{2}} \leq t e r m 1 + t e r m 2 + c_{p, M} {| N + S |}^{2 p + 1} N^{- (2 p + 1)} T^{p + 3 / 2} h^{- 2 p}, \end{matrix}

(15)

where the constant

c_{p, M} = 2^{4 p + 2} c_{p} C_{G}^{2 p + 1} M^{2 p + 1}

:

\begin{matrix} t e r m 1 & = sup_{1 \leq l \leq m} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) C_{{\tilde{D}}_{l}, λ} \times {\sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) {∥ f_{s} - f^{*} ∥}_{K} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} \\ + (1 + λ η i) B_{{\tilde{D}}_{l}, λ} (C_{{\tilde{D}}_{l}, λ} ∥ f^{*} ∥_{K} + G_{{\tilde{D}}_{l}, λ}) λ^{- \frac{1}{2}} + c_{p, M} {| N + S |}^{2 p + 1} N^{- (2 p + 1)} i^{p + 1 / 2} h^{- 2 p}}, \end{matrix}

and:

\begin{matrix} t e r m 2 = \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) D_{\tilde{D}, λ} ∥ f_{i} - f^{*} ∥_{K} + (1 + λ η T) (D_{\tilde{D}, λ} ∥ f^{*} ∥_{K} + F_{\tilde{D}, λ}) . \end{matrix}

With the help of Lemma above, to bound the sample error

∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{T + 1} ∥_{L^{2}},

we first need to estimate the quantities the quantities

B_{\tilde{D}, λ},

C_{\tilde{D}, λ},

D_{\tilde{D}, λ}

F_{\tilde{D} λ}

and

G_{\tilde{D}, λ} .

Denote

A_{D, λ} : = \frac{1}{⌈ | D | / 4 ⌉ \sqrt{λ}} + \sqrt{\frac{N (λ)}{⌈ | D | / 4 ⌉}}

(

| D |

is the cardinality of D). In previous work [19,21,22,23], we have foundnd that each of the following inequality holds with confidence at least

1 - δ

:

\begin{matrix} B_{\tilde{D}, λ} \leq 2 {(\frac{2 A_{\tilde{D}, λ} log \frac{2}{δ}}{\sqrt{λ}})}^{2} + 2, C_{\tilde{D}, λ} \leq 2 A_{\tilde{D}, λ} log \frac{2}{δ}, D_{{\tilde{D}}_{,} λ} \leq 2 A_{\tilde{D}, λ} log \frac{2}{δ} \\ F_{\tilde{D}, λ} \leq 16 M A_{D, λ} log \frac{4}{δ}, a n d G_{\tilde{D}, λ} \leq 16 M A_{D, λ} log \frac{4}{δ} . \end{matrix}

(16)

By Lemma 3, we also see that the function

f^{*}

is crucial to determine

∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{T + 1} ∥ .

To get a tight bound for the learning error, we should choose an appropriate

f^{*} \in H_{\tilde{K}}

according to the regularity of the target function. When

r \geq \frac{1}{2},

f_{ρ} \in H_{K}

and we take

f^{*} = f_{ρ} .

When

0 < r < \frac{1}{2},

f_{ρ}

is out of the space

H_{K}

and we let

f^{*} = 0 .

Now, we give the first main result when the target function

f_{ρ}

is out of

H_{K}

with

0 < r < \frac{1}{2} .

Theorem 2.

Assume Equation (5) for

0 < r < \frac{1}{2} .

Let

0 < η < min {1, C_{G}^{- 1}},

T \in N

and

λ = T^{- 1} .

Then, for any

0 < δ < 1

, with probability at least

1 - δ,

there holds:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}} & \leq C^{*} {T^{- r} + {log}^{2} (T) J_{D, \tilde{D}, λ} {log}^{4} \frac{24 m}{δ} + (log (T) A_{\tilde{D}, λ} λ^{r - \frac{1}{2}} + A_{D, λ}) log \frac{16}{δ} \\ {+ | N + S |}^{2 p + 1} N^{- (2 p + 1)} h^{- 2 p} (T^{p + 3 / 2} + T^{p + \frac{1}{2}} log (T) sup_{1 \leq l \leq k} A_{{\tilde{D}}_{l}, λ}) log \frac{2}{δ}}, \end{matrix}

(17)

where

C^{*}

is a constant given in the proof,

J_{D, \tilde{D}, λ} = sup_{1 \leq l \leq m} ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) (A_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} + A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ} λ^{- \frac{1}{2}}) .

Proof.

Decompose

∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}}

into:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}} \leq ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{T + 1} ∥_{L^{2}} + {∥ f_{T + 1} - f_{ρ} ∥}_{L^{2}} . \end{matrix}

The estimate of

∥ f_{T + 1} - f_{ρ} ∥_{L^{2}}

is presented in Lemma 1. We only need to handle

∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{T + 1} ∥_{L^{2}}

by Lemma 3.

For any

0 < s \leq T - 1

and

λ = T^{- 1}

, by Equation (11), we have

∥ f_{s} ∥_{K} \leq d_{ϕ, η, r} s^{\frac{1}{2} - r} \leq d_{ϕ, η, r} λ^{r - \frac{1}{2}}

. Take

f^{*} = 0

in Lemma 3, then:

\begin{matrix} t e r m 1 & \leq (1 + d_{ϕ, η, r} + c_{p, M}) sup_{1 \leq l \leq m} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) C_{{\tilde{D}}_{l}, λ} \times {\sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} λ^{r - 1} \\ + (1 + λ η i) B_{{\tilde{D}}_{l}, λ} (C_{{\tilde{D}}_{l}, λ} λ^{r - \frac{1}{2}} + G_{{\tilde{D}}_{l}, λ}) λ^{- \frac{1}{2}} + {| N + S |}^{2 p + 1} N^{- (2 p + 1)} i^{p + 1 / 2} h^{- 2 p}}, \end{matrix}

and:

\begin{matrix} t e r m 2 & \leq (1 + d_{ϕ, η, r}) \{\sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) D_{\tilde{D}, λ} λ^{r - \frac{1}{2}} + (1 + λ η T) (D_{\tilde{D}, λ} λ^{r - \frac{1}{2}} + F_{\tilde{D}, λ})\} . \end{matrix}

Noticing the elementary inequality

\sum_{s = 1}^{i} i^{- 1} \leq 2 log (i),

then:

\begin{matrix} \sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) \leq \sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + T^{- 1}) \leq 4 log (i), \end{matrix}

\begin{matrix} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) \sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) \leq 4 \sum_{i = 1}^{T} ({(T - i)}^{- 1} + T^{- 1}) log (i) \\ \leq 16 \sum_{i = 1}^{T} \frac{log (i)}{T - i} \leq 16 log (T) \sum_{i = 1}^{T} \frac{1}{T - i} = 16 log (T) \sum_{i = 1}^{T - 1} i^{- 1} \leq 32 {log}^{2} (T), \end{matrix}

\begin{matrix} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) (1 + λ η i) \leq \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η T^{- 1}) (1 + T^{- 1} η i) \\ \leq 2 \sum_{i = 1}^{T} ({(T - i)}^{- 1} + 1) \leq 8 log (T), \end{matrix}

and:

\begin{matrix} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) \leq 4 log (T), \end{matrix}

\begin{matrix} \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) i^{p + 1 / 2} \leq \sum_{i = 1}^{T} ({(T - i)}^{- 1} + η λ) T^{p + 1 / 2} \leq 4 T^{p + 1 / 2} log (T) . \end{matrix}

Plugging the above inequalities into term 1 and term 2, then:

\begin{matrix} t e r m 1 \leq & sup_{1 \leq l \leq m} C_{1} ({log}^{2} (T) B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} + log (T) B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} \\ + log (T) B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} G_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} + {| N + S |}^{2 p + 1} N^{- (2 p + 1)} T^{p + 1 / 2} log (T) C_{{\tilde{D}}_{l}, λ} h^{- 2 p}), \end{matrix}

(18)

and:

\begin{matrix} t e r m 2 \leq C_{2} (log (T) D_{\tilde{D}, λ} λ^{r - \frac{1}{2}} + F_{\tilde{D}, λ}), \end{matrix}

(19)

where

C_{1} = 32 (1 + d_{ϕ, η, r} + c_{p, M})

and

C_{2} = 6 (1 + d_{ϕ, η, r})

.

By Equation (16), for any fixed

l,

there exist three subsets with measure at least

1 - δ

such that:

\begin{matrix} B_{{\tilde{D}}_{l}, λ} \leq 2 {(\frac{2 A_{{\tilde{D}}_{l}, λ} log \frac{2}{δ}}{\sqrt{λ}})}^{2} + 2, C_{{\tilde{D}}_{l}, λ} \leq 2 A_{{\tilde{D}}_{l}, λ} log \frac{2}{δ}, \end{matrix}

and:

\begin{matrix} G_{{\tilde{D}}_{l}, λ} \leq 16 M A_{D_{l}, λ} log \frac{4}{δ} . \end{matrix}

Thus, for any fixed

l,

with confidence at least

1 - 3 δ,

there holds:

\begin{matrix} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} \leq 32 ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) A_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} {log}^{4} \frac{2}{δ}, \end{matrix}

and:

\begin{matrix} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} G_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} \leq 256 M ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ} λ^{- \frac{1}{2}} {log}^{3} \frac{2}{δ} log \frac{4}{δ} . \end{matrix}

Therefore, with confidence at least

1 - 3 m δ,

there holds:

\begin{matrix} sup_{1 \leq l \leq m} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} \leq 32 ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) A_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} {log}^{4} \frac{2}{δ}, \end{matrix}

and:

\begin{matrix} sup_{1 \leq l \leq m} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} G_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} \leq 256 M ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ} λ^{- \frac{1}{2}} {log}^{3} \frac{2}{δ} log \frac{4}{δ} . \end{matrix}

Thus, by Equation (18), it follows that with confidence at least

1 - δ / 2

by scaling

3 m δ

to

δ / 2

, there holds:

\begin{matrix} t e r m 1 & \leq C_{3} sup_{1 \leq l \leq m} ({log}^{2} (T) ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) (A_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} + A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ} λ^{- \frac{1}{2}}) {log}^{4} \frac{24 m}{δ} \\ {+ | N + S |}^{2 p + 1} N^{- (2 p + 1)} T^{p + 1 / 2} log (T) A_{{\tilde{D}}_{l}, λ} h^{- 2 p}), \end{matrix}

where

C_{3} = C_{1} (256 M + 64) .

Similarly, with confidence at least

1 - 2 δ

such that:

\begin{matrix} D_{\tilde{D}, λ} \leq 2 A_{\tilde{D}, λ} log \frac{2}{δ}, a n d F_{\tilde{D}, λ} \leq 16 M A_{D, λ} log \frac{4}{δ} . \end{matrix}

By Equation (19), it follows that with confidence at least

1 - δ / 2

by scaling

2 δ

to

δ / 2

:

\begin{matrix} t e r m 2 \leq C_{4} (log (T) A_{\tilde{D}, λ} λ^{r - \frac{1}{2}} + A_{D, λ}) log \frac{16}{δ}, \end{matrix}

where

C_{4} = C_{2} (16 M + 2) .

Together with Lemma 1, we obtain the desired bound (Equation (17)) with

C^{*} = c_{ϕ, r} + C_{3} + C_{4} + c_{p, M} .

□

Next, we give the result when the target function

f_{ρ}

is in

H_{K}

with

r \geq \frac{1}{2} .

Theorem 3.

Assume Equation (1) for

r \geq \frac{1}{2} .

Let

0 < η < min {1, C_{G}^{- 1}},

T \in N

and

λ = T^{- 1} .

Then, for any

0 < δ < 1

, with probability at least

1 - δ,

there holds:

\begin{matrix} ∥ {\bar{f}}_{T + 1, \tilde{D}} - f_{ρ} ∥_{L^{2}} & \leq C^{*} {T^{- r} + {log}^{2} (T) K_{D, \tilde{D}, λ} {log}^{4} \frac{24 m}{δ} + (log (T) A_{\tilde{D}, λ} + A_{D, λ}) log \frac{16}{δ} \\ {+ | N + S |}^{2 p + 1} N^{- (2 p + 1)} h^{- 2 p} (T^{p + 3 / 2} + T^{p + \frac{1}{2}} log (T) sup_{1 \leq l \leq k} A_{{\tilde{D}}_{l}, λ}) log \frac{2}{δ}}, \end{matrix}

(20)

where

K_{D, \tilde{D}, λ} = {sup}_{1 \leq l \leq m} ({(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1) (A_{{\tilde{D}}_{l}, λ}^{2} + A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ}) λ^{- \frac{1}{2}}

and

C^{*}

is a constant given in the proof.

The proof is similar to that of Theorem 2. Here we omit it.

With these preliminaries in place, we can prove our main result in Theorem 1.

Proof of Theorem 1.

We first prove Equation (8) by Theorem 2 when

0 < r < \frac{1}{2} .

Let

T = {⌈ | D | / 4 ⌉}^{\frac{1}{2 r + β}}

and

λ = T^{- 1}

. Notice that

| D | = N,

| \tilde{D} | = N + S

and

m | D_{l} | = | D |, m | {\tilde{D}}_{l} | = | \tilde{D} |

for

1 \leq l \leq m,

with

r + β > \frac{1}{2}

and Equation (7), we obtain that:

\begin{matrix} A_{D, λ} & = {⌈ | D | / 4 ⌉}^{- 1 + \frac{1}{4 r + 2 β}} + \sqrt{C_{0}} {⌈ | D | / 4 ⌉}^{- \frac{1}{2} + \frac{β}{4 r + 2 β}} \\ \leq (\sqrt{C_{0}} + 1) {⌈ | D | / 4 ⌉}^{- \frac{r}{2 r + β}} \leq \sqrt{5} (\sqrt{C_{0}} + 1) {| D |}^{- \frac{r}{2 r + β}} \\ = \sqrt{5} (\sqrt{C_{0}} + 1) N^{- \frac{r}{2 r + β}}, \end{matrix}

\begin{matrix} A_{\tilde{D}, λ} & = ⌈ | \tilde{D} {| / 4 ⌉}^{- 1} {[| D | / 4]}^{\frac{1}{4 r + 2 β}} + \sqrt{C_{0}} ⌈ | \tilde{D} {| / 4 ⌉}^{- \frac{1}{2}} {⌈ | D | / 4 ⌉}^{\frac{β}{4 r + 2 β}} \\ \leq \sqrt{5} (\sqrt{C_{0}} + 1) (| \tilde{D} |^{- 1} {| D |}^{\frac{1}{4 r + 2 β}} + | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}}) \\ = \sqrt{5} (\sqrt{C_{0}} + 1) {(| N + S |}^{- 1} N^{\frac{1}{4 r + 2 β}} {+ | N + S |}^{- \frac{1}{2}} N^{\frac{β}{4 r + 2 β}}), \end{matrix}

\begin{matrix} A_{D_{l}, λ} & = ⌈ | D_{l} {| / 4 ⌉}^{- 1} {⌈ | D | / 4 ⌉}^{\frac{1}{4 r + 2 β}} + \sqrt{C_{0}} ⌈ | D_{l} {| / 4 ⌉}^{- \frac{1}{2}} {⌈ | D | / 4 ⌉}^{\frac{β}{4 r + 2 β}} \\ \leq \sqrt{5} (\sqrt{C_{0}} + 1) {(m | D |}^{- 1 + \frac{1}{4 r + 2 β}} + m^{\frac{1}{2}} {| D |}^{- \frac{1}{2} + \frac{β}{4 r + 2 β}}) \\ = \sqrt{5} (\sqrt{C_{0}} + 1) (m N^{- 1 + \frac{1}{4 r + 2 β}} + m^{\frac{1}{2}} N^{- \frac{1}{2} + \frac{β}{4 r + 2 β}}), \end{matrix}

and:

\begin{matrix} A_{{\tilde{D}}_{l}, λ} & \leq \sqrt{5} (\sqrt{C_{0}} + 1) (| {\tilde{D}}_{l} |^{- 1} {| D |}^{\frac{1}{4 r + 2 β}} + | {\tilde{D}}_{l} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}}) \\ = \sqrt{5} (\sqrt{C_{0}} + 1) (m | \tilde{D} |^{- 1} {| D |}^{\frac{1}{4 r + 2 β}} + m^{\frac{1}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}}) \\ \leq 2 \sqrt{5} (\sqrt{C_{0}} + 1) m^{\frac{1}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}} . \end{matrix}

Thus:

\begin{matrix} \frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}} & \leq \sqrt{5} (\sqrt{C_{0}} + 1) (m | \tilde{D} |^{- 1} {| D |}^{\frac{1}{4 r + 2 β}} + m^{\frac{1}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}} {) ⌈ | D | / 4 ⌉}^{\frac{1}{2 (2 r + β)}} \\ \leq 2 \sqrt{5} (\sqrt{C_{0}} + 1) m^{\frac{1}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β + 1}{4 r + 2 β}} \\ = 2 \sqrt{5} (\sqrt{C_{0}} + 1) m^{\frac{1}{2}} {| N + S |}^{- \frac{1}{2}} N^{\frac{β + 1}{4 r + 2 β}} \leq 2 \sqrt{5} (\sqrt{C_{0}} + 1) . \end{matrix}

It follows for

l = 1, \dots, m

:

\begin{matrix} {(\frac{A_{{\tilde{D}}_{l}, λ}}{\sqrt{λ}})}^{2} + 1 \leq 20 {(\sqrt{C_{0}} + 1)}^{2} + 1, \end{matrix}

\begin{matrix} A_{{\tilde{D}}_{l}, λ}^{2} λ^{r - 1} & \leq 10 {(\sqrt{C_{0}} + 1)}^{2} (m^{2} | \tilde{D} |^{- 2} {| D |}^{\frac{1}{2 r + β}} + m | \tilde{D} |^{- 1} {| D |}^{\frac{s}{2 r + β}} {) | D |}^{\frac{1 - r}{2 r + β}} \\ = 10 {(\sqrt{C_{0}} + 1)}^{2} (m^{2} | \tilde{D} |^{- 2} {| D |}^{\frac{2}{2 r + β}} + m | \tilde{D} |^{- 1} {| D |}^{\frac{1 + β}{2 r + β}} {) | D |}^{- \frac{r}{2 r + β}} \\ \leq 20 {(\sqrt{C_{0}} + 1)}^{2} {| D |}^{- \frac{r}{2 r + s}} / {log}^{6} | D | \\ = 20 {(\sqrt{C_{0}} + 1)}^{2} N^{- \frac{r}{2 r + s}} / {log}^{6} N, \end{matrix}

\begin{matrix} A_{{\tilde{D}}_{l}, λ} A_{D_{l}, λ} λ^{- \frac{1}{2}} & \leq 10 {(\sqrt{C_{0}} + 1)}^{2} m^{\frac{1}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β}{4 r + 2 β}} {(m | D |}^{- 1 + \frac{1}{4 r + 2 β}} + m^{\frac{1}{2}} {| D |}^{- \frac{1}{2} + \frac{β}{4 r + 2 β}} {) | D |}^{\frac{1}{4 r + 2 β}} \\ \leq 10 {(\sqrt{C_{0}} + 1)}^{2} (m^{\frac{3}{2}} | \tilde{D} |^{- \frac{1}{2}} {| D |}^{- \frac{β + 4 r - 2}{4 r + 2 β}} + m | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β + 1 - 2 r}{4 r + 2 β}}) \\ \leq 20 {(\sqrt{C_{0}} + 1)}^{2} {| D |}^{- \frac{r}{2 r + β}} / {log}^{6} | D | \\ = 20 {(\sqrt{C_{0}} + 1)}^{2} N^{- \frac{r}{2 r + β}} / {log}^{6} N . \end{matrix}

Thus, by the above estimates:

\begin{matrix} J_{D, \tilde{D}, λ} \leq 40 (20 {(\sqrt{C_{0}} + 1)}^{2} + 1) {(\sqrt{C_{0}} + 1)}^{2} N^{- \frac{r}{2 r + β}} / {log}^{6} N \end{matrix}

Thus:

\begin{matrix} {log}^{2} (T) J_{D, \tilde{D}, λ} {log}^{4} \frac{24 m}{δ} \leq 2^{4} {log}^{2} (T) J_{D, \tilde{D}, λ} ({log}^{4} m) {log}^{4} \frac{24}{δ} \\ \leq 2^{4} {(2 r + β)}^{- 2} ({log}^{2} N) J_{D, \tilde{D}, λ} ({log}^{4} m) {log}^{4} \frac{24}{δ} \\ \leq 2^{4} {(2 r + β)}^{- 2} ({log}^{6} N) J_{D, \tilde{D}, λ} {log}^{4} \frac{24}{δ} \\ \leq 2^{10} {(2 r + β)}^{- 2} (20 {(\sqrt{C_{0}} + 1)}^{2} + 1) {(\sqrt{C_{0}} + 1)}^{2} {| D |}^{- \frac{r}{2 r + β}} {log}^{4} \frac{24}{δ} \\ = 2^{10} {(2 r + β)}^{- 2} (20 {(\sqrt{C_{0}} + 1)}^{2} + 1) {(\sqrt{C_{0}} + 1)}^{2} N^{- \frac{r}{2 r + β}} {log}^{4} \frac{24}{δ}, \end{matrix}

and:

\begin{matrix} log (T) A_{\tilde{D}, λ} λ^{r - \frac{1}{2}} & \leq {(2 r + β)}^{- 1} \sqrt{5} (\sqrt{C_{0}} + 1) log | D | (| \tilde{D} |^{- 1} {| D |}^{\frac{1}{2 r + β}} + | \tilde{D} |^{- \frac{1}{2}} {| D |}^{\frac{β + 1}{4 r + 2 β}} {) | D |}^{- \frac{r}{2 r + β}} \\ \leq 2 \sqrt{5} {(2 r + β)}^{- 1} (\sqrt{C_{0}} + 1) {| D |}^{- \frac{r}{2 r + β}} = 2 \sqrt{5} {(2 r + β)}^{- 1} (\sqrt{C_{0}} + 1) N^{- \frac{r}{2 r + β}} . \end{matrix}

Putting the above estimates into Theorem 2, we have the desired conclusion (Equation (8)) with:

C^{'} = C^{*} (2^{10} {(2 r + β)}^{- 2} (20 {(\sqrt{C_{0}} + 1)}^{2} + 1) {(\sqrt{C_{0}} + 1)}^{2} + 2 \sqrt{5} {(2 r + β)}^{- 1} (\sqrt{C_{0}} + 1) + \sqrt{5} (\sqrt{C_{0}} + 2)) .

When

r \geq \frac{1}{2},

we apply Theorem 3 and take the same proof procedure a above. Then, the conclusion (Equation (8)) can be obtained. The proof is completed. □

4. Simulation and Conclusions

In this section, we provide the simulation to verify our theoretical statements. We assume that the inputs

{x_{i}}

are independently drawn according to the uniform distribution on

[0, 1] .

Consider the regression model

y_{i} = g_{ρ} (x_{i}) + ε_{i}, i = 1, \dots, N

, where

ε_{i}

is the independent Gaussian noise

N (0, 0 . 1^{2})

and:

g_{ρ} (x) = \{\begin{matrix} x, & if 0 < x \leq 0.5, \\ 1 - x, & if 0.5 < x \leq 1 . \end{matrix}

Define the pairwise kernel

K : X^{2} \times X^{2} \to R

by

K ((x, u), (x^{'}, u^{'})) : = K_{1} (x, x^{'}) + K_{1} (u, u^{'}) - K_{1} (x, u^{'}) - K_{1} (x^{'}, u)

where:

\begin{matrix} K_{1} (x, x^{'}) = 1 + min {x, x^{'}} . \end{matrix}

We apply the kernel K to the distributed algorithm (Equation (3)). In Figure 1, we plot the mean squared error of Equation (3) for

N = 600

and

S = 0, 300, 600

when the number of local machines m varies. Note that

S = 0,

and it is a standard distributed MEE algorithm without unlabeled data. When m becomes large, the red curve increases dramatically. However, when we add 300 or 600 unlabeled data, the error curves begin to increase very slowly. This coincides with our theory that using unlabeled data can enlarge the range of m in the distributed method.

This paper studied the convergence rate of the distribute gradient descent MEE algorithm in a semi-supervised setting. Our results demonstrated that using additional unlabeled data can improve the learning performance of the distributed MEE algorithm, especially in enlarging the range of m to guarantee the learning rate. As we know, there are many gaps between theory and empirical studies. We regard this paper as mainly a theoretical paper and expect that the theoretical analysis give some guidance to real applications.

Author Contributions

B.W. conceived the presented idea. T.H. developed the theory and performed the computations. All authors discussed the results and contributed to the final manuscript.

Funding

The work described in this paper is partially supported by the National Natural Science Foundation of China [Nos. 11671307 and 11571078], the Natural Science Foundation of Hubei Province in China [No. 2017CFB523], and the Fundamental Research Funds for the Central Universities, South-Central University for Nationalities [No. CZY18033].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 3

We state two useful lemmas as follows, whose proof can be found in Reference [22].

Lemma A1.

For

λ > 0

,

0 < η < 1

and

j = 1, \dots, t - 1

, we have:

\begin{matrix} max {∥ η (L_{K} + λ) π_{j + 1}^{t} (L_{K}) ∥, ∥ η (L_{K, \tilde{D}} + λ) π_{j + 1}^{t} (L_{K, \tilde{D}}) ∥} \leq \frac{1}{t - j} + η λ, \\ max {∥ \sum_{j = 1}^{t} η (L_{K} + λ) π_{j + 1}^{t} (L_{K}) ∥, ∥ \sum_{j = 1}^{t} η (L_{K, \tilde{D}} + λ) π_{j + 1}^{t} (L_{K, \tilde{D}}) ∥} \leq 1 + η λ t . \end{matrix}

Lemma A2.

For any

λ > 0

and

f^{*} \in H_{K},

there holds:

\begin{matrix} ∥ f_{t + 1, \tilde{D}} - f_{t + 1} ∥_{L^{2}} \leq B_{\tilde{D}, λ} C_{\tilde{D}, λ} \sum_{i = 1}^{t} [{(t - i)}^{- 1} + η λ] {∥ f_{i} - f^{*} ∥}_{K} \\ + B_{\tilde{D}, λ} (1 + η λ t) (C_{\tilde{D}, λ} {∥ f^{*} ∥}_{K} + G_{\tilde{D}, λ}) + c_{p, M} {(N + S)}^{2 p + 1} N^{- (2 p + 1)} t^{p + 1 / 2} h^{- 2 p}, \end{matrix}

(A1)

and:

\begin{matrix} ∥ f_{t + 1, \tilde{D}} - f_{t + 1} ∥_{K} \leq B_{\tilde{D}, λ} C_{\tilde{D}, λ} \sum_{i = 1}^{t} [{(t - i)}^{- 1} + η λ] {∥ f_{i} - f^{*} ∥}_{K} / \sqrt{λ} \\ + B_{\tilde{D}, λ} (1 + η λ t) (C_{\tilde{D}, λ} {∥ f^{*} ∥}_{K} + G_{\tilde{D}, λ}) / \sqrt{λ} + c_{p, M} {(N + S)}^{2 p + 1} N^{- (2 p + 1)} t^{p + 1 / 2} h^{- 2 p}, \end{matrix}

(A2)

where the constant

c_{p, M} = 2^{4 p + 2} c_{p} C_{G}^{2 p + 1} M^{2 p + 1} .

Proof of Lemma A2.

By Equations (10) and (13), we get a two error decomposition for

f_{t + 1, D} - f_{t + 1} .

The first one is:

\begin{matrix} f_{t + 1, \tilde{D}} - f_{t + 1} = η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K, \tilde{D}}) [L_{K} - L_{K, \tilde{D}}] (f_{i}) \\ + η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K, \tilde{D}}) [f_{ρ, \tilde{D}} - L_{K} (f_{ρ})] + η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K, \tilde{D}}) E_{i, \tilde{D}}, \end{matrix}

(A3)

and the second one is:

\begin{matrix} f_{t + 1, \tilde{D}} - f_{t + 1} = η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K}) [L_{K} - L_{K, \tilde{D}}] (f_{i, \tilde{D}}) \\ + η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K}) [f_{ρ, \tilde{D}} - L_{K} (f_{ρ})] + η \sum_{i = 1}^{t} π_{i + 1}^{t} (L_{K}) E_{i, \tilde{D}} . \end{matrix}

(A4)

It has been proven in Reference [22] that

{f_{t, \tilde{D}}}

as

∥ f_{t, \tilde{D}} ∥_{K} \leq 2 C_{G} \frac{| \tilde{D} |}{| D |} M t^{\frac{1}{2}} = 2 C_{G} \frac{| N + S |}{N} M t^{\frac{1}{2}} .

It follows from Equation (4) that:

\begin{matrix} ∥ E_{t, \tilde{D}} ∥_{K} \leq \frac{1}{{| N + S |}^{2}} \sum_{\begin{matrix} (x_{i}, y_{i}) \in \tilde{D}, \\ (x_{j}, y_{j}) \in \tilde{D} \end{matrix}} ∥ [G^{'} (\frac{{(f_{t, D^{*}} (x_{i}, x_{j}) - y_{i} + y_{j})}^{2}}{h^{2}}) - G^{'} (0)] (f_{t, \tilde{D}} (x_{i}, x_{j}) - y_{i} + y_{j}) K_{(x_{i}, x_{j})} ∥_{K} \\ \leq c_{p} \frac{{(2 \frac{| N + S |}{N} C_{G} M + 2)}^{2 p + 1}}{h^{2 p}} {∥ f_{t, \tilde{D}} ∥}_{K}^{2 p + 1} \leq 2^{4 p + 2} c_{p} C_{G}^{2 p + 1} M^{2 p + 1} \frac{{| N + S |}^{2 p + 1}}{N^{2 p + 1}} t^{p + 1 / 2} h^{- 2 p} . \end{matrix}

(A5)

Then, we can follow the proof procedure in Proposition 1 of Reference [24] to prove Equations (A2) and (A1). □

With the help of the lemmas above, we can prove Lemma 3.

Proof of Lemma 3.

Applying Equation (24) with

\tilde{D} = {\tilde{D}}_{l}

for

l = 1, \dots, m

, we have that:

\begin{matrix} ∥ {\bar{f}}_{T + 1, {\tilde{D}}_{l}} - f_{T + 1} ∥_{L^{2}} = {∥ \frac{1}{m} \sum_{l = 1}^{m} ({\bar{f}}_{T + 1, {\tilde{D}}_{l}} - f_{T + 1}) ∥}_{L^{2}} \\ \leq ∥ η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i, {\tilde{D}}_{l}}) ∥_{L^{2}} \\ + ∥ η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} [{\hat{f}}_{ρ, {\tilde{D}}_{l}} - L_{K} (f_{ρ})] ∥_{L^{2}} + {∥ η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} E_{i, {\tilde{D}}_{l}} ∥}_{L^{2}} \\ : = I_{1} + I_{2} + I_{3} . \end{matrix}

Firstly, we will bound

I_{1},

which is most difficult to handle. It can be decomposed as:

\begin{matrix} I_{1} & \leq {∥η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{\frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i, {\tilde{D}}_{l}} - f_{i})∥}_{K} \\ + {∥η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{\frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i} - f^{*})∥}_{K} \\ + {∥η \sum_{i = 1}^{T} π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{\frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f^{*})∥}_{K} \\ : = I_{11} + I_{12} + I_{13} . \end{matrix}

Then, it is easy to get that by Lemma A1 and

f_{1, {\tilde{D}}_{l}} = f_{1} = 0

:

\begin{matrix} I_{11} & = {∥\sum_{i = 1}^{T} η (L_{K} + λ) π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i, {\tilde{D}}_{l}} - f_{i})∥}_{K} \\ \leq \sum_{i = 1}^{T} ∥η (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ {∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i, {\tilde{D}}_{l}} - f_{i})∥}_{K} \\ \leq \sum_{i = 1}^{T} (η λ + {(T - i)}^{- 1}) sup_{1 \leq l \leq m} {∥{(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i, {\tilde{D}}_{l}} - f_{i})∥}_{K} \\ \leq sup_{1 \leq l \leq m} \sum_{i = 1}^{T} (η λ + {(T - i)}^{- 1}) {∥ f_{i, {\tilde{D}}_{l}} - f_{i} ∥}_{K} C_{{\tilde{D}}_{l}, λ} . \end{matrix}

Applying Proposition A2 with

\tilde{D} = {\tilde{D}}_{l}

and

t + 1 = i

, we have that:

\begin{matrix} ∥ f_{i, {\tilde{D}}_{l}} - f_{i} ∥_{K} \leq \sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) {∥ f_{s} - f^{*} ∥}_{K} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} \\ + (1 + λ η i) B_{{\tilde{D}}_{l}, λ} (C_{{\tilde{D}}_{l}, λ} ∥ f^{*} ∥_{K} + G_{{\tilde{D}}_{l}, λ}) λ^{- \frac{1}{2}} + c_{p, M} \frac{{| N + S |}^{2 p + 1}}{N^{2 p + 1}} i^{p + 1 / 2} h^{- 2 p} . \end{matrix}

Thus:

\begin{matrix} I_{11} & \leq sup_{1 \leq l \leq m} \sum_{i = 1}^{T} (η λ + {(T - i)}^{- 1}) C_{{\tilde{D}}_{l}, λ} \times {\sum_{s = 1}^{i - 1} ({(i - s - 1)}^{- 1} + λ η) {∥ f_{s} - f^{*} ∥}_{K} B_{{\tilde{D}}_{l}, λ} C_{{\tilde{D}}_{l}, λ} λ^{- \frac{1}{2}} \\ + (1 + λ η i) B_{{\tilde{D}}_{l}, λ} (C_{{\tilde{D}}_{l}, λ} ∥ f^{*} ∥_{K} + G_{{\tilde{D}}_{l}, λ}) λ^{- \frac{1}{2}} + c_{p, M} \frac{{| N + S |}^{2 p + 1}}{N^{2 p + 1}} i^{p + 1 / 2} h^{- 2 p}} . \end{matrix}

(A6)

By Lemma A1 again, we have:

\begin{matrix} I_{12} & \leq {∥\sum_{i = 1}^{T} η (L_{K} + λ) π_{i + 1}^{T} (L_{K}) \frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] (f_{i} - f^{*})∥}_{K} \\ \leq \sum_{i = 1}^{T} ∥η (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ {∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}] ∥ ∥ f_{i} - f^{*}∥}_{K} \\ \leq \sum_{i = 1}^{T} (η λ + {(T - i)}^{- 1}) D_{\tilde{D}, λ} {∥ f_{i} - f^{*} ∥}_{K}, \end{matrix}

(A7)

and:

\begin{matrix} I_{13} & \leq ∥\sum_{i = 1}^{T} η (L_{K} + λ) π_{i + 1}^{T} (L_{K})∥ ∥\frac{1}{m} \sum_{l = 1}^{m} {(L_{K} + λ)}^{- \frac{1}{2}} [L_{K} - L_{K, {\tilde{D}}_{l}}]∥ {∥f^{*}∥}_{K} \\ \leq (1 + λ η T) D_{\tilde{D}, λ} {∥ f^{*} ∥}_{K} . \end{matrix}

(A8)

This completes the estimate of

I_{1}

with Equations (A6), (A7), and (A8).

Now we turn to bound

I_{2} .

Then, by the definition of

F_{\tilde{D}, λ}

, the bound (Equation (A5)) of

E_{t, \tilde{D}}

and Lemma A1, we obtain that:

\begin{matrix} I_{2} \leq (1 + η λ T) F_{\tilde{D}, λ}, \end{matrix}

and:

\begin{matrix} I_{3} \leq c_{p, M} \frac{| \tilde{D} |^{2 p + 1}}{{| D |}^{2 p + 1}} T^{p + 3 / 2} h^{- 2 p} = c_{p, M} \frac{{| N + S |}^{2 p + 1}}{N^{2 p + 1}} T^{p + 3 / 2} h^{- 2 p} . \end{matrix}

Together with the bound of Equations (A6), (A7), and (A8), we can get the desired conclusion (Equation (15)). □

References

Principe, J.C. Renyi’s entropy and Kernel perspectives. In Information Theoretic Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In Proceedings of the International Conference on ICA and Signal Separation; Springer: Berlin, Germany, 2000; pp. 75–90. [Google Scholar]
Erdogmus, D.; Hild, K.; Principe, J.C. Blind source separation using Renyi’s α-marginal entropies. Neurocomputing 2002, 49, 25–38. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. Convergence properties and data efficiency of the minimum error entropy criterion in adaline training. IEEE Trans. Signal Process. 2003, 51, 1966–1978. [Google Scholar] [CrossRef]
Gokcay, E.; Principe, J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Learn. 2002, 24, 158–171. [Google Scholar] [CrossRef]
Silva, L.M.; Marques, J.; Alexandre, L.A. Neural network classification using Shannon’s entropy. In Proceedings of the European Symposium on Artificial Neural Networks; D-Side: Bruges, Belgium, 2005; pp. 217–222. [Google Scholar]
Silva, L.M.; Marques, J.; Alexandre, L.A. The MEE principle in data classification: A perceptron-based analysis. Neural Comput. 2010, 22, 2698–2728. [Google Scholar] [CrossRef]
Choe, Y. Information criterion for minimum cross-entropy model selection. arXiv, 2017; arXiv:1704.04315. [Google Scholar]
Ying, Y.; Zhou, D.X. Online pairwise learning algorithms. Neural Comput. 2016, 28, 743–777. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Duchi, J.C.; Wainwright, M.J. Divide and conquer kernel ridge regression: A dis tributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2013, 30, 592–617. [Google Scholar]
Chapelle, O.; Zien, A. Semi-Supervised Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Zhang, T. Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 2005, 17, 2077–2098. [Google Scholar] [CrossRef] [PubMed]
Caponnetto, A.; Vito, E.D. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
Steinwart, I.; Hush, D.R.; Scovel, C. Optimal rates for regularized least squares regression. In Proceedings of the COLT 2009—the Conference on Learning Theory, Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. [Google Scholar]
Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Prob. 2017, 33, 074009. [Google Scholar] [CrossRef]
Guo, Z.C.; Shi, L.; Wu, Q. Learning theory of distributed regression with bias corrected regu-larization kernel network. J. Mach. Learn. Res. 2017, 18, 4237–4261. [Google Scholar]
M<i>u</i>¨cke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. [Google Scholar]
Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge re-gression. J. Mach. Learn. Res. 2017, 18, 1493–1514. [Google Scholar]
Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Unpublished work. 2018. [Google Scholar]
Guo, X.; Hu, T.; Wu, Q. Distributed minimum error entropy algorithms. Unpublished work. 2018. [Google Scholar]
Wang, B.; Hu, T. Distributed pairwise algorithms with gradient descent methods. Unpublished work. 2018. [Google Scholar]

Figure 1. The mean square errors for the size of unlabeled data

S \in {0, 300, 600}

as the number of local machines m varies.

Figure 1. The mean square errors for the size of unlabeled data

S \in {0, 300, 600}

as the number of local machines m varies.

Table 1. List of notations used throughout the paper.

Notation	Meaning of the Notation
X	the explanatory variable
Y	the response variable
$X$	$X \in X$ , a compact subset of an Euclidian space $R^{n}$
$Y$	$Y \in Y$ , a subset of $R$
$ρ (\cdot, \cdot)$	a Boreal measure on $X \times Y$
$ρ_{X}$	the marginal probability measure of $ρ$ on $X$
$ρ (y \| x)$	the conditional probability measure of $y \in Y$ given $X = x$
$g_{ρ} (x)$	the mean regression function $g_{ρ} (x) = \int_{Y} y d ρ (y \| x)$
$f_{ρ} (x, u)$	the target function of MEE induced by $f_{ρ} (x, u) = g_{ρ} (x) - g_{ρ} (u)$
K	a reproducing kernel on $X \times X$
D	the labeled data set $D = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$
N	the size of labeled data set D
$⌈ N / 4 ⌉$	the largest integer not exceeding $N / 4$
$\| D \|$	the cardinality of $D,$ $\| D \| = N$
$D^{*}$	the unlabeled data set $D^{*} = {x_{1}, \dots, x_{S}}$
S	the size of unlabeled data set $D^{*}$
$\| D^{*} \|$	the cardinality of $D^{},$ $\| D^{} \| = S$
$\tilde{D}$	training data set used in the distributed MEE algorithm, consisting of D and $D^{*}$
$\| \tilde{D} \|$	the cardinality of $\tilde{D},$ $\| \tilde{D} \| = N + S$
m	the number of local machines
${\tilde{D}}_{l}$	the lth subset of $\tilde{D},$ $1 \leq l \leq m$
G	the loss function of MEE algorithm
$L_{K}$	the integral operator associated with K
$L_{K, \tilde{D}}$	the empirical operator of $L_{K}$ on $\tilde{D}$
$f_{t + 1, D}$	the function output by the kernel gradient descent MEE algorithm
	with data D and kernel K after t iterations
$f_{t + 1, D_{l}}$	the function output by the kernel gradient MEE algorithm
	with data $D_{l}$ and kernel K after t iterations
${\bar{f}}_{t + 1, \tilde{D}}$	the global output averaging over local outputs $f_{t + 1, {\tilde{D}}_{l}}, l = 1, \dots, m$

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Hu, T. Semi-Supervised Minimum Error Entropy Principle with Distributed Method. Entropy 2018, 20, 968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

AMA Style

Wang B, Hu T. Semi-Supervised Minimum Error Entropy Principle with Distributed Method. Entropy. 2018; 20(12):968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

Chicago/Turabian Style

Wang, Baobin, and Ting Hu. 2018. "Semi-Supervised Minimum Error Entropy Principle with Distributed Method" Entropy 20, no. 12: 968. https://0-doi-org.brum.beds.ac.uk/10.3390/e20120968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Minimum Error Entropy Principle with Distributed Method

Abstract

1. Introduction

2. Algorithms and Main Results

3. Proof of Main Result

3.1. Approximation Error

3.2. Sample Error

4. Simulation and Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI