Convergence of a Fixed-Point Minimum  Error Entropy Algorithm

Zhang, Yu; Chen, Badong; Liu, Xi; Yuan, Zejian; Principe, Jose C.

doi:10.3390/e17085549

Open AccessArticle

Convergence of a Fixed-Point Minimum Error Entropy Algorithm

¹

School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

²

School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China

³

Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA

^*

Author to whom correspondence should be addressed.

Entropy 2015, 17(8), 5549-5560; https://0-doi-org.brum.beds.ac.uk/10.3390/e17085549

Submission received: 3 May 2015 / Revised: 17 July 2015 / Accepted: 28 July 2015 / Published: 3 August 2015

Download

Browse Figures

Versions Notes

Abstract

:

The minimum error entropy (MEE) criterion is an important learning criterion in information theoretical learning (ITL). However, the MEE solution cannot be obtained in closed form even for a simple linear regression problem, and one has to search it, usually, in an iterative manner. The fixed-point iteration is an efficient way to solve the MEE solution. In this work, we study a fixed-point MEE algorithm for linear regression, and our focus is mainly on the convergence issue. We provide a sufficient condition (although a little loose) that guarantees the convergence of the fixed-point MEE algorithm. An illustrative example is also presented.

Keywords:

information theoretic learning (ITL); minimum error entropy (MEE) criterion; fixed-point algorithm

MSC Codes:

62B10

1. Introduction

In recent years, information theoretic measures, such as entropy and mutual information, have been widely applied in domains of machine learning (so called information theoretic learning (ITL) [1]) and signal processing [1,2]. A possible main reason for the success of ITL is that information theoretic quantities can capture higher-order statistics of the data and offer potentially significant performance improvement in machine learning applications [1]. Based on the Parzen window method [3], the smooth and nonparametric information theoretic estimators can be applied directly to the data without imposing any a priori assumptions (say the Gaussian assumption) about the underlying probability density functions (PDFs). In particular, Renyi’s quadratic entropy estimator can be easily calculated by a double sum over samples [4,5,6,7]. The entropy in supervised learning serves as a measure of similarity and follows a similar framework of the well-known mean square error (MSE) [1,2]. An adaptive system can be trained by minimizing the entropy of the error over the training dataset [4]. This learning criterion is called the minimum error entropy (MEE) criterion [1,2,8,9,10]. MEE may achieve much better performance than MSE especially when data are heavy-tailed or multimodal non-Gaussian [1,2,10].

However, the MEE solution cannot be obtained in closed form even when the system is a simple linear model such as a finite impulse response (FIR) filter. A practical approach is to search the solution over performance surface by an iterative algorithm. Usually, a simple gradient based search algorithm is adopted. With a gradient based learning algorithm, however, one has to select a proper learning rate (or step-size) to ensure the stability and achieve a better tradeoff between misadjustment and convergence speed [4,5,6,7]. Another more promising search algorithm is the fixed-point iterative algorithm, which is step-size free and is often much faster than gradient based methods [11]. The fixed-point algorithms have received considerable attention in machine learning and signal processing due to their desirable properties of low computational requirement and fast convergence speed [12,13,14,15,16,17].

The convergence is a key issue for an iterative learning algorithm. For the gradient based MEE algorithms, the convergence problem has already been studied and some theoretical results have been obtained [6,7]. For the fixed-point MEE algorithms, up to now there is still no study concerning the convergence. The goal of this paper is to study the convergence of a fixed-point MEE algorithm and provide a sufficient condition that ensures the convergence to a unique solution (the fixed point). It is worth noting that the convergence of a fixed-point maximum correntropy criterion (MCC) algorithm has been studied in [18]. The remainder of the paper is organized as follows. In Section 2, we derive a fixed-point MEE algorithm. In Section 3, we prove a sufficient condition to guarantee the convergence. In Section 4, we present an illustrative example. Finally in Section 5, we give the conclusion.

2. Fixed-Point MEE Algorithm

Consider a simple linear regression (filtering) case where the error signal is

e (i) = d (i) - y (i) = d (i) - W^{T} X (i)

(1)

with

d (i) \in ℝ

being a desired value at time

i

,

y (i) = W^{T} X (i)

the output of the linear model,

W = {[w_{1}, w_{2}, \dots, w_{m}]}^{T} \in ℝ^{m}

the weight vector, and

X (i) = {[x_{1} (i), x_{2} (i), \dots, x_{m} (i)]}^{T} \in ℝ^{m}

the input vector (i.e., the regressor). The goal is to find a weight vector such that the error signal is as small as possible. Under the MEE criterion, the optimal weight vector is obtained by minimizing the error entropy [1,2]. With Renyi’s quadratic entropy, the MEE solution can be expressed as

W = \underset{W \in ℝ^{m}}{\arg} \min - \log \int p_{e}^{2} (x) d x = \underset{W \in ℝ^{m}}{\arg} \max \int p_{e}^{2} (x) d x

(2)

where

p_{e} (.)

denotes the PDF of the error signal. In ITL the quantity

\int p_{e}^{2} (x) d x

is also called the quadratic information potential (QIP) [1]. In a practical situation, however, the error distribution is usually unknown, and one has to estimate it from the error samples

{e (1), e (2), \dots, e (N)}

, where

N

denotes the sample number. Based on the Parzen window approach [3], the estimated PDF takes the form

{\hat{p}}_{e} (x) = \frac{1}{N} \sum_{i = 1}^{N} κ (x - e (i))

(3)

where

κ (.)

stands for a kernel function (not necessarily a Mercer kernel), satisfying

κ (x) \geq 0

and

\int_{- \infty}^{\infty} κ (x) d x = 1

. Without mentioned otherwise, the kernel function is selected as a Gaussian kernel, given by

κ_{σ} (x) = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{x^{2}}{2 σ^{2}})

(4)

where

σ

denotes the kernel bandwidth. With Gaussian kernel, the QIP can be simply estimated as [1]

\int {\hat{p}}_{e}^{2} (x) d x = \int {(\frac{1}{N} \sum_{i = 1}^{N} κ_{σ} (x - e (i)))}^{2} d x = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j))

(5)

Therefore, in practical situations, the MEE solution of (2) becomes

W = \underset{W \in ℝ^{m}}{\arg} \max \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j))

(6)

Unfortunately, there is no closed form solution of (6). One can apply a gradient based iterative algorithm to search the solution, starting from an initial point. Below we derive a fixed-point iterative algorithm, which is, in general, much faster than a gradient based method (although a gradient method can be viewed as a special case of the fixed-point methods, it involves a step-size parameter). Let’s take the following first order derivative:

\begin{array}{l} \frac{\partial}{\partial W} \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) \\ = \frac{1}{2 σ^{2} N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) (e (i) - e (j)) [X (i) - X (j)] \\ = \frac{1}{2 σ^{2} N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) (d (i) - d (j)) [X (i) - X (j)] \\ - \frac{1}{2 σ^{2} N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T} W \\ = \frac{1}{2 σ^{2}} {P_{d X}^{M E E} - R_{X X}^{M E E} W} \end{array}

(7)

where

{\begin{cases} R_{X X}^{M E E} = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T} \\ P_{d X}^{M E E} = \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) (d (i) - d (j)) [X (i) - X (j)] \end{cases}

(8)

Let

\frac{\partial}{\partial W} \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) = 0

, and assume that the matrix

R_{X X}^{M E E}

is invertible. Then, we obtain the following solution [15]:

W = {(R_{X X}^{M E E})}^{- 1} P_{d X}^{M E E}

(9)

The above solution is, in form, very similar to the well-known Wiener solution [19]. However, it is not a closed form solution, since both matrix

R_{X X}^{M E E}

and vector

P_{d X}^{M E E}

depend on the weight vector

W

(note that

e (i)

depends on

W

). Therefore, the solution of (9) is actually a fixed-point equation, which can also be expressed as

W = f (W)

, where

f (W) = {(R_{X X}^{M E E})}^{- 1} P_{d X}^{M E E}

(10)

The solution (fixed-point) of the equation

W = f (W)

can be found by the following iterative fixed-point algorithm:

W_{k + 1} = f (W_{k})

(11)

where

W_{k}

denotes the estimated weight vector at iteration

k

. This algorithm is called the fixed-point MEE algorithm [15]. An online fixed-point MEE algorithm was also derived in [15]. In the next section, we will prove a sufficient condition under which the algorithm (11) surely converges to a unique fixed-point.

3. Convergence of the Fixed-Point MEE

The convergence of a fixed-point algorithm can be proved by the well-known contraction mapping theorem (also known as the Banach fixed-point theorem) [11]. According to the contraction mapping theorem, the convergence of the fixed-point MEE algorithm (11) is guaranteed if

\exists β > 0

and

0 < α < 1

such that the initial weight vector

{‖ W_{0} ‖}_{p} \leq β

, and

\forall W \in {W \in ℝ^{m} : {‖ W ‖}_{p} \leq β}

, it holds that

{\begin{cases} {‖ f (W) ‖}_{p} \leq β \\ {‖ \nabla_{W} f (W) ‖}_{p} = {‖ \frac{\partial f (W)}{\partial W^{T}} ‖}_{p} \leq α \end{cases}

(12)

where

{‖ . ‖}_{p}

denotes an l_p-norm of a vector or an induced norm of a matrix, defined by

{‖ A ‖}_{p} = \max_{{‖ X ‖}_{p} \neq 0} {‖ A X ‖}_{p} / {‖ X ‖}_{p}

, with

p \geq 1

,

A \in ℝ^{m \times m}

,

X \in ℝ^{m \times 1}

, and

\nabla_{W} f (W)

denotes the

m \times m

Jacobian matrix of

f (W)

with respect to

W

, given by

\nabla_{W} f (W) = [\begin{matrix} \frac{\partial}{\partial w_{1}} f (W) & \frac{\partial}{\partial w_{2}} f (W) & \dots & \frac{\partial}{\partial w_{m}} f (W) \end{matrix}]

(13)

where

\begin{array}{l} \frac{\partial}{\partial w_{s}} f (W) \\ = \frac{\partial}{\partial w_{s}} ({[R_{X X}^{M E E}]}^{- 1} P_{d X}^{M E E}) \\ = - {[R_{X X}^{M E E}]}^{- 1} (\frac{\partial}{\partial w_{s}} R_{X X}^{M E E}) {[R_{X X}^{M E E}]}^{- 1} P_{d X}^{M E E} + {[R_{X X}^{M E E}]}^{- 1} (\frac{\partial}{\partial w_{s}} P_{d X}^{M E E}) \\ = - {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial}{\partial w_{s}} κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}) f (W) \\ + {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial}{\partial w_{s}} κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)]) \\ = - {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}) f (W) \\ + {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)]) \end{array}

(14)

To obtain a sufficient condition to guarantee the convergence of the fixed-point MEE algorithm (11), we prove two theorems below.

Theorem 1. If

β > ξ = \frac{\sqrt{m} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} [X (i) - X (j)] {[X (i) - X (j)]}^{T}]}

and

σ \geq σ^{*}

, where

σ^{*}

is the solution of equation

φ (σ) = β

, where

φ (σ) = \frac{\sqrt{m} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} \exp (- \frac{{(β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |)}^{2}}{4 σ^{2}}) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]}, σ \in (0, \infty)

(15)

Then

{‖ f (W) ‖}_{1} \leq β

for all

W \in {W \in ℝ^{m} : {‖ W ‖}_{1} \leq β}

.

Proof. The induced matrix norm is compatible with the corresponding vector l_p-norm, hence

{‖ f (W) ‖}_{1} = {‖ {[R_{X X}^{M E E}]}^{- 1} P_{d X}^{M E E} ‖}_{1} \leq {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {‖ P_{d X}^{M E E} ‖}_{1}

(16)

where

{‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1}

is the 1-norm (also referred to as the column-sum norm) of the inverse matrix

{[R_{X X}^{M E E}]}^{- 1}

, which is simply the maximum absolute column sum of the matrix. According to the matrix theory, the following inequality holds:

{‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} \leq \sqrt{m} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{2} = \sqrt{m} λ_{\max} [{[R_{X X}^{M E E}]}^{- 1}]

(17)

where

{‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{2}

is the 2-norm (also referred to as the spectral norm) of

{[R_{X X}^{M E E}]}^{- 1}

, which equals the maximum eigenvalue of the matrix. Further, we have

\begin{array}{l} λ_{\max} [{(R_{X X}^{M E E})}^{- 1}] = \frac{1}{λ_{\min} [R_{X X}^{M E E}]} \\ = \frac{N^{2}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \\ \overset{(a)}{\leq} \frac{N^{2}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \end{array}

(18)

where (a) comes from

\begin{array}{l} | e (i) - e (j) | = | d (i) - d (j) - W^{T} (X (i) - X (j)) | \\ \leq {‖ W ‖}_{1} {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) | \\ \leq β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) | \end{array}

(19)

In addition, it holds that

\begin{array}{l} {‖ P_{d X}^{M E E} ‖}_{1} = {‖ \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)] ‖}_{1} \\ \overset{(b)}{\leq} \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} {‖ κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)] ‖}_{1} \\ \overset{(c)}{\leq} \frac{1}{2 σ N^{2} \sqrt{π}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1} \end{array}

(20)

where (b) follows from the convexity of the vector l₁-norm, and (c) is because

κ_{σ \sqrt{2}} (x) \leq \frac{1}{2 σ \sqrt{π}}

for any

x

. Combining (16)–(18) and (20), we derive

\begin{array}{l} {‖ f (W) ‖}_{1} \leq \frac{\frac{1}{2 σ} \sqrt{\frac{m}{π}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \\ = \frac{\sqrt{m} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} \exp (- \frac{{(β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |)}^{2}}{4 σ^{2}}) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \\ = φ (σ) \end{array}

(21)

Clearly, the function

φ (σ)

is a continuous and monotonically decreasing function of

σ

over

(0, \infty)

, satisfying

\lim_{σ \to 0 +} φ (σ) = \infty

, and

\lim_{σ \to \infty} φ (σ) = ξ

. Therefore, if

β > ξ

, the equation

φ (σ) = β

will have a unique solution

σ^{*}

over

(0, \infty)

, and if

σ \geq σ^{*}

, we have

φ (σ) \leq β

, which completes the proof. □

Theorem 2. If

β > ξ = \frac{\sqrt{m} \sum_{i = 1}^{N} \sum_{j = 1}^{N} | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}}{λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} [X (i) - X (j)] {[X (i) - X (j)]}^{T}]},

and

σ \geq \max {σ^{*}, σ^{†}}

, where

σ^{*}

is the solution of the equation

φ (σ) = β

, and

σ^{†}

is the solution of equation

ψ (σ) = α

(

0 < α < 1

), where

ψ (σ) = \frac{γ \sqrt{m}}{2 σ^{2} λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} \exp (- \frac{{(β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |)}^{2}}{4 σ^{2}}) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]}, σ \in (0, \infty)

(22)

in which

\begin{array}{l} γ = \sum_{i = 1}^{N} \sum_{j = 1}^{N} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) {‖ X (i) - X (j) ‖}_{1} (β {‖ [X (i) - X (j)] {[X (i) - X (j)]}^{T} ‖}_{1} + \\ | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}) \end{array}

(23)

then it holds that

{‖ f (W) ‖}_{1} \leq β

, and

{‖ \nabla_{W} f (W) ‖}_{1} \leq α

for all

W \in {W \in ℝ^{m} : {‖ W ‖}_{1} \leq β}

.

Proof. By Theorem 1, we have

{‖ f (W) ‖}_{1} \leq β

. To prove

{‖ \nabla_{W} f (W) ‖}_{1} \leq α

, it suffices to prove

\forall s, {‖ \frac{\partial}{\partial w_{s}} f (W) ‖}_{1} \leq α

. By (14), we have

\begin{array}{l} {‖ \frac{\partial}{\partial w_{s}} f (W) ‖}_{1} \\ = {‖ \begin{array}{l} - {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}) f (W) \\ + {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)]) \end{array} ‖}_{1} \\ \leq {‖ {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}) f (W) ‖}_{1} \\ + {‖ {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)]) ‖}_{1} \end{array}

(24)

It is easy to derive

\begin{array}{l} {‖ {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T}) f (W) ‖}_{1} \\ \leq \frac{1}{2 N^{2} σ^{2}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {‖ \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T} ‖}_{1} {‖ f (W) ‖}_{1} \\ \overset{(d)}{\leq} \frac{β}{2 N^{2} σ^{2}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {\sum_{i = 1}^{N} \sum_{j = 1}^{N} {‖ (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [X (i) - X (j)] {[X (i) - X (j)]}^{T} ‖}_{1}} \\ \overset{(e)}{\leq} \frac{β}{4 N^{2} σ^{3} \sqrt{π}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {\sum_{i = 1}^{N} \sum_{j = 1}^{N} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) {‖ X (i) - X (j) ‖}_{1} {‖ [X (i) - X (j)] {[X (i) - X (j)]}^{T} ‖}_{1}} \end{array}

(25)

where (d) follows from the convexity of the vector l₁-norm and

{‖ f (W) ‖}_{1} \leq β

, and (e) is due to the fact that

| (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) | \leq (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) {‖ X (i) - X (j) ‖}_{1}

and

κ_{σ \sqrt{2}} (x) \leq \frac{1}{2 σ \sqrt{π}}

for any

x

. In a similar way, one can derive

\begin{array}{l} {‖ {[R_{X X}^{M E E}]}^{- 1} (\frac{1}{2 N^{2} σ^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)]) ‖}_{1} \\ \leq \frac{1}{2 N^{2} σ^{2}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {\sum_{i = 1}^{N} \sum_{j = 1}^{N} {‖ (e (i) - e (j)) (x_{s} (i) - x_{s} (j)) κ_{σ \sqrt{2}} (e (i) - e (j)) [d (i) - d (j)] [X (i) - X (j)] ‖}_{1}} \\ \leq \frac{1}{4 N^{2} σ^{3} \sqrt{π}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {\sum_{i = 1}^{N} \sum_{j = 1}^{N} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) \times | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}^{2}} \end{array}

(26)

Then, combining (24)–(26), (17) and (18), we have

\begin{array}{l} {‖ \frac{\partial}{\partial w_{s}} f (W) ‖}_{1} \\ \leq \frac{β}{4 N^{2} σ^{3} \sqrt{π}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} \\ \times {\sum_{i = 1}^{N} \sum_{j = 1}^{N} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) {‖ X (i) - X (j) ‖}_{1} {‖ [X (i) - X (j)] {[X (i) - X (j)]}^{T} ‖}_{1}} \\ + \frac{1}{4 N^{2} σ^{3} \sqrt{π}} {‖ {[R_{X X}^{M E E}]}^{- 1} ‖}_{1} {\sum_{i = 1}^{N} \sum_{j = 1}^{N} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) \times | d (i) - d (j) | \times {‖ X (i) - X (j) ‖}_{1}^{2}} \\ \leq \frac{γ \sqrt{m / π}}{4 σ^{3} λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} κ_{σ \sqrt{2}} (β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \\ = \frac{γ \sqrt{m}}{2 σ^{2} λ_{\min} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} \exp (- \frac{{(β {‖ X (i) - X (j) ‖}_{1} + | d (i) - d (j) |)}^{2}}{4 σ^{2}}) [X (i) - X (j)] {[X (i) - X (j)]}^{T}]} \\ = ψ (σ) \end{array}

(27)

Obviously,

ψ (σ)

is also a continuous and monotonically decreasing function of

σ

over

(0, \infty)

, and satisfies

\lim_{σ \to 0 +} ψ (σ) = \infty

,

\lim_{σ \to \infty} ψ (σ) = 0

. Therefore, given

0 < α < 1

, the equation

ψ (σ) = α

has a unique solution

σ^{†}

over

(0, \infty)

, and if

σ \geq σ^{†}

, we have

ψ (σ) \leq α

. This completes the proof. □

According to Theorem 2 and Banach Fixed-Point Theorem [11], given an initial weight vector satisfying

{‖ W_{0} ‖}_{1} \leq β

, the fixed-point MEE algorithm (11) will surely converge to a unique fixed point in the range

W \in {W \in ℝ^{m} : {‖ W ‖}_{1} \leq β}

provided that the kernel bandwidth

σ

is larger than a certain value. Moreover, the value of

α

(

0 < α < 1

) guarantees the convergence speed. It is worth noting that the derived sufficient condition will be, certainly, a little loose, due to the zooming out in the proof process.

4. Illustrative Example

In the following, we give an illustrative example to verify the derived sufficient condition that guarantees the convergence of the fixed-point MEE algorithm. Let us consider a simple linear model:

d (i) = 2 X (i) + v (i)

(28)

where

X (i)

is a scalar input, and

v (i)

is an additive noise. Assume that

X (i)

is uniform distributed over

[- \sqrt{3}, \sqrt{3}]

and

v (i)

is zero-mean Gaussian with variance

0.01

. There are 100 training samples

{X (i), d (i)}_{i = 1}^{100}

generated from the system (28). Based on these data we calculate

ξ = \frac{\sum_{i = 1}^{100} \sum_{j = 1}^{100} | d (i) - d (j) | \times | X (i) - X (j) |}{\sum_{i = 1}^{100} \sum_{j = 1}^{100} {| X (i) - X (j) |}^{2}} = 1.9714

(29)

We choose

β = 3 > ξ

and

α = 0.9938 < 1

. Then by solving the equations

φ (σ) = β

and

ψ (σ) = α

, we obtain

σ^{*} = 2.38

and

σ^{†} = 2.68

. Therefore, by Theorem 2, if

σ \geq 2.68

the fixed-point MEE algorithm will converge to a unique solution in the range

- 3 \leq W \leq 3

. Figure 1, Figure 2 and Figure 3 illustrate the curves of the functions

W

,

f (W) = {(R_{X X}^{M E E})}^{- 1} P_{d X}^{M E E}

, and

| \frac{d f (W)}{d W} |

when

σ = 3.0

,

0.1

,

0.01

, respectively. From the Figures we observe: (i) when

σ = 3.0 > 2.68

, we have

| f (W) | < 3

and

| \frac{d f (W)}{d W} | < α

for

- 3 \leq W \leq 3

; (ii) when

σ = 0.1 < 2.68

, we still have

| f (W) | < 3

and

| \frac{d f (W)}{d W} | < α

for

- 3 \leq W \leq 3

. In this case, the algorithm still will converge to a unique solution in the range

- 3 \leq W \leq 3

. This result confirms the fact that the derived sufficient condition is a little loose (i.e., far from being necessary). The main reason for this is that there is a lot of zooming out in the derivation process; (iii) however, when

σ

is too small, say

σ = 0.01

, the condition

| \frac{d f (W)}{d W} | < α

will not hold for some

W \in {- 3 \leq W \leq 3}

. In this case, the algorithm may diverge.

Figure 1. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 3.0

.

Figure 1. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 3.0

.

Figure 2. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 0.1

.

Figure 2. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 0.1

.

Figure 3. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 0.01

.

Figure 3. Plots of the functions

W

,

f (W)

and

| \frac{d f (W)}{d W} |

when

σ = 0.01

.

Table 1 shows the numbers of iterations for convergence with different kernel bandwidths (3.0, 1.0, 0.1, 0.05). The initial weight vector is set at

W_{0} = 0.1

, and the stop condition for the convergence is

| \frac{W_{k} - W_{k - 1}}{W_{k - 1}} | < 10^{- 6}

(30)

As one can see, when

σ = 3.0 \geq \max {σ^{*}, σ^{†}}

, the fixed-point MEE algorithm will surely converge to a solution with few iterations. When

σ

becomes smaller, the algorithm may still converge, but the convergence speed will become much slower. Note that when

σ

is too small (e.g.,

σ = 0.01

), the algorithm will diverge (the corresponding results are not shown in Table 1).

Table 1. Numbers of iterations for convergence with different kernel bandwidths

σ

.

**Table 1.** Numbers of iterations for convergence with different kernel bandwidths $σ$ .
$σ$	3.0	1.0	0.1	0.05
Iterations	3	4	16	43

5. Conclusion

The MEE criterion has received increasing attention in signal processing and machine learning due to its desirable performance in adaptive system training especially with non-Gaussian data. Many iterative optimization methods have been developed to minimize the error entropy for practical use. But the fixed-point algorithms have been seldom studied, and in particular, too little attention has been paid to the convergence issue of the fixed-point MEE algorithms. This paper presented a theoretical study of this problem, and proved a sufficient condition to guarantee the convergence of a fixed-point MEE algorithm. The results of this study may provide a possible range for choosing a kernel bandwidth for MEE learning. However, the derived sufficient condition may give a much larger kernel bandwidth than a desired one due to the zooming out in the formula derivation process. In the future study, we will try to derive a tighter sufficient condition that ensures the convergence of thefixed-point MEE algorithm.

Acknowledgments

This work was supported by 973 Program (No. 2015CB351703) and National NSF of China (No. 61372152).

Author Contributions

Yu Zhang and Badong Chen proved the main theorems in this paper, Xi Liu presented the illustrative example, Zejian Yuan and Jose C. Principe polished the language and were in charge of technical checking. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Chen, B.; Zhu, Y.; Hu, J.C.; Principe, J.C. System Parameter Identification: Information Criteria and Algorithms; Elsevier: Amsterdam, the Netherlands, 2013. [Google Scholar]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall: New York, NY, USA, 1986. [Google Scholar]
Erdogmus, D.; Principe, J.C. An error-entropy minimization for supervised training of nonlinear adaptive systems. IEEE Trans. Signal Process. 2002, 50, 1780–1786. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. Generalized information potential criterion for adaptive system training. IEEE Trans. Neural Netw. 2002, 13, 1035–1044. [Google Scholar] [CrossRef] [PubMed]
Erdogmus, D.; Principe, J.C. Convergence properties and data efficiency of the minimum error entropy criterion in adaline traing. IEEE Trans. Signal Process. 2003, 51, 1966–1978. [Google Scholar] [CrossRef]
Chen, B.; Zhu, Y.; Hu, J. Mean-square convergence analysis of ADALINE training with minimum error entropy criterion. IEEE Trans. Neural Netw. 2010, 21, 1168–1179. [Google Scholar] [CrossRef] [PubMed]
Chen, B.; Principe, J.C. Some further results on the minimum error entropy estimation. Entropy 2012, 14, 966–977. [Google Scholar] [CrossRef]
Chen, B.; Principe, J.C. On the Smoothed Minimum Error Entropy Criterion. Entropy 2012, 14, 2311–2323. [Google Scholar] [CrossRef]
Marques de Sá, J.P.; Silva, L.M.A.; Santos, J.M.F.; Alexandre, L.A. Minimum Error Entropy Classification; Springer: London, UK, 2013. [Google Scholar]
Agarwal, R.P.; Meehan, M.; O’Regan, D. Fixed Point Theory and Applications; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Cichocki, A.; Amari, S. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications; Wiley: New York, NY, USA, 2002. [Google Scholar]
Regalia, P.A.; Kofidis, E. Monotonic convergence of fixed-point algorithms for ICA. IEEE Trans. Neural Netw. 2003, 14, 943–949. [Google Scholar] [CrossRef] [PubMed]
Fiori, S. Fast fixed-point neural blind-deconvolution algorithm. IEEE Trans. Neural Netw. 2004, 15, 455–459. [Google Scholar] [CrossRef] [PubMed]
Han, S.; Principe, J.C. A fixed-point minimum error entropy algorithm. In Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, Arlington, VA, USA, 6–8 September 2006; pp. 167–172.
Chen, J.; Richard, C.; Bermudez, J.C.M.; Honeine, P. Non-negative least-mean-square algorithm. IEEE Trans. Signal Process. 2011, 59, 5225–5235. [Google Scholar] [CrossRef]
Chen, J.; Richard, C.; Bermudez, J.C.M.; Honeine, P. Variants of non-negative least-mean-square algorithm and convergence analysis. IEEE Trans. Signal Process. 2014, 62, 3990–4005. [Google Scholar] [CrossRef]
Chen, B.; Wang, J.; Zhao, H.; Zheng, N.; Principe, J.C. Convergence of a fixed-point algorithm under Maximum Correntropy Criterion. IEEE Signal Process. Lett. 2015, 22, 1723–1727. [Google Scholar] [CrossRef]
Kailath, T.; Sayed, A.H.; Hassibi, B. Linear Estimation; Prentice Hall: Upper Saddle River, NJ, USA, 2000. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Chen, B.; Liu, X.; Yuan, Z.; Principe, J.C. Convergence of a Fixed-Point Minimum Error Entropy Algorithm. Entropy 2015, 17, 5549-5560. https://0-doi-org.brum.beds.ac.uk/10.3390/e17085549

AMA Style

Zhang Y, Chen B, Liu X, Yuan Z, Principe JC. Convergence of a Fixed-Point Minimum Error Entropy Algorithm. Entropy. 2015; 17(8):5549-5560. https://0-doi-org.brum.beds.ac.uk/10.3390/e17085549

Chicago/Turabian Style

Zhang, Yu, Badong Chen, Xi Liu, Zejian Yuan, and Jose C. Principe. 2015. "Convergence of a Fixed-Point Minimum Error Entropy Algorithm" Entropy 17, no. 8: 5549-5560. https://0-doi-org.brum.beds.ac.uk/10.3390/e17085549

Article Menu

Convergence of a Fixed-Point Minimum Error Entropy Algorithm

Abstract

1. Introduction

2. Fixed-Point MEE Algorithm

3. Convergence of the Fixed-Point MEE

4. Illustrative Example

5. Conclusion

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI