On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

Shahrivari, Farzad; Zlatanov, Nikola

doi:10.3390/e23081045

Open AccessArticle

On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

by

Farzad Shahrivari

and

Nikola Zlatanov

^*

Electrical and Computer Systems Engineering, Monash University, Alliance Ln, Clayton, VIC 3168, Australia

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(8), 1045; https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

Submission received: 26 July 2021 / Revised: 6 August 2021 / Accepted: 10 August 2021 / Published: 13 August 2021

(This article belongs to the Special Issue Information Theory and Machine Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high.

Keywords:

supervised classification; independent and non-identically distributed features; analytical error probability

1. Introduction

1.1. Background

Supervised classification is a machine learning technique that maps an input feature vector to an output label based on a set of correctly labeled training data. There is no single learning algorithm that works best on all supervised learning problems, as shown by the no free lunch theorem in [1]. As a result, there are many algorithms proposed in the literature whose performance depends on the underlying problem and the amount of training data available. The most widely used algorithms in the literature are decision trees [2,3], Support Vector Machines (SVM) [4,5], Rule-Based Systems [6], naive Bayes classifiers [7], k-nearest neighbors (KNN) [8], logistic regressions, and neural networks [9,10].

1.2. Motivation

In the following, we discuss the motivation for this work.

1.2.1. Lack of Tight Upper Bounds on the Performance of Classifiers

In general, there are no tight upper bounds on the performance of the classifiers used in practice. Many of the previous works only provide experimental performance results. However, this approach has drawbacks. For example, one has to rely on the trial-and-error approach in order to develop a good classifier for a given problem, which impacts the reliability. Next, the algorithms whose performance has been verified only experimentally may work for a given problem, but may fail to work when applied to a similar problem. Finally, experimental results do not provide intuition into the underlying problem, whereas the analytical results provide the understanding of the underlying problem and the corresponding solutions.

Motivated by this, in the paper, we aim to investigate classifiers with analytical upper bounds on their performance.

1.2.2. Independent and Non-Identically Distributed Features

In general, we can categorize the statistical properties of the feature vectors, which are the input to the classifier, into three types. To this end, let

Y^{n} (X) = [Y_{1} (X), Y_{2} (X), \dots, Y_{n} (X)]

denote the input feature vector to the supervised classifier, where n is the length of the feature vector and X is the label to which the feature vector

Y^{n} (X)

belongs. Then, we can distinguish the following three types of feature vectors depending on the statistics of the elements in the feature vector

Y^{n} (X)

.

The first type of feature vector is when the elements of

Y^{n} (X)

are independent and identically distributed (i.i.d.). This is the simplest features model, but also the least applicable in practice. This model is identical to hypothesis testing, which has been well investigated in the literature [11,12,13]. As a result, tight upper bounds on the performance of supervised learning algorithms for this type of feature vector are available in the hypothesis testing literature. For instance, the authors in [11] showed that the posterior entropy and the maximum a posterior error probability decay to zero with the length of the feature vector at the identical exponential rate, where the maximum achievable exponent is the minimum Chernoff information. In [12], the authors determine the requirements for the length of the vector

Y^{n} (X)

and the number of labels m in order to achieve vanishing exponential error probability in testing m hypothesis that minimizes the rejection zone. In [13], the authors provide an upper bound and a lower-bound on the error probability of Bayesian m-ary hypothesis testing in terms of conditional entropy.

The second type of feature vectors is when the elements of

Y^{n} (X)

are mutually dependent and non-identically distributed (d.non-i.d.). This type of features model is the most general model and the most applicable in practice. However, it is also the most difficult to tackle analytically. As a result, supervised learning algorithms proposed for this features model lack analytical tight upper bounds on their performance [14,15,16,17,18,19,20,21,22,23]. This is because there are not any frameworks that produce closed-form results when deriving statistics of vectors with d.non-i.d. elements when the underlying distributions are unknown. Then how can we investigate analytically classifiers for practical scenarios when the feature vectors have d.non-i.d. elements? A possible approach leads us to the third type of feature vectors, explained in the following.

The third type of feature vectors is when the elements of

Y^{n} (X)

are mutually independent but non-identically distributed (i.non-i.d.). This features model is much simpler than the d.non-i.d. features model and, more importantly, it is analytically tractable, as we show in this paper. Furthermore, this features model is applicable in practice. Specifically, there exists a class of algorithms, known as Independent Component Analysis (ICA), that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements with a zero or a negligible loss of information [24,25,26,27,28]. The origins of ICA can be traced back to Barlow [29], who argued that a good representation of binary data can be achieved by an invertible transformation that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements. Finding such a transformation with no prior information about the distribution of the data has been considered an open problem until recently [28]. Specifically, the authors in [28] show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Thereby, the authors in [28] provide the first efficient set of solutions to Barlow’s problem. So far, the complexity of the fastest such algorithm is

O (n \times 2^{n})

[28]. Nevertheless, since there exist such invertible transformations (i.e., no loss of information) which can transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements, we can tackle the features model comprised of d.non-i.d. elements by first transforming it (without loss of information) into the features model comprised of i.non-i.d. elements and then tackling the i.non-i.d. features model.

Motivated by this, in this paper, we investigate supervised classification of feature vectors with i.non-i.d. elements.

1.2.3. Small Training Set

The main factor that impacts the accuracy of supervised classification is the amount of training data. In fact, most supervised algorithms are able to learn only if there is a very large set of training data available [30]. The main reason for this is the curse of dimensionality [31,32], which states that “the higher the dimensionality of the feature vectors, the more training data are needed for the supervised classifier” [33]. For example, supervised classification methods such as random forest [34,35] and KNN [36] suffer from the curse of dimensionality. However, having large training data sets is not always possible in practice. As a result, designing a supervised classification algorithm that exhibits good performance even when the training data set is extremely small is important.

Motivated by this, in this paper, we investigate supervised classifiers for the case when t training feature vectors per label are available, where

t = 1, 2, . . .

1.3. Contributions

In this paper, we propose an algorithm for supervised classification of feature vectors with i.non-i.d. elements when the number of training feature vectors per label is t, where

t = 1, 2, . . .

Next, we derive an upper bound on the error probability of the proposed classifier for uniformly distributed labels and prove that the error probability exponentially decays to zero when the length of the feature vector, n, grows, even when only one training vector per label is available, i.e., when

t = 1

. Hence, the proposed classification algorithm provides an asymptotically optimal performance even when the number of training vectors per label is extremely small. We compare the performance of the proposed classifier with the naive Bayes classifier and to the KNN algorithm. Our numerical results show that the proposed classifier significantly outperforms the naive Bayes classifier and the KNN algorithm when the number of training feature vectors per label is small and the length of the feature vectors n is sufficiently high.

The proposed algorithm is a form of the nearest neighbor classification algorithm, where the nearest neighbor is searched in the domain of empirical distributions. As a result, we refer to the algorithm as the nearest empirical distribution. The nearest empirical distribution algorithm is not new and, to the best of our knowledge, it was first proposed in [37] for the case when the elements of

Y^{n} (X)

are i.i.d., i.e., for the equivalent problem of hypothesis testing. However, in this paper, we propose the nearest empirical distribution algorithm for the case when the elements of

Y^{n} (X)

are i.non-i.d., which is much more complex than the problem of hypothesis testing where the elements of

Y^{n} (X)

are i.i.d.

To the best of our knowledge, this is the first paper that investigates the important problem of classifying feature vectors with i.non-i.d. elements and provides an upper bound on its error probability. The novelty of this paper is not with the classifier itself, but rather in showing the importance of the problem of classifying feature vectors with i.non-i.d elements and in showing analytically that at least one classifier with an asymptotically optimal error probability exists when at least one training feature vectors per label is available.

The remainder of this paper is structured as follows. In Section 2, we formulate the considered classification problem. In Section 3, we provide our classifier and derive an upper bound on its error probability. In Section 4, we provide numerical examples of the performance on the proposed classifier. Finally, Section 5 concludes the paper.

2. Problem Formulation

The machine learning model is comprised of a label X, a feature vector

Y^{n} (X) = [Y_{1} (X), Y_{2} (X),

\dots, Y_{n} (X)]

of length n mapped to the label X, and a learned label

\hat{X}

, as shown in Figure 1. In this paper, we adopt the information-theoretic style of notations and thereby random variables are denoted by capital letters and their realizations are denoted with small letters. The feature vector

Y^{n} (X)

is the input to the machine learning algorithm whose aim is to detect the label X from the observed feature vector

Y^{n} (X)

. The performance of the machine learning algorithm is measured by the error probability

P_{e} = \Pr \{X \neq \hat{X}\}

.

We adopt the modeling in [38,39,40] and represent the dependency between the label X and the feature vector

Y^{n} (X)

via a joint probability distribution

p_{X, Y^{n}} (x, y^{n})

. Now, in order to gain a better understanding of the problem, we include the joint probability distribution

p_{X, Y^{n}} (x, y^{n})

into the model in Figure 1. To this end, since

p_{X, Y^{n}} (x, y^{n}) = p_{Y^{n} | X} (y^{n} | x) p_{X} (x)

holds, instead of

p_{X, Y^{n}} (x, y^{n})

, we can include the conditional probability distribution

p_{Y^{n} | X} (y^{n} | x)

and the probability distribution

p_{X} (x)

into the model in Figure 1, and thereby obtain the model in Figure 2.

Now, the classification learning model in Figure 2 is a system comprised of a label generating source X according to the distribution

p_{_{X}} (x)

, a feature vector generator modelled by the conditional probability distribution

p_{_{Y^{n} | X}} (y^{n} | x)

, a feature vector

Y^{n}

, a classifier that aims to detect X from the observed feature vector

Y^{n}

, and the detected label

\hat{X}

. Note that the system model in Figure 2 can be seen equivalently as a communication system comprised of a source X, a channel with input X and output

Y^{n}

, and a decoder (i.e., detector) that aims to detect X from

Y^{n}

. The notation used in this paper, letter X for labels and letter Y for features, is based on the notation used in information theory for modelling communication systems. In the classification model shown in Figure 2, we assume that the label X can take values from the set

X

, according to

p_{_{X}} (x) = 1 / | X |

, where

| \cdot |

denotes the cardinality of a set. Next, we assume that the i-th element of the feature vector

Y^{n}

,

Y_{i}

, for

i = 1, 2, \dots, n

, takes values from the set

Y = \{y_{1}, y_{2}, \dots, y_{| Y |}\}

, according to the conditional probability distribution

p_{_{Y_{i} | X}} (y_{i} | x)

.

Moreover, we assume that the elements of the feature vector

Y^{n}

are i.non-i.d. As a result, the feature vector

Y^{n}

takes values from the set

Y^{n}

according to the conditional probability distribution

p_{_{Y^{n} | X}} (y^{n} | x)

given by

p_{_{Y^{n} | X}} (y^{n} | x) = p_{_{Y_{1}, Y_{2}, \dots, Y_{n} | X}} (y_{1}, y_{2}, \dots, y_{n} | x) \overset{(a)}{=} \prod_{i = 1}^{n} p_{_{Y_{i} | X}} (y_{i} | x) \overset{(b)}{=} \prod_{i = 1}^{n} p_{_{i}} (y_{i} | x),

(1)

where

(a)

comes from the fact that elements in the feature vector

Y^{n}

are mutually independent and

(b)

is for the sake of notational simplicity, where

p_{_{i}}

is used instead of

p_{_{Y_{i} | X}}

. As a result of (1), the considered classification model in Figure 2 can be represented equivalently as in Figure 3.

Next, we assume that

p_{_{i}} (y_{i} | x)

,

\forall i

, and thereby

p_{_{Y^{n} | X}} (y^{n} | x)

are unknown to the classifier. Instead, the classifier knows

X

,

Y

, and for each

x_{i} \in X

, where

i = 1, 2, \dots, | X |

, it has access to a finite set of t correctly labelled input–output pairs

(x_{i}, {\hat{y}}_{i_{1}}^{n}), (x_{i}, {\hat{y}}_{i_{2}}^{n}), \dots,

(x_{i}, {\hat{y}}_{i_{t}}^{n})

, denoted by

T_{i}

, referred to as the training set for label

x_{i}

.

Finally, we assume that the following holds

\sum_{l = 1}^{n} p_{l} (y | x_{i}) \neq \sum_{l = 1}^{n} p_{l} (y | x_{j}), for y \in Y and i \neq j .

(2)

The condition in (2) means that the distribution of the feature vectors

Y^{n} (X)

for label

X = i

is not a perturbation of distribution of the feature vectors

Y^{n} (X)

for label

X = j

. As a result, the proposed classifier only applies to the subset of data vectors with i.non-i.d. elements that satisfy (2).

For the classification system model defined above and illustrated in Figure 3, we wish to propose a classifier that exhibits an asymptotically optimal error probability

P_{e} = \Pr \{X \neq \hat{X}\}

with respect to the length of

Y^{n}

, n, for any

t \geq 1

, i.e., for any

t \geq 1

,

P_{e} \to 0

as

n \to \infty

. Moreover, we wish to obtain an analytical upper bound on the error probability of the proposed classifier for a given t and n.

3. The Proposed Classifier and Its Performance

In this section, we propose our classifier, derive an analytical upper bound on its error probability, and prove that the classifier exhibits an asymptotically optimal performance when the length of the feature vector

Y^{n}

, n, satisfies

n \to \infty

. This is conducted in the following.

For a given vector

v^{n} = (v_{1}, v_{2}, \dots, v_{n})

, let the Minkowski distance r be defined as

\begin{matrix} {∥v∥}_{r} = {(\sum_{i = 1}^{n} v_{i}^{r})}^{(1 / r)} . \end{matrix}

(3)

Moreover, for a given feature vector

y^{k} = (y_{1}, y_{2},, \dots, y_{k})

, let

I [y^{k} = y]

be a function defined as

\begin{matrix} I [y^{k} = y] = \sum_{i = 1}^{k} Z [y_{i} = y], \end{matrix}

(4)

where

Z [y_{i} = y]

is an indicator function assuming the value 1 if

y_{i} = y

and 0 otherwise. Hence,

I [y^{k} = y]

counts the number of elements in

Y^{k}

that have the value y.

3.1. The Proposed Classifier

Let

{\hat{y}}_{i}^{n t}

be a vector obtained by concatenating all training feature vectors for the input label

x_{i}

as

\begin{matrix} {\hat{y}}_{i}^{n t} = ({\hat{y}}_{i_{1}}^{n}, {\hat{y}}_{i_{2}}^{n}, \dots, {\hat{y}}_{i_{t}}^{n}) . \end{matrix}

(5)

Let

P_{{\hat{y}}_{i}^{n t}}

be the empirical probability distribution of the concatenated training feature vector for label

x_{i}

,

{\hat{y}}_{i}^{n t}

, given by

\begin{matrix} P_{{\hat{y}}_{i}^{n t}} = [\frac{I [{\hat{y}}_{i}^{n t} = y_{1}]}{n t}, \frac{I [{\hat{y}}_{i}^{n t} = y_{2}]}{n t}, \dots, \frac{I [{\hat{y}}_{i}^{n t} = y_{| Y |}]}{n t}] . \end{matrix}

(6)

Let

y^{n}

be the observed feature vector at the classifier whose label it wants to detect and let

P_{y^{n}}

denote the empirical probability distribution of

y^{n}

, given by

\begin{matrix} P_{y^{n}} = [\frac{I [y^{n} = y_{1}]}{n}, \frac{I [y^{n} = y_{2}]}{n}, \dots, \frac{I [y^{n} = y_{| Y |}]}{n}] . \end{matrix}

(7)

Using the above notations, we propose the following classifier.

Proposition 1.

For the considered system model, we propose a classifier with the following classification rule

\begin{matrix} \hat{x} = x_{i}, where i = arg min_{i} {∥P_{y^{n}} - P_{{\hat{y}}_{i}^{n t}}∥}_{r}, \end{matrix}

(8)

where

r \geq 1

and ties are resolved by assigning the label among the ties uniformly at random. (For example, if

{∥P_{y^{n}} - P_{{\hat{y}}_{i}^{n t}}∥}_{r} = {∥P_{y^{n}} - P_{{\hat{y}}_{j}^{n t}}∥}_{r}

holds for,

i \neq j

, we set

\hat{x} = x_{i}

or

\hat{x} = x_{j}

uniformly at random).

As seen from (8), the proposed classifier assigns the label

x_{i}

if the empirical probability distribution of the concatenated training feature vector mapped to label

x_{i}

,

P_{{\hat{y}}_{i}^{n t}}

is the closest, in terms of Minkowski distance r, to the empirical probability distribution of the observed feature vector

P_{y^{n}}

. In that sense, the proposed classifier can be considered as the nearest empirical distribution classifier.

3.2. Upper Bound on the Error Probability

The following theorem establishes an upper bound on the error probability of the proposed classifier.

Theorem 1.

Let

{\bar{P}}_{j}

, for

j = 1, 2, \dots, | X |

, be a vector defined as

\begin{matrix} {\bar{P}}_{j} = [\bar{p} (y_{1} | x_{j}), \bar{p} (y_{2} | x_{j}), \dots, \bar{p} (y_{| Y |} | x_{j})], \end{matrix}

(9)

where

\bar{p} (y | x_{j})

is given by

\begin{matrix} \bar{p} (y | x_{j}) = \frac{1}{n} \sum_{k = 1}^{n} p_{_{k}} (y | x_{j}) . \end{matrix}

(10)

Then, for a given

r \geq 1

, the error probability of the proposed classifier is upper bounded by

\begin{matrix} P_{e} \leq 2 | Y | e^{- 2 n ϵ^{2}} + 2 | Y | e^{- 2 n t^{1 / 3} ϵ^{2}}, \end{matrix}

(11)

where ϵ is given by

\begin{matrix} ϵ = min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{j}∥}_{r}}{(2 + t^{- 1 / 3}) {| Y |}^{1 / r}} . \end{matrix}

(12)

Proof of Theorem 1.

Without loss of generality we assume that

x_{1}

is the input to

p_{_{Y^{n} | X}} (y^{n} | x)

and

y^{n}

is observed.

Let

A_{k}^{ϵ}

, for

1 \leq k \leq | Y |

, be a set defined as

\begin{matrix} A_{k}^{ϵ} = \{y^{n} : | \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | \leq ϵ\} . \end{matrix}

(13)

Furthermore, let

B_{k}^{ϵ}

, for

1 \leq k \leq | Y |

, be a set defined as

\begin{matrix} B_{k}^{ϵ} = \{{\hat{y}}^{n t} : | \frac{I [{\hat{y}}^{n T} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) | \leq \frac{ϵ}{\sqrt[3]{t}}\} . \end{matrix}

(14)

Let

A^{ϵ} = ⋂_{k = 1}^{| Y |} A_{k}^{ϵ}

and

B^{ϵ} = ⋂_{k = 1}^{| Y |} B_{k}^{ϵ}

. Now, for any

y^{n} \in A^{ϵ}

, we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} \overset{(a)}{\leq} {(\sum_{k = 1}^{| Y |} ϵ^{r})}^{1 / r}, \end{matrix}

(15)

where

(a)

follows from (13). Moreover, for

{\hat{y}}_{1}^{n t} \in B^{ϵ}

, we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} \overset{(a)}{\leq} {(\sum_{k = 1}^{| Y |} {(\frac{ϵ}{\sqrt[3]{t}})}^{r})}^{1 / r}, \end{matrix}

(16)

where

(a)

follows from (14). Next, we have the following upper bound

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \\ = {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) - (\frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1})) |^{r})}^{1 / r} \\ \overset{(a)}{\leq} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} + {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r}, \end{matrix}

(17)

where

(a)

follows from the Minkowski inequality. Combining (15)–(17), we obtain

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \leq {| Y |}^{1 / r} ϵ + {| Y |}^{1 / r} \frac{ϵ}{\sqrt[3]{t}} . \end{matrix}

(18)

Hence, the Minkowski distance between the empirical probability distribution of the observed vector

y^{n}

and the empirical probability distribution of the concatenated training vector for label

x_{1}

is upper bounded by the right hand side of (18). We now derive a lower bound for

{\hat{y}}_{i}^{n t}

, where

i \neq 1

. For any

x_{i}

, such that

i \neq 1

, we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} + {(\sum_{k = 1}^{| Y |} ϵ^{r})}^{1 / r} \\ \overset{(a)}{\geq} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} + {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} \\ \overset{(b)}{\geq} {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r}, \end{matrix}

(19)

where

(a)

follows from (15) and

(b)

is again due to the Minkowski inequality. The expression in (19), can be written equivalently as

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \\ \geq {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} - {| Y |}^{1 / r} ϵ, \end{matrix}

(20)

where

i \neq 1

. Now, using the definitions of

P_{{\hat{y}}_{i}^{n t}}

and

{\bar{P}}_{1}

given by (6) and (9), respectively, into (20) we can replace the expression in the right-hand side of (20) by

{∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r}

, and thereby for any

i \neq 1

we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \geq & {∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r} - {| Y |}^{1 / r} ϵ . \end{matrix}

(21)

The expression in (21) represents a lower bound on the Minkowski r distance between the empirical probability distribution of the observed vector

y^{n}

and the empirical probability distribution of the concatenated training vector for any label

x_{i}

, where

i \neq 1

.

Using the bounds in (18) and (21), we now relate the left-hand sides of (18) and (21). As long as the following inequality holds for each

i \neq 1

,

\begin{matrix} {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}) < ∥ P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1} ∥_{r} - {| Y |}^{1 / r} ϵ, \end{matrix}

(22)

which is equivalent to the following for

i \neq 1

\begin{matrix} ϵ < \frac{{∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r}}{(2 + t^{- 1 / 3}) {| Y |}^{1 / r}}, \end{matrix}

(23)

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} & \overset{(a)}{\leq} {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}) \\ \overset{(b)}{<} ∥ P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1} ∥_{r} - | Y |^{1 / r} ϵ \\ \overset{(c)}{\leq} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r}, \end{matrix}

(24)

where

(a)

,

(b)

, and

(c)

follow from (18), (22), and (21), respectively. Thereby, from (24), we have the following for

i \neq 1

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n T} = y_{k}]}{n T} |^{r})}^{1 / r} < {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n T} = y_{k}]}{n T} |^{r})}^{1 / r} . \end{matrix}

(25)

Note that the right- and left-hand sides of (25) can be replaced by the Minkowski distance of the vectors

\begin{matrix} v_{1} = [\frac{I [y^{n} = y_{1}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{1}]}{n t}, \dots, \frac{I [y^{n} = y_{| Y |}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{| Y |}]}{n t}], \end{matrix}

(26)

and

\begin{matrix} v_{2} = [\frac{I [y^{n} = y_{1}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{1}]}{n t}, \dots, \frac{I [y^{n} = y_{| Y |}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{| Y |}]}{n t}], \end{matrix}

(27)

respectively. Now, (26) and (27) can be replaced by

P_{y^{n}} - P_{{\hat{y}}_{1}^{n t}}

and

P_{y^{n}} - P_{{\hat{y}}_{i}^{n t}}

, respectively, by the definitions of

P_{y^{n}}

and

P_{{\hat{y}}_{i}^{n t}}

given by (7) and (6), respectively. Therefore, (25) can be written equivalently as

\begin{matrix} {∥P_{y^{n}} - P_{{\hat{y}}_{1}^{n t}}∥}_{r} < {∥P_{y^{n}} - P_{{\hat{y}}_{i}^{n t}}∥}_{r} . \end{matrix}

(28)

Now, let us highlight what we have obtained. We obtained that there is an

ϵ

for which if (23) holds for

i \neq 1

, and for that

ϵ

there are sets

A^{ϵ}

and

B^{ϵ}

for which

y^{n} \in A^{ϵ}

and

{\hat{y}}_{1}^{n t} \in B^{ϵ}

then (28) holds for

i \neq 1

, and thereby our classifier will detect that

x_{1}

is the correct label. Using this, we can upper bound the error probability as

\begin{matrix} P_{e} & = 1 - \Pr \{{\hat{x}}_{1} = x_{1}\} \\ \leq 1 - \Pr {(y^{n} \in A^{ϵ}) \cap ({\hat{y}}_{1}^{n t} \in B^{ϵ}) |ϵ \in S\}, \end{matrix}

(29)

where

S

is a set defined as

\begin{matrix} S = \{ϵ : ϵ \leq min_{\begin{matrix} i \\ i \neq 1 \end{matrix}} \frac{{∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r}}{(2 + t^{- 1 / 3}) {| Y |}^{1 / r}}\} . \end{matrix}

(30)

In the following, we derive the expression in (29). The right-hand side of (29) can be upper bounded as -4.6cm0cm

\begin{matrix} 1 - \Pr {(y^{n} \in A^{ϵ}) \cap ({\hat{y}}_{1}^{n t} \in B^{ϵ}) |ϵ \in S\} & = \Pr {(y^{n} \notin A^{ϵ}) \cup ({\hat{y}}_{1}^{n t} \notin B^{ϵ}) |ϵ \in S\} \\ \overset{(a)}{\leq} \Pr \{y^{n} \notin A^{ϵ} | ϵ \in S\} + \Pr {{\hat{y}}_{1}^{n t} \notin B^{ϵ} |ϵ \in S\}, \end{matrix}

(31)

where

(a)

follows from Boole’s inequality. Now, note that we have the following upper bound for the first expression in the right-hand side of (31)

\begin{matrix} \Pr \{y^{n} \notin A^{ϵ} | ϵ \in S\} & = \Pr {y^{n} \notin ⋂_{k = 1}^{| Y |} A_{k}^{ϵ} |ϵ \in S\} \\ = \Pr {y^{n} \in ⋃_{k = 1}^{| Y |} \bar{A_{k}^{ϵ}} |ϵ \in S\} \\ \overset{(a)}{\leq} \sum_{k = 1}^{| Y |} \Pr \{y^{n} \in \bar{A_{k}^{ϵ}} | ϵ \in S\} \\ = \sum_{k = 1}^{| Y |} \Pr {| \frac{I [y^{n} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | > ϵ |ϵ \in S\} \\ = \sum_{k = 1}^{| Y |} \Pr {| \sum_{j = 1}^{n} \frac{Z [y_{j} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | > ϵ |ϵ \in S\}, \end{matrix}

(32)

where

\bar{A_{k}^{ϵ}}

is the complement of

A_{k}^{ϵ}

and

(a)

follows from Boole’s inequality. Note that

Z [y_{1} = y_{k}], Z [y_{2} = y_{k}], \dots, Z [y_{n} = y_{k}]

in (32) are n independent Bernoulli random variables with probabilities of success

p_{_{1}} (y_{k} | x_{1}), p_{_{2}} (y_{k} | x_{1}), \dots, p_{_{n}} (y_{k} | x_{1})

, respectively. Let

W [y_{k}]

be a binomial random variable with parameters

(n, \bar{p} (y_{k} | x_{1}))

. We proceed the proof by introducing the following well-known Hoefdding’s Theorem from [41]. □

Theorem 2

(Hoeffding [41]). Assume that

Z_{1}, Z_{2}, \dots,

and

Z_{n}

are n independent Bernoulli random variables with probabilities of success

p_{_{1}}, p_{_{2}}, \dots,

and

p_{_{n}}

, respectively. Next, let

Z

be defined as

Z = Z_{1} + Z_{2} + \dots + Z_{n}

and, let

\bar{p}

be defined as

\bar{p} = (p_{_{1}} + p_{_{2}} + \dots + p_{_{n}}) / n

. Let

W

be a binomial random variable with parameters

(n, \bar{p})

. Then, for a given a and b, where

0 \leq a \leq n \bar{p} \leq b \leq n

holds, we have

\begin{matrix} \Pr \{a \leq W \leq b\} \leq \Pr \{a \leq Z \leq b\} . \end{matrix}

(33)

In other words, the probability distribution of

W

is more dispersed around its mean

n \bar{p}

than is the probability distribution of

Z

. Except in the trivial case when

a = b = 0

, the bound in (33) holds with equality if and only if

p_{1} = \dots = p_{n} = \bar{p}

.

Proof of Theorem 2.

Please refer to [41]. □

Setting

a = n (\bar{p} - δ)

and

b = n (\bar{p} + δ)

in (33), we obtain

\begin{matrix} \Pr \{n (\bar{p} - δ) \leq W \leq n (\bar{p} + δ)\} \leq \Pr \{n (\bar{p} - δ) \leq Z \leq n (\bar{p} + δ)\} . \end{matrix}

(34)

Using (34), we have the following upper bound

\begin{matrix} \Pr \{| \frac{Z}{n} - \bar{p} | > δ\} & = 1 - \Pr \{n (\bar{p} - δ) \leq Z \leq n (\bar{p} + δ)\} \\ \overset{(a)}{\leq} 1 - \Pr \{n (\bar{p} - δ) \leq W \leq n (\bar{p} + δ)\} \\ = \Pr \{| \frac{W}{n} - \bar{p} | > δ\}, \end{matrix}

(35)

where

(a)

follows from (34).

We now turn to the proof of Theorem 1. According to Theorem 2, the probability distribution of

W [y_{k}]

is more dispersed around its mean

n \bar{p} (y_{k} | x_{1})

than is the probability distribution of

\sum_{1 \leq j \leq n} Z [y_{j} = y_{k}]

. Therefore, we can upper bound the probability in the last line of (32) as

\begin{matrix} \Pr {| \sum_{j = 1}^{n} \frac{Z [y_{j} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | > ϵ |ϵ \in S\} \overset{(a)}{\leq} \Pr {| \frac{W [y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | > ϵ | ϵ \in S}, \end{matrix}

(36)

where

ϵ \in S

is defined in (30) and

(a)

follows from (35). Now, let us introduce another well-known Hoeffding’s Theorem from [42].

Theorem 3

(Hoeffding’s inequality [42]). Let

W_{1}, W_{2}, \dots, W_{n}

be n independent random variables such that for each

1 \leq i \leq n

, we have

\Pr \{W_{i} \in [a_{i}, b_{i}]\} = 1

. Then for

S_{n}

, defined as

S_{n} = \sum_{i = 1}^{n} W_{i}

, we have

\begin{matrix} \Pr \{S_{n} - E [S_{n}] \geq δ\} \leq exp (- \frac{2 δ^{2}}{\sum_{i = 1}^{n} {(b_{i} - a_{i})}^{2}}), \end{matrix}

(37)

where

E [S_{n}]

is the expectation of

S_{n}

.

Proof of Theorem 3.

Please refer to [42]. □

Back to (36), by using the result of (37) for

a_{i} = 0

and

b_{i} = 1

since the binomial random variable

W [y_{k}]

can take values 0 or 1, respectively, we have

\begin{matrix} \Pr {| \sum_{j = 1}^{n} \frac{Z [y_{j} = y_{k}]}{n} - \bar{p} (y_{k} | x_{1}) | > ϵ | ϵ \in S} & \leq 2 exp (- \frac{2 n^{2} ϵ^{2}}{\sum_{1 \leq i \leq n} {(1 - 0)}^{2}}) \\ \leq 2 e^{- 2 n ϵ^{2}}, \end{matrix}

(38)

where

ϵ \in S

is defined in (30). Inserting (38) into (32), we obtain the following upper bound

\begin{matrix} \Pr \{y^{n} \notin A^{ϵ} | ϵ \in S\} \leq 2 | Y | e^{- 2 n ϵ^{2}} . \end{matrix}

(39)

Similarly, we have the following result for the second expression in the right-hand side of (31)

\begin{matrix} \Pr {{\hat{y}}_{1}^{n t} \notin B^{ϵ} | ϵ \in S} & = \Pr {{\hat{y}}_{1}^{n t} \notin ⋂_{k = 1}^{| Y |} B_{k}^{ϵ} | ϵ \in S} \\ = \Pr {{\hat{y}}_{1}^{n t} \in ⋃_{k = 1}^{| Y |} \bar{B_{k}^{ϵ}} | ϵ \in S} \\ \overset{(a)}{\leq} \sum_{k = 1}^{| Y |} \Pr \{{\hat{y}}_{1}^{n t} \in \bar{B_{k}^{ϵ}} | ϵ \in S\} \\ = \sum_{k = 1}^{| Y |} \Pr {| \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) | > \frac{ϵ}{\sqrt[3]{t}} | ϵ \in S} \\ = \sum_{k = 1}^{| Y |} \Pr {| \sum_{j = 1}^{n t} \frac{Z [y_{j} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) | > \frac{ϵ}{\sqrt[3]{t}} | ϵ \in S}, \end{matrix}

(40)

where again

(a)

follows from Boole’s inequality. Note that due to (5), for any integer number l such that

0 \leq l \leq t - 1

the random variables

Z [y_{n l + 1} = y_{k}], Z [y_{n l + 2} = y_{k}], \dots,

and

Z [y_{n l + n} = y_{k}]

in (40) are n independent Bernoulli random variables with the probabilities of success

p_{_{1}} (y_{k} | x_{1}), p_{_{2}} (y_{k} | x_{1}), \dots,

and

p_{_{n}} (y_{k} | x_{1})

, respectively

(y_{n l + 1}, y_{n l + 2}, \dots, y_{n l + n}

are elements of

{\hat{y}}_{1_{l + 1}}^{n})

. In addition, note that

\begin{matrix} \bar{p} (y_{k} | x_{1}) & = \frac{1}{n} \sum_{j = 1}^{n} p_{_{j}} (y_{k} | x_{1}) \\ = \frac{1}{n t} (\sum_{l = 0}^{t - 1} \sum_{j = 1}^{n} p_{_{j}} (y_{k} | x_{1})) . \end{matrix}

(41)

Notice that for each

0 \leq l \leq t - 1

,

p_{_{1}} (y_{k} | x_{1}) + p_{_{2}} (y_{k} | x_{1}) + \dots + p_{_{n}} (y_{k} | x_{1})

is the summation of the probabilities of success of the random variables

Z [y_{n l + 1} = y_{k}], Z [y_{n l + 2} = y_{k}], \dots,

and

Z [y_{n l + n} = y_{k}]

. Thereby, the last expression on the right-hand side of (41) is the average probability of success of random variables

Z [y_{j} = y_{k}]

for

1 \leq j \leq n t

. Now, let

W [y_{k}]

be a binomial random variable with parameters

(n t, \bar{p} (y_{k} | x_{1}))

. Once again, according to Theorem 2, the probability distribution of

W [y_{k}]

is more dispersed around its mean

n t \bar{p} (y_{k} | x_{1}))

than is the probability distribution of

\sum_{1 \leq j \leq n t} Z [y_{j} = y_{k}]

. Therefore, the probability in the last line of (40) can be upper bounded as

\begin{matrix} \Pr {| \sum_{j = 1}^{n t} \frac{Z [y_{j} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) | > \frac{ϵ}{\sqrt[3]{t}} | ϵ \in S} & \overset{(a)}{\leq} \Pr {| \frac{W [y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) | > \frac{ϵ}{\sqrt[3]{t}} | ϵ \in S} \\ \overset{(b)}{\leq} 2 exp (- \frac{2 {(n t)}^{2} {(t^{- 1 / 3} ϵ)}^{2}}{\sum_{1 \leq i \leq n t} {(1 - 0)}^{2}}) \\ \leq 2 e^{- 2 n t (t^{- 2 / 3} ϵ^{2})} \\ = 2 e^{- 2 n t^{1 / 3} ϵ^{2}}, \end{matrix}

(42)

where

ϵ \in S

, defined in (30),

(a)

follows from (35) (in which n is replaced by

n t

), and

(b)

is the result of (37) for

a_{i} = 0

and

b_{i} = 1

since the binomial random variable

W [y_{k}]

can take values 0 or 1, respectively. Inserting (42) into (40), we have the following upper bound

\begin{matrix} \Pr {{\hat{y}}_{1}^{n t} \notin B^{ϵ} | ϵ \in S} \leq 2 | Y | e^{- 2 n t^{1 / 3} ϵ^{2}} . \end{matrix}

(43)

Inserting (39) and (43) into (31), and then inserting (31) into (29), we obtain the following upper bound for the error probability

\begin{matrix} P_{e} \leq 2 | Y | e^{- 2 n ϵ^{2}} + 2 | Y | e^{- 2 n t^{1 / 3} ϵ^{2}}, \end{matrix}

(44)

where

\begin{matrix} ϵ = min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{j}∥}_{r}}{(2 + t^{- 1 / 3}) {| Y |}^{1 / r}}, \end{matrix}

(45)

which is the optimal value of

ϵ

that exhibits the tightest upper bound for the error probability

P_{e}

given by (44). This completes the proof of Theorem 1.

The following corollary provides a simplified upper bound on the error probability when

t \to \infty

.

Corollary 1.

When the number of training vectors per label reaches infinity, i.e., when

t \to \infty

, which is equivalently to the case when the probability distribution

p (y^{n} | x)

is known at the classifier, the error probability of the proposed classifier is upper bounded as

\begin{matrix} P_{e} \leq 2 | Y | e^{- 2 n ϵ^{2}}, \end{matrix}

(46)

where ϵ is given by

\begin{matrix} ϵ = min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{j}∥}_{r}}{{2 | Y |}^{1 / r}} . \end{matrix}

(47)

Proof.

The proof is straightforward. □

As can be seen from (8) and (11), the performance of the proposed classifier depends on r. We cannot derive the optimal value of r that minimizes the error probability since we do not have the exact expression of the error probability, we only have its upper bound. On the other hand, in practice, the optimal r with respect to the upper bound on the error probability also cannot be derived since the upper bound depends on

{\bar{P}}_{j}

, which would be unknown in practice due to

p_{Y^{n} | X} (y^{n} | x)

being unknown. As a result, for our numerical examples, we consider the Euclidean distance

(r = 2)

, which is one of the most widely used distance metrics in practice.

The following corollary establishes the asymptotic optimality of the proposed classifier with respect to n.

Corollary 2.

The proposed classifier has an error probability that satisfies

P_{e} \to 0

as

n \to \infty

if

| Y | \leq O (n^{m})

, m is fixed, and

r > 2 m

. Here,

n^{m}

indicates the dimension of our space, i.e., maximum number of alphabets each element in the feature vector

y^{n}

can take. Thereby, the proposed classifier is asymptotically optimal.

Proof.

For the proof, please see Appendix A. □

4. Simulation Results

In this section, we provide simulation results of the performance of the proposed classifier for

r = 2

and compare it to benchmark schemes. The benchmark schemes that we adopt for comparison are the naive Bayes classifier and the KNN algorithm. We cannot adopt a classifier based on a neural network since neural networks require a very large training set, which we assume is not available. For the naive Bayes classifier, the probability distribution

p_{_{Y^{n} | X}} (y^{n} | x)

is estimated from the training vectors as follows. Let again

{\hat{y}}_{i}^{n t}

be a vector obtained by concatenating all training feature vectors for the input label

x_{i}

as in (5). Then, the estimated probability distribution of

p (y_{j} = y | x_{i})

, denoted by

\hat{p} (y_{j} = y | x_{i})

, is found as

\begin{matrix} \hat{p} (y_{j} = y | x_{i}) = \frac{I [{\hat{y}}_{i}^{n t} = y]}{n t}, \end{matrix}

(48)

and the naive Bayes classifier decides according to

\begin{matrix} \hat{x} = arg max_{x_{i}} \prod_{k = 1}^{n} \hat{p} (y_{k} | x_{i}) . \end{matrix}

(49)

The main problem of the naive Bayes classifier occurs when an alphabet

y_{j} \in Y

is not present in the training feature vectors. In that case,

\hat{p} (y_{j} | x_{i})

in (48) is

\hat{p} (y_{j} | x_{i}) = 0

,

\forall x_{i} \in X

and, as a result, the right hand side of (49) is zero since at least one of the elements in the product in (49) is zero. In this case, the naive Bayes classifier fails to provide an accurate classification of the labels. In what follows, we see that this issue of the naive Bayes classifier appears frequently when we have a small number of training feature vectors. On the other hand, the KNN classifier works as follows. For the observed feature vector

y^{n}

, the KNN classifier looks for the k nearest feature vectors to

y^{n}

, among all training feature vectors

{\hat{y}}_{r_{s}}^{n}

, for all

1 \leq r \leq | X |

and

1 \leq s \leq T

. Then by considering a set of K input–output pairs

(x_{k}, {\hat{y}}_{k_{l}}^{n})

, for

k \in {1, 2, \dots, | X |}

and

l \in {1, 2, \dots, | T |}

, the KNN classifier decides a label which is the most frequent among

x_{k}

-s. The optimum value of k for

t = 1

is

k = 1

.

In the following, we provide numerical examples where we illustrate the performance of the proposed classifier when

p_{_{Y^{n} | X}} (y^{n} | x)

is artificially generated.

4.1. The I.I.D. Case with One Training Sample per Label

In the following examples, we assume that the classifiers have access to only one training feature vector for each label, the elements of the feature vectors are generated i.i.d., and the alphabet size of the feature vector,

| Y |

, is fixed.

In Figure 4 and Figure 5, we compare the error probability of the proposed classifier with the naive Bayes classifier and the KNN algorithm for the case when

| Y | = 6

and

| Y | = 20

, respectively. In both examples, we have two different labels, i.e.,

| X | = 2

. As a result, we have two different probability distributions

p_{_{Y^{n} | X_{1}}} (y^{n} | x_{1})

and

p_{_{Y^{n} | X_{2}}} (y^{n} | x_{2})

. The probability distributions

p_{_{Y^{n} | X_{1}}} (y^{n} | x_{1})

and

p_{_{Y^{n} | X_{2}}} (y^{n} | x_{2})

are randomly generated as follows. We first generate two random vectors of length 6 and length 20 for Figure 4 and Figure 5, respectively, where the elements of these vectors are drawn independently from a uniform probability distribution. Then we normalize these vectors such that the sum of their elements is equal to one. These two normalized randomly generated vectors then represent the two probability distributions

p_{_{Y_{i} | X_{1}}} (y_{i} | x_{1}) = p_{_{Y | X_{1}}} (y | x_{1})

and

p_{_{Y_{i} | X_{2}}} (y_{i} | x_{2}) = p_{_{Y | X_{2}}} (y | x_{2})

,

\forall i

. Then,

p_{_{Y^{n} | X_{k}}} (y^{n} | x_{k})

is obtained as

p_{_{Y^{n} | X_{k}}} (y^{n} | x_{k}) = \prod_{i = 1}^{n} p_{_{Y_{i} | X_{k}}} (y_{i} | x_{k})

, for

k = 1, 2

. The simulation is carried out as follows. For each n, we generate one training vector for each label, using the aforementioned probability distributions. Then, as test samples, we generate 1000 feature vectors for each label and pass these feature vectors through our proposed classifier, the naive Bayes classifier, and the KNN algorithm, and compute the errors. The length of the feature vector n is varied from

n = 1

to

n = 100

. We repeat the simulation 5000 times and then plot the error probability. Figure 4 and Figure 5 show that the proposed classifier outperforms both the naive Bayes classification and KNN. The main reason for this performance gain is because when only one training vector per label is available, the proposed classifier is more resilient to errors than the naive Bayes classifier, whereas the KNN algorithm has very poor performance because of the “curse of dimensionality”. Specifically, the naive Bayes classifier cannot perform an accurate classification for small n compared to

| Y |

since the chance that an alphabet will not be present in one of the training feature vectors is close to 1. On the other hand, the KNN algorithm cannot perform an accurate classification for large n since the dimension of the input feature vector becomes much larger than the training data and the “curse of dimensionality” occurs.

In Figure 6, we compare the performance of the proposed classifier for different values of r when

| Y | = 6

with the derived upper bounds. As can be seen, for this example, the derived theoretical upper bounds have similar slope as the exact error probabilities. Moreover, we can see that for this example, the optimal r is

r = 1

. However, this is not always the case and it depends on

p_{_{Y^{n} | X_{k}}} (y^{n} | x_{k})

,

| Y |

, and

| X |

.

4.2. The Overlapping I.Non-I.D. Case with One Training Sample per Label

In this example, we consider the i.non-i.d. case where the probability distributions

p_{_{i}} (y_{i} | x_{k})

are overlapping for all i, as shown in Figure 7. The small orthogonal lines on the x-axis in Figure 7 represent alphabets, i.e., the elements in

Y

, and the probability of occurrence of an alphabet

y_{i}

is equal to the intersection between the corresponding orthogonal line to the represented probability distribution

p_{_{i}} (y_{i} | x_{k})

for

k = 1, 2

. By “overlapping”, we mean the following. Let

Y_{v}

and

Y_{u}

denote the set of outputs generated by

p_{_{v}} (y_{v} | x_{k})

and

p_{_{u}} (y_{u} | x_{k})

, respectively. If for any v and u,

Y_{v} \cap Y_{u} \neq \emptyset

holds, we say that the output alphabets are overlapping.

To demonstrate the performance of our proposed classifier in the overlapping case, we assume that we have two different labels,

X = {x_{1}, x_{2}}

, where the corresponding conditional probability distributions

p_{_{i}} (y_{i} | x_{1})

and

p_{_{i}} (y_{i} | x_{2})

are obtained as follows. For a given n, let

Y = \{- n, - n + 1, \dots, 0, \dots, n - 1, n\}

be the set of all alphabets. Note that the size of

Y

grows with n. Moreover, let

u_{i}

and

v_{i}

(1 \leq i \leq n)

be vectors of length

2 n + 1

, given by

\begin{matrix} u_{i} & = [0, \dots, 0, \frac{1}{i (i + 1)}, \frac{2}{i (i + 1)}, \dots, \frac{i}{i (i + 1)}, \frac{i + 1}{i (i + 1)}, \frac{i}{i (i + 1)}, \dots, \frac{1}{i (i + 1)}, 0, \dots, 0], \end{matrix}

(50)

\begin{matrix} v_{i} & = [0, \dots, 0, \frac{1}{i (i + 1)}, \frac{1}{i (i + 1)}, \dots, \frac{1}{i (i + 1)}, \frac{1}{i (i + 1)}, 0, \dots, 0] . \end{matrix}

(51)

The number of zeros in each side of the vectors

u_{i}

and

v_{i}

is

(n - i)

. To generate a feature vector from label

x_{1} (x_{2})

, we generate the vector

y^{n} = (y_{1}, y_{2}, \dots, y_{n})

, where

y_{k}

takes values from the set

Y

, with a probability distribution

p_{_{i}} (y_{i} | x_{1}) = u_{i} (1 + 2 (n + y_{i}))

(p_{_{i}} (y_{i} | x_{2}) = v_{i} (1 + 2 (n + y_{i})))

.

The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then, we generate 1000 feature vectors for each label and pass them through our proposed classifier, the naive Bayes classifier, and the KNN algorithm and calculate the error probability. We change the length of the feature vector from

n = 1

to

n = 100

and repeat the simulation 1000 times and then plot the error probability.

As shown in Figure 8, there is a huge difference between the performance of the two benchmark classifiers and the proposed classifier. The error probability of the naive Bayes classifier is almost

0.5

for all shown values of n as it is susceptible to the problem of unseen alphabets in the training vectors. The error probability of the KNN classifier is also almost

0.5

for

n > 20

as it is susceptible to the “curse of dimensionality”. However, the error probability of our proposed classifier continuously decays as n increases.

In Figure 9, we run the same experiments as in Figure 8 but with

T = 100

, i.e., 100 training feature vectors per label. As can be seen from Figure 9, the performance of the proposed classifier is better than the naive Bayes classifier, for

n > 15

. Since

| Y | = 2 n + 1

, for small values of n, the naive Bayes classifier has access to many training samples and, thereby, its performance is very close to the case when the probability distribution

p_{_{Y^{n} | X}} (y^{n} | x)

is known, i.e., to the maximum-likelihood classifier, and hence it has the optimal performance. As n increases, the number of alphabets rises, i.e.,

| Y |

rises, and due to the aforementioned issue of the naive Bayes classifier with unseen alphabets, our proposed classifier performs much better classification than the naive Bayes classifier. Furthermore, note that the error probability of our proposed classifier decays exponentially as n increases which is not the case with the naive Bayes classifier. Moreover, Figure 9 also shows the theoretical upper bound on the error probability we derived in (11).

4.3. The Non-Overlapping I.Non-I.D. Case with One Training Sample for Each Label

In this example, we consider the i.non-i.d. case where the probability distributions

p_{_{j}} (y_{j} | x_{i})

are non-overlapping for all j as shown in Figure 10, where we defined “overlapping” in Section 4.2. Hence, we test the other extreme in terms of possible distribution of the elements in the feature vectors

Y^{n}

.

To demonstrate the performance of our proposed classifier in the non-overlapping case, we assume that we have two different labels

X = {x_{1}, x_{2}}

, the corresponding conditional probability distributions

p_{_{i}} (y_{i} | x_{1})

and

p_{_{i}} (y_{i} | x_{2})

are obtained as follows. For a given n, let

Y = \{1, 2, 3, \dots, {(n + 1)}^{2} - 1\}

be the set of all alphabets of the element in the feature vectors. Note again that the size of

Y

grows with n. in addition, let

u_{i}

and

v_{i}

for

(1 \leq i \leq n)

, be vectors of length

{(n + 1)}^{2} - 1

, given by

\begin{matrix} u_{i} & = [0, \dots, 0, \frac{1}{i (i + 1)}, \frac{2}{i (i + 1)}, \dots, \frac{i}{i (i + 1)}, \frac{i + 1}{i (i + 1)}, \frac{i}{i (i + 1)}, \dots, \frac{1}{i (i + 1)}, 0, \dots, 0], \end{matrix}

(52)

\begin{matrix} v_{i} & = [0, \dots, 0, \frac{1}{i (i + 1)}, \frac{1}{i (i + 1)}, \dots, \frac{1}{i (i + 1)}, \frac{1}{i (i + 1)}, 0, \dots, 0] . \end{matrix}

(53)

The number of zeros in the left-hand sides of

u_{i}

and

v_{i}

is

i^{2} - 1

. To generate a feature vector from the label

x_{1} (x_{2})

, we generate the vector

y^{n} = (y_{1}, y_{2}, \dots, y_{n})

, where

y_{k}

take values from the set

Y

, with probability distribution

p_{_{i}} (y_{i} | x_{1}) = u_{i} (y_{i})

(p_{_{i}} (y_{i} | x_{2}) = v_{i} (y_{i}))

.

The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then we generate 250 feature vectors for each label and pass it through our proposed classifier, the naive Bayes classifier and KNN and calculate the error probabilities. We change the length of the vector from 1 to 80 and repeat the simulation 250 times and then plot the error probability. As shown in Figure 11, there is a huge difference between the performance of the proposed classifier and the two benchmark classifiers. The error probability of the naive Bayes classifier is almost

0.5

for all shown values of n as it is susceptible to the issue with unseen alphabets in the training feature vector. The error probability of the KNN classifier is almost

0.5

for all shown values of

n > 30

as it becomes susceptible to the “curse of dimensionality”. However, the error probability of our proposed classifier still decays continuously as n increases.

Note that, in our numerical examples, we compared our algorithm with the benchmark schemes on two extreme cases of i.non-i.d. vectors, referred to as “overlapping” and “non-overlapping”. Any other i.non-i.d. vector can be represented as a combination of the “overlapping” and “non-overlapping” vectors. Since our algorithm works better than the benchmark schemes for small t on both these cases, it will work better than the benchmark schemes on any combination between “overlapping” and “non-overlapping” vectors, i.e., for any other i.non-i.d. vectors.

5. Conclusions

In this paper, we proposed a supervised classification algorithm that assigns labels to input feature vectors with independent but non-identically distributed elements, a statistical property found in practice. We proved that the proposed classifier is asymptotically optimal since the error probability moves to zero as the length of the input feature vectors grows. We showed that this asymptotic optimality is achievable even when one training feature vector per label is available. In the numerical examples, we compared the proposed classifier with the naive Bayes classifier and the KNN algorithm. Our numerical results show that the proposed classifier outperforms the benchmark classifiers when the number of training data is small and the length of the input feature vectors is sufficiency large.

Author Contributions

Methodology, F.S. and N.Z.; software, F.S.; formal analysis, F.S. and N.Z.; investigation, F.S.; supervision, N.Z.; writing—original draft preparation, F.S.; writing—review and editing, N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Corollary 2

The proof is almost identical to the proof of Theorem 1; however, here we derive a looser upper-bound on the error-probability than that in (11), which is independent of

P_{{\hat{y}}_{i}^{n T}}

.

Without loss of generality we assume that

x_{1}

is the input to

p_{_{Y^{n} | X}} (y^{n} | x)

and

y^{n}

is observed at the classifier.

Let

B_{k, l}^{ϵ}

, for

1 \leq k \leq | Y |

and

1 \leq l \leq | X |

, be a set defined as

\begin{matrix} B_{k, l}^{ϵ} = \{{\hat{y}}^{n t} : | \frac{I [{\hat{y}}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{l}) | \leq \frac{ϵ}{\sqrt[3]{t}}\} . \end{matrix}

(A1)

Let

B_{l}^{ϵ} = ⋂_{k = 1}^{| Y |} B_{k, l}^{ϵ}

. For

{\hat{y}}_{1}^{n t} \in B_{1}^{ϵ}

, we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} \overset{(a)}{\leq} {(\sum_{k = 1}^{| Y |} {(\frac{ϵ}{\sqrt[3]{t}})}^{r})}^{1 / r}, \end{matrix}

(A2)

Using the same derivation as (18), for any

y^{n} \in A^{ϵ}

and for

{\hat{y}}_{1}^{n t} \in B_{1}^{ϵ}

, we have:

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n T} |^{r})}^{1 / r} \leq {| Y |}^{1 / r} ϵ + {| Y |}^{1 / r} \frac{ϵ}{\sqrt[3]{t}} . \end{matrix}

(A3)

On the other hand, the same as the derivation in (21), for each

i \neq 1

, we have:

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \geq & {∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r} - {| Y |}^{1 / r} ϵ . \end{matrix}

(A4)

Now, for any

{\hat{y}}_{i}^{n t} \in B_{i}^{ϵ}

, we have

\begin{matrix} {∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r} + {(\sum_{k = 1}^{| Y |} {(\frac{ϵ}{\sqrt[3]{t}})}^{r})}^{1 / r} \\ \overset{(a)}{\geq} {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{1}) |^{r})}^{1 / r} + {(\sum_{k = 1}^{| Y |} | \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} - \bar{p} (y_{k} | x_{i}) |^{r})}^{1 / r} \\ \overset{(b)}{\geq} {(\sum_{k = 1}^{| Y |} | \bar{p} (y_{k} | x_{1}) - \bar{p} (y_{k} | x_{i}) |^{r})}^{1 / r}, \end{matrix}

(A5)

where

(a)

follows from (A1) and

(b)

is again due to the Minkowski inequality. The expression in (A5), can be written equivalently as

\begin{matrix} {∥P_{{\hat{y}}_{i}^{n t}} - {\bar{P}}_{1}∥}_{r} \geq {∥{\bar{P}}_{i} - {\bar{P}}_{1}∥}_{r} - {| Y |}^{1 / r} \frac{ϵ}{\sqrt[3]{t}} . \end{matrix}

(A6)

where

i \neq 1

. Using the bounds in (A6) and (A4), for any

i \neq 1

we have

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \geq {∥{\bar{P}}_{i} - {\bar{P}}_{1}∥}_{r} - {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}) . \end{matrix}

(A7)

Using the bounds in (A3) and (A7), we now relate the left-hand sides of (A3) and (A7) as follows. As long as the following inequality holds for each

i \neq 1

,

\begin{matrix} {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{T}}) < ∥ {\bar{P}}_{i} - {\bar{P}}_{1} ∥_{r} - {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}), \end{matrix}

(A8)

which is equivalent to the following for

i \neq 1

\begin{matrix} ϵ < \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{1}∥}_{r}}{2 (1 + t^{- 1 / 3}) {| Y |}^{1 / r}}, \end{matrix}

(A9)

we have the following for

i \neq 1

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} & \overset{(a)}{\leq} {| Y |}^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}) \\ \overset{(b)}{<} ∥ {\bar{P}}_{i} - {\bar{P}}_{1} ∥_{r} - | Y |^{1 / r} ϵ (1 + \frac{1}{\sqrt[3]{t}}) \\ \overset{(c)}{\leq} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r}, \end{matrix}

(A10)

where

(a)

,

(b)

, and

(c)

follow from (A3), (A8), and (A7), respectively. Thereby, from (A10), we have the following for

i \neq 1

\begin{matrix} {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{1}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r} \leq {(\sum_{k = 1}^{| Y |} | \frac{I [y^{n} = y_{k}]}{n} - \frac{I [{\hat{y}}_{i}^{n t} = y_{k}]}{n t} |^{r})}^{1 / r}, \end{matrix}

(A11)

or equivalently as

\begin{matrix} {∥P_{y^{n}} - P_{{\hat{y}}_{1}^{n t}}∥}_{r} < {∥P_{y^{n}} - P_{{\hat{y}}_{i}^{n t}}∥}_{r} . \end{matrix}

(A12)

Once again, we obtained that if there is an

ϵ

for which (A9) holds for

i \neq 1

and for that

ϵ

there are sets

A^{ϵ}

and

B_{i}^{ϵ}

for which

y^{n} \in A^{ϵ}

and

{\hat{y}}_{j}^{n t} \in B_{l}^{ϵ}

for all

1 \leq l \leq | X |

, then (A12) holds for

i \neq 1

, and thereby our classifier will detect that

x_{1}

is the correct label. Using this, we can upper-bound the error probability as

\begin{matrix} P_{e} & = 1 - \Pr \{{\hat{x}}_{1} = x_{1}\} \\ \leq 1 - \Pr {(y^{n} \in A^{ϵ}) \cap (⋂_{j = 1}^{| X |} {\hat{y}}_{l}^{n t} \in B_{l}^{ϵ}) | ϵ \in S}, \end{matrix}

(A13)

where

S

is a set defined as

\begin{matrix} S = \{ϵ : ϵ \leq min_{\begin{matrix} i \\ i \neq 1 \end{matrix}} \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{1}∥}_{r}}{(2 + t^{- 1 / 3}) {| Y |}^{1 / r}}\} . \end{matrix}

(A14)

The right-hand side of (A13) can be upper-bounded as

\begin{matrix} 1 - \Pr {(y^{n} \in A^{ϵ}) \cap (⋂_{l = 1}^{| X |} {\hat{y}}_{l}^{n t} \in B_{j}^{ϵ}) | ϵ \in S} & = \Pr {(y^{n} \notin A^{ϵ}) \cup (⋃_{l = 1}^{| X |} {\hat{y}}_{l}^{n t} \notin B_{l}^{ϵ}) | ϵ \in S} \\ \overset{(a)}{\leq} \Pr \{y^{n} \notin A^{ϵ} | ϵ \in S\} \\ + \sum_{l = 1}^{| X |} \Pr {{\hat{y}}_{l}^{n t} \notin B_{l}^{ϵ} |ϵ \in S\}, \end{matrix}

(A15)

Using the same derivation as (39), we have:

\begin{matrix} \Pr \{y^{n} \notin A^{ϵ} | ϵ \in S\} \leq 2 | Y | e^{- 2 n ϵ^{2}} . \end{matrix}

(A16)

Similarly, we have the following result for the second expression in the right-hand side of (A15), which is the same as the derivation in (43)

\begin{matrix} \Pr {{\hat{y}}_{l}^{n t} \notin B_{l}^{ϵ} |ϵ \in S\} \leq 2 | Y | e^{- 2 n t^{1 / 3} ϵ^{2}} . \end{matrix}

(A17)

Inserting (A16) and (A17) into (A15), and then inserting (A15) into (A13), we obtain the following upper-bound for the error probability

\begin{matrix} P_{e} \leq 2 | Y | e^{- 2 n ϵ^{2}} + 2 | X | | Y | e^{- 2 n t^{1 / 3} ϵ^{2}}, \end{matrix}

(A18)

where

\begin{matrix} ϵ = min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{j}∥}_{r}}{2 (1 + t^{- 1 / 3}) {| Y |}^{1 / r}}, \end{matrix}

(A19)

Now, if

| Y | \leq n^{m}

, (A18) can be written as

\begin{matrix} P_{e} & \leq 2 | Y | e^{- 2 n ϵ^{2}} + 2 | X | | Y | e^{- 2 n t^{1 / 3} ϵ^{2}} \\ \leq 2 n^{m} exp (- 2 n min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{j}∥}_{r}^{2}}{2 {(1 + t^{- 1 / 3})}^{2} n^{2 m / r}}) \\ + 2 | X | n^{m} exp (- 2 n t^{1 / 3} min_{\begin{matrix} i, j \\ i \neq j \end{matrix}} \frac{{∥{\bar{P}}_{i} - {\bar{P}}_{j}∥}_{r}^{2}}{2 {(1 + t^{- 1 / 3})}^{2} n^{2 m / r}}) \\ \leq O (n^{m} exp (- n^{1 - \frac{2 m}{r}})) . \end{matrix}

(A20)

According to (A20), for a fixed

r > 2 m

, the right-hand side of (A20) moves to zero as

n \to \infty

and, thereby, the classifier is asymptotically optimal.

References

Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, MA, USA, 2014. [Google Scholar]
Quinlan, J.R. Learning Efficient Classification Procedures and Their Application to Chess End Games; Springer: Berlin/Heidelberg, Germany, 1983; pp. 453–482. [Google Scholar]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth and Brooks: Monterey, CA, USA, 1984. [Google Scholar]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, San Jose, CA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Lallich, S.; Teytaud, O.; Prudhomme, E. Association Rule Interestingness: Measure and Statistical Validation; Studies in Computational Intelligence; Springer: Berlin, Germany, 2007; Volume 43, pp. 251–275. [Google Scholar]
Langley, P.; Iba, W.; Thompson, K. An analysis of bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, USA, 14–16 July 1992; AAAI Press: San Jose, CA, USA, 1992; p. 223228. [Google Scholar]
Aha, D.W. Lazy learning. In Artificial Intelligent Review; Kluwer Academic Publishers: Cambridge, MA, USA, 1997; p. 710. [Google Scholar]
Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
Kanaya, F.; Han, T.S. The asymptotics of posterior entropy and error probability for bayesian estimation. IEEE Trans. Inf. Theory 1995, 41, 1988–1992. [Google Scholar] [CrossRef]
Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 1989, 35, 401–408. [Google Scholar] [CrossRef]
Sason, I.; Verd, S. Arimotornyi conditional entropy and bayesian hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
Wang, C.; She, Z.; Cao, L. Coupled attribute analysis on numerical data. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 2–9 August 2013; AAAI Press: Beijing, China, 2013; p. 17361742. [Google Scholar]
Wang, C.; Dong, X.; Zhou, F.; Cao, L.; Chi, C.H. Coupled attribute similarity learning on categorical data. IEEE Trans. Neural Networks Learn. Syst. 2015, 26, 781–797. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Cao, L.; Wang, M.; Li, J.; Wei, W.; Ou, Y. Coupled nominal similarity in unsupervised learning. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Scotland, UK, 24–28 October 2011; Association for Computing Machinery: Glasgow, Scotland, UK, 2011; p. 973978. [Google Scholar]
Liu, C.; Cao, L. A coupled k-nearest neighbor algorithm for multi-label classification. In PAKDD; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Liu, C.; Cao, L.; Yu, P. A hybrid coupled k-nearest neighbor algorithm on imbalance data. In Proceedings of the International Joint Conference on Neural Networks, Beijing, China, 6–11 July 2014; pp. 2011–2018. [Google Scholar]
Liu, C.; Cao, L.; Yu, P.S. Coupled fuzzy k-nearest neighbors classification of imbalanced non-iid categorical data. In Proceedings of the 2014 International Joint Conference on Neural Networks IJCNN, Beijing, China, 8–13 July 2014; pp. 1122–1129. [Google Scholar]
Cao, L. Non-IIDness Learning in Behavioral and Social Data. Comput. J. 2013, 57, 1358–1370. [Google Scholar] [CrossRef]
Cao, L. Coupling learning of complex interactions. Inf. Process. Manag. 2015, 51, 167–186. [Google Scholar] [CrossRef]
Cao, L.; Zhang, H.; Zhao, Y.; Luo, D.; Zhang, C. Combined mining discovering informative knowledge in complex data. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2011, 41, 699–712. [Google Scholar]
Getoor, L.; Taskar, B. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Abed-Meraim, K.; Loubaton, P.; Moulines, E. A subspace algorithm for certain blind identification problems. IEEE Trans. Inf. Theory 2006, 43, 499–511. [Google Scholar] [CrossRef]
Almeida, L. Linear and nonlinear ica based on mutual information—The misep method. IEEE Trans. Signal Process. 2004, 84, 231–245. [Google Scholar] [CrossRef]
Hyvarinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411430. [Google Scholar] [CrossRef] [Green Version]
Yeredor, A. Independent component analysis over galois fields of prime order. IEEE Trans. Inf. Theory 2011, 57, 53425359. [Google Scholar] [CrossRef] [Green Version]
Nguyen, H.; Zheng, R. Binary independent component analysis with or mixtures. IEEE Trans. Signal Processing 2011, 59, 31683181. [Google Scholar] [CrossRef] [Green Version]
Barlow, H.B. Unsupervised learning. Neural Comput. 1989, 1, 295311. [Google Scholar] [CrossRef]
Nakhaeizadeh, G.; Taylor, C.C.; Kunisch, G. Dynamic Supervised Learning: Some Basic Issues and Application Aspects. In Classification and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 1997; pp. 123–135. [Google Scholar]
Koppen, M. The Curse of Dimensionality; SAGE: London, UK, 2000; pp. 4–8. [Google Scholar]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2000. [Google Scholar]
Jain, A.K.; Duin, R.P.W.; Mao, J. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 437. [Google Scholar] [CrossRef] [Green Version]
Xu, B.; Huang, J.; Williams, G.; Wang, Q.; Ye, Y. Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 2012, 8, 44–63. [Google Scholar] [CrossRef]
Ye, Y.; Wu, Q.; Huang, J.Z.; Ng, M.K.; Li, X. Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit. 2013, 46, 769–787. [Google Scholar] [CrossRef]
Kouiroukidis, N.; Evangelidis, G. The effects of dimensionality curse in high dimensional knn search. In Proceedings of the 15th Panhellenic Conference on Informatics, Kastoria, Greece, 30 September–2 October 2011; pp. 41–45. [Google Scholar]
Matusita, K. Classification based on distance in multivariate gaussian cases. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Volume 1: Statistics; University of California Press: Regents, CA, USA, 1967; pp. 299–304. [Google Scholar]
Collins, M.; Schapire, R.E.; Singer, Y. Logistic regression, adaboost and bregman distances. Mach. Learn. 2002, 48, 253285. [Google Scholar] [CrossRef]
Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Warsaw, Poland, 2020. [Google Scholar]
Lebanon, G.; Lafferty, J. Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the 19th International Conference on Machine Learning, San Francisco, CA, USA, 8–12 July 2002; pp. 363–370. [Google Scholar]
Hoeffding, W. On the distribution of the number of successes in independent trials. Ann. Math. Statist. 1956, 27, 713–721. [Google Scholar] [CrossRef]
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]

Figure 1. A typical structural modelling of the classification learning problem.

Figure 2. An alternative modeling of the classification learning problem.

Figure 3. An alternative modelling of the classification learning problem when the elements of

Y^{n} (X)

are mutually independent but non-identically distributed (i.non-i.d.).

Figure 3. An alternative modelling of the classification learning problem when the elements of

Y^{n} (X)

are mutually independent but non-identically distributed (i.non-i.d.).

Figure 4. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.

Figure 5. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.

Figure 6. Comparison in error probability of the proposed classifier for different values of r when

| Y | = 6

. The related theoretical upper bounds for each value of r are also given.

Figure 6. Comparison in error probability of the proposed classifier for different values of r when

| Y | = 6

. The related theoretical upper bounds for each value of r are also given.

Figure 7. Illustration of the probability distributions

p_{_{i}} (y_{i} | x_{1})

(upper figure) and

p_{_{i}} (y_{i} | x_{2})

(lower figure), for

i = 1, 2, \dots, n

.

Figure 7. Illustration of the probability distributions

p_{_{i}} (y_{i} | x_{1})

(upper figure) and

p_{_{i}} (y_{i} | x_{2})

(lower figure), for

i = 1, 2, \dots, n

.

Figure 8. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier

(T = 1)

.

Figure 8. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier

(T = 1)

.

Figure 9. Comparison in error probability between the naive Bayes classifier and the proposed classifier

(T = 100)

.

Figure 9. Comparison in error probability between the naive Bayes classifier and the proposed classifier

(T = 100)

.

Figure 10. Illustration of the probability distributions

p_{_{i}} (y_{i} | x_{1})

(upper figure) and

p_{_{i}} (y_{i} | x_{2})

(lower figure), for

i = 1, 2, \dots, n

.

Figure 10. Illustration of the probability distributions

p_{_{i}} (y_{i} | x_{1})

(upper figure) and

p_{_{i}} (y_{i} | x_{2})

(lower figure), for

i = 1, 2, \dots, n

.

Figure 11. Comparison in error probability between the naive Bayes classifier and the proposed classifier

(T = 1)

.

Figure 11. Comparison in error probability between the naive Bayes classifier and the proposed classifier

(T = 1)

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shahrivari, F.; Zlatanov, N. On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements. Entropy 2021, 23, 1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

AMA Style

Shahrivari F, Zlatanov N. On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements. Entropy. 2021; 23(8):1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

Chicago/Turabian Style

Shahrivari, Farzad, and Nikola Zlatanov. 2021. "On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements" Entropy 23, no. 8: 1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

Abstract

1. Introduction

1.1. Background

1.2. Motivation

1.2.1. Lack of Tight Upper Bounds on the Performance of Classifiers

1.2.2. Independent and Non-Identically Distributed Features

1.2.3. Small Training Set

1.3. Contributions

2. Problem Formulation

3. The Proposed Classifier and Its Performance

3.1. The Proposed Classifier

3.2. Upper Bound on the Error Probability

4. Simulation Results

4.1. The I.I.D. Case with One Training Sample per Label

4.2. The Overlapping I.Non-I.D. Case with One Training Sample per Label

4.3. The Non-Overlapping I.Non-I.D. Case with One Training Sample for Each Label

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Corollary 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI