Next Article in Journal
A Review of Shannon and Differential Entropy Rate Estimation
Next Article in Special Issue
Population Risk Improvement with Model Compression: An Information-Theoretic Approach
Previous Article in Journal
KnowRU: Knowledge Reuse via Knowledge Distillation in Multi-Agent Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

Electrical and Computer Systems Engineering, Monash University, Alliance Ln, Clayton, VIC 3168, Australia
*
Author to whom correspondence should be addressed.
Submission received: 26 July 2021 / Revised: 6 August 2021 / Accepted: 10 August 2021 / Published: 13 August 2021
(This article belongs to the Special Issue Information Theory and Machine Learning)

Abstract

:
In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements that take values from a finite alphabet set. First, we show the importance of this problem. Next, we propose a classifier and derive an analytical upper bound on its error probability. We show that the error probability moves to zero as the length of the feature vectors grows, even when there is only one training feature vector per label available. Thereby, we show that for this important problem at least one asymptotically optimal classifier exists. Finally, we provide numerical examples where we show that the performance of the proposed classifier outperforms conventional classification algorithms when the number of training data is small and the length of the feature vectors is sufficiently high.

1. Introduction

1.1. Background

Supervised classification is a machine learning technique that maps an input feature vector to an output label based on a set of correctly labeled training data. There is no single learning algorithm that works best on all supervised learning problems, as shown by the no free lunch theorem in [1]. As a result, there are many algorithms proposed in the literature whose performance depends on the underlying problem and the amount of training data available. The most widely used algorithms in the literature are decision trees [2,3], Support Vector Machines (SVM) [4,5], Rule-Based Systems [6], naive Bayes classifiers [7], k-nearest neighbors (KNN) [8], logistic regressions, and neural networks [9,10].

1.2. Motivation

In the following, we discuss the motivation for this work.

1.2.1. Lack of Tight Upper Bounds on the Performance of Classifiers

In general, there are no tight upper bounds on the performance of the classifiers used in practice. Many of the previous works only provide experimental performance results. However, this approach has drawbacks. For example, one has to rely on the trial-and-error approach in order to develop a good classifier for a given problem, which impacts the reliability. Next, the algorithms whose performance has been verified only experimentally may work for a given problem, but may fail to work when applied to a similar problem. Finally, experimental results do not provide intuition into the underlying problem, whereas the analytical results provide the understanding of the underlying problem and the corresponding solutions.
Motivated by this, in the paper, we aim to investigate classifiers with analytical upper bounds on their performance.

1.2.2. Independent and Non-Identically Distributed Features

In general, we can categorize the statistical properties of the feature vectors, which are the input to the classifier, into three types. To this end, let Y n ( X ) = Y 1 ( X ) , Y 2 ( X ) , , Y n ( X ) denote the input feature vector to the supervised classifier, where n is the length of the feature vector and X is the label to which the feature vector Y n ( X ) belongs. Then, we can distinguish the following three types of feature vectors depending on the statistics of the elements in the feature vector Y n ( X ) .
The first type of feature vector is when the elements of Y n ( X ) are independent and identically distributed (i.i.d.). This is the simplest features model, but also the least applicable in practice. This model is identical to hypothesis testing, which has been well investigated in the literature [11,12,13]. As a result, tight upper bounds on the performance of supervised learning algorithms for this type of feature vector are available in the hypothesis testing literature. For instance, the authors in [11] showed that the posterior entropy and the maximum a posterior error probability decay to zero with the length of the feature vector at the identical exponential rate, where the maximum achievable exponent is the minimum Chernoff information. In [12], the authors determine the requirements for the length of the vector Y n ( X ) and the number of labels m in order to achieve vanishing exponential error probability in testing m hypothesis that minimizes the rejection zone. In [13], the authors provide an upper bound and a lower-bound on the error probability of Bayesian m-ary hypothesis testing in terms of conditional entropy.
The second type of feature vectors is when the elements of Y n ( X ) are mutually dependent and non-identically distributed (d.non-i.d.). This type of features model is the most general model and the most applicable in practice. However, it is also the most difficult to tackle analytically. As a result, supervised learning algorithms proposed for this features model lack analytical tight upper bounds on their performance [14,15,16,17,18,19,20,21,22,23]. This is because there are not any frameworks that produce closed-form results when deriving statistics of vectors with d.non-i.d. elements when the underlying distributions are unknown. Then how can we investigate analytically classifiers for practical scenarios when the feature vectors have d.non-i.d. elements? A possible approach leads us to the third type of feature vectors, explained in the following.
The third type of feature vectors is when the elements of Y n ( X ) are mutually independent but non-identically distributed (i.non-i.d.). This features model is much simpler than the d.non-i.d. features model and, more importantly, it is analytically tractable, as we show in this paper. Furthermore, this features model is applicable in practice. Specifically, there exists a class of algorithms, known as Independent Component Analysis (ICA), that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements with a zero or a negligible loss of information [24,25,26,27,28]. The origins of ICA can be traced back to Barlow [29], who argued that a good representation of binary data can be achieved by an invertible transformation that transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements. Finding such a transformation with no prior information about the distribution of the data has been considered an open problem until recently [28]. Specifically, the authors in [28] show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Thereby, the authors in [28] provide the first efficient set of solutions to Barlow’s problem. So far, the complexity of the fastest such algorithm is O n × 2 n [28]. Nevertheless, since there exist such invertible transformations (i.e., no loss of information) which can transform vectors with d.non-i.d. elements into vectors with i.non-i.d. elements, we can tackle the features model comprised of d.non-i.d. elements by first transforming it (without loss of information) into the features model comprised of i.non-i.d. elements and then tackling the i.non-i.d. features model.
Motivated by this, in this paper, we investigate supervised classification of feature vectors with i.non-i.d. elements.

1.2.3. Small Training Set

The main factor that impacts the accuracy of supervised classification is the amount of training data. In fact, most supervised algorithms are able to learn only if there is a very large set of training data available [30]. The main reason for this is the curse of dimensionality [31,32], which states that “the higher the dimensionality of the feature vectors, the more training data are needed for the supervised classifier” [33]. For example, supervised classification methods such as random forest [34,35] and KNN [36] suffer from the curse of dimensionality. However, having large training data sets is not always possible in practice. As a result, designing a supervised classification algorithm that exhibits good performance even when the training data set is extremely small is important.
Motivated by this, in this paper, we investigate supervised classifiers for the case when t training feature vectors per label are available, where t = 1 , 2 , . . .

1.3. Contributions

In this paper, we propose an algorithm for supervised classification of feature vectors with i.non-i.d. elements when the number of training feature vectors per label is t, where t = 1 , 2 , . . . Next, we derive an upper bound on the error probability of the proposed classifier for uniformly distributed labels and prove that the error probability exponentially decays to zero when the length of the feature vector, n, grows, even when only one training vector per label is available, i.e., when t = 1 . Hence, the proposed classification algorithm provides an asymptotically optimal performance even when the number of training vectors per label is extremely small. We compare the performance of the proposed classifier with the naive Bayes classifier and to the KNN algorithm. Our numerical results show that the proposed classifier significantly outperforms the naive Bayes classifier and the KNN algorithm when the number of training feature vectors per label is small and the length of the feature vectors n is sufficiently high.
The proposed algorithm is a form of the nearest neighbor classification algorithm, where the nearest neighbor is searched in the domain of empirical distributions. As a result, we refer to the algorithm as the nearest empirical distribution. The nearest empirical distribution algorithm is not new and, to the best of our knowledge, it was first proposed in [37] for the case when the elements of Y n ( X ) are i.i.d., i.e., for the equivalent problem of hypothesis testing. However, in this paper, we propose the nearest empirical distribution algorithm for the case when the elements of Y n ( X ) are i.non-i.d., which is much more complex than the problem of hypothesis testing where the elements of Y n ( X ) are i.i.d.
To the best of our knowledge, this is the first paper that investigates the important problem of classifying feature vectors with i.non-i.d. elements and provides an upper bound on its error probability. The novelty of this paper is not with the classifier itself, but rather in showing the importance of the problem of classifying feature vectors with i.non-i.d elements and in showing analytically that at least one classifier with an asymptotically optimal error probability exists when at least one training feature vectors per label is available.
The remainder of this paper is structured as follows. In Section 2, we formulate the considered classification problem. In Section 3, we provide our classifier and derive an upper bound on its error probability. In Section 4, we provide numerical examples of the performance on the proposed classifier. Finally, Section 5 concludes the paper.

2. Problem Formulation

The machine learning model is comprised of a label X, a feature vector Y n ( X ) = [ Y 1 ( X ) , Y 2 ( X ) , , Y n ( X ) ] of length n mapped to the label X, and a learned label X ^ , as shown in Figure 1. In this paper, we adopt the information-theoretic style of notations and thereby random variables are denoted by capital letters and their realizations are denoted with small letters. The feature vector Y n ( X ) is the input to the machine learning algorithm whose aim is to detect the label X from the observed feature vector Y n ( X ) . The performance of the machine learning algorithm is measured by the error probability P e = Pr X X ^ .
We adopt the modeling in [38,39,40] and represent the dependency between the label X and the feature vector Y n ( X ) via a joint probability distribution p X , Y n ( x , y n ) . Now, in order to gain a better understanding of the problem, we include the joint probability distribution p X , Y n ( x , y n ) into the model in Figure 1. To this end, since p X , Y n ( x , y n ) = p Y n | X ( y n | x ) p X ( x ) holds, instead of p X , Y n ( x , y n ) , we can include the conditional probability distribution p Y n | X ( y n | x ) and the probability distribution p X ( x ) into the model in Figure 1, and thereby obtain the model in Figure 2.
Now, the classification learning model in Figure 2 is a system comprised of a label generating source X according to the distribution p X ( x ) , a feature vector generator modelled by the conditional probability distribution p Y n | X ( y n | x ) , a feature vector Y n , a classifier that aims to detect X from the observed feature vector Y n , and the detected label X ^ . Note that the system model in Figure 2 can be seen equivalently as a communication system comprised of a source X, a channel with input X and output Y n , and a decoder (i.e., detector) that aims to detect X from Y n . The notation used in this paper, letter X for labels and letter Y for features, is based on the notation used in information theory for modelling communication systems. In the classification model shown in Figure 2, we assume that the label X can take values from the set X , according to p X ( x ) = 1 / | X | , where | · | denotes the cardinality of a set. Next, we assume that the i-th element of the feature vector Y n , Y i , for i = 1 , 2 , , n , takes values from the set Y = y 1 , y 2 , , y | Y | , according to the conditional probability distribution p Y i | X ( y i | x ) .
Moreover, we assume that the elements of the feature vector Y n are i.non-i.d. As a result, the feature vector Y n takes values from the set Y n according to the conditional probability distribution p Y n | X ( y n | x ) given by
p Y n | X ( y n | x ) = p Y 1 , Y 2 , , Y n | X ( y 1 , y 2 , , y n | x ) = ( a ) i = 1 n p Y i | X ( y i | x ) = ( b ) i = 1 n p i ( y i | x ) ,
where ( a ) comes from the fact that elements in the feature vector Y n are mutually independent and ( b ) is for the sake of notational simplicity, where p i is used instead of p Y i | X . As a result of (1), the considered classification model in Figure 2 can be represented equivalently as in Figure 3.
Next, we assume that p i ( y i | x ) , i , and thereby p Y n | X ( y n | x ) are unknown to the classifier. Instead, the classifier knows X , Y , and for each x i X , where i = 1 , 2 , , | X | , it has access to a finite set of t correctly labelled input–output pairs ( x i , y ^ i 1 n ) , ( x i , y ^ i 2 n ) , , ( x i , y ^ i t n ) , denoted by T i , referred to as the training set for label x i .
Finally, we assume that the following holds
l = 1 n p l ( y | x i ) l = 1 n p l ( y | x j ) , for y Y and i j .
The condition in (2) means that the distribution of the feature vectors Y n ( X ) for label X = i is not a perturbation of distribution of the feature vectors Y n ( X ) for label X = j . As a result, the proposed classifier only applies to the subset of data vectors with i.non-i.d. elements that satisfy (2).
For the classification system model defined above and illustrated in Figure 3, we wish to propose a classifier that exhibits an asymptotically optimal error probability P e = Pr X X ^ with respect to the length of Y n , n, for any t 1 , i.e., for any t 1 , P e 0 as n . Moreover, we wish to obtain an analytical upper bound on the error probability of the proposed classifier for a given t and n.

3. The Proposed Classifier and Its Performance

In this section, we propose our classifier, derive an analytical upper bound on its error probability, and prove that the classifier exhibits an asymptotically optimal performance when the length of the feature vector Y n , n, satisfies n . This is conducted in the following.
For a given vector v n = ( v 1 , v 2 , , v n ) , let the Minkowski distance r be defined as
v r = i = 1 n v i r ( 1 / r ) .
Moreover, for a given feature vector y k = ( y 1 , y 2 , , , y k ) , let I [ y k = y ] be a function defined as
I [ y k = y ] = i = 1 k Z [ y i = y ] ,
where Z [ y i = y ] is an indicator function assuming the value 1 if y i = y and 0 otherwise. Hence, I [ y k = y ] counts the number of elements in Y k that have the value y.

3.1. The Proposed Classifier

Let y ^ i n t be a vector obtained by concatenating all training feature vectors for the input label x i as
y ^ i n t = y ^ i 1 n , y ^ i 2 n , , y ^ i t n .
Let P y ^ i n t be the empirical probability distribution of the concatenated training feature vector for label x i , y ^ i n t , given by
P y ^ i n t = I y ^ i n t = y 1 n t , I y ^ i n t = y 2 n t , , I y ^ i n t = y | Y | n t .
Let y n be the observed feature vector at the classifier whose label it wants to detect and let P y n denote the empirical probability distribution of y n , given by
P y n = I y n = y 1 n , I y n = y 2 n , , I y n = y | Y | n .
Using the above notations, we propose the following classifier.
Proposition 1.
For the considered system model, we propose a classifier with the following classification rule
x ^ = x i , where i = arg min i P y n P y ^ i n t r ,
where r 1 and ties are resolved by assigning the label among the ties uniformly at random. (For example, if P y n P y ^ i n t r = P y n P y ^ j n t r holds for, i j , we set x ^ = x i or x ^ = x j uniformly at random).
As seen from (8), the proposed classifier assigns the label x i if the empirical probability distribution of the concatenated training feature vector mapped to label x i , P y ^ i n t is the closest, in terms of Minkowski distance r, to the empirical probability distribution of the observed feature vector P y n . In that sense, the proposed classifier can be considered as the nearest empirical distribution classifier.

3.2. Upper Bound on the Error Probability

The following theorem establishes an upper bound on the error probability of the proposed classifier.
Theorem 1.
Let P ¯ j , for j = 1 , 2 , , | X | , be a vector defined as
P ¯ j = p ¯ y 1 | x j , p ¯ y 2 | x j , , p ¯ y | Y | | x j ,
where p ¯ ( y | x j ) is given by
p ¯ ( y | x j ) = 1 n k = 1 n p k ( y | x j ) .
Then, for a given r 1 , the error probability of the proposed classifier is upper bounded by
P e 2 | Y | e 2 n ϵ 2 + 2 | Y | e 2 n t 1 / 3 ϵ 2 ,
where ϵ is given by
ϵ = min i , j i j P y ^ i n t P ¯ j r ( 2 + t 1 / 3 ) | Y | 1 / r .
Proof of Theorem 1. 
Without loss of generality we assume that x 1 is the input to p Y n | X ( y n | x ) and y n is observed.
Let A k ϵ , for 1 k | Y | , be a set defined as
A k ϵ = y n : | I y n = y k n p ¯ ( y k | x 1 ) | ϵ .
Furthermore, let B k ϵ , for 1 k | Y | , be a set defined as
B k ϵ = y ^ n t : | I y ^ n T = y k n t p ¯ ( y k | x 1 ) | ϵ t 3 .
Let A ϵ = k = 1 | Y | A k ϵ and B ϵ = k = 1 | Y | B k ϵ . Now, for any y n A ϵ , we have
k = 1 | Y | | I [ y n = y k ] n p ¯ ( y k | x 1 ) | r 1 / r ( a ) k = 1 | Y | ϵ r 1 / r ,
where ( a ) follows from (13). Moreover, for y ^ 1 n t B ϵ , we have
k = 1 | Y | | I [ y ^ 1 n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r ( a ) k = 1 | Y | ϵ t 3 r 1 / r ,
where ( a ) follows from (14). Next, we have the following upper bound
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n t | r 1 / r = k = 1 | Y | | I [ y n = y k ] n p ¯ ( y k | x 1 ) I [ y ^ 1 n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r ( a ) k = 1 | Y | | I [ y n = y k ] n p ¯ ( y k | x 1 ) | r 1 / r + k = 1 | Y | | I [ y ^ 1 n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r ,
where ( a ) follows from the Minkowski inequality. Combining (15)–(17), we obtain
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n t | r 1 / r | Y | 1 / r ϵ + | Y | 1 / r ϵ t 3 .
Hence, the Minkowski distance between the empirical probability distribution of the observed vector y n and the empirical probability distribution of the concatenated training vector for label x 1 is upper bounded by the right hand side of (18). We now derive a lower bound for y ^ i n t , where i 1 . For any x i , such that i 1 , we have
k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r + k = 1 | Y | ϵ r 1 / r ( a ) k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r + k = 1 | Y | | I [ y n = y k ] n p ¯ ( y k | x 1 ) | r 1 / r ( b ) k = 1 | Y | | I [ y ^ i n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r ,
where ( a ) follows from (15) and ( b ) is again due to the Minkowski inequality. The expression in (19), can be written equivalently as
k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r k = 1 | Y | | I [ y ^ i n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r | Y | 1 / r ϵ ,
where i 1 . Now, using the definitions of P y ^ i n t and P ¯ 1 given by (6) and (9), respectively, into (20) we can replace the expression in the right-hand side of (20) by P y ^ i n t P ¯ 1 r , and thereby for any i 1 we have
k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r P y ^ i n t P ¯ 1 r | Y | 1 / r ϵ .
The expression in (21) represents a lower bound on the Minkowski r distance between the empirical probability distribution of the observed vector y n and the empirical probability distribution of the concatenated training vector for any label x i , where i 1 .
Using the bounds in (18) and (21), we now relate the left-hand sides of (18) and (21). As long as the following inequality holds for each i 1 ,
| Y | 1 / r ϵ 1 + 1 t 3 < P y ^ i n t P ¯ 1 r | Y | 1 / r ϵ ,
which is equivalent to the following for i 1
ϵ < P y ^ i n t P ¯ 1 r ( 2 + t 1 / 3 ) | Y | 1 / r ,
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n t | r 1 / r ( a ) | Y | 1 / r ϵ 1 + 1 t 3 < ( b ) P y ^ i n t P ¯ 1 r | Y | 1 / r ϵ ( c ) k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r ,
where ( a ) , ( b ) , and ( c ) follow from (18), (22), and (21), respectively. Thereby, from (24), we have the following for i 1
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n T = y k ] n T | r 1 / r < k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n T = y k ] n T | r 1 / r .
Note that the right- and left-hand sides of (25) can be replaced by the Minkowski distance of the vectors
v 1 = I y n = y 1 n I y ^ 1 n t = y 1 n t , , I y n = y | Y | n I y ^ 1 n t = y | Y | n t ,
and
v 2 = I y n = y 1 n I y ^ i n t = y 1 n t , , I y n = y | Y | n I y ^ i n t = y | Y | n t ,
respectively. Now, (26) and (27) can be replaced by P y n P y ^ 1 n t and P y n P y ^ i n t , respectively, by the definitions of P y n and P y ^ i n t given by (7) and (6), respectively. Therefore, (25) can be written equivalently as
P y n P y ^ 1 n t r < P y n P y ^ i n t r .
Now, let us highlight what we have obtained. We obtained that there is an ϵ for which if (23) holds for i 1 , and for that ϵ there are sets A ϵ and B ϵ for which y n A ϵ and y ^ 1 n t B ϵ then (28) holds for i 1 , and thereby our classifier will detect that x 1 is the correct label. Using this, we can upper bound the error probability as
P e = 1 Pr x ^ 1 = x 1 1 Pr { y n A ϵ y ^ 1 n t B ϵ ϵ S ,
where S is a set defined as
S = ϵ : ϵ min i i 1 P y ^ i n t P ¯ 1 r ( 2 + t 1 / 3 ) | Y | 1 / r .
In the following, we derive the expression in (29). The right-hand side of (29) can be upper bounded as -4.6cm0cm
1 Pr { y n A ϵ y ^ 1 n t B ϵ ϵ S = Pr { y n A ϵ y ^ 1 n t B ϵ ϵ S ( a ) Pr y n A ϵ | ϵ S + Pr { y ^ 1 n t B ϵ ϵ S ,
where ( a ) follows from Boole’s inequality. Now, note that we have the following upper bound for the first expression in the right-hand side of (31)
Pr y n A ϵ | ϵ S = Pr { y n k = 1 | Y | A k ϵ ϵ S = Pr { y n k = 1 | Y | A k ϵ ¯ ϵ S ( a ) k = 1 | Y | Pr y n A k ϵ ¯ | ϵ S = k = 1 | Y | Pr { | I [ y n = y k ] n p ¯ ( y k | x 1 ) | > ϵ ϵ S = k = 1 | Y | Pr { | j = 1 n Z [ y j = y k ] n p ¯ ( y k | x 1 ) | > ϵ ϵ S ,
where A k ϵ ¯ is the complement of A k ϵ and ( a ) follows from Boole’s inequality. Note that Z [ y 1 = y k ] , Z [ y 2 = y k ] , , Z [ y n = y k ] in (32) are n independent Bernoulli random variables with probabilities of success p 1 ( y k | x 1 ) , p 2 ( y k | x 1 ) , , p n ( y k | x 1 ) , respectively. Let W [ y k ] be a binomial random variable with parameters n , p ¯ ( y k | x 1 ) . We proceed the proof by introducing the following well-known Hoefdding’s Theorem from [41]. □
Theorem 2
(Hoeffding [41]). Assume that Z 1 , Z 2 , , and Z n are n independent Bernoulli random variables with probabilities of success p 1 , p 2 , , and p n , respectively. Next, let Z be defined as Z = Z 1 + Z 2 + + Z n and, let p ¯ be defined as p ¯ = p 1 + p 2 + + p n / n . Let W be a binomial random variable with parameters ( n , p ¯ ) . Then, for a given a and b, where 0 a n p ¯ b n holds, we have
Pr a W b Pr a Z b .
In other words, the probability distribution of W is more dispersed around its mean n p ¯ than is the probability distribution of Z . Except in the trivial case when a = b = 0 , the bound in (33) holds with equality if and only if p 1 = = p n = p ¯ .
Proof of Theorem 2. 
Please refer to [41]. □
Setting a = n ( p ¯ δ ) and b = n ( p ¯ + δ ) in (33), we obtain
Pr n ( p ¯ δ ) W n ( p ¯ + δ ) Pr n ( p ¯ δ ) Z n ( p ¯ + δ ) .
Using (34), we have the following upper bound
Pr | Z n p ¯ | > δ = 1 Pr n ( p ¯ δ ) Z n ( p ¯ + δ ) ( a ) 1 Pr n ( p ¯ δ ) W n ( p ¯ + δ ) = Pr | W n p ¯ | > δ ,
where ( a ) follows from (34).
We now turn to the proof of Theorem 1. According to Theorem 2, the probability distribution of W [ y k ] is more dispersed around its mean n p ¯ ( y k | x 1 ) than is the probability distribution of 1 j n Z [ y j = y k ] . Therefore, we can upper bound the probability in the last line of (32) as
Pr { | j = 1 n Z [ y j = y k ] n p ¯ ( y k | x 1 ) | > ϵ ϵ S ( a ) Pr { | W [ y k ] n p ¯ ( y k | x 1 ) | > ϵ | ϵ S } ,
where ϵ S is defined in (30) and ( a ) follows from (35). Now, let us introduce another well-known Hoeffding’s Theorem from [42].
Theorem 3
(Hoeffding’s inequality [42]). Let W 1 , W 2 , , W n be n independent random variables such that for each 1 i n , we have Pr W i [ a i , b i ] = 1 . Then for S n , defined as S n = i = 1 n W i , we have
Pr S n E S n δ exp 2 δ 2 i = 1 n ( b i a i ) 2 ,
where E S n is the expectation of S n .
Proof of Theorem 3. 
Please refer to [42]. □
Back to (36), by using the result of (37) for a i = 0 and b i = 1 since the binomial random variable W [ y k ] can take values 0 or 1, respectively, we have
Pr { | j = 1 n Z [ y j = y k ] n p ¯ ( y k | x 1 ) | > ϵ | ϵ S } 2 exp 2 n 2 ϵ 2 1 i n ( 1 0 ) 2 2 e 2 n ϵ 2 ,
where ϵ S is defined in (30). Inserting (38) into (32), we obtain the following upper bound
Pr y n A ϵ | ϵ S 2 | Y | e 2 n ϵ 2 .
Similarly, we have the following result for the second expression in the right-hand side of (31)
Pr { y ^ 1 n t B ϵ | ϵ S } = Pr { y ^ 1 n t k = 1 | Y | B k ϵ | ϵ S } = Pr { y ^ 1 n t k = 1 | Y | B k ϵ ¯ | ϵ S } ( a ) k = 1 | Y | Pr y ^ 1 n t B k ϵ ¯ | ϵ S = k = 1 | Y | Pr { | I [ y ^ 1 n t = y k ] n t p ¯ ( y k | x 1 ) | > ϵ t 3 | ϵ S } = k = 1 | Y | Pr { | j = 1 n t Z [ y j = y k ] n t p ¯ ( y k | x 1 ) | > ϵ t 3 | ϵ S } ,
where again ( a ) follows from Boole’s inequality. Note that due to (5), for any integer number l such that 0 l t 1 the random variables Z [ y n l + 1 = y k ] , Z [ y n l + 2 = y k ] , , and Z [ y n l + n = y k ] in (40) are n independent Bernoulli random variables with the probabilities of success p 1 ( y k | x 1 ) , p 2 ( y k | x 1 ) , , and p n ( y k | x 1 ) , respectively ( y n l + 1 , y n l + 2 , , y n l + n are elements of y ^ 1 l + 1 n ) . In addition, note that
p ¯ ( y k | x 1 ) = 1 n j = 1 n p j ( y k | x 1 ) = 1 n t l = 0 t 1 j = 1 n p j ( y k | x 1 ) .
Notice that for each 0 l t 1 , p 1 ( y k | x 1 ) + p 2 ( y k | x 1 ) + + p n ( y k | x 1 ) is the summation of the probabilities of success of the random variables Z [ y n l + 1 = y k ] , Z [ y n l + 2 = y k ] , , and Z [ y n l + n = y k ] . Thereby, the last expression on the right-hand side of (41) is the average probability of success of random variables Z [ y j = y k ] for 1 j n t . Now, let W [ y k ] be a binomial random variable with parameters n t , p ¯ ( y k | x 1 ) . Once again, according to Theorem 2, the probability distribution of W [ y k ] is more dispersed around its mean n t p ¯ ( y k | x 1 ) ) than is the probability distribution of 1 j n t Z [ y j = y k ] . Therefore, the probability in the last line of (40) can be upper bounded as
Pr { | j = 1 n t Z [ y j = y k ] n t p ¯ ( y k | x 1 ) | > ϵ t 3 | ϵ S } ( a ) Pr { | W [ y k ] n t p ¯ ( y k | x 1 ) | > ϵ t 3 | ϵ S } ( b ) 2 exp 2 ( n t ) 2 t 1 / 3 ϵ 2 1 i n t ( 1 0 ) 2 2 e 2 n t t 2 / 3 ϵ 2 = 2 e 2 n t 1 / 3 ϵ 2 ,
where ϵ S , defined in (30), ( a ) follows from (35) (in which n is replaced by n t ), and ( b ) is the result of (37) for a i = 0 and b i = 1 since the binomial random variable W [ y k ] can take values 0 or 1, respectively. Inserting (42) into (40), we have the following upper bound
Pr { y ^ 1 n t B ϵ | ϵ S } 2 | Y | e 2 n t 1 / 3 ϵ 2 .
Inserting (39) and (43) into (31), and then inserting (31) into (29), we obtain the following upper bound for the error probability
P e 2 | Y | e 2 n ϵ 2 + 2 | Y | e 2 n t 1 / 3 ϵ 2 ,
where
ϵ = min i , j i j P y ^ i n t P ¯ j r ( 2 + t 1 / 3 ) | Y | 1 / r ,
which is the optimal value of ϵ that exhibits the tightest upper bound for the error probability P e given by (44). This completes the proof of Theorem 1.
The following corollary provides a simplified upper bound on the error probability when t .
Corollary 1.
When the number of training vectors per label reaches infinity, i.e., when t , which is equivalently to the case when the probability distribution p ( y n | x ) is known at the classifier, the error probability of the proposed classifier is upper bounded as
P e 2 | Y | e 2 n ϵ 2 ,
where ϵ is given by
ϵ = min i , j i j P ¯ i P ¯ j r 2 | Y | 1 / r .
Proof. 
The proof is straightforward. □
As can be seen from (8) and (11), the performance of the proposed classifier depends on r. We cannot derive the optimal value of r that minimizes the error probability since we do not have the exact expression of the error probability, we only have its upper bound. On the other hand, in practice, the optimal r with respect to the upper bound on the error probability also cannot be derived since the upper bound depends on P ¯ j , which would be unknown in practice due to p Y n | X ( y n | x ) being unknown. As a result, for our numerical examples, we consider the Euclidean distance ( r = 2 ) , which is one of the most widely used distance metrics in practice.
The following corollary establishes the asymptotic optimality of the proposed classifier with respect to n.
Corollary 2.
The proposed classifier has an error probability that satisfies P e 0 as n if | Y | O ( n m ) , m is fixed, and r > 2 m . Here, n m indicates the dimension of our space, i.e., maximum number of alphabets each element in the feature vector y n can take. Thereby, the proposed classifier is asymptotically optimal.
Proof. 
For the proof, please see Appendix A. □

4. Simulation Results

In this section, we provide simulation results of the performance of the proposed classifier for r = 2 and compare it to benchmark schemes. The benchmark schemes that we adopt for comparison are the naive Bayes classifier and the KNN algorithm. We cannot adopt a classifier based on a neural network since neural networks require a very large training set, which we assume is not available. For the naive Bayes classifier, the probability distribution p Y n | X ( y n | x ) is estimated from the training vectors as follows. Let again y ^ i n t be a vector obtained by concatenating all training feature vectors for the input label x i as in (5). Then, the estimated probability distribution of p ( y j = y | x i ) , denoted by p ^ ( y j = y | x i ) , is found as
p ^ ( y j = y | x i ) = I y ^ i n t = y n t ,
and the naive Bayes classifier decides according to
x ^ = arg max x i k = 1 n p ^ ( y k | x i ) .
The main problem of the naive Bayes classifier occurs when an alphabet y j Y is not present in the training feature vectors. In that case, p ^ ( y j | x i ) in (48) is p ^ ( y j | x i ) = 0 , x i X and, as a result, the right hand side of (49) is zero since at least one of the elements in the product in (49) is zero. In this case, the naive Bayes classifier fails to provide an accurate classification of the labels. In what follows, we see that this issue of the naive Bayes classifier appears frequently when we have a small number of training feature vectors. On the other hand, the KNN classifier works as follows. For the observed feature vector y n , the KNN classifier looks for the k nearest feature vectors to y n , among all training feature vectors y ^ r s n , for all 1 r | X | and 1 s T . Then by considering a set of K input–output pairs ( x k , y ^ k l n ) , for k { 1 , 2 , , | X | } and l { 1 , 2 , , | T | } , the KNN classifier decides a label which is the most frequent among x k -s. The optimum value of k for t = 1 is k = 1 .
In the following, we provide numerical examples where we illustrate the performance of the proposed classifier when p Y n | X ( y n | x ) is artificially generated.

4.1. The I.I.D. Case with One Training Sample per Label

In the following examples, we assume that the classifiers have access to only one training feature vector for each label, the elements of the feature vectors are generated i.i.d., and the alphabet size of the feature vector, | Y | , is fixed.
In Figure 4 and Figure 5, we compare the error probability of the proposed classifier with the naive Bayes classifier and the KNN algorithm for the case when | Y | = 6 and | Y | = 20 , respectively. In both examples, we have two different labels, i.e., | X | = 2 . As a result, we have two different probability distributions p Y n | X 1 ( y n | x 1 ) and p Y n | X 2 ( y n | x 2 ) . The probability distributions p Y n | X 1 ( y n | x 1 ) and p Y n | X 2 ( y n | x 2 ) are randomly generated as follows. We first generate two random vectors of length 6 and length 20 for Figure 4 and Figure 5, respectively, where the elements of these vectors are drawn independently from a uniform probability distribution. Then we normalize these vectors such that the sum of their elements is equal to one. These two normalized randomly generated vectors then represent the two probability distributions p Y i | X 1 ( y i | x 1 ) = p Y | X 1 ( y | x 1 ) and p Y i | X 2 ( y i | x 2 ) = p Y | X 2 ( y | x 2 ) , i . Then, p Y n | X k ( y n | x k ) is obtained as p Y n | X k ( y n | x k ) = i = 1 n p Y i | X k ( y i | x k ) , for k = 1 , 2 . The simulation is carried out as follows. For each n, we generate one training vector for each label, using the aforementioned probability distributions. Then, as test samples, we generate 1000 feature vectors for each label and pass these feature vectors through our proposed classifier, the naive Bayes classifier, and the KNN algorithm, and compute the errors. The length of the feature vector n is varied from n = 1 to n = 100 . We repeat the simulation 5000 times and then plot the error probability. Figure 4 and Figure 5 show that the proposed classifier outperforms both the naive Bayes classification and KNN. The main reason for this performance gain is because when only one training vector per label is available, the proposed classifier is more resilient to errors than the naive Bayes classifier, whereas the KNN algorithm has very poor performance because of the “curse of dimensionality”. Specifically, the naive Bayes classifier cannot perform an accurate classification for small n compared to | Y | since the chance that an alphabet will not be present in one of the training feature vectors is close to 1. On the other hand, the KNN algorithm cannot perform an accurate classification for large n since the dimension of the input feature vector becomes much larger than the training data and the “curse of dimensionality” occurs.
In Figure 6, we compare the performance of the proposed classifier for different values of r when | Y | = 6 with the derived upper bounds. As can be seen, for this example, the derived theoretical upper bounds have similar slope as the exact error probabilities. Moreover, we can see that for this example, the optimal r is r = 1 . However, this is not always the case and it depends on p Y n | X k ( y n | x k ) , | Y | , and | X | .

4.2. The Overlapping I.Non-I.D. Case with One Training Sample per Label

In this example, we consider the i.non-i.d. case where the probability distributions p i ( y i | x k ) are overlapping for all i, as shown in Figure 7. The small orthogonal lines on the x-axis in Figure 7 represent alphabets, i.e., the elements in Y , and the probability of occurrence of an alphabet y i is equal to the intersection between the corresponding orthogonal line to the represented probability distribution p i ( y i | x k ) for k = 1 , 2 . By “overlapping”, we mean the following. Let Y v and Y u denote the set of outputs generated by p v ( y v | x k ) and p u ( y u | x k ) , respectively. If for any v and u, Y v Y u holds, we say that the output alphabets are overlapping.
To demonstrate the performance of our proposed classifier in the overlapping case, we assume that we have two different labels, X = { x 1 , x 2 } , where the corresponding conditional probability distributions p i ( y i | x 1 ) and p i ( y i | x 2 ) are obtained as follows. For a given n, let Y = n , n + 1 , , 0 , , n 1 , n be the set of all alphabets. Note that the size of Y grows with n. Moreover, let u i and v i ( 1 i n ) be vectors of length 2 n + 1 , given by
u i = 0 , , 0 , 1 i ( i + 1 ) , 2 i ( i + 1 ) , , i i ( i + 1 ) , i + 1 i ( i + 1 ) , i i ( i + 1 ) , , 1 i ( i + 1 ) , 0 , , 0 ,
v i = 0 , , 0 , 1 i ( i + 1 ) , 1 i ( i + 1 ) , , 1 i ( i + 1 ) , 1 i ( i + 1 ) , 0 , , 0 .
The number of zeros in each side of the vectors u i and v i is ( n i ) . To generate a feature vector from label x 1 ( x 2 ) , we generate the vector y n = ( y 1 , y 2 , , y n ) , where y k takes values from the set Y , with a probability distribution p i ( y i | x 1 ) = u i 1 + 2 ( n + y i ) p i ( y i | x 2 ) = v i 1 + 2 ( n + y i ) .
The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then, we generate 1000 feature vectors for each label and pass them through our proposed classifier, the naive Bayes classifier, and the KNN algorithm and calculate the error probability. We change the length of the feature vector from n = 1 to n = 100 and repeat the simulation 1000 times and then plot the error probability.
As shown in Figure 8, there is a huge difference between the performance of the two benchmark classifiers and the proposed classifier. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of n as it is susceptible to the problem of unseen alphabets in the training vectors. The error probability of the KNN classifier is also almost 0.5 for n > 20 as it is susceptible to the “curse of dimensionality”. However, the error probability of our proposed classifier continuously decays as n increases.
In Figure 9, we run the same experiments as in Figure 8 but with T = 100 , i.e., 100 training feature vectors per label. As can be seen from Figure 9, the performance of the proposed classifier is better than the naive Bayes classifier, for n > 15 . Since | Y | = 2 n + 1 , for small values of n, the naive Bayes classifier has access to many training samples and, thereby, its performance is very close to the case when the probability distribution p Y n | X ( y n | x ) is known, i.e., to the maximum-likelihood classifier, and hence it has the optimal performance. As n increases, the number of alphabets rises, i.e., | Y | rises, and due to the aforementioned issue of the naive Bayes classifier with unseen alphabets, our proposed classifier performs much better classification than the naive Bayes classifier. Furthermore, note that the error probability of our proposed classifier decays exponentially as n increases which is not the case with the naive Bayes classifier. Moreover, Figure 9 also shows the theoretical upper bound on the error probability we derived in (11).

4.3. The Non-Overlapping I.Non-I.D. Case with One Training Sample for Each Label

In this example, we consider the i.non-i.d. case where the probability distributions p j ( y j | x i ) are non-overlapping for all j as shown in Figure 10, where we defined “overlapping” in Section 4.2. Hence, we test the other extreme in terms of possible distribution of the elements in the feature vectors Y n .
To demonstrate the performance of our proposed classifier in the non-overlapping case, we assume that we have two different labels X = { x 1 , x 2 } , the corresponding conditional probability distributions p i ( y i | x 1 ) and p i ( y i | x 2 ) are obtained as follows. For a given n, let Y = 1 , 2 , 3 , , ( n + 1 ) 2 1 be the set of all alphabets of the element in the feature vectors. Note again that the size of Y grows with n. in addition, let u i and v i for ( 1 i n ) , be vectors of length ( n + 1 ) 2 1 , given by
u i = 0 , , 0 , 1 i ( i + 1 ) , 2 i ( i + 1 ) , , i i ( i + 1 ) , i + 1 i ( i + 1 ) , i i ( i + 1 ) , , 1 i ( i + 1 ) , 0 , , 0 ,
v i = 0 , , 0 , 1 i ( i + 1 ) , 1 i ( i + 1 ) , , 1 i ( i + 1 ) , 1 i ( i + 1 ) , 0 , , 0 .
The number of zeros in the left-hand sides of u i and v i is i 2 1 . To generate a feature vector from the label x 1 ( x 2 ) , we generate the vector y n = ( y 1 , y 2 , , y n ) , where y k take values from the set Y , with probability distribution p i ( y i | x 1 ) = u i ( y i ) p i ( y i | x 2 ) = v i ( y i ) .
The simulation is carried out as follows. For each n, we generate one training feature vector for each label. Then we generate 250 feature vectors for each label and pass it through our proposed classifier, the naive Bayes classifier and KNN and calculate the error probabilities. We change the length of the vector from 1 to 80 and repeat the simulation 250 times and then plot the error probability. As shown in Figure 11, there is a huge difference between the performance of the proposed classifier and the two benchmark classifiers. The error probability of the naive Bayes classifier is almost 0.5 for all shown values of n as it is susceptible to the issue with unseen alphabets in the training feature vector. The error probability of the KNN classifier is almost 0.5 for all shown values of n > 30 as it becomes susceptible to the “curse of dimensionality”. However, the error probability of our proposed classifier still decays continuously as n increases.
Note that, in our numerical examples, we compared our algorithm with the benchmark schemes on two extreme cases of i.non-i.d. vectors, referred to as “overlapping” and “non-overlapping”. Any other i.non-i.d. vector can be represented as a combination of the “overlapping” and “non-overlapping” vectors. Since our algorithm works better than the benchmark schemes for small t on both these cases, it will work better than the benchmark schemes on any combination between “overlapping” and “non-overlapping” vectors, i.e., for any other i.non-i.d. vectors.

5. Conclusions

In this paper, we proposed a supervised classification algorithm that assigns labels to input feature vectors with independent but non-identically distributed elements, a statistical property found in practice. We proved that the proposed classifier is asymptotically optimal since the error probability moves to zero as the length of the input feature vectors grows. We showed that this asymptotic optimality is achievable even when one training feature vector per label is available. In the numerical examples, we compared the proposed classifier with the naive Bayes classifier and the KNN algorithm. Our numerical results show that the proposed classifier outperforms the benchmark classifiers when the number of training data is small and the length of the input feature vectors is sufficiency large.

Author Contributions

Methodology, F.S. and N.Z.; software, F.S.; formal analysis, F.S. and N.Z.; investigation, F.S.; supervision, N.Z.; writing—original draft preparation, F.S.; writing—review and editing, N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Corollary 2

The proof is almost identical to the proof of Theorem 1; however, here we derive a looser upper-bound on the error-probability than that in (11), which is independent of P y ^ i n T .
Without loss of generality we assume that x 1 is the input to p Y n | X ( y n | x ) and y n is observed at the classifier.
Let B k , l ϵ , for 1 k | Y | and 1 l | X | , be a set defined as
B k , l ϵ = y ^ n t : | I y ^ n t = y k n t p ¯ ( y k | x l ) | ϵ t 3 .
Let B l ϵ = k = 1 | Y | B k , l ϵ . For y ^ 1 n t B 1 ϵ , we have
k = 1 | Y | | I [ y ^ 1 n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r ( a ) k = 1 | Y | ϵ t 3 r 1 / r ,
Using the same derivation as (18), for any y n A ϵ and for y ^ 1 n t B 1 ϵ , we have:
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n T | r 1 / r | Y | 1 / r ϵ + | Y | 1 / r ϵ t 3 .
On the other hand, the same as the derivation in (21), for each i 1 , we have:
k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r P y ^ i n t P ¯ 1 r | Y | 1 / r ϵ .
Now, for any y ^ i n t B i ϵ , we have
P y ^ i n t P ¯ 1 r + k = 1 | Y | ϵ t 3 r 1 / r ( a ) k = 1 | Y | | I [ y ^ i n t = y k ] n t p ¯ ( y k | x 1 ) | r 1 / r + k = 1 | Y | | I [ y ^ i n t = y k ] n t p ¯ ( y k | x i ) | r 1 / r ( b ) k = 1 | Y | | p ¯ ( y k | x 1 ) p ¯ ( y k | x i ) | r 1 / r ,
where ( a ) follows from (A1) and ( b ) is again due to the Minkowski inequality. The expression in (A5), can be written equivalently as
P y ^ i n t P ¯ 1 r P ¯ i P ¯ 1 r | Y | 1 / r ϵ t 3 .
where i 1 . Using the bounds in (A6) and (A4), for any i 1 we have
k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r P ¯ i P ¯ 1 r | Y | 1 / r ϵ 1 + 1 t 3 .
Using the bounds in (A3) and (A7), we now relate the left-hand sides of (A3) and (A7) as follows. As long as the following inequality holds for each i 1 ,
| Y | 1 / r ϵ 1 + 1 T 3 < P ¯ i P ¯ 1 r | Y | 1 / r ϵ 1 + 1 t 3 ,
which is equivalent to the following for i 1
ϵ < P ¯ i P ¯ 1 r 2 ( 1 + t 1 / 3 ) | Y | 1 / r ,
we have the following for i 1
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n t | r 1 / r ( a ) | Y | 1 / r ϵ 1 + 1 t 3 < ( b ) P ¯ i P ¯ 1 r | Y | 1 / r ϵ 1 + 1 t 3 ( c ) k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r ,
where ( a ) , ( b ) , and ( c ) follow from (A3), (A8), and (A7), respectively. Thereby, from (A10), we have the following for i 1
k = 1 | Y | | I [ y n = y k ] n I [ y ^ 1 n t = y k ] n t | r 1 / r k = 1 | Y | | I [ y n = y k ] n I [ y ^ i n t = y k ] n t | r 1 / r ,
or equivalently as
P y n P y ^ 1 n t r < P y n P y ^ i n t r .
Once again, we obtained that if there is an ϵ for which (A9) holds for i 1 and for that ϵ there are sets A ϵ and B i ϵ for which y n A ϵ and y ^ j n t B l ϵ for all 1 l | X | , then (A12) holds for i 1 , and thereby our classifier will detect that x 1 is the correct label. Using this, we can upper-bound the error probability as
P e = 1 Pr x ^ 1 = x 1 1 Pr { y n A ϵ j = 1 | X | y ^ l n t B l ϵ | ϵ S } ,
where S is a set defined as
S = ϵ : ϵ min i i 1 P ¯ i P ¯ 1 r ( 2 + t 1 / 3 ) | Y | 1 / r .
The right-hand side of (A13) can be upper-bounded as
1 Pr { y n A ϵ l = 1 | X | y ^ l n t B j ϵ | ϵ S } = Pr { y n A ϵ l = 1 | X | y ^ l n t B l ϵ | ϵ S } ( a ) Pr y n A ϵ | ϵ S + l = 1 | X | Pr { y ^ l n t B l ϵ ϵ S ,
Using the same derivation as (39), we have:
Pr y n A ϵ | ϵ S 2 | Y | e 2 n ϵ 2 .
Similarly, we have the following result for the second expression in the right-hand side of (A15), which is the same as the derivation in (43)
Pr { y ^ l n t B l ϵ ϵ S 2 | Y | e 2 n t 1 / 3 ϵ 2 .
Inserting (A16) and (A17) into (A15), and then inserting (A15) into (A13), we obtain the following upper-bound for the error probability
P e 2 | Y | e 2 n ϵ 2 + 2 | X | | Y | e 2 n t 1 / 3 ϵ 2 ,
where
ϵ = min i , j i j P ¯ i P ¯ j r 2 ( 1 + t 1 / 3 ) | Y | 1 / r ,
Now, if | Y | n m , (A18) can be written as
P e 2 | Y | e 2 n ϵ 2 + 2 | X | | Y | e 2 n t 1 / 3 ϵ 2 2 n m exp 2 n min i , j i j P ¯ i P ¯ j r 2 2 ( 1 + t 1 / 3 ) 2 n 2 m / r + 2 | X | n m exp 2 n t 1 / 3 min i , j i j P ¯ i P ¯ j r 2 2 ( 1 + t 1 / 3 ) 2 n 2 m / r O n m exp n 1 2 m r .
According to (A20), for a fixed r > 2 m , the right-hand side of (A20) moves to zero as n and, thereby, the classifier is asymptotically optimal.

References

  1. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, MA, USA, 2014. [Google Scholar]
  2. Quinlan, J.R. Learning Efficient Classification Procedures and Their Application to Chess End Games; Springer: Berlin/Heidelberg, Germany, 1983; pp. 453–482. [Google Scholar]
  3. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth and Brooks: Monterey, CA, USA, 1984. [Google Scholar]
  4. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, San Jose, CA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
  5. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  6. Lallich, S.; Teytaud, O.; Prudhomme, E. Association Rule Interestingness: Measure and Statistical Validation; Studies in Computational Intelligence; Springer: Berlin, Germany, 2007; Volume 43, pp. 251–275. [Google Scholar]
  7. Langley, P.; Iba, W.; Thompson, K. An analysis of bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, USA, 14–16 July 1992; AAAI Press: San Jose, CA, USA, 1992; p. 223228. [Google Scholar]
  8. Aha, D.W. Lazy learning. In Artificial Intelligent Review; Kluwer Academic Publishers: Cambridge, MA, USA, 1997; p. 710. [Google Scholar]
  9. Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
  10. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
  11. Kanaya, F.; Han, T.S. The asymptotics of posterior entropy and error probability for bayesian estimation. IEEE Trans. Inf. Theory 1995, 41, 1988–1992. [Google Scholar] [CrossRef]
  12. Gutman, M. Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 1989, 35, 401–408. [Google Scholar] [CrossRef]
  13. Sason, I.; Verd, S. Arimotornyi conditional entropy and bayesian hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
  14. Wang, C.; She, Z.; Cao, L. Coupled attribute analysis on numerical data. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 2–9 August 2013; AAAI Press: Beijing, China, 2013; p. 17361742. [Google Scholar]
  15. Wang, C.; Dong, X.; Zhou, F.; Cao, L.; Chi, C.H. Coupled attribute similarity learning on categorical data. IEEE Trans. Neural Networks Learn. Syst. 2015, 26, 781–797. [Google Scholar] [CrossRef] [PubMed]
  16. Wang, C.; Cao, L.; Wang, M.; Li, J.; Wei, W.; Ou, Y. Coupled nominal similarity in unsupervised learning. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Scotland, UK, 24–28 October 2011; Association for Computing Machinery: Glasgow, Scotland, UK, 2011; p. 973978. [Google Scholar]
  17. Liu, C.; Cao, L. A coupled k-nearest neighbor algorithm for multi-label classification. In PAKDD; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
  18. Liu, C.; Cao, L.; Yu, P. A hybrid coupled k-nearest neighbor algorithm on imbalance data. In Proceedings of the International Joint Conference on Neural Networks, Beijing, China, 6–11 July 2014; pp. 2011–2018. [Google Scholar]
  19. Liu, C.; Cao, L.; Yu, P.S. Coupled fuzzy k-nearest neighbors classification of imbalanced non-iid categorical data. In Proceedings of the 2014 International Joint Conference on Neural Networks IJCNN, Beijing, China, 8–13 July 2014; pp. 1122–1129. [Google Scholar]
  20. Cao, L. Non-IIDness Learning in Behavioral and Social Data. Comput. J. 2013, 57, 1358–1370. [Google Scholar] [CrossRef]
  21. Cao, L. Coupling learning of complex interactions. Inf. Process. Manag. 2015, 51, 167–186. [Google Scholar] [CrossRef]
  22. Cao, L.; Zhang, H.; Zhao, Y.; Luo, D.; Zhang, C. Combined mining discovering informative knowledge in complex data. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2011, 41, 699–712. [Google Scholar]
  23. Getoor, L.; Taskar, B. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
  24. Abed-Meraim, K.; Loubaton, P.; Moulines, E. A subspace algorithm for certain blind identification problems. IEEE Trans. Inf. Theory 2006, 43, 499–511. [Google Scholar] [CrossRef]
  25. Almeida, L. Linear and nonlinear ica based on mutual information—The misep method. IEEE Trans. Signal Process. 2004, 84, 231–245. [Google Scholar] [CrossRef]
  26. Hyvarinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411430. [Google Scholar] [CrossRef] [Green Version]
  27. Yeredor, A. Independent component analysis over galois fields of prime order. IEEE Trans. Inf. Theory 2011, 57, 53425359. [Google Scholar] [CrossRef] [Green Version]
  28. Nguyen, H.; Zheng, R. Binary independent component analysis with or mixtures. IEEE Trans. Signal Processing 2011, 59, 31683181. [Google Scholar] [CrossRef] [Green Version]
  29. Barlow, H.B. Unsupervised learning. Neural Comput. 1989, 1, 295311. [Google Scholar] [CrossRef]
  30. Nakhaeizadeh, G.; Taylor, C.C.; Kunisch, G. Dynamic Supervised Learning: Some Basic Issues and Application Aspects. In Classification and Knowledge Organization; Springer: Berlin/Heidelberg, Germany, 1997; pp. 123–135. [Google Scholar]
  31. Koppen, M. The Curse of Dimensionality; SAGE: London, UK, 2000; pp. 4–8. [Google Scholar]
  32. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2000. [Google Scholar]
  33. Jain, A.K.; Duin, R.P.W.; Mao, J. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 437. [Google Scholar] [CrossRef] [Green Version]
  34. Xu, B.; Huang, J.; Williams, G.; Wang, Q.; Ye, Y. Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 2012, 8, 44–63. [Google Scholar] [CrossRef]
  35. Ye, Y.; Wu, Q.; Huang, J.Z.; Ng, M.K.; Li, X. Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recognit. 2013, 46, 769–787. [Google Scholar] [CrossRef]
  36. Kouiroukidis, N.; Evangelidis, G. The effects of dimensionality curse in high dimensional knn search. In Proceedings of the 15th Panhellenic Conference on Informatics, Kastoria, Greece, 30 September–2 October 2011; pp. 41–45. [Google Scholar]
  37. Matusita, K. Classification based on distance in multivariate gaussian cases. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Volume 1: Statistics; University of California Press: Regents, CA, USA, 1967; pp. 299–304. [Google Scholar]
  38. Collins, M.; Schapire, R.E.; Singer, Y. Logistic regression, adaboost and bregman distances. Mach. Learn. 2002, 48, 253285. [Google Scholar] [CrossRef]
  39. Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Warsaw, Poland, 2020. [Google Scholar]
  40. Lebanon, G.; Lafferty, J. Cranking: Combining rankings using conditional probability models on permutations. In Proceedings of the 19th International Conference on Machine Learning, San Francisco, CA, USA, 8–12 July 2002; pp. 363–370. [Google Scholar]
  41. Hoeffding, W. On the distribution of the number of successes in independent trials. Ann. Math. Statist. 1956, 27, 713–721. [Google Scholar] [CrossRef]
  42. Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
Figure 1. A typical structural modelling of the classification learning problem.
Figure 1. A typical structural modelling of the classification learning problem.
Entropy 23 01045 g001
Figure 2. An alternative modeling of the classification learning problem.
Figure 2. An alternative modeling of the classification learning problem.
Entropy 23 01045 g002
Figure 3. An alternative modelling of the classification learning problem when the elements of Y n ( X ) are mutually independent but non-identically distributed (i.non-i.d.).
Figure 3. An alternative modelling of the classification learning problem when the elements of Y n ( X ) are mutually independent but non-identically distributed (i.non-i.d.).
Entropy 23 01045 g003
Figure 4. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.
Figure 4. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.
Entropy 23 01045 g004
Figure 5. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.
Figure 5. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier.
Entropy 23 01045 g005
Figure 6. Comparison in error probability of the proposed classifier for different values of r when | Y | = 6 . The related theoretical upper bounds for each value of r are also given.
Figure 6. Comparison in error probability of the proposed classifier for different values of r when | Y | = 6 . The related theoretical upper bounds for each value of r are also given.
Entropy 23 01045 g006
Figure 7. Illustration of the probability distributions p i ( y i | x 1 ) (upper figure) and p i ( y i | x 2 ) (lower figure), for i = 1 , 2 , , n .
Figure 7. Illustration of the probability distributions p i ( y i | x 1 ) (upper figure) and p i ( y i | x 2 ) (lower figure), for i = 1 , 2 , , n .
Entropy 23 01045 g007
Figure 8. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier ( T = 1 ) .
Figure 8. Comparison in error probability between the naive Bayes classifier, KNN, and the proposed classifier ( T = 1 ) .
Entropy 23 01045 g008
Figure 9. Comparison in error probability between the naive Bayes classifier and the proposed classifier ( T = 100 ) .
Figure 9. Comparison in error probability between the naive Bayes classifier and the proposed classifier ( T = 100 ) .
Entropy 23 01045 g009
Figure 10. Illustration of the probability distributions p i ( y i | x 1 ) (upper figure) and p i ( y i | x 2 ) (lower figure), for i = 1 , 2 , , n .
Figure 10. Illustration of the probability distributions p i ( y i | x 1 ) (upper figure) and p i ( y i | x 2 ) (lower figure), for i = 1 , 2 , , n .
Entropy 23 01045 g010
Figure 11. Comparison in error probability between the naive Bayes classifier and the proposed classifier ( T = 1 ) .
Figure 11. Comparison in error probability between the naive Bayes classifier and the proposed classifier ( T = 1 ) .
Entropy 23 01045 g011
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shahrivari, F.; Zlatanov, N. On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements. Entropy 2021, 23, 1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

AMA Style

Shahrivari F, Zlatanov N. On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements. Entropy. 2021; 23(8):1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

Chicago/Turabian Style

Shahrivari, Farzad, and Nikola Zlatanov. 2021. "On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements" Entropy 23, no. 8: 1045. https://0-doi-org.brum.beds.ac.uk/10.3390/e23081045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop