Next Article in Journal
Experimental Force Data of a Restrained ROV under Waves and Current
Next Article in Special Issue
Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling
Previous Article in Journal
A Database for the Radio Frequency Fingerprinting of Bluetooth Devices
Previous Article in Special Issue
An Interdisciplinary Review of Camera Image Collection and Analysis Techniques, with Considerations for Environmental Conservation Social Science
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning

by
Kajsa Møllersen
1,*,
Jon Yngve Hardeberg
2 and
Fred Godtliebsen
3
1
Department of Community Medicine, Faculty of Health Science, UiT The Arctic University of Norway, N-9037 Tromsø, Norway
2
Department of Computer Science, Faculty of Information Technology and Electrical Engineering, NTNU—Norwegian University of Science and Technology, N-2815 Gjøvik, Norway
3
Department of Mathematics and Statistics, Faculty of Science and Technology, UiT The Arctic University of Norway, N-9037 Tromsø, Norway
*
Author to whom correspondence should be addressed.
Submission received: 31 March 2020 / Revised: 3 June 2020 / Accepted: 23 June 2020 / Published: 26 June 2020
(This article belongs to the Special Issue Machine Learning in Image Analysis and Pattern Recognition)

Abstract

:
Multi-instance (MI) learning is a branch of machine learning, where each object (bag) consists of multiple feature vectors (instances)—for example, an image consisting of multiple patches and their corresponding feature vectors. In MI classification, each bag in the training set has a class label, but the instances are unlabeled. The instances are most commonly regarded as a set of points in a multi-dimensional space. Alternatively, instances are viewed as realizations of random vectors with corresponding probability distribution, where the bag is the distribution, not the realizations. By introducing the probability distribution space to bag-level classification problems, dissimilarities between probability distributions (divergences) can be applied. The bag-to-bag Kullback–Leibler information is asymptotically the best classifier, but the typical sparseness of MI training sets is an obstacle. We introduce bag-to-class divergence to MI learning, emphasizing the hierarchical nature of the random vectors that makes bags from the same class different. We propose two properties for bag-to-class divergences, and an additional property for sparse training sets, and propose a dissimilarity measure that fulfils them. Its performance is demonstrated on synthetic and real data. The probability distribution space is valid for MI learning, both for the theoretical analysis and applications.
Dataset License: CC BY 4.0

1. Introduction

1.1. Classification of Weakly Supervised Data

Machine-learning applications include a wide variety of data types, images being one of the most successful areas. It has had an enormous impact on image analysis, especially in replacing small sets of hand-crafted features with large sets of computer readable features, which often lack apparent or intuitive meaning. The task and problems to which machine learning is applied can be divided broadly into unsupervised and supervised learning. In supervised learning, the training data consists of K objects, x , with corresponding class labels, y; { ( x 1 , y 1 ) , , ( x k , y k ) , , ( x K , y K ) } . An object is typically a vector of d feature values, x k = ( x k 1 , , x k d ) , observed directly or extracted from e.g., an image. In classification, the task is to build a classifier that correctly labels a new object. The training data is used to adjust the model according to the desired outcome, often maximizing the accuracy of the classifier.
For many types of images, only a small part of the image defines the class, but the label is available only at image level. This is common in medical images, such as histology slides, where the tumor cells typically make up a small proportion of the image. However, the location of those cells is not available for training. Multi-instance (MI) learning is a branch of machine learning that specifically targets problems where labels are available only at a superior level, and relates to other weakly supervised data problems, such as semi-supervised learning and transfer learning through label scarcity [1].

1.2. Multi-Instance Learning

In MI learning, each object is a set of feature vectors referred to as instances. The set X k = { x k 1 , , x k n k } , where the n k elements are vectors of length d, is referred to as bag. The number of instances, n k , varies from bag to bag, whereas the vector length is constant. In supervised MI learning, the training data consists of K sets and their corresponding class labels, { ( X 1 , y 1 ) , , ( X k , y k ) , , ( X K , y K ) } .
Figure 1a shows an image (bag), k, of benign breast tissue [2], divided into n k segments with corresponding feature vectors (instances) x k 1 , , x k n k [3]. Correspondingly, Figure 1b shows malignant breast tissue.
The images in the data set have class labels; however, the individual segments do not. This is a key characteristic of MI learning—the instances are not labeled. MI learning includes instance classification [4], clustering [5], regression [5], and multi-label learning [6,7], but this article will focus on bag classification. MI learning can also be found as integrated parts of end-to-end methods for image analysis that generate patches, extract features and do feature selection [7]. See also [8] for an overview and discussion on end-to-end neural network MI learning methods.
The term “MI learning” was introduced in an application of molecules (bags) with different shapes (instances), and their ability to bind to other molecules [9]. A molecule binds if at least one of its shapes can bind. In MI terminology, the classes in binary classification are referred to as positive, p o s , and negative, n e g . The assumption that a positive bag contains at least one positive instance, and a negative bag contains only negative instances is referred to as the standard MI assumption.
Many new applications violate the standard MI assumption, such as image classification [10] and text categorization [11]. Consequently, successful algorithms meet more general assumptions, see e.g., the hierarchy of Weidmann et al. [12] or Foulds and Frank’s taxonomy [13]. For a more recent review of MI classification algorithms, see e.g., [14]. Amores [15] presented the three paradigms of instance space (IS), embedded space (ES), and bag space (BS). IS methods aggregate the outcome of single-instance classifiers applied to the instances of a bag, whereas ES methods map the instances to a vector, followed by use of a single-instance classifier. In the BS paradigm, the instances are transformed to a non-vectorial space where the classification is performed, avoiding the detour via single-instance classifiers. The non-vectorial space of probability functions has not yet been introduced to the BS paradigm, despite its analytical benefits, see Section 3.2 and Section 3.3.
Although both Carbonneau et al. [16] and Amores [15] defined a bag as a set of feature vectors, Foulds and Frank [13] stated that a bag can also be modelled as a probability distribution. The distinction is necessary in analysis of classification approaches, and both viewpoints offer benefits, see Section 6.1 for a discussion.

1.3. Bag Density and Class Sparsity

Optimal classification in MI learning depends on the number of instances per bag (bag density) and the number of bags per class in the training set (class density). Sample sparsity is a common obstacle in MI learning [16], which we address in Section 3.5. High bag density ensures a precise description of each bag, whereas high class density ensures precise modelling of each class when training the classifier. In image analysis, the number of patches corresponds to the number of instances, and is commonly a user input parameter. The number of images corresponds to the number of bags, and is limited by the training set itself.
High resolution of today’s images and the increasingly common practice of sharing the images themselves instead of extracted features ensure high bag density. The number of bags available for training is still limited, and will continue to be so in the foreseeable future, especially for medical images where data collection is restricted by laws and regulations. This motivates an approach to MI learning that can exploit the increasing bag density and overcome the class sparsity.

1.4. A Probabilistic Bag-to-Class Approach to Multi-Instance Learning

We propose to model the bags as probability distributions and the instances as random samples. The bags are assumed to be random samples from their respective classes and the instance-bag sampling form a hierarchical distribution. Hierarchical distribution is novel for bag classification and novel outside the strict standard MI assumption. Unbiased estimators for the bag probability distributions ensure that as the number of instances increases ( n k ), the discrepancy between the estimate and the underlying truth diminishes, taking advantage of increasing bag density. To overcome the problem of class sparsity, the instances are aggregated at class level.
We further propose to use a bag-to-class dissimilarity measure for classification. This is novel in the MI context, where dissimilarity measures have been either instance-to-instance or bag-to-bag. With the analytical framework of probability distributions and their dissimilarity measures, we present the optimal classifier for dense class sampling as a theoretical background and identify data-independent properties for bag classification under class sparsity.
The main contribution of this article is a bag-to-class dissimilarity measure for sparse training data. It builds on:
  • presenting the hierarchical model for general, non-standard MI assumptions (Section 3.3),
  • introduction of bag-to-class dissimilarity measures (Section 3.5), and
  • identification of two properties for bag-to-class divergence (Section 4.1).
The novelty is that it takes into account the class sparsity by comparing a bag to one class while conditioning on the other class.
In Section 5, the Kullback–Leibler (KL) information and the new dissimilarity measure is applied to data sets and the results are reported. Bags defined in the probability distribution space, in combination with bag-to-class divergence, constitutes a new framework for MI learning, which is compared to other frameworks in Section 6.

2. Related Work and State-of-the-Art

The feature vector set viewpoint seems to be the most common, but the probabilistic viewpoint was introduced already in 1998, under the assumption that instances of the same class are independent and identically distributed (i.i.d.) [17]. This assumption has been used in approaches such as estimating the expectation by the mean [18], or estimation of class distribution parameters [19], but has also been criticized [20]. The hierarchical distribution was introduced for learnability theory under the standard MI assumption for instance classification in 2016 [4]. We expand the use for more general assumptions in Section 3.3.
Dissimilarities in MI learning have been categorized as instance-to-instance or bag-to-bag [15,21]. The bag-to-prototype approach in [21] offers an in-between category, but the theoretical framework is missing. Bag-to-class dissimilarity has not been studied within the MI framework, but has been used under the i.i.d. given class assumption for image classification [22]. The sparseness of training sets was also addressed: if the instances are aggregated on class level, a denser representation is achieved. Many MI algorithms use dissimilarities, e.g., graph distances [23], Hausdorff metrics [24], functions of the Euclidean distance [14,25], and distribution parameter-based distances [14]. The performances of dissimilarities on specific data sets have been investigated [14,19,21,25,26], but more analytical comparisons are missing. A large class of commonly used kernels are also distances [27], and hence, many kernel-based approaches in MI learning can be viewed as dissimilarity-based approaches. In [28], the Fisher kernel is used as input to a support vector machine (SVM), whereas in [11,20] the kernels are an integrated part of the methods.
The non-vectorial graph space was used in [20,23]. We introduce the non-vectorial space of probability functions as an extension within the BS paradigm for bag classification through dissimilarity measures between distributions in Section 3.2.
The KL information was applied in [22], and is a much-used divergence function. It is closely connected to the Fisher information [29] used in [28] and to the cross entropy used as loss function in [8]. We propose a conditional KL information in Section 4.2, which differs from the earlier proposed weighted KL information [30] whose weight is a constant function of X.
There is a wide variety in MI learning, both in methods and data sets, and it should be clear that state-of-the-art will depend on the type of data. Sudharshan et al. [31] gave a comparison of 12 MI classification methods and five state-of-the-art general classification methods on a well-described, publicly available histology image data set. All methods included have shown best performance on other data sets. The five methods that showed best performance for at least one of the data subsets serve as state-of-the-art baseline for evaluation in Section 5.3.
Cheplygina et al. [1] gave an overview of MI learning applications in different categories, but no comparison was made. The work of Sudharshan et al. falls into the “Histology/Microscopy” category, and the overview offers a potential expansion of histology state-of-the-art. Among the 12 listed articles, Zhang et al. [32] concluded that GPMIL outperforms Citation-kNN, which is one of the 12 methods in [31], but not one of the 5 best-performing. Kandemir et al. [3], Li et al. [33] and Tomczak et al. [34] presented methods that outperform GPMIL on a publicly available data set. We include these as comparison.
Of the remaining articles, none of them present an extensive comparison to other methods, their data sets are either non-public [35,36,37,38], no longer available [39], or the reference is not complete [40,41], which make them unsuitable for comparison. Jia et al. [42] presented a segmentation method, and is therefore not comparable.

3. Theoretical Background and Intuitions

3.1. Notation

Subscript and superscript p o s and n e g refer to the class label of the bag, subscript and superscript + and − refer to the unknown instance label.
  • X: instance random vector
  • C: class, either p o s or n e g
  • B: bag
  • P ( · ) : probability distribution
  • x k i : feature vector (instance) in set k, i = 1 , , n k
  • X k : set of feature vectors k of size n k
  • y k : bag label
  • X : sample space for instances
  • X + : sample space for positive instances
  • X : sample space for negative instances
  • B p o s : sample space of positive bags
  • B n e g : sample space of negative bags
  • P ( C | X k ) : posterior class probability, given instance sample
  • Θ : parameter random vector
  • θ k : parameter vector
  • P b a g ( X ) = P ( X | B ) : probability distribution for instances in bag B
  • P ( X | θ k ) : parameterized probability distribution of bag k
  • P p o s ( X ) = P ( X | p o s ) : probability distribution for instances from the positive class
  • P n e g ( X ) = P ( X | n e g ) : probability distribution for instances from the negative class
  • τ i : instance label
  • π k : probability of positive instances
  • D ( P k , P ) = D P k ( X ) , P ( X ) : divergence from P k ( X ) to P ( X )
  • f k ( x ) = f ( x | θ k ) : probability density function (PDF) for bag k
  • D ( f k , f ) = D f k ( x ) , f ( x ) : divergence from f k ( x ) to f ( x )
We assume P ( X ) < , and equivalently f ( x ) < , for all distributions.

3.2. The Non-Vectorial Space of Probability Functions

The intuition behind the probabilistic approach in MI learning can be understood through image analysis and tumor classification. Figure 1a represents parts of a tumor, chosen carefully for diagnostic purposes. The process from biological material to image contains steps whose outcome is influenced by subjective choices and randomness: The precise day the patient is admitted influences the state of the tumor; the specific parts of the tumor that are extracted for staining; the actual stain varies from batch to batch, and the imaging equipment has multiple parameter settings. All this means that the same tumor would have produced a different image under different circumstances. The process from image to feature vector set also contains several steps: Patch size, grid or random patches, color conversion, etc. In summary, the observed feature vectors are a representation of an underlying object, and that representation may vary, even if the object remains fixed.
From the probabilistic viewpoint, an instance, x , is a realization of a random vector, X, with probability distribution P ( X ) and sample space X . The bag is the probability distribution P ( X ) , and the set of instances, X , is multiple realizations of X. The task of an MI classifier is to classify the bag given the observations, X .
The posterior class probability, P ( C | X k ) , is an effective classifier if the standard MI assumption holds, since it is defined as:
P ( p o s | X k ) = 1   if   any   x k i X + ,   i = 1 , , n k 0   otherwise ,  
where X + is the positive instance space, and the positive and negative instance spaces are disjoint.
Bayes’ rule, P ( C | X ) P ( X | C ) P ( C ) , can be used when the posterior probability is unknown. An assumption used to estimate the probability distribution of instance given the class, P ( X | C ) , is that instances from bags of the same class are i.i.d. random samples. However, this is a poor description for MI learning.

3.3. Hierarchical Distributions

As an illustrative example, let the instances be the color of image patches from the class sea or desert, and let image k depict a blue sea like in Figure 2a with instances X k , and image depict a turquoise sea like in Figure 2b with instances X . The instances are realizations from P ( X | θ k ) and P ( X | θ ) , respectively, where θ is the parameter indicating the colors. If the instance distribution were dependent only on class, then θ k = θ = θ s e a , which is clearly not the case. Instance distributions are dependent not only on class, but also on bag. The random vectors in X k are i.i.d., but have a different distribution than those in X . An important distinction between uncertain objects, whose distribution depends solely on the class label [43,44], and MI learning is that the instances of two bags from the same class are not from the same distribution.
The dependency nature for MI learning can be described as a hierarchical distribution (Equation (1)), where a bag, B, is defined as the probability distribution of its instances, P ( X | B ) , and the bag space, B , is a set of distributions. A bag is itself a realization from the sample space of bags, whose distribution depends on the class. The generative model of instances from a positive or negative bag follows a hierarchical distribution:
X | B P ( X | B )       X | B P ( X | B ) B P ( B | p o s )               or   B P ( B | n e g ) ,
respectively. From a practical viewpoint, P ( X | B ) can be considered parametric functions, P ( X | θ ) , where the sampling of a bag corresponds to sampling the parameter vector θ that defines its distribution:
X | θ p o s P ( X | θ p o s )       X | θ n e g P ( X | θ n e g ) Θ p o s P ( Θ p o s )               or   Θ n e g P ( Θ n e g ) .
The parametric generative model is shown in Figure 3.
The common view in MI learning is that a bag consists of positive and negative instances, which corresponds to a bag being a mixture of a positive and a negative distribution. Consider tumor images labeled p o s or n e g , with instances extracted from patches. Let P ( X | θ k + ) and P ( X | θ k ) denote the probability distributions of positive and negative segments, respectively, of image k. The distribution of bag k is a mixture distribution:
P ( X | π k , θ k + , θ k ) = p k P ( X | θ k + ) + ( 1 p k ) P ( X | θ k ) ,
where p k = i = 1 n k τ i / n k , where τ i = 1 if instance i is positive. The parameter vector ( π k , θ k + , θ k ) defines the bag. The probability of positive segments, π k , depends on the image’s class label, and hence π k is sampled from P ( Π p o s ) or P ( Π n e g ) . The characteristics of positive and negative segments vary from image to image. Hence, θ k + and θ k are realizations of random variables, with corresponding probability distributions P ( Θ + ) and P ( Θ ) . The generative model of instances from a positive bag is:
X | τ , θ + , θ P ( X | τ = 1 ) = P ( X | θ + ) P ( X | τ = 0 ) = P ( X | θ ) T | π p o s P ( τ = 1 ) = π p o s P ( τ = 0 ) = 1 π p o s Π p o s P ( Π p o s ) ,     Θ + P ( Θ + ) ,     Θ P ( Θ ) .
The corresponding sampling procedure from positive bag, k, is
  • Step 1: Draw π k from P ( Π p o s ) , θ k + from P ( Θ + ) , and θ k from P ( Θ ) . These three parameters define the bag.
  • Step 2: For i = 1 , , n k , draw τ i from P ( T | π k ) , draw x i from P ( X | θ k + ) if τ i = 1 , and from P ( X | θ k ) otherwise.
The generative model and sampling procedure for negative bags are equivalent to that of positive bags.
By imposing restrictions, assumptions can be accurately described, e.g., the standard MI assumption: at least one positive instance in a positive bag: P ( p k 1 / n k ) = 1 ; no positive instances in a negative bag: P ( Π n e g = 0 ) = 1 ; the positive and negative instance spaces are disjoint.
Equation (3) is the generative model of MI problems, assuming that the instances have unknown class labels and that the distributions are parametric. The parameters π k , θ k + and θ k are i.i.d. samples from their respective distributions, but are not observed and are hard to estimate due to the very nature of MI learning: the instances are not labeled. Instead, P ( X | B ) can be estimated from the observed instances, and a divergence function can serve as classifier.
The instance i.i.d. assumption is not inherent to the probability distribution viewpoint, but the asymptotic results for the KL information discussed in Section 3.5 rely on it. In many applications, such as image analysis with sliding windows, the instances are best represented as dependent samples, but the dependencies are hard to estimate, and the independence assumption is often the best approximation. Doran and Ray [4] showed that the independence assumption is an approximation of dependent instances, but comes with the cost of slower convergence.

3.4. Dissimilarities in MI Learning

The information contained at bag-level is converted before it is fed into a classifier. If the bags are sets, they are commonly converted into distances. Dissimilarities in MI learning can be categorized as instance-to-instance, bag-to-bag or bag-to-class. Amores [15] implicitly assumed metricity for dissimilarity functions [27] in the BS paradigm, but there is nothing inherent to MI learning that imposes these restrictions. In the case where bags are probability distributions, distances are no longer applicable since they live in a non-vectorial space. Distances are a special case of dissimilarity functions, and the equivalent for probability distributions are referred to as divergences, D P k ( X ) , P ( X ) . Although distances fulfil three properties by definition—among them symmetry and zero distance for identical sets—divergences do not have such properties, by definition.
A group of divergences named f-divergences has properties that are reasonable to demand for measuring the ability to distinguish probability distributions [45,46]:
  • Equality and orthogonality: An f-divergence takes its minimum when the two probability functions are equal and its maximum when they are orthogonal. This means that two identical bags will have minimum dissimilarity between them, and that two bags without shared sample space will have maximum dissimilarity. A definition of orthogonal distributions can be found in [47].
  • Monotonicity: The f-divergences possess a monotonicity property that can be thought of as an equivalent to the triangle property for distances: For a family of densities with monotone likelihood ratio, if a < θ 1 < θ 2 < θ 3 < b , then D P ( X | θ 1 ) , P ( X | θ 2 ) D P ( X | θ 1 ) , P ( X | θ 3 ) . This is valid, e.g., for Gaussian densities with equal variance and mean θ . This means that if the distance between θ 1 and θ 3 is larger than the distance between θ 1 and θ 2 , the divergence is larger or equal. The f-divergences are not symmetric by definition, but some of them are.
Divergences as functions of probability distributions have not been used in MI learning, due to the lack of a probability function space defined for the BS paradigm, despite the benefit of analysis independent of specific data sets [48]. Cheplygina et al. [14] proposed using the Cauchy-Schwarz divergence with a Gaussian kernel, but as a function of the instances in the bag-to-bag setting. The KL information [29] is a non-symmetric f-divergence, often used in both statistics and computer science, and is defined as follows for two probability density functions (PDFs) f k ( x ) and f ( x ) :
D K L ( f k , f ) = f k ( x ) log f k ( x ) f ( x ) d x .
An example of a symmetric f-divergence is the Bhattacharyya (BH) distance, defined as
D B H ( f k , f ) = log f k ( x ) f ( x ) d x ,
and can be a better choice if the absolute difference, and not the ratio, differentiates the two PDFs. The appropriate divergence for a specific task can be chosen based on identified properties, e.g., for clustering [49], or a new dissimilarity function can be proposed [50].

3.5. Bag-to-Class Dissimilarity

Bag-to-bag classification can be thought of as model selection: Two bags from the training set, f k ( x ) and f ( x ) are the models, and unlabeled bag f b a g ( x ) is the sample distribution, and is labeled according to which model it resembles the most. The log-ratio test is the most powerful for model selection under certain conditions (Neyman–Pearson lemma). It is possible then to perform the log-ratio test between f b a g ( x ) and each of the bags in the training set.
The training set in MI learning is the instances, since the bag distributions are unknown. Under the assumption that the instances from each bag are i.i.d. samples, the KL information has a special role in model selection, both from the frequentist and the Bayesian perspective. Let f b a g ( x ) be the sample distribution (unlabeled bag), and let f k ( x ) and f ( x ) be two models (labeled bags). Then the expectation over f b a g ( x ) of the log-ratio of the two models, E { log ( f k ( x ) / f ( x ) ) } , is equal to D K L ( f b a g , f ) D K L ( f b a g , f k ) . In other words, the log-ratio test reveals the model closest to the sampling distribution in terms of KL information [51]. From the Bayesian viewpoint, the Akaike Information Criterion (AIC) reveals the model closest to the data in terms of KL information, and is asymptotically equivalent to Bayes factor under certain assumptions [52].
An obstacles arises: The core of MI learning is that bags from the same class are not equal, e.g., two images of the sea, so that the model is most likely not in the training set. In fact, for probability distributions with continuous parameters, the probability of the new bag being in the training set is zero. For ratio-based divergences, such as the f-divergences, the difference between D ( f b a g , f k ) and D ( f b a g , f ) becomes arbitrary. Despite their necessary properties as dissimilarity measures, and the KL information’s property as most powerful model selector, we see that they can fail in practice.
If the bag sampling is sparse, the dissimilarity between f b a g ( x ) and the labeled bags becomes somewhat arbitrary regarding the true label of f b a g ( x ) . The risk is high for ratio-based divergences such as the KL information, since f k ( x ) / f ( x ) = for { x : f ( x ) = 0 , f k ( x ) > 0 } . The bag-to-bag KL information is asymptotically the best choice of divergence function, but this is not the case for sparse training sets. Bag-to-class dissimilarity makes up for some of the sparseness by aggregation of instances. Consider an image segment of color deep green, which appears in sea images, but not in desert images, and a segment of color white, which appears in both classes (waves and clouds). If the combination deep green and white does not appear in the training set, then a bag-to-bag KL information will result in infinite dissimilarity for all bags, regardless of class, but the bag-to-class KL information will be finite for the sea class.
Let P ( X | C ) = B P ( X | B ) d P B ( B | C ) be the probability distribution of a random vector from the bags of class C. Let D P ( X | B ) , P ( X | p o s ) and D P ( X | B ) , P ( X | n e g ) be the divergences between the unlabeled bag and each of the classes. Choice of divergence is not obvious, since P ( X | B ) is different from both P ( X | p o s ) and P ( X | n e g ) , but can be done by identification of properties.

4. Properties for Bag-Level Classification

4.1. Properties for Bag-to-Class Divergences

We argue that the equality, orthogonality and monotonicity properties possessed by f-divergences are reasonable also for bag-to-class divergences, although less likely to occur in practice:
The equality property and the monotonicity property are valid for uncertain objects, but in practice it can occur with sparse class sampling, and we therefore argue that these properties are valid also for bag-to-class divergences. The opposite implies that a bag can be regarded more similar to one class, even though its probability distribution is identical to the other class (equality), or that, e.g., if P b a g ( X ) , P p o s ( X ) and P n e g ( X ) are Gaussian distributions with the same variance and means θ b a g < θ p o s < θ n e g , we can have that D P ( X | θ b a g ) , P ( X | θ p o s ) > D P ( X | θ b a g ) , P ( X | θ n e g ) . In other words, we can have that the divergence between the bag and the positive class is larger than between the bag and the negative class, although the bag mean is closer to the positive class mean. This is clearly not appropriate for a dissimilarity measure.
The orthogonality property is reasonable for bag-to-class divergences: If there is no common sample space between bag and class, the divergence should take its maximum. In conclusion, f-divergences is the correct group for bag-to-class divergences.
There may be other desirable properties for bag-to-class divergences, where the aim is no longer to compare an i.i.d. sample to a model, but to compare an i.i.d. sample to an aggregation of models where the sample comes from one of them. We here propose two properties for bag-to-class divergences regarding infinite bag-to-class ratio and zero instance probability. Denote the divergence between an unlabeled bag and the reference distribution, P r e f ( X ) , by D ( P b a g , P r e f ) .
In the sea images example, the class contains all possible colors that the sea can have, whereas a bag consists only of the colors of that particular moment in time. If the bag contains something that the class does not, e.g., brown color, this should be reflected in a larger divergence. However, the class should be allowed to contain something that the bag does not without this resulting in a similarly large divergence.
As a motivating example, consider the following: A positive bag, P ( X | a ) , is a continuous uniform distribution U ( a , a + δ ) , sampled according to P ( A ) = U ( η , ζ δ ) :
X | a U ( a , a + δ ) A U ( η , ζ δ )
A negative bag, P ( X | a ) , is   U ( a , a + δ ) sampled according to P ( A ) = U ( η , ζ δ ) :
X | a U ( a , a + δ ) A U ( η , ζ δ ) ,
and let η < ζ so that there is an overlap between the two classes. For both positive and negative bags, we have that P p o s ( X ) / P b a g ( X ) = for a subspace of X and P n e g ( X ) / P b a g ( X ) = for a different subspace of X , merely reflecting that the variability in instances within a class is larger than within a bag, as illustrated in Figure 4.
If P b a g ( X ) is a sample from the negative class, and P b a g ( X ) / P p o s ( X ) = for some subspace of X it can easily be classified. From the above analysis, large bag-to-class ratio should be reflected in large divergence, whereas large class-to-bag ratio should not.
Property 1: Let X M be the subspace of X where the bag-to-class ratio is larger than some M:
X M X : P b a g ( X ) / P r e f ( X ) > M ,
and let X \ X M be its complement. Let D X M ( P b a g , P r e f ) be the contribution to the total divergence for that subspace: D ( P b a g , P r e f ) = D X M ( P b a g , P r e f ) + D X \ X M ( P b a g , P r e f ) . Let X M * be the subspace of X where the class-to-bag ratio is larger than some M:
X M * X : P r e f ( X ) / P b a g ( X ) > M ,
and let X \ X M * be its complement. Let D X M * ( P b a g , P r e f ) be the contribution to the total divergence for that subspace: D ( P b a g , P r e f ) = D X M * ( P b a g , P r e f ) + D X \ X M * ( P b a g , P r e f ) .
D X M approaches the maximum contribution as M . D X M * does not approach the maximum contribution as M :
M : D X M ( P b a g , P r e f ) max ( D X M ( P b a g , P r e f ) ) D X M * ( P b a g , P r e f ) max ( D X M * ( P b a g , P r e f ) ) .
Property 1 cannot be fulfilled by a symmetric divergence. This property is necessary in cases where the sample space of a bag is a subset of the sample space of the class, X b a g X c l a s s , e.g., for uniform distributions, and in cases where the variance of a bag is smaller than the variance of the class.
Consider X M * . Because P ( X ) < , this occurs for the subspace of X where P b a g ( X ) is smaller than some ϵ and P r e f ( X ) is not. We argue that when P b a g ( X ) < ϵ , there should be no contribution to the divergence due to the very nature of MI learning: a bag is not a representation of the entire class, but only a small part of it.
Consider an unlabeled image coming from the class sea, and a binary classification problem with desert as the alternative class. If the unlabeled image contains only blue and white colors, it should not influence the divergence how the different shades of brown or green are distributed in the two classes, as it does not influence the likelihood of this bag coming from one class or the other. This is in contrast to bag-to-bag divergences, where this indicates a bad sample-model match.
As a second motivating example, consider the same positive class as before, and the two alternative negative classes defined by:
A P ( A = η ) = 0.5 P ( A = η + 2 δ ) = 0.5   A P ( A = η ) = 0.5 P ( A = η + 2 δ ) = 0.25 P ( A = η + 3 δ ) = 0.25 .
For bag classification, the question becomes: from which class is a specific bag sampled? It is equally probable that a bag P η ( X ) = P ( X | A = η ) comes from each of the two negative classes, since P n e g ( X ) and P n e g ( X ) only differ where P η ( X ) = 0 , and we argue that D ( P η , P n e g ) should be equal to D ( P η , P n e g ) .
Property 2: Let X ϵ be the subspace of X where P b a g ( X ) is larger than some ϵ > 0 :
X ϵ X : P b a g ( X ) > ϵ ,
and let X \ X ϵ be its complement. Let D X ϵ ( P b a g , P r e f ) be the contribution to the total divergence for that subspace: D ( P b a g , P r e f ) = D X ϵ ( P b a g , P r e f ) + D X \ X ϵ ( P b a g , P r e f ) .
The contribution to the total divergence approaches zero as ϵ 0 :
ϵ 0 : D X ϵ ( P b a g , P r e f ) 0 .
This property is necessary when the bag distributions are mixture distributions with possibly zero mixture proportion. It also covers the case when the bags are different distributions, not merely have different parameters, which can be modelled as a mixture of all possible distributions in the class and only one non-zero mixture proportion.
KL information is the only divergence that fulfils these two properties among the non-symmetric divergences listed in [53]. See Appendix A. As there is no complete list of divergences, it is possible that other divergences that the authors are not aware of fulfil these properties.

4.2. A Class-Conditional Dissimilarity for MI Classification

In the sea and desert images example, consider an unlabeled image with a pink segment, e.g., a boat. If pink is absent in the training set, then the bag-to-class KL information will be infinite for both classes. We therefore propose the following property:
Property 3: For the subspace of X where the alternative class probability, P r e f , is smaller than some ϵ , the contribution to the total divergence, D X ϵ , approaches zero as ϵ 0 :
Let X ϵ be the subspace of X where P r e f ( X ) is larger than some ϵ > 0 :
X ϵ X : P r e f ( X ) > ϵ ,
and let X \ X ϵ be its complement. Let D X ϵ ( P b a g , P r e f | P r e f ) be the contribution to the total divergence for that subspace: D ( P b a g , P r e f | P r e f ) = D X ϵ ( P b a g , P r e f | P r e f ) + D X \ X ϵ ( P b a g , P r e f | P r e f ) .
The contribution to the total divergence approaches zero as ϵ 0 :
ϵ 0 : D X ϵ ( P b a g , P r e f | P r e f ) 0 .
We present a class-conditional dissimilarity that accounts for this:
c K L ( f b a g , f p o s | f n e g ) = f n e g ( x ) f p o s ( x ) f b a g ( x ) log f b a g ( x ) f p o s ( x ) d x ,
which also fulfils Properties 1 and 2, see Appendix A.

4.3. Bag-Level Divergence Classification

With a proper dissimilarity measure, the classification task is, in theory, straightforward: A bag is given the label of its most similar class. With dense bag and class sample, the KL bag-to-bag classifier is the most powerful. There are, however, a couple of practical obstacles: The distributions from where the instances have been drawn are not known, and must be estimated. The divergences usually do not have analytical solutions, and must therefore be approximated.
We propose two similar methods based on either the ratio of bag-to-class divergences, r D f b a g , f p o s , f n e g = D f b a g , f p o s ) / D f b a g , f n e g , or the class-conditional dissimilarity in Equation (6). We propose using the KL information (Equation (4)), and report for the BH distance (Equation (5)) for comparison, but any divergence function can be applied.
Given a training set { ( X 1 , y 1 ) , , ( X k , y k ) , , ( X K , y K ) } and a set, X b a g , of instances drawn from an unknown distribution, f b a g ( x ) , with unknown class label y b a g , and let X n e g denote the set of all x i k X k , y k = n e g and X p o s denote the set of all x i k X k , y k = p o s . The bag-level divergence classification follows the steps:
1 .   Estimate   pdfs :   Fit   f ^ n e g ( x )   to   X n e g ,   f ^ p o s ( x )   to   X p o s ,   and   f ^ b a g ( x )   to   X b a g . 2 .   Calculate   divergences :   D f ^ b a g , f ^ n e g )   and   D f ^ b a g , f ^ p o s ,     or   c K L f ^ b a g , f ^ p o s | f ^ n e g   by   integral   approximation .   3 .   Classify   according   to :     y b a g = p o s   if   r D f ^ b a g , f ^ p o s , f ^ n e g < t n e g   otherwise .       or     y b a g = p o s   if   c K L f ^ b a g , f ^ p o s | f ^ n e g < t n e g   otherwise .  
Common methods for PDF estimation are Gaussian mixture models (GMMs) and kernel density estimation (KDE). The integrals in step 2 are commonly approximated by importance sampling and Riemann sums. In rare cases, e.g., when the distributions are Gaussian, the divergences can be calculated directly. The threshold t can be pre-defined based on, e.g., misclassification penalty and prior class probabilities, or estimated from the training set by leave-one-out cross-validation. When the feature dimension is high and the number of instances in each bag is low, PDF estimation becomes arbitrary. A solution is to estimate separate PDFs for each dimension, calculate the corresponding divergences D 1 , , D D i m , and treat them as inputs into a classifier replacing step 3.
In image analysis, it has become more and more common that MI data sets are limited by the number of (labeled) bags per class, more than the number of instances per bag. With the proposed algorithm, the PDF estimates improve with increasing number of instances, and the aggregation of class instances allows for sparser bag samples.

5. Experiments

5.1. Simulated Data and Class Sparsity

The following study exemplifies the difference between BH distance ratio, r B H , KL information ratio, r K L , and c K L as classifiers for sparse training data. We investigate how the three divergences vary in accordance with the number of bags in the training set. The minimum dissimilarity bag-to-bag classifiers are also implemented, based on KL information and BH distance. The number of instances from each bag is 50, the number of bags in the training set is varied from 1 to 25 from each class, and the number of bags in the test set is 100. Each bag and its instances are sampled as described in Equation (3), and the area under the receiver operating characteristic (ROC) curve (AUC) serves as performance measure. For simplicity, we use Gaussian distributions in one dimension for Sim 1-Sim 4:
X N ( μ , σ 2 ) μ N ( 0 , 10 ) σ 2 = | ζ | ,   ζ N ( 1 , 1 ) Π = π     X + N ( μ + , σ 2 + ) μ + N ( ν + , 10 ) σ 2 + = | ζ + | ,   ζ + N ( η + , 1 ) Π + = 0.10 .
  • Sim 1: ν + = 15 ,   η + = 1 ,   π = 0 : No positive instances in negative bags.
  • Sim 2: ν + = 15 ,   η + = 1 ,   π = 0.01 : Positive instances in negative bags.
  • Sim 3: ν + = 0 ,   η + = 100 ,   π = 0 : Positive and negative instances have the same expectation of the mean, but unequal variance.
  • Sim 4: P ( ν + = 15 ) = P ( ν + = 15 ) = 0.5 ,   η + = 1 ,   π = 0.01 : Positive instances are sampled from two distributions with unequal mean expectation.
We add Sim 5 and Sim 6 for the discussion on instance labels in Section 6, as follows: Sim 5 is an uncertain object classification, where the positive bags are lognormal densities with μ = log ( 10 ) and σ 2 = 0.04 , and negative bags are Gaussian mixtures densities with μ 1 = 9.5 , μ 2 = 13.5 , σ 2 = 2.5 , and π 1 = 0.9 . These two densities are nearly identical, see [54], p. 15. In Sim 6, the parameters of Sim 5 are i.i.d. observations from Gaussian distributions, each with σ 2 = 1 for the Gaussian mixture, and σ 2 = 0.04 for the lognormal distribution. Figure 5 shows the estimated class densities and two estimated bag densities for Sim 2 with 10 negative bags in the training set.
We use the following details for the algorithm in (7): 1. KDE fitting: Epanechnikov kernel with estimated bandwidth varying with the number of observations. 2. Integrals: Importance sampling. 3. Classifier: t is varied to give the full range of sensitivities and specificities necessary to calculate AUC.
Table 1 shows the mean AUCs for 50 repetitions.

5.2. The Impact of Pdf Estimation and Comparison to Other Methods

We use a public data set from UCSB Center for Bio-Image Informatics to demonstrate the impact of PDF estimation method and for comparison with other MI classification methods. The UCSB data set consists of 58 breast tumor histology images, as seen in Figure 1). There are 32 images labeled as benign and 26 as malignant. The image patches are of size 7 × 7 pixels, and 708 features have been extracted from each patch. The mean number of instances per bag is 35. We have used the published instance values [14] to minimize other sources of variation than the classification algorithms. Following the procedure in [3], the principal components are used for dimension reduction, and 4-fold cross-validation is used so that f ^ n e g ( x ) and f ^ p o s ( x ) are fitted only to the instances in the training folds. Table 2 shows the AUC for r K L and c K L for three different methods for PDF estimation. GMMs are fitted to the first principal component, using an EM-algorithm, with number of components chosen by minimum AIC. In addition, KDE as in Section 5.1, and KDE with Gaussian kernel and optimal bandwidth [55] is used.
Table 3 shows the AUC of the GMM fitted r K L and c K L compared to four other MI learning methods. For articles presenting more than one method, the best-performing method is displayed in Table 3.

5.3. Comparison to State-of-the-Art Methods

The benchmark data sets that have been used for comparison of MIL methods have particularly low number of instances compared to the number of features. e.g., in M u s k 1 , more than half of the bags contain less than 5 instances, and in M u s k 2 , one fourth of the bags contain less than 5 instances. It is obvious that a PDF-based method will not work. The COREL data base, previously used in MIL method comparisons, is no longer available, only data sets with extracted features. Again, the number of instances is too low for density estimation. In addition, [56] showed how the feature extraction methods influence the results of MIL classifiers.
We here present the results of c K L and r K L compared to the five best-performing MIL methods using the BreakHis data set, as presented in [31]. This data set is suited for PDF-based methods, since the images themselves are available, and hence, the number of instances can be adjusted to assure a sufficiently dense sampling. We follow the procedure in [31], using the 162 parameter-free threshold adjacency statistics (PFTAS) features for 1000 image patches of size 64 × 64 . Dimension reduction is done by principal components, so that 90 % of the variance is explained, and the dimension is reduced to about 25, depending on which data set, see Table 4. Each data set is split into training, validation and test sets (35%/35%/30%), where we use the exact same five test sets as [31]. There are multiple images from the same tumor, but the data set is split so that the same tumor does not appear in both training/validation and test set.
We use the following details for the algorithm in (7):
  • GMMs are fitted with 1 , , 100 components, and the number of components is chosen by minimum AIC. To save computation time, the number of components is estimated for 10 bags sampled from the training set. The median number of components is used to fit the bag PDFs in the rest of the algorithm, see Table 4. For the class PDFs, a random subsample of 10% of the instances is taken from each bag, to reduce computation time.
  • Integrals: Importance sampling.
  • Classification: To estimate the threshold, t, the training set is used to estimate f p o s t r a i n ( x ) and f n e g t r a i n ( x ) , and the divergences between the bags in the validation set and f p o s t r a i n ( x ) and f n e g t r a i n ( x ) are calculated. The threshold, t ^ , that gives the highest accuracy will then serve as threshold for the test set.
Please note that the bags from the test set is not involved in picking the number of components or estimating t ^ .

5.4. Results

The general trend in Table 1 is that c K L gives higher AUC than r K L , which in turn gives higher AUC than r B H , in line with the divergences’ properties for sparse training sets. The same trend can be seen with a Gaussian kernel and optimal bandwidth (numbers not reported). The gap between c K L and r K L narrows with larger training sets. In other words, the benefit of c K L increases with sparsity. This can be explained by the / risk of r K L , as seen in Figure 5a. Increasing π + also narrows the gap between r K L and c K L , and eventually (at approximately π + = 0.25 ), r K L outperforms c K L (numbers not reported). Sim 1 and Sim 3 are less affected because the ratio π + / π is already .
The minimum bag-to-bag classifier gives a single sensitivity-specificity outcome, and the KL information outperforms the BH distance. Compared to the ROC curve, as illustrated in Figure 6, the minimum bag-to-bag KL information classifier exceeds the bag-to-class dissimilarities only for very large training sets, typically for 500 or more, then at the expense of extensive computation time.
Sim 5 is an example in which the absolute difference, not the ratio, differentiates the two classes, and r B H has the superior performance. When the extra hierarchy level is added in Sim 6, the performances returned to normal.
The UCSB breast tissue study shows that the simple divergence-based approach can outperform more sophisticated algorithms. r K L is more sensitive than c K L to choice of density estimation method, as shown in Table 2. r K L performs better than c K L with GMM, and both are among the best performing in Table 3. The study is too small to draw conclusions. Table 2 shows how the performance can vary between two common PDF estimation methods that do not assume a particular underlying distribution. Both KDE and GMM are sensitive to chosen parameters or parameter estimation method, bandwidth and number of components, respectively, and no method will fit all data sets. In general, KDE is faster, but more sensitive to bandwidth, whereas GMM is more stable. For bags with very few instances the benefits of GMM cannot be exploited, and KDE is preferred.
The BreakHis study shows that both r K L and c K L perform as good as or better than the other methods, the exception being c K L for 40 × , as reported in Table 5. “As good as” refers to the mean being within one standard deviation of the highest mean. Since none of the methods have overall superior performance, we believe that the differences within one standard deviation is not enough to declare a winner. r K L has overall best performance in the sense that it is always within one standard deviation from the highest mean. However, cKL, MI-SVM poly and Non-parametric follow close behind with four out of five. Therefore, we will again avoid declaring a winner. Table 4 demonstrates that the number of components varies between repetitions, but does not influence the accuracy substantially. For reference, we have reported the AUC in Table 6, as this is a common way of reporting performance in the MIL context.
The superior performance of c K L for the KDE (Epan.) in Table 2 can be explained by the Epanechnikov kernel’s zero value, as opposed to the Gaussian kernel which is always positive. r K L will then suffer from its / property given the limited training set for each class. With Gaussian kernel and GMMs, r K L improves its performance compared to c K L , as demonstrated in the simulation study. For the BreakHist data, r K L and c K L show similar performance. Although c K L is not within one standard deviation from the best-performing method for the 40 × data set, it is within one standard deviation from r K L . The similar performance of r K L and c K L is in line with the simulation study where the superiority of c K L is demonstrated for sparse training sets, but not for all types of data.

6. Discussion

6.1. Point-of-View

The theoretical basis of the bag-to-class divergence approach relies on viewing a bag as a probability distribution, hence fitting into the branch of collective assumptions of the Foulds and Frank taxonomy [13]. The probability distribution estimation can be seen as extracting bag-level information from a set X , and hence falls into the BS paradigm of Amores [15]. The probability distribution space is non-vectorial, different from the distance-kernel spaces in [15], and divergences are used for classification.
In practice, the evaluation points of the importance sampling gives a mapping from the set X to a single vector, f ^ b a g ( z ) . The mapping concurs with the ES paradigm, and the same applies for the graph-based methods. From that viewpoint, the bag-to-class divergence approach expands the distance branch of Foulds and Frank to include a bag-to-class category in addition to instance-level and bag-level distances. However, the importance sampling is a technicality of the algorithm. We argue that the method belongs to the BS paradigm. When the divergences are used as input to a classifier, the ES paradigm is a better description.
Carbonneau et al. [16] assume underlying instance labels. From a probability distribution viewpoint, this corresponds to posterior probabilities, which are in practice, inaccessible. In Sim 1–Sim 4, the instance labels are inaccessible through observations without previous knowledge about the distributions. In Sim 6, the instance label approach is not useful due to the similarity between the two distributions:
X | θ + P ( X | θ + ) Θ + P ( Θ + )     X | θ P ( X | θ ) Θ P ( Θ ) ,
where P ( X | Θ + ) and P ( X | Θ ) are the lognormal and the Gaussian mixture, respectively. Equation (3) is just a special case of Equation (8), where Θ + is the random vector { Θ , Π p o s } . Without knowledge about the distributions, discriminating between training sets following the generative model of Equations (3) and (8) is only possible for a limited number of problems. Even the uncertain objects of Sim 5 are difficult to discriminate from MI objects based solely on the observations in the training set.

6.2. Conclusions and Future Work

Although the bag-to-bag KL information has the minimum misclassification rate, the typical bag sparseness of MI training sets is an obstacle. This is partly solved by bag-to-class dissimilarities and the proposed class-conditional KL information accounts for additional sparsity of bags.
The bag-to-class divergence approach addresses three main challenges of MI learning. (1) Aggregation of instances according to bag label and the additional class-conditioning provide a solution for the bag sparsity problem. (2) The bag-to-bag approach suffers from extensive computation time, solved by the bag-to-class approach. (3) Viewing bags as probability distributions give access to analytical tools from statistics and probability theory, and comparisons of methods can be done on a data-independent level through identification of properties. The properties presented here are not an extensive list, and any extra knowledge should be taken into account whenever available.
A more thorough analysis of the proposed function, c K L , will identify its weaknesses and strengths, and can lead to improved versions as well as alternative class-conditional dissimilarity measures and a more comprehensive tool.
The diversity of data types, assumptions, problem characteristics, sampling sparsity, etc. is far too large for any one approach to be sufficient. The introduction of divergences as an alternative class of dissimilarity functions, and the bag-to-class dissimilarity as an alternative to the bag-to-bag dissimilarity, has added additional tools to the MI toolbox.

Author Contributions

Conceptualization, K.M.; methodology, K.M.; software, K.M.; writing—original draft preparation, K.M.; writing—review and editing, J.Y.H. and F.G.; visualization, K.M.; supervision, J.Y.H. and F.G. All authors have read and agreed to the published version of the manuscript.

Funding

The publication charges for this article have been funded by a grant from the publication fund of UiT The Arctic University of Norway.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MImulti-instance
PDFprobability density function
ISinstance space
ESembedded space
BSbag space
KLKullback–Leibler
SVMsupport vector machine
AICAkaike Information Criterion
GMMGaussian mixture models
KDEkernel density estimation
ROCreceiver operating characteristic
AUCarea under the ROC curve

Appendix A

For the sake of readability, we repeat summary versions of the properties here:
Property 1:
  X M X : P b a g ( X ) / P r e f ( X ) > M         M : D X M ( P b a g , P r e f ) max ( D X M ( P b a g , P r e f ) ) D X M * ( P b a g , P r e f ) max ( D X M * ( P b a g , P r e f ) )
Property 2:
X ϵ X : P b a g ( X ) > ϵ       ϵ 0 : D X ϵ ( P b a g , P r e f ) 0
Property 3:
X ϵ X : P r e f ( X ) > ϵ       ϵ 0 : D X ϵ ( P b a g , P r e f | P r e f ) 0

Appendix A.1. Non-Symmetric Divergences

We show that the only non-symmetric divergences listed in [53] that fulfil both Property 1 and Property 2 is the KL information. For all other divergences, we show one property that it does not fulfil.
The χ 2 -divergence, defined as:
( f b a g ( x ) f r e f ( x ) ) 2 f r e f ( x ) d x ,
does not fulfil Property 2:
X ϵ ( ϵ f r e f ( x ) ) 2 f r e f ( x ) d x X ϵ f r e f ( x ) d x 0 .
The KL information, referred to as Relative information in [53], defined as:
f b a g ( x ) log f b a g ( x ) f r e f ( x ) d x ,
fulfils Property 1:
  X M f b a g ( x ) log M d x = max   X M * f b a g ( x ) log 1 M d x ,
since f b a g ( x ) < and 1 M < , and Property 2:
X ϵ ϵ log ϵ f r e f ( x ) d x 0 = min
The Relative Jensen-Shannon divergence, defined as:
f b a g ( x ) log 2 f b a g ( x ) f b a g ( x ) + f r e f ( x ) d x ,
does not fulfil Property 1:
  X M f b a g ( x ) log 2 1 + f r e f ( x ) f b a g ( x ) d x = X M f b a g ( x ) log 2 1 + 1 M d x X M f b a g ( x ) log 2   d x = max .
The Relative Arithmetic-Geometric divergence, defined as:
f b a g ( x ) + f r e f ( x ) 2 log f b a g ( x ) + f r e f ( x ) 2 f b a g ( x ) d x ,
does not fulfil Property 2:
  X ϵ ϵ + f r e f ( x ) 2 log ϵ + f r e f ( x ) 2 ϵ d x min .
The Relative J-divergence, defined as:
( f b a g ( x ) + f r e f ( x ) ) log f b a g ( x ) + f r e f ( x ) 2 f b a g ( x ) d x ,
does not fulfil Property 2:
  X ϵ ( ϵ + f r e f ( x ) ) log ϵ + f r e f ( x ) 2 ϵ d x min .

Appendix A.2. Class-Conditional Bag-to-Class Divergence

Class-conditional KL-divergence:
For the class-conditional divergence, there are three PDFs involved, and therefore, we have some additional restrictions. We show that the Equality and Orthogonality properties of f-divergences are fulfilled also by c K L . We were not able to conclude regarding the Monotonicity property.
c K L ( f b a g , f p o s | f n e g ) = f n e g ( x ) f p o s ( x ) f b a g ( x ) log f b a g ( x ) f p o s ( x ) d x
Equality, f b a g ( x ) = f p o s ( x ) ,   f n e g 0 :
f n e g ( x ) f p o s ( x ) f p o s ( x ) log f p o s ( x ) f p o s ( x ) d x = f n e g ( x ) log 1   d x = 0 = min .
Orthogonality, f b a g ( x ) / f p o s ( x ) = ,   f n e g ( x ) > 0 :
f n e g ( x ) f b a g ( x ) f p o s ( x ) log f b a g ( x ) f p o s ( x )   d x = = max .
Property 1: f n e g ( x ) · M > 0
  X M f n e g ( x ) M log M d x = max   X M * f n e g ( x ) M log 1 M d x
Property 2: f n e g ( x ) f p o s ( x ) · ϵ > 0
  X ϵ f n e g ( x ) f p o s ( x ) ϵ log ϵ f p o s ( x ) d x 0
Property 3: f b a g ( x ) / f p o s ( x ) > M , ϵ 0 faster than M
X ϵ ϵ f p o s ( x ) f b a g ( x ) log f b a g ( x ) f p o s ( x ) d x 0

References

  1. Cheplygina, V.; de Bruijne, M.; Pluim, J.P. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med. Image Anal. 2019, 54, 280–296. [Google Scholar] [CrossRef] [Green Version]
  2. Gelasca, E.D.; Byun, J.; Obara, B.; Manjunath, B.S. Evaluation and Benchmark for Biological Image Segmentation. In Proceedings of the IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 1816–1819. [Google Scholar] [CrossRef] [Green Version]
  3. Kandemir, M.; Zhang, C.; Hamprecht, F.A. Empowering Multiple Instance Histopathology Cancer Diagnosis by Cell Graphs. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention—MICCAI 2014, Boston, MA, USA, 14–18 September 2014; pp. 228–235. [Google Scholar] [CrossRef]
  4. Doran, G.; Ray, S. Multiple-Instance Learning from Distributions. J. Mach. Learn. Res. 2016, 17, 1–50. [Google Scholar]
  5. Zhang, M.L.; Zhou, Z.H. Multi-instance clustering with applications to multi-instance prediction. Appl. Intell. 2009, 31, 47–68. [Google Scholar] [CrossRef]
  6. Zhou, Z.H.; Zhang, M.L.; Huang, S.J.; Li, Y.F. Multi-instance multi-label learning. Artif. Intell. 2012, 176, 2291–2320. [Google Scholar] [CrossRef] [Green Version]
  7. Tang, P.; Wang, X.; Huang, Z.; Bai, X.; Liu, W. Deep patch learning for weakly supervised object classification and discovery. Pattern Recognit. 2017, 71, 446–459. [Google Scholar] [CrossRef] [Green Version]
  8. Wang, X.; Yan, Y.; Tang, P.; Bai, X.; Liu, W. Revisiting multiple instance neural networks. Pattern Recognit. 2018, 74, 15–24. [Google Scholar] [CrossRef] [Green Version]
  9. Dietterich, T.G.; Lathrop, R.H.; Lozano-Pérez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef] [Green Version]
  10. Xu, Y.Y. Multiple-instance learning based decision neural networks for image retrieval and classification. Neurocomputing 2016, 171, 826–836. [Google Scholar] [CrossRef]
  11. Qiao, M.; Liu, L.; Yu, J.; Xu, C.; Tao, D. Diversified dictionaries for multi-instance learning. Pattern Recognit. 2017, 64, 407–416. [Google Scholar] [CrossRef]
  12. Weidmann, N.; Frank, E.; Pfahringer, B. A Two-Level Learning Method for Generalized Multi-instance Problems. In Proceedings of the European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; pp. 468–479. [Google Scholar] [CrossRef] [Green Version]
  13. Foulds, J.; Frank, E. A review of multi-instance learning assumptions. Knowl. Eng. Rev. 2010, 25, 1–25. [Google Scholar] [CrossRef] [Green Version]
  14. Cheplygina, V.; Tax, D.M.J.; Loog, M. Multiple Instance Learning with Bag Dissimilarities. Pattern Recognit. 2015, 48, 264–275. [Google Scholar] [CrossRef] [Green Version]
  15. Amores, J. Multiple Instance Classification: Review, Taxonomy and Comparative Study. Artif. Intell. 2013, 201, 81–105. [Google Scholar] [CrossRef]
  16. Carbonneau, M.A.; Cheplygina, V.; Granger, E.; Gagnon, G. Multiple Instance Learning: A survey of Problem Characteristics and Applications. Pattern Recognit. 2018, 77, 329–353. [Google Scholar] [CrossRef]
  17. Maron, O.; Lozano-Pérez, T. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems, Denver, CO, USA, 30 November–5 December 1998; MIT Press: Cambridge, MA, USA, 1998; Volume 10, pp. 570–576. [Google Scholar]
  18. Xu, X.; Frank, E. Logistic Regression and Boosting for Labeled Bags of Instances; Lecture Notes in Computer Science; Dai, H., Srikant, R., Zhang, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 272–281. [Google Scholar] [CrossRef] [Green Version]
  19. Tax, D.M.J.; Loog, M.; Duin, R.P.W.; Cheplygina, V.; Lee, W.J. Bag Dissimilarities for Multiple Instance Learning. In Similarity-Based Pattern Recognition; Lecture Notes in Computer Science; Pelillo, M., Hancock, E.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 7005, pp. 222–234. [Google Scholar] [CrossRef]
  20. Zhou, Z.H.; Sun, Y.Y.; Li, Y.F. Multi-instance Learning by Treating Instances As non-I.I.D. Samples. In Proceedings of the 26th Annual International Conference on Machine Learning—ICML ’09, Montreal, QC, Canada, 14–18 June 2009; pp. 1249–1256. [Google Scholar] [CrossRef] [Green Version]
  21. Cheplygina, V.; Tax, D.M.J.; Loog, M. Dissimilarity-Based Ensembles for Multiple Instance Learning. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1379–1391. [Google Scholar] [CrossRef] [Green Version]
  22. Boiman, O.; Shechtman, E.; Irani, M. In defense of Nearest-Neighbor based image classification. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
  23. Lee, W.J.; Cheplygina, V.; Tax, D.M.J.; Loog, M.; Duin, R.P.W. Bridging structure and feature representations in graph matching. Int. J. Patten Recognit. Artif. Intell. 2012, 26, 1260005. [Google Scholar] [CrossRef] [Green Version]
  24. Scott, S.; Zhang, J.; Brown, J. On generalized multiple-instance learning. Int. J. Comput. Intell. Appl. 2005, 5, 21–35. [Google Scholar] [CrossRef] [Green Version]
  25. Ruiz-Muñoz, J.F.; Castellanos-Dominguez, G.; Orozco-Alzate, M. Enhancing the dissimilarity-based classification of birdsong recordings. Ecol. Inform. 2016, 33, 75–84. [Google Scholar] [CrossRef]
  26. Sørensen, L.; Loog, M.; Tax, D.M.J.; Lee, W.J.; de Bruijne, M.; Duin, R.P.W. Dissimilarity-Based Multiple Instance Learning. In Structural, Syntactic, and Statistical Pattern Recognition; Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 129–138. [Google Scholar] [CrossRef]
  27. Schölkopf, B. The Kernel Trick for Distances. In Proceedings of the 13th International Conference on Neural Information Processing Systems, Denver, CO, USA, 27 November–2 December 2000; MIT Press: Cambridge, MA, USA, 2000; pp. 283–289. [Google Scholar]
  28. Wei, X.S.; Wu, J.; Zhou, Z.H. Scalable Algorithms for Multi-Instance Learning. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 975–987. [Google Scholar] [CrossRef]
  29. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  30. Sahu, S.K.; Cheng, R.C.H. A fast distance-based approach for determining the number of components in mixtures. Can. J. Stat. 2003, 31, 3–22. [Google Scholar] [CrossRef]
  31. Sudharshan, P.; Petitjean, C.; Spanhol, F.; Oliveira, L.E.; Heutte, L.; Honeine, P. Multiple instance learning for histopathological breast cancer image classification. Expert Syst. Appl. 2019, 117, 103–111. [Google Scholar] [CrossRef]
  32. Zhang, G.; Yin, J.; Li, Z.; Su, X.; Li, G.; Zhang, H. Automated skin biopsy histopathological image annotation using multi-instance representation and learning. BMC Med. Genom. 2013, 6, S10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Li, W.; Zhang, J.; McKenna, S.J. Multiple instance cancer detection by boosting regularised trees. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Basel, Switzerland, 2015; Volume 9349, pp. 645–652. [Google Scholar] [CrossRef] [Green Version]
  34. Tomczak, J.M.; Ilse, M.; Welling, M. Deep Learning with Permutation-invariant Operator for Multi-instance Histopathology Classification. arXiv 2017, arXiv:1712.00310. [Google Scholar]
  35. Mercan, C.; Aksoy, S.; Mercan, E.; Shapiro, L.G.; Weaver, D.L.; Elmore, J.G. Multi-Instance Multi-Label Learning for Multi-Class Classification of Whole Slide Breast Histopathology Images. IEEE Trans. Med. Imaging 2018, 37, 316–325. [Google Scholar] [CrossRef] [Green Version]
  36. Xu, Y.; Zhu, J.Y.; Chang, E.I.; Lai, M.; Tu, Z. Weakly supervised histopathology cancer image segmentation and classification. Med. Image Anal. 2014, 18, 591–604. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. McCann, M.T.; Bhagavatula, R.; Fickus, M.C.; Ozolek, J.A.; Kovaĉević, J. Automated colitis detection from endoscopic biopsies as a tissue screening tool in diagnostic pathology. In Proceedings of the 2012 19th IEEE International Conference on Image Processing, Orlando, FL, USA, 30 September–3 October 2012; pp. 2809–2812. [Google Scholar] [CrossRef] [Green Version]
  38. Dundar, M.M.; Badve, S.; Raykar, V.C.; Jain, R.K.; Sertel, O.; Gurcan, M.N. A multiple instance learning approach toward optimal classification of pathology slides. In Proceedings of the International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2732–2735. [Google Scholar] [CrossRef]
  39. Samsudin, N.A.; Bradley, A.P. Nearest neighbour group-based classification. Pattern Recognit. 2010, 43, 3458–3467. [Google Scholar] [CrossRef]
  40. Kraus, O.Z.; Ba, J.L.; Frey, B.J. Classifying and segmenting microscopy imageswith deep multiple instance learning. Bioinformatics 2016, 32, i52–i59. [Google Scholar] [CrossRef]
  41. Hou, L.; Samaras, D.; Kurc, T.M.; Gao, Y.; Davis, J.E.; Saltz, J.H. Efficient Multiple Instance Convolutional Neural Networks for GigapixelResolution Image Classification. arXiv 2015, arXiv:1504.07947. [Google Scholar]
  42. Jia, Z.; Huang, X.; Chang, E.I.; Xu, Y. Constrained Deep Weak Supervision for Histopathology Image Segmentation. IEEE Trans. Med. Imaging 2017, 36, 2376–2388. [Google Scholar] [CrossRef] [Green Version]
  43. Jiang, B.; Pei, J.; Tao, Y.; Lin, X. Clustering Uncertain Data Based on Probability Distribution Similarity. IEEE Trans. Knowl. Data Eng. 2013, 25, 751–763. [Google Scholar] [CrossRef]
  44. Kriegel, H.P.; Pfeifle, M. Density-based Clustering of Uncertain Data. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining KDD ’05, Chicago, IL, USA, 21–24 August 2005; pp. 672–677. [Google Scholar] [CrossRef]
  45. Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
  46. Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 299–318. [Google Scholar]
  47. Berger, A. On orthogonal probability measures. Proc. Am. Math. Soc. 1953, 4, 800–806. [Google Scholar] [CrossRef]
  48. Gibbs, A.L.; Su, F.E. On Choosing and Bounding Probability Metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef] [Green Version]
  49. Møllersen, K.; Dhar, S.S.; Godtliebsen, F. On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering. Appl. Math. 2016, 07, 1674–1706. [Google Scholar] [CrossRef] [Green Version]
  50. Møllersen, K.; Hardeberg, J.Y.; Godtliebsen, F. Divergence-based colour features for melanoma detection. In Proceedings of the 2015 Colour and Visual Computing Symposium (CVCS), Gjøvik, Norway, 25–26 August 2015; pp. 1–6. [Google Scholar] [CrossRef]
  51. Eguchi, S.; Copas, J. Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma. J. Multivar. Anal. 2006, 97, 2034–2040. [Google Scholar] [CrossRef] [Green Version]
  52. Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
  53. Taneja, I.J.; Kumar, P. Generalized non-symmetric divergence measures and inequaities. J. Interdiscip. Math. 2006, 9, 581–599. [Google Scholar] [CrossRef]
  54. McLachlan, G.; Peel, D. Finite Mixture Models; Wiley Series in Probability and Statistics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2000. [Google Scholar] [CrossRef]
  55. Sheather, S.J.; Jones, M.C. A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. J. R. Stat. Soc. Ser. B (Methodol.) 1991, 53, 683–690. [Google Scholar] [CrossRef]
  56. Wei, X.S.; Zhou, Z.H. An empirical study on image bag generators for multi-instance learning. Mach. Learn. 2016, 105, 155–198. [Google Scholar] [CrossRef] [Green Version]
  57. Andrews, S.; Andrews, S.; Tsochantaridis, I.; Hofmann, T. Support vector machines for multiple-instance learning. Adv. Neural Inf. Process. Syst. 2003, 15, 561–568. [Google Scholar]
  58. Venkatesan, R.; Chandakkar, P.; Li, B. Simpler Non-Parametric Methods Provide as Good or Better Results to Multiple-Instance Learning. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
  59. Sun, M.; Han, T.X.; Liu, M.-C.; Khodayari-Rostamabad, A. Multiple Instance Learning Convolutional Neural Networks for object recognition. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3270–3275. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Breast tissue images [2]. The image segments are not labeled.
Figure 1. Breast tissue images [2]. The image segments are not labeled.
Data 05 00056 g001
Figure 2. Sea and desert images from Wikimedia Commons.
Figure 2. Sea and desert images from Wikimedia Commons.
Data 05 00056 g002
Figure 3. Parametric generative model. Bags are realizations of random parameter vectors, sampled according to the respective class distributions. Instances are realizations of feature vectors, sampled according the respective bag distributions. Only the instance sets are observed.
Figure 3. Parametric generative model. Bags are realizations of random parameter vectors, sampled according to the respective class distributions. Instances are realizations of feature vectors, sampled according the respective bag distributions. Only the instance sets are observed.
Data 05 00056 g003
Figure 4. The PDF of a bag with uniform distribution and the PDFs of the two classes.
Figure 4. The PDF of a bag with uniform distribution and the PDFs of the two classes.
Data 05 00056 g004
Figure 5. (a) One positive bag in the training set gives small variance for the class PDF. (b) Ten positive bags in the training set, and the variance has increased.
Figure 5. (a) One positive bag in the training set gives small variance for the class PDF. (b) Ten positive bags in the training set, and the variance has increased.
Data 05 00056 g005
Figure 6. An example of ROC curves for c K L , r K L and r B H classifiers. The performance increases when the number of positive bags in the training set increases from 1 (dashed line) to 10 (solid line). The sensitivity-specificity pairs for the bag-to-bag KL and BH classifier is displayed for 100 positive and negative bags in the training set for comparison.
Figure 6. An example of ROC curves for c K L , r K L and r B H classifiers. The performance increases when the number of positive bags in the training set increases from 1 (dashed line) to 10 (solid line). The sensitivity-specificity pairs for the bag-to-bag KL and BH classifier is displayed for 100 positive and negative bags in the training set for comparison.
Data 05 00056 g006
Table 1. AUC · 100 for simulated data.
Table 1. AUC · 100 for simulated data.
Bags neg : 5 neg : 10 neg : 25
Sim: pos : rBH rKL cKL rBH rKL cKL rBH rKL cKL
1616985627289617392
15637586648294688497
10698687739195759198
1576175596178585575
25596779606884626385
10647780667886687286
1515571525873505774
35536176536681526583
10587378587684577687
1556170566273565869
45566375576481595980
10607477627685636984
1646162676366646267
55736963747067757172
10747062757369767472
1686867666868687168
65656467686869707174
10666466706972727374
Table 2. AUC · 100 We use the following details for the algorithm.
Table 2. AUC · 100 We use the following details for the algorithm.
KDE (Epan.)KDE (Gauss.)GMMs
c K L 909294
r K L 829296
Table 3. AUC · 100 for USCB breast tissue images.
Table 3. AUC · 100 for USCB breast tissue images.
MethodAUC
c K L 94
r K L 96
DEEPISR-MIL [34]90
Li et al. [33]93
GPMIL [3]86
RGPMIL [3]90
Table 4. Number of components.
Table 4. Number of components.
Data Set 40 × 100 × 200 × 400 ×
Dimension23262524
Rep 166555270
Rep 258496971
Rep 359505070
Rep 447495873
Rep 563597274
Table 5. Accuracy and standard deviation. Best results and those within one standard deviation in bold.
Table 5. Accuracy and standard deviation. Best results and those within one standard deviation in bold.
Data Set (Magnification) 40 × 100 × 200 × 400 ×
MI-SVM poly [57]86.2 (2.8)82.8 (4.8)81.7 (4.4)82.7 (3.8)
Non-parametric [58]87.8 (5.6)85.6 (4.3)80.8 (2.8)82.9 (4.1)
MILCNN [59]86.1 (4.2)83.8 (3.1)80.2 (2.6)80.6 (4.6)
CNN [31]85.6 (4.8)83.5 (3.9)83.1 (1.9)80.8 (3.0)
SVM [31]79.9 (3.7)77.1 (5.5)84.2 (1.6)81.2 (3.6)
rKL83.4 (4.1)84.9 (4.2)88.3 (3.6)84.0 (2.8)
cKL81.5 (3.2)85.2 (3.5)88.1 (3.6)85.0 (3.5)
Table 6. AUC and standard deviation.
Table 6. AUC and standard deviation.
Data Set (Magnification) 40 × 100 × 200 × 400 ×
rKL91.4 (2.4)91.3 (2.2)94.4 (1.9)91.6 (1.7)
cKL88.4 (2.6)89.7 (1.6)91.9 (2.7)91.7 (2.4)

Share and Cite

MDPI and ACS Style

Møllersen, K.; Hardeberg, J.Y.; Godtliebsen, F. A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning. Data 2020, 5, 56. https://0-doi-org.brum.beds.ac.uk/10.3390/data5020056

AMA Style

Møllersen K, Hardeberg JY, Godtliebsen F. A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning. Data. 2020; 5(2):56. https://0-doi-org.brum.beds.ac.uk/10.3390/data5020056

Chicago/Turabian Style

Møllersen, Kajsa, Jon Yngve Hardeberg, and Fred Godtliebsen. 2020. "A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning" Data 5, no. 2: 56. https://0-doi-org.brum.beds.ac.uk/10.3390/data5020056

Article Metrics

Back to TopTop