## 1. Introduction

Without a doubt, the last two decades have been characterized by massive production of data with regards to the fields of Computer Science (CS) and Artificial Intelligence (AI). Several real-life applications contribute to this phenomenon, operating as rich sources of data over all possible kinds: structured, semi-structured or unstructured [

1,

2]. We distinguish the next fields: social media platforms, economic transactions, medical recordings and Internet of Things (IoT) where Industry 4.0, constitutes a highly affected application coming from the latter field [

3,

4,

5]. Although these applications offer advanced mechanisms for producing the necessary data under automated protocols and/or mechanisms, they still, in their majority, cannot address the data annotation under a similar and still accurate manner.

Therefore, the most widely known variant of Machine Learning (ML) algorithms, the supervised ones, do not stand as a proper solution for obtaining informative insights over the collected data, since the restricted number of annotated examples that may be acquired, even by a manual procedure, do not establish a sufficient training subset. Weakly Supervised Learning (WSL) and/or Partially Supervised Learning (PSL) approaches tackle this problem, trying to exploit the existence of the clearly larger amounts of non-annotated examples in order to mine useful information that might hide based on the aforementioned available annotated examples [

6,

7].

Different approaches of WSL and/or PSL approaches have been recorded in the related literature since their emergence. The common factor is the much smaller size of the labeled subset (

L) against the corresponding unlabeled subset (

U), while the most important points over which they differentiate are the following [

8]:

inductive/transductive approaches, where an explicit learning rule is formatted using the train set during the former one, trying to apply this on a distinct test set, while these two sets are both provided in advance during the latter;

incomplete/inaccurate supervision, where both labeled and unlabeled examples are initially gathered regarding the first category, on contrast with the second one which is distinguished because of the noise that may govern the provided labeled examples, a fact that would cause intense deterioration on learning a specific task; and

active/semi-supervised learning, where there is a straightforward separation of the approaches that need or demand human intervention so as to blend human’s knowledge into their total learning kernel for acquiring safer decisions instead of being based solely on a base learner’s predictions building a more automated learning chain but with greater risks.

Following the recipes of WSL/PSL approaches, the abundance of collected unlabeled data might act as a valuable source of information, reducing the negative effect of the difficulty on obtaining much labeled data, because of the inherent difficulties that take place on several domains. Ethical issues, scarcity of specific incidents and highly expensive labeling procedures are some of the obstacles that usually prevent us from handling successfully the creation of a large enough

L per case [

9]. However, based on the assumption that both

L and

U are produced from the same underlying data distribution

$P$, the connection between the conditional distribution

$P\left(y|\mathit{x}\right)$ and the marginal distribution

$P\left(\mathit{x}\right)$ could lead potentially to more accurate learning functions:

$f:\mathit{X}\mapsto Y$, where

$\mathit{x}=\left\{{x}_{1},{x}_{2},\dots ,{x}_{N}\right\}$ stands for a typical example with

N features while

$y$ symbolizes the label that accompanies any such sample, being either known or unknown in the cases of

L and

U, respectively. Furthermore, through

$\mathit{X},Y$, we depict the space of the examples and the labels.

In this work, we aim to propose an accurate and robust batch-based inductive Active Learning (AL) algorithm for pool-based scenario, regarding the manner that the data are initially concerned. The base learner of the proposed AL algorithm is based on the adoption of an ensemble learner into its learning structure so as to cover efficiently the shortage of labeled data (

${\mathit{l}}_{i}\u03f5L$) along with the existence of a human oracle (

${H}^{oracle}$) that may provide us with trustworthy annotations. At the same time, a quite larger pool of unlabeled examples

$\left({\mathit{u}}_{i}\u03f5U\right)$ is available for mining its context [

10]. Since the need for well-established predictions during the labeling stage of

${u}_{i}$ is one of the most crucial point during AL strategies, the exploitation of ensemble learners seems mandatory in order to capture better the insights of the examined data. Several recent works are also directed towards introducing ensemble learners into AL or other WSL variants, such as Semi-supervised Learning (SSL) [

11], Cooperative Learning (CL) [

12]—also known as AL + SSL [

13,

14]—or even Transfer Learning (TL) [

15], while the field of supervised ensemble learners is still in bloom [

16].

In our case, we prefer the adoption of the Logitboost ensemble learner, a product of the well-known procedure of generating ensemble learners through serially formatting a Generalized Additive Model (GAM), which does not prevent us from adopting additive tree models, as follows:

According to its strong theory, Logitboost manipulates the distribution of the training dataset based on the errors that occur during their categorization, where

$M$ depicts the learning rounds of this iterative procedure [

17]. Following the generic concept of boosting, during the training stage, we fit a number of learners

$f(\xb7)$ which try to emphasize better on the instances that are misclassified. This is made through the weighting vector

$\mathit{w}$ that enables the fitted learner to modify its decisions towards covering the most difficult cases based on their weight factors. After

$M$ such rounds, we reach an iteratively boosted learning function, which has ideally transformed to a strong learner by continuously reducing the errors that the initial weak model and its previous variants faced.

Although the variant of AdaBoost is the most popular product of the boosting family algorithms, Logitboost actually constitutes a choice that may reward us compared to some defects that AdaBoost presents [

18]. To be more specific, Logitboost uses a smoother weighting function than the default AdaBoost classifier, a fact that allows to address better the examples that are highly misclassified since the direction towards learning the proper mapping function per each learning round is not heavily affected by them. Instead, their importance does not overwhelm the corresponding importance of the examples with smaller misclassified errors, providing more robust confidence scores:

$p\left(y|\mathit{x}\right)$. Taking into consideration that the choice of the most informative

${\mathit{u}}_{i}$ examples are more often than not selected through suitable metrics that depend on

$p\left({y}_{i}|{\mathit{u}}_{i}\right)$, such a behavior may be proven quite successful in practice [

19,

20]. Additionally, the convergence of the Logitboost scheme is not violated, as the general boosting procedure guarantees, since the logit-loss function which is described later is asymptotically minimized.

The favoring properties of the Logitboost ensemble learner have also been noticed in the related literature, although their cardinality is restricted. To be more specific, apart from using only univariate regressors inside this scheme, generalized functions could also be applied, increasing thus the total predictive ability but probably disrupting interpretability [

21]. Otero and Sanchez proposed the use of descriptive fuzzy learners inside Logitboost, modifying slightly the usual structure of default fuzzy learners and overpassing the behavior of a similar fuzzy-based AdaBoost version [

22], while a modification of the internal scoring mechanism based on distance from the decision regions using weak learners under Logitboost scheme was tested in [

23]. Naive Bayes (NB) has also been combined appropriately with this scheme, improving its total performance against other popular variants of Bayesian Networks [

24], while Logitboost’s operation was totally matched into the learning procedure proposed by Leathart et al., introducing Probability Calibration Trees (PCT) in the context of regression task, separating the input task space and fitting local predictors [

25].

Logitboost autoregressive networks made use of the same scheme for modeling conditional distributions, offering a procedure that could be parallelized, exploiting the advantages of boosting ensembles for which the hyperparameters are clearly less than that of Neural Networks (NNs) and appeared to converge for several examined cases into same values, at least for the shrinkage factor [

26]. More sophisticated multi-class expansions of Logitboost could further improve its applicability as it has been mentioned by the corresponding authors. This direction has been actually studied recently by some works, providing interesting expansions of the default multi-class operation of Logitboost scheme: Adaptive Base class (ABC) [

27] and Adaptive One vs. One (AOSO) Logitboost [

28].

Moreover, since the pool-based inductive AL strategies are inherently iterative procedures which are based on a few initially provided data, exploiting appropriately at least one

${H}^{oracle}$ for detecting the most informative

${u}_{i}$—further analysis is presented on the next Section—the importance of obtaining accurate predictions is highly considered, but time limitations may occur when much complex learning models are embedded into these strategies. Trying to satisfy this trade-off under the Logitboost wrapper, we propose the use of M5P, a model tree regressor that tackles efficiently high-dimensional data since it builds linear models after having grown its preferred decision tree structure, taking advantage of its widely accepted decent learning performance over various scientific fields [

29,

30]. On the other hand, with the greedy manner under which Logitboost acts, although it is applied under a number of learning rounds (M), its total complexity does not differentiate heavier than other state-of-the-art classification algorithms [

27]. Thus, its integration under AL strategies would not induce prohibitive time response in practice. Furthermore, since we maintain a regression tree as its base learner, both binary and multi-class classification problems can be addressed efficiently without inserting further modifications that would probably raise the computational complexity of the total algorithm.

Consequently, we propose the adoption of Logitboost(M5P) under pool-based AL classification problems, exploiting its favoring properties both for selecting informative unlabeled instances and for evaluating the final learning hypothesis, built into the gradually augmented L, based on the annotations that a powerful human oracle provides us. This combination has been recently examined in the scenario of SSL [

31], presenting remarkable performance. For investigating the overall ability of Logitboost(M5P) under AL scenario, we examined 91 different datasets, separated into binary and multi-class under 3 different query strategies against the baseline strategy of Random Sampling (RS), comparing its performance against 10 other well-known learning algorithms as well as the default use of weak Decision Trees (DTs)—to be more particular, one-node trees [

32]—into Logitboost, as it is usually met in the literature and related ML packages [

33]. A further study tuning one hyperparameter of the proposed algorithm was also made, proving that its learning performance may still improve under suitable preprocess stages which however are not easy to trust under the existence of limited training instances.

More details regarding AL and a description of the proposed framework for examining the efficacy of Logitboost(M5P) in the case of an AL ecosystem are provided in the next two sections, along with the experimental procedure, the results and our comments, following the structure of the current journal. The last section summarizes our contributions and the pros and cons of our proposed combination, based mainly on our results, while future directions are posed.

## 2. Materials and Methods

The main reason that we resort to PSL methods is the coexistence of both

L and

U, while the amount of the latter (size(

U)) is much larger than of the former (size(

L)):

$size\left(U\right)\gg size\left(L\right)$ [

6]. One of the subcategories of PSL algorithms is AL, where Settles has demonstrated a great survey work on this kind of algorithms [

10]. Trying not to present many details, we highlight the most important parts of such a learning strategy.

First of all, we employ a probabilistic classifier (

f) acting towards two different directions: searching for the most compatible

${u}_{i}$s and evaluating the final model after a predefined number of iterations is reached or until any other set stopping criterion is satisfied. The first part is handled by exploiting a proper sampling Query Strategy (

QS) which defines a specific criterion or metric

$\left(metri{c}_{usefulness}\right)$, so as to measure the informativeness or the utility of all the available

${u}_{i}$s. In order to detect the most convenient of them for creating a batch (

**B**) of potentially informative

${u}_{i}$s so as to increase the learning ability of the total AL learning procedure, we select the top b highly ranked instances:

where

$\mathit{B}$ is actually a subset of the applied

U during each iteration. The second is resolved through employing one or more human oracles or sources of information, like known crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower [

34].

This means that, after having detected the **B**, we ask the available ${H}^{oracle}$ to assign the corresponding label based on its knowledge background. Then, we merge the pairs of {**B**, ${H}^{oracle}\left(\mathit{B}\right)$} with the initially collected L during the first iteration or the current version of L for next iterations L^{iter}, where iter depicts the current iteration. Then, we refine f and repeat this procedure until a terminating condition is satisfied. In contrast with pure SSL approaches or in general with the wide spectrum of WSL approaches, the terminating condition of the empty U pool is not a realistic one here, since this would demand much effort on the side of the human factor. The participation of the latter introduces several trade-off situations that should be considered carefully.

According to the related literature [

20], there are 3 general kinds of

QS models on the field of pool-based AL and a group of hybrid ones that combine more than one strategy:

where more details are provided in [

35]. A quite important research orientation of the related community is the proposal of a new

QS, either introducing new metrics which may measure a behavior that seems more favorable for specific tasks [

36,

37] or trying to capture better the reasoning of some choices made by similar methods [

38]. One representative work related to this last category is the work of Vu-Linh Nguyen et al. [

39], exploring further Uncertainty Sampling (UncS)

QS, discriminating this into epistemic and aleatoric sampling strategies, highlighting their differences and proposing the first variant as more promising.

We actually adopted UncS in our AL framework, which tries to distinguish the ${u}_{i}$ instances for which the applied learning algorithm being trained on L^{iter} is less confident. For a binary problem, such an instance would induce $p\left({y}_{i}=0|{\mathit{u}}_{i}\right)\approx p\left({y}_{i}=1|{\mathit{u}}_{i}\right)\approx 0.5$. This strategy favors time-efficient solutions for the majority of ML algorithms because its time complexity demands a training stage of f over L^{iter} and an evaluation stage of U^{iter}. Since the cardinality of the former is smaller than the latter, especially for low labeled Ratios (R)—where R is defined as the ratio of the initial L’s cardinality against the total amount of both L and U—the needed computational resources can be bounded based on the computational complexity of the base learner. Of course, the size of the batches (b) and the number of executed iterations (k) also play important roles.

In order to investigate the efficiency of Logitboost(M5P) under UncS, we employed 3 separate metrics inside this wrapper strategy, comparing them each time with the baseline of RS, where no sophisticated criterion was assessed for selecting the participant

${u}_{i}$ of each batch but a random pick took place before the corresponding batch was provided to

${H}^{oracle}$. This strategy comes with no time costs during the mining of

U. This means that any examined QS should outreach this performance for being qualified as a valid one for the concept of AL. The relationship of each of the utilized

$metri{c}_{usefulness}$ (Least Confident:

LConf, Smallest Margin:

SMar and Ent:

Ent) is given here:

where

$p\left(y|{\mathit{u}}_{i}\right)$ is the confidence of the base learner on the examined

${\mathit{u}}_{i}$, while Equation (5) computes the difference between the two most probable classes

${y}^{1}and{y}^{2}$ of the same

${\mathit{u}}_{i}$ so as to return the most compatible choice from the available into

U.

With regards to the proposed base learner, Logitboost(M5P), more details are given here. Logitboost is an additive logistic regression algorithm that can be seen as a convex optimization problem. An additive model, like simple linear models or regression trees, for solving a binary problem has the function of the following form:

where

m is the number of classifiers,

${w}_{m}$ is the constants to be determined and

${h}_{m}$ is the chosen base functions along with their internal parameters

${\gamma}_{m}$. Assuming now that

${f}_{additive}\left(\mathit{x}\right)$ is the mapping that we need to fit our strong aggregate hypothesis and

${h}_{m}$ is the separate weak hypotheses, then the two-class boosting algorithm is fit by minimizing the next criterion:

where

y is the true class label and

$y\xb7{f}_{additive}\left(\mathit{x}\right)$ is the voting margin term [

40], while

$\epsilon [\xb7]$ denotes the expected value.

Adopting the negative binomial log-likelihood does not affect the minimizer of Equation (8) compared with the typical boosting function, enabling at the same time a smoother weighting of examples for which predictions of Logitboost learner are far away from the discriminating decision threshold, constituting a great asset. This procedure takes place using the Newton-like steps, a more complicated optimization process than in the case of exponential loss function of AdaBoost algorithm. However, this fact does not affect the ambition of minimizing the set loss function during the training process.

Regression trees are known for their ability to deal efficiently with large datasets in terms of both features and instances, added to their simplicity and robustness [

41]. The M5P regressor is a recreation of M5 algorithm [

42], where the portion of the dataset that reaches the leaf is classified by a linear regression model stored in each branch of the tree. For the dataset split, certain attributes are chosen using the standard deviation error (SDR) as a criterion for the best attributes to split the dataset at each node. The chosen attribute is the one with the maximum expectation to error reduction:

where

$Tre{e}_{i}$ refers to the subset of examples that have the ith result of the potential test and

$SD(\xb7)$ refers to the standard deviation of its argument. The stopping criteria is either the number of remaining instances to reach a certain number or a very small change in class value.

The successful competition of M5P against other regression trees or other conventional ML learners has been stated in recent literature [

30,

43,

44]. Its exploitation under the wrapper scheme of Logitboost could lead to a robust classifier that operates on the field of AL, both for choosing informative

${u}_{i}$ instances and for providing remarkable classification performance. Moreover, possible inaccurate predictions that would appear because of either a shortage of

${l}_{i}$ that covers the total range of the output values or weak indicators originally existing in the feature space of a dataset could be alleviated by the smooth weighting function that Logitboost applies, avoiding the overfitting phenomena that might discard its decisions [

45].

The last mechanism that needs to be described before more technical details of the experimental procedure is the increase in

L during the iterative process of a typical AL environment. To be more specific, the queried instances (b) are chosen in batches. The size of each batch relies on the size of the initial labeled set of each dataset and the number of predefined iterations (k). This happens because we adopted an augmenting strategy that aims to double the size of the labeled instances at the final (kth) iteration of the experiment. According to this concept, the steady value of the parameter b per dataset is computed as follows:

Thus, to execute a complete AL experiment fairly, we adopted a flexible process for which the pseudocode is placed in Algorithm 1. Initially, we start with size(

L) collected labeled instances, and before the final evaluation, 2*size(

L) instances are gathered, where the additional labeled instances have been assigned with pseudo-labels by

${H}^{oracle}$ after their selection through the combined interaction of any chosen QS and our selected base learner. During each iteration, a batch

**B** that consists of b instances is extracted from the

U subset and is added along with the decisions of the employed

${H}^{oracle}$ into the current

L subset. For the evaluation process, a test set is used to examine the accuracy of the Logitboost(M5P), which is now trained on the augmented labeled set through the process of the AL. The total procedure is as follows:

**Algorithm 1**Active learning scheme |

**1: Mode:** |

2: Pool-based scenario over a provided dataset D = X_{n x N} ⋃ Y_{n x 1} |

**3: x**_{i}—i-th instance vector with N features **x**_{i}: <x_{1}, x_{2}, … x_{N}> ∀ 1 ≤ i ≤ n |

4: y_{i}—scalar class variable with y_{i} ∊ {0, 1} or unknown ∀ 1 ≤ i ≤ n |

5: n—number of instances n = size(L) + size(U) |

**6:** **B**—batch of unlabeled samples that are labeled per iteration |

**7: Input:** |

8: L^{iter} (U^{iter})—(un)labeled instances during the iter-th iteration, L^{iter} ⊂ D, U^{iter} ⊂ D |

9: k—number of executed iterations |

10: base learner—the selected classifier |

11: QS(metric)—the selected Query Strategy along with its embedded metric |

**12: Preprocess**: |

13: b—size of batch **B** computed by Equation (10) |

**14: Main Procedure:** |

15: Set iter = 1 |

16: While iter < k do |

17: Train base learner on L^{iter} |

18: Assign class probabilities over each u_{i} ∊ U^{iter} |

19: Rank u_{i} according to QS(metric) |

20:Select the top-b ranked u_{i} formatting current **B** |

21: Provide batch **B** to human oracle and obtain their pseudo-labels: ${H}^{oracle}\left(\mathit{B}\right)$ |

22: Update L: L^{iter+1} ← L^{iter} ⋃ {**B**, ${H}^{oracle}\left(\mathit{B}\right)$} |

23: Update U: U^{iter+1} ← U^{iter}\{**B**} |

24: iter = iter + 1 |

**25: Output:** |

26: Train base learner on L^{k} for predicting class labels of test data |

## 4. Discussion

In this section, we briefly discuss the obtained results from both comparisons to summarize the overall results and to perceive better the assets of the proposed AL approach that is based on the combination of Logitboost scheme with the M5P regressor. First, application of Logitboost under AL has not been recorded in the literature in contrast to several other ML models [

57]. Consequently, it was reasonable to adopt a common AL framework to make fair comparisons with the selected approaches that are based on state-of-the-art algorithms during both of the examined settings. Thus, no tuning stages were inserted into the learning pipeline. The total experimental procedure was conducted regarding 4 different values of the R (%) parameter, trying to investigate further the behavior of all the examined approaches, assuming that the human oracle did not introduce any noisy decisions at all. Thus, we do not insert noisy decisions during augmentation of the initial labeled set, which shifts the responsibility of selecting informative instances to the learning ability of the base learner per case. Additionally, we are more interested on the lowest R-based scenarios since they constitute more realistic simulations of real-life WSL problems.

From the accuracy score recorded in

Table 3 through

Table 6, we can see that the proposed combination highly outperformed its rivals on the majority of the 91 datasets. The aggregated number of victories for both sets of comparisons—versus simple and ensemble base learners—have been placed in

Table 7 and

Table 8, where the proposed set managed to capture the best performance in 774 cases out of 1112 (69.6%) against 6 approaches and in 499 out of 866 cases (57.6%) against 4 approaches. Moreover, its defects seem to appear on specified datasets (e.g., “iris” and “post-operative” from binary datasets as well as “heart-h”, “saheart” and “newthyroid” from multiclass ones). The structure of these datasets should be examined further, but probably a tuning stage of Logitboost scheme, for which the parameters could be expanded because of the presence of the internal regressor, might lead to better performance against algorithms like kNN or learners that are based on DTs, either individually or under an ensemble fashion.

As it concerns the first experimental scenario, the distribution of the number of victories of the proposed AL approach was similar across the different query strategies and the labeled ratios, denoting its general efficacy against the other approaches. The 3 distinct kNN learners also performed cumulatively 243 victories, constituting a useful proof about their robustness despite the restricted number of labeled examples [

58]. On the other hand, the performance of NB as a base learner was disappointing, affected by the aforementioned shortage of numerous initial examples since it did not record any victory.

During the second and, of course, more challenging scenario against ensemble base learners, the proposed algorithm was again more competitive and outperformed the rest in the majority of the grouped experiments. However, LMT-based approaches managed also to score several winning accuracies per dataset, especially in binary problems when larger R-based experiments were conducted. In fact, LMT expands the Logitboost procedure internally into its main learning kernel but, at its final stage, exploits only a subset of the initial feature space so as to build its logistic model. This property seems favorable for the aforementioned case, as the statistical rankings placed in

Table 10 prove. Specifically, for binary problems, the proposed algorithm performed significantly better behavior than the LMT-based AL approach only for the case of R = 5%, while for the next two comparisons, no statistical difference was recorded, ranking Logitboost(M5P)-based approaches in first with a slight lead, while in the last scenario, where R = 20%, the LMT significantly outperformed the proposed one. However, in multi-class problems, this behavior was not repeated. This kind of result possibly denotes the existence of noisy features that highly affect binary problems and/or highlights the overfitting phenomena that Logitboost may face when outliers are inserted into its training stage.

Returning to the first set of experiments, a statistical comparison was executed for verifying the results obtained from the examined query strategies against also the baseline of random sampling.

Figure 2 captures the performance per district learner, where the proposed base learner recorded a statistically significant behavior against its baseline as well as the rest of the AL approaches, being ranked always as the best across all the conducted scenarios. It is remarkable that the RS(Logitboost(M5P)) approach managed also to outreach the other simple algorithms on average, proving the overall predictive ability of the proposed ensemble learner. This last note highlights also the implications that may occur when ranking the available unlabeled examples, a fact that may deteriorate the total learning behavior since the less informative the selected instances are labeled, the more redundancy that occurs in the gradually augmented training subset.

Discussing again the second experiment setup, a similar procedure was implemented, where besides a comparison with the LMT ensemble learner, useful conclusions were drawn through the conducted comparisons. The replacement of a weak one-node tree with the M5P of model trees under the Logitboost scheme was substantially examined, along with the use of the AdaBoost procedure. This amendment helped us to clarify even better the overall benefits of the proposed boosting approach, since the improvement that was noticed mainly against these two approaches was impressive. Additionally, two separate figures (

Figure 3 and

Figure 4) were produced to better visualize the discriminative ability of the proposed approach against other ensemble learners, which is clearly formatted by the initial iterations in 8 out of 9 selected datasets, recording also more robust learning behavior judging by its fluctuations along the iterative procedure of the applied AL framework. The instable behavior of AdaBoost(DStump) is also remarkable, showing clearly its untrustworthy behavior compared with the proposed one, as intense fluctuation was recorded.

Finally, we also conducted a study of one hyperparameter of the Logitboost scheme, without tuning further the internal base learner of M5P, in order to verify its optimality regarding at least one parameter of this scheme. This hyperparameter was selected to be the number of iterations that are executed during its training stage, a property that affects both its spent computational resources and the main drawback of boosting procedures: overfitting. We noticed that, more often than not, the default approach with 10 internal iterations did not achieve the best performance. This fact leaves much space for further investigation on the parameters of the Logitboost scheme under AL learning scenarios, which is however not easy to shed light on because of the limited initial data that are in practice provided. We pose the corresponding statistical rankings in

Table 10.