This section presents the results addressing each of the research questions stated above. For clarity, the results are organized into different subsections, each corresponding to a research question and displaying the findings for each of the three datasets.
6.1. Scenario 1: Data Augmentation
Scenario 1 compared the performance of different data augmentation strategies across three different datasets. The performance of the models was analyzed focusing on the effectiveness of these strategies in improving predictions for minority classes.
Figure 4 shows the evolution in the performance of the different strategies in terms of F-score every time a new sample is labeled. The shaded area in our graphs represents the standard deviation, calculated based on experiments conducted with three different random seeds. This provides a visual indication of the variability of our results. For the Wine dataset, the baseline (no augmentation) approach performs the best in terms of F1-score, with random oversampling showing better performance among CTGAN-based approaches and G-SMOTE. Similarly, in the Baseball dataset, the baseline approach yields the highest performance, followed by random oversampling. The high baseline performance (0.95) means that the model is capable of predicting most instances accurately without the need for augmentation. In the case of the Steel Plates dataset, random oversampling presents slightly better results than the baseline approach, and training one CTGAN per class also matches the baseline’s performance. Overall, all approaches perform similarly in terms of F1-score for the Steel Plates dataset.
Although the F1-score is a useful metric for evaluating the performance of models in imbalanced datasets, it sometimes can be influenced by the performance in majority classes, especially in highly imbalanced datasets. Therefore, with our purpose being the improvement of minority classes, balanced accuracy was also analyzed (see
Figure 5), as it ensures that the performance on each class is given equal importance.
In the Wine dataset, CTGAN-based approaches present the best performance, with the CTGAN per class approach being the most effective, achieving a balanced accuracy of 0.43 (see
Table 4). All augmentation techniques outperform the baseline non-augmented data on the Wine dataset. In the Baseball dataset, random oversampling and GSMOTE achieve the best results, both with values of 0.8. The CTGAN per class approach also shows better performance compared to the non-augmented data. For the Steel Plates dataset, the CTGAN per class approach and random oversampling achieve the best results (balanced accuracy of 0.82), with G-SMOTE also outperforming the baseline.
To specifically evaluate the improvement in minority classes, we analyzed the recall per class for each augmentation technique, as shown in
Table 5. This analysis was conducted to understand the impact of different augmentation strategies on individual class performance, providing insights into their effectiveness in addressing class imbalance. Detailed performance graphs of recall per class for each dataset are provided in
Appendix B.
Analyzing the values from
Table 5, we observe how in the Wine dataset, classes with a larger number of instances (Classes 2 and 3) are best predicted by the baseline approach. However, for classes with fewer instances (Classes 0, 1, 4, and 5), most augmentation strategies outperform the baseline recall, with the exception of Class 5 where all strategies perform similarly. Specifically, the CTGAN per class strategy presents the best results in predicting Classes 1 and 4, incrementing the recall by 33.33% and 28.49%, respectively, while CTGAN performs best for Class 0 with a recall of 48.48%. In the Baseball dataset, we can see a similar behavior. The majority class (Class 0) is best predicted by the baseline approach. However, for the minority classes (Classes 1 and 2), CTGAN per class and GSMOTE improve the results compared to the baseline. This suggests that augmentation strategies are particularly effective for classes with fewer instances. The Steel Plates dataset further reinforces this trend. CTGAN per class outperforms all other strategies for Classes 0 to 4, while random oversampling performs best in predicting Class 5. Once again, the majority class is best predicted by the baseline approach.
After analyzing the results on specific classes, we can confirm that the global F1-score, as shown in earlier graphs, was influenced by the performance of majority classes, leading to an apparent decrease with augmentation strategies. However, a deeper analysis of balanced accuracy and individual recall per class revealed that most techniques improved performance for minority classes, in exchange for a slight decrease in majority class recall. We hypothesize that maintaining the imbalance ratio of the dataset, rather than balancing it to the majority class, could better balance these trade-offs. This highlights the importance of focusing on minority class performance in imbalanced datasets and suggests that augmentation strategies can provide significant benefits in these scenarios. Overall, although the performance of different augmentation strategies varies over the specific case, the CTGAN per class approach consistently shows the best results.
Table 6 illustrates the percentage of real data usage required to achieve the maximum performance of the baseline approach for different augmentation strategies. Specifically, the baseline values represent the percentage of real data needed by the baseline to achieve the best performance in both the F1-score and balanced accuracy. The subsequent columns indicate the percentage of real data needed by the augmentation techniques to match the baseline performance. If an augmentation technique does not reach the baseline performance, it is denoted by a dash (“-”).
Based on the results, and in consequence with the values presented in
Table 4, no augmentation strategy is able to outperform the baseline F1-score in the Wine and Baseball datasets. However, in the Steel Plates dataset, random oversampling is capable of reducing the amount of real data needed to reach the baseline performance by 33.45%. On the other hand, nearly all strategies manage to achieve the baseline performance in balanced accuracy while using less real data (with the exception of CTGAN in the Baseball dataset). Among these strategies, random oversampling stands out for its significant reduction in data requirement in both the Wine and Steel Plates datasets, while CTGAN achieves an impressive reduction of 98% in the Baseball dataset.
6.2. Scenario 2: Active Learning
Results of this scenario analyzed the effect of data augmentation approaches in an active learning framework, and tried to determine if data augmentation helps reduce the number of real samples needed to achieve a certain performance in active learning scenarios.
Graphs of the F1-scores in Scenario 2 (
Figure 6) present similar patterns to the ones in Scenario 1. For the three datasets, both the baseline and random oversampling techniques achieve the highest F1-score (see
Table 7 for F1-score values). On the Wine dataset, the CTGAN per class approach and GSMOTE achieve similar results as both the baseline and random oversampling strategies, while, on the Steel Plates dataset, the CTGAN approach performs as well as both strategies.
Focusing on the balanced accuracy graphs in
Figure 7, and the values in
Table 7, we observe how the CTGAN per class approach performs the best on the Wine dataset, reaching a value of 0.44, whereas baseline and random oversampling both scored 0.38. GSMOTE achieved the best value (0.78) on the Baseball dataset, slightly outperforming the other methods. Finally, on the Steel Plates dataset, the CTGAN per class approach again showed the best performance of 0.84, followed by the random oversampling strategy. Overall, results in
Table 7 show that, while a simple augmentation strategy such as random oversampling presents good results in terms of the F1-score, more sophisticated methods like CTGAN per class and GSMOTE tend to perform better in terms of balanced accuracy, indicating their effectiveness in managing class imbalances.
Because the improvement of minority class predictions is our main interest, recall values for each class at the point of maximum balanced accuracy are presented in
Table 8. While the detailed performance graphs of recall per class are not included here, they are provided in
Appendix C.
Results for the Wine dataset show that GANs and GSMOTE augmentation techniques decrease the recall of the classes with more instances (Classes 2 and 3), with the baseline and random oversampling being the ones achieving higher recall rates of 76.89% and 70.74% in those classes, respectively. However, they present significant improvements in performance for the minority classes. CTGAN and CTGAN per class show notable improvements in recall for Classes 0, 1, and 4. For instance, CTGAN achieves a recall of 45.45% for Class 0, while all other strategies present recalls close to 0%. Similarly, the CTGAN per class approach performs best for Class 1, obtaining a recall of 29.80%, compared to the baseline’s 0% and random oversampling’s 0.55%, and also outperforms other methods for Class 4 with a recall of 71.81%. With respect to Class 5, all methods achieve the same recall of 33.33%.
In the Baseball dataset, the random oversampling and baseline approaches, again, achieve the highest recall for the majority class (Class 0), with values of 98.54% and 98.43%, respectively. For Class 1, CTGAN per class stands out with a recall of 87.88%, surpassing baseline and random oversampling, both at 83.33%. GSMOTE performs best for minority Class 2, achieving a recall of 67.83%, slightly higher than CTGAN’s 66.20%. Finally, on the Steel Plates dataset, CTGAN per class shows the highest recall for Classes 0, 1, and 2, with recall rates of 82.58%, 94.74%, and 95.84%, respectively. For Class 3, all methods achieve the same recall rate of 92.86%. GSMOTE achieves the highest recall for Class 4 with 81.54%. Finally, for classes with majority of instances (5 and 6), random oversampling (73.4% in Class 5) and baseline (74.39%) once again present the best results.
Recall results over individual classes demonstrate that the effectiveness of augmentation techniques varies across different classes and datasets. CTGAN and CTGAN per class often show significant improvements in recall for minority class instances, particularly in the Wine and Steel Plates datasets, while methods like baseline and random oversampling normally achieve the highest recall rates in predicting majority classes, but without being able to increase the performance of minority classes.
Table 9 shows the percentage of real data usage required to achieve the maximum performance of the baseline approach for different augmentation strategies on active learning scenarios, with the aim of determining whether data augmentation can help active learning (AL) reduce the amount of real data needed.
Results show that data augmentation does not help active learning use less real data on the Baseball dataset. The baseline without augmentation achieves the desired performance with less data in terms of both F1-score and balanced accuracy. This suggests that, for this particular dataset, active learning alone is sufficient. In the Wine dataset, random oversampling is able to use less real data to achieve the baseline performance in terms of F1-score. Similarly, for the Steel Plates dataset, all augmentation techniques, except for CTGAN, outperform the baseline in terms of data efficiency, with CTGAN per class being the most effective. This indicates how data augmentation can effectively enhance the efficiency of active learning on these datasets. When considering balanced accuracy, all data augmentation strategies outperform the baseline in terms of the amount of real data used for the Wine and Steel Plates datasets. In the Wine dataset, GSMOTE is the most effective, while in the Steel Plates dataset, CTGAN per class stands out as the best technique. After analyzing the results, we can conclude that, while the benefits of mixing data augmentation with active learning may depend on the data structure, as seen with the Baseball dataset, it is able to enhance the efficiency of active learning by reducing the amount of real data needed, such as with the Wine and Steel Plates datasets.
6.3. Scenario 3: Iterative Synthetic Data Generation
This scenario studied the effect of iterative data augmentation in active learning scenarios. After each active learning iteration, new generated synthetic data were introduced in the labeled dataset. Two different strategies were followed to generate the data: generating synthetic data for the minority class or generating one instance for all minority classes.
Figure 8 presents the F1-score performance. Wine dataset results are displayed in the top row, Baseball dataset results in the middle row, and the Steel Plates dataset results in the bottom row. For each dataset, the left column shows the F1-scores for the minority class approach, while the right column presents the F1-scores for the all-minority-class approach.
The graphs show a really similar performance of all data augmentation techniques in terms of F1-score in the Wine and Steel Plates datasets. However, in the Baseball dataset, the CTGAN approach does not achieve the same performance level as the other techniques. Furthermore, when comparing the two augmentation strategies (augmenting only the minority class versus augmenting all minority classes) the performance is very similar in both of them across all datasets. This suggests that both approaches are equally effective in enhancing the F1-score, regardless of the specific dataset.
This pattern is confirmed in balanced accuracy graphs (
Figure 9), except for a slight improvement of CTGAN per class in the Wine dataset. Given the similar results observed between the two augmentation strategies (see
Table 10), and for clarity purposes in the body of the manuscript, we will present the results of specific recall per classes, focusing only on the minority class approach. Results from the all-minority-class approach will be included in
Appendix D for reference. Other analysis regarding real data usage will be detailed for both approaches.
Table 10 shows the maximum values of the F1-score and balanced accuracy on iterative data augmentation across the different datasets. For the Wine dataset, GSMOTE achieves the best F1-score (0.67) when using the minority class approach. In terms of balanced accuracy, CTGAN and CTGAN per class both outperform the baseline, achieving a score of 0.42. The all-minority-class approach shows similar behavior, with CTGAN per class presenting the best results in balanced accuracy. In the Baseball dataset, the baseline method achieves the same F1-score than CTGAN per class and GSMOTE in both the minority class and all-minority-class approaches. In balanced accuracy, CTGAN outperforms the other methods in both the minority class and all-minority-class approaches. For the Steel Plates dataset, all methods, again, perform similarly in both strategies, with a best F1-score of 0.78 and a best balanced accuracy of 0.79. Specific recall results for each classes are presented in
Table 11.
Recall results of iterative data augmentation in specific classes present some interesting insights. In the Wine dataset, CTGAN per class shows the best result in predicting minority Class 0, while the CTGAN approach improves the baseline prediction for minority Classes 1 and 4 by 7.71% and 11.88%, respectively. All strategies have the same results in predicting Class 5, while, just as in previous Scenarios, the baseline achieves the best result in predicting majority Class 2. However, unlike in previous scenarios, GSMOTE is able to slightly outperform the prediction of the other majority class (Class 3). This pattern is also presented in the Baseball dataset, where, again, it is able to perform marginally better than the baseline in predicting majority Class 0. On the Steel Plates dataset, majority Class 6 is best predicted by the baseline approach, while CTGAN- and CTGAN-based approaches have the best results in predicting minority Classes 4 and 5, respectively. The results show that the effectiveness of iterative data augmentation varies over different classes and datasets. This suggests that the best specific approach depends on the data structure and distribution. Trying to find some patterns, CTGAN-based approaches seem to present a more balanced performance across different classes, and GSMOTE seems to reduce the loss of majority classes often generated by augmentation techniques. Apart from recall results,
Table 12 presents the percentage of real data used by the strategies to achieve baseline performance.
Data efficiency results for the Wine dataset show that iterative data augmentation with GSMOTE reduces the real data needed to achieve the baselines F1-score, both in the minority and all-minority-class cases, while CTGAN per class performs the same way in balanced accuracy. For the Baseball dataset, while any approach is able to perform better than the baseline in terms of F-score, the CTGAN approach reaches the baseline performance in balanced accuracy using 24.6% less real data on the minority class approach, and 25.5% on the all-minority-class approach. Finally, CTGAN reduces the data needed in the minority class approach to reach the baseline F1-score, and GSMOTE does the same in the all-minority-class approach. These results indicate that data augmentation techniques applied in an iterative manner, particularly CTGAN and GSMOTE, can significantly reduce the amount of real data needed to achieve baseline performance in certain scenarios.