entropy-logo

Journal Browser

Journal Browser

Entropy in Real-World Datasets and Its Impact on Machine Learning

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Multidisciplinary Applications".

Deadline for manuscript submissions: closed (15 December 2022) | Viewed by 28927

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors

1. Department of Knowledge Engineering, University of Economics, 1 Maja 50, 40-287 Katowice, Poland
2. Łukasiewicz Research Network - Institute of Innovative Technologies EMAG, 40-189 Katowice, Poland
Interests: machine learning; ensemble methods; decision trees; ant colony optimization; computational intelligence; data analysis; optimization
Special Issues, Collections and Topics in MDPI journals
Systems Research Institute, Polish Academy of Sciences, 01-447 Warszawa, Poland
Interests: machine learning; multicriteria optimization; game theory; data analysis; financial markets; decision support

Special Issue Information

Nowadays, machine learning is considered as a group of various methods used to solve the most complex real-world problems. Its usability is crucial in fields such as medicine, finance, text mining, image analysis, and more. Among the most prominent examples of machine learning-related methods, we can find ensemble methods, multicriteria evolutionary algorithms, deep learning in neural networks, etc. Here, we are particularly interested in subjects connecting the entropy of datasets and the effectiveness of the machine learning algorithms.

The main aspect of this session is devoted to entropy in the still growing number of data available for users. Concepts such as big data and data streams are still increasingly gaining attention. Classical methods seem to give debatable efficiency among these types of data; thus, we believe that there is a necessity for continuous improvements in what is widely understood as machine learning. This session is dedicated to the analysis of real-world datasets, in particular, in terms of the entropy present in them and its impact on machine learning.

We would like to offer the opportunity for researchers to focus especially on entropy and methods related mostly to applications of machine learning in real-world applications.

Prof. Dr. Jan Kozak
Dr. Przemysław Juszczuk
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • entropy as a measure
  • real-world datasets
  • machine learning algorithms
  • information gain
  • optimization
  • classification
  • prediction methods
  • entropy in big data

Related Special Issue

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 4010 KiB  
Article
Processing Real-Life Recordings of Facial Expressions of Polish Sign Language Using Action Units
by Anna Irasiak, Jan Kozak, Adam Piasecki and Tomasz Stęclik
Entropy 2023, 25(1), 120; https://0-doi-org.brum.beds.ac.uk/10.3390/e25010120 - 06 Jan 2023
Viewed by 1535
Abstract
Automatic translation between the national language and sign language is a complex process similar to translation between two different foreign languages. A very important aspect is the precision of not only manual gestures but also facial expressions, which are extremely important in the [...] Read more.
Automatic translation between the national language and sign language is a complex process similar to translation between two different foreign languages. A very important aspect is the precision of not only manual gestures but also facial expressions, which are extremely important in the overall context of a sentence. In this article, we present the problem of including facial expressions in the automation of Polish-to-Polish Sign Language (PJM) translation—this is part of an ongoing project related to a comprehensive solution allowing for the animation of manual gestures, body movements and facial expressions. Our approach explores the possibility of using action unit (AU) recognition in the automatic annotation of recordings, which in the subsequent steps will be used to train machine learning models. This paper aims to evaluate entropy in real-life translation recordings and analyze the data associated with the detected action units. Our approach has been subjected to evaluation by experts related to Polish Sign Language, and the results obtained allow for the development of further work related to automatic translation into Polish Sign Language. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

16 pages, 533 KiB  
Article
Improved EAV-Based Algorithm for Decision Rules Construction
by Krzysztof Żabiński and Beata Zielosko
Entropy 2023, 25(1), 91; https://0-doi-org.brum.beds.ac.uk/10.3390/e25010091 - 02 Jan 2023
Viewed by 1180
Abstract
In this article, we present a modification of the algorithm based on EAV (entity–attribute–value) model, for induction of decision rules, utilizing novel approach for attribute ranking. The selection of attributes used as premises of decision rules, is an important stage of the process [...] Read more.
In this article, we present a modification of the algorithm based on EAV (entity–attribute–value) model, for induction of decision rules, utilizing novel approach for attribute ranking. The selection of attributes used as premises of decision rules, is an important stage of the process of rules induction. In the presented approach, this task is realized using ranking of attributes based on standard deviation of attributes’ values per decision classes, which is considered as a distinguishability level. The presented approach allows to work not only with numerical values of attributes but also with categorical ones. For this purpose, an additional step of data transformation into a matrix format has been proposed. It allows to transform data table into a binary one with proper equivalents of categorical values of attributes and ensures independence of the influence of the attribute selection function from the data type of variables. The motivation for the proposed method is the development of an algorithm which allows to construct rules close to optimal ones in terms of length, while maintaining enough good classification quality. The experiments presented in the paper have been performed on data sets from UCI ML Repository, comparing results of the proposed approach with three selected greedy heuristics for induction of decision rules, taking into consideration classification accuracy and length and support of constructed rules. The obtained results show that for the most part of datasests, the average length of rules obtained for 80% of best attributes from the ranking is very close to values obtained for the whole set of attributes. In case of classification accuracy, for 50% of considered datasets, results obtained for 80% of best attributes from the ranking are higher or the same as results obtained for the whole set of attributes. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

20 pages, 1617 KiB  
Article
New Classification Method for Independent Data Sources Using Pawlak Conflict Model and Decision Trees
by Małgorzata Przybyła-Kasperek and Katarzyna Kusztal
Entropy 2022, 24(11), 1604; https://0-doi-org.brum.beds.ac.uk/10.3390/e24111604 - 04 Nov 2022
Cited by 1 | Viewed by 1222
Abstract
The research concerns data collected in independent sets—more specifically, in local decision tables. A possible approach to managing these data is to build local classifiers based on each table individually. In the literature, many approaches toward combining the final prediction results of independent [...] Read more.
The research concerns data collected in independent sets—more specifically, in local decision tables. A possible approach to managing these data is to build local classifiers based on each table individually. In the literature, many approaches toward combining the final prediction results of independent classifiers can be found, but insufficient efforts have been made on the study of tables’ cooperation and coalitions’ formation. The importance of such an approach was expected on two levels. First, the impact on the quality of classification—the ability to build combined classifiers for coalitions of tables should allow for the learning of more generalized concepts. In turn, this should have an impact on the quality of classification of new objects. Second, combining tables into coalitions will result in reduced computational complexity—a reduced number of classifiers will be built. The paper proposes a new method for creating coalitions of local tables and generating an aggregated classifier for each coalition. Coalitions are generated by determining certain characteristics of attribute values occurring in local tables and applying the Pawlak conflict analysis model. In the study, the classification and regression trees with Gini index are built based on the aggregated table for one coalition. The system bears a hierarchical structure, as in the next stage the decisions generated by the classifiers for coalitions are aggregated using majority voting. The classification quality of the proposed system was compared with an approach that does not use local data cooperation and coalition creation. The structure of the system is parallel and decision trees are built independently for local tables. In the paper, it was shown that the proposed approach provides a significant improvement in classification quality and execution time. The Wilcoxon test confirmed that differences in accuracy rate of the results obtained for the proposed method and results obtained without coalitions are significant, with a p level = 0.005. The average accuracy rate values obtained for the proposed approach and the approach without coalitions are, respectively: 0.847 and 0.812; so the difference is quite large. Moreover, the algorithm implementing the proposed approach performed up to 21-times faster than the algorithm implementing the approach without using coalitions. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

12 pages, 290 KiB  
Article
Selected Data Mining Tools for Data Analysis in Distributed Environment
by Mikhail Moshkov, Beata Zielosko and Evans Teiko Tetteh
Entropy 2022, 24(10), 1401; https://0-doi-org.brum.beds.ac.uk/10.3390/e24101401 - 01 Oct 2022
Cited by 2 | Viewed by 1184
Abstract
In this paper, we deal with distributed data represented either as a finite set T of decision tables with equal sets of attributes or a finite set I of information systems with equal sets of attributes. In the former case, we discuss a [...] Read more.
In this paper, we deal with distributed data represented either as a finite set T of decision tables with equal sets of attributes or a finite set I of information systems with equal sets of attributes. In the former case, we discuss a way to the study decision trees common to all tables from the set T: building a decision table in which the set of decision trees coincides with the set of decision trees common to all tables from T. We show when we can build such a decision table and how to build it in a polynomial time. If we have such a table, we can apply various decision tree learning algorithms to it. We extend the considered approach to the study of test (reducts) and decision rules common to all tables from T. In the latter case, we discuss a way to study the association rules common to all information systems from the set I: building a joint information system for which the set of true association rules that are realizable for a given row ρ and have a given attribute a on the right-hand side coincides with the set of association rules that are true for all information systems from I, have the attribute a on the right-hand side, and are realizable for the row ρ. We then show how to build a joint information system in a polynomial time. When we build such an information system, we can apply various association rule learning algorithms to it. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

22 pages, 12418 KiB  
Article
A Short-Term Hybrid TCN-GRU Prediction Model of Bike-Sharing Demand Based on Travel Characteristics Mining
by Shenghan Zhou, Chaofei Song, Tianhuai Wang, Xing Pan, Wenbing Chang and Linchao Yang
Entropy 2022, 24(9), 1193; https://0-doi-org.brum.beds.ac.uk/10.3390/e24091193 - 26 Aug 2022
Cited by 6 | Viewed by 2103
Abstract
This paper proposes an accurate short-term prediction model of bike-sharing demand with the hybrid TCN-GRU method. The emergence of shared bicycles has provided people with a low-carbon, green and healthy way of transportation. However, the explosive growth and free-form development of bike-sharing has [...] Read more.
This paper proposes an accurate short-term prediction model of bike-sharing demand with the hybrid TCN-GRU method. The emergence of shared bicycles has provided people with a low-carbon, green and healthy way of transportation. However, the explosive growth and free-form development of bike-sharing has also brought about a series of problems in the area of urban governance, creating a new opportunity and challenge in the use of a large amount of historical data for regional bike-sharing traffic flow predictions. In this study, we built an accurate short-term prediction model of bike-sharing demand with the bike-sharing dataset from 2015 to 2017 in London. First, we conducted a multidimensional bike-sharing travel characteristics analysis based on explanatory variables such as weather, temperature, and humidity. This will help us to understand the travel characteristics of local people, will facilitate traffic management and, to a certain extent, improve traffic congestion. Then, the explanatory variables that help predict the demand for bike-sharing were obtained using the Granger causality with the entropy theory-based MIC method to verify each other. The Multivariate Temporal Convolutional Network (TCN) and Gated Recurrent Unit (GRU) model were integrated to build the prediction model, and this is abbreviated as the TCN-GRU model. The fitted coefficient of determination R2 and explainable variance score (EVar) of the dataset reached 98.42% and 98.49%, respectively. Meanwhile, the mean absolute error (MAE) and root mean square error (RMSE) were at least 1.98% and 2.4% lower than those in other models. The results show that the TCN-GRU model has strong efficiency and robustness. The model can be used to make short-term accurate predictions of bike-sharing demand in the region, so as to provide decision support for intelligent dispatching and urban traffic safety improvement, which will help to promote the development of green and low-carbon mobility in the future. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

23 pages, 789 KiB  
Article
A Novel Method for Fault Diagnosis of Rotating Machinery
by Meng Tang, Yaxuan Liao, Fan Luo and Xiangshun Li
Entropy 2022, 24(5), 681; https://0-doi-org.brum.beds.ac.uk/10.3390/e24050681 - 12 May 2022
Cited by 1 | Viewed by 1651
Abstract
When rotating machinery fails, the consequent vibration signal contains rich fault feature information. However, the vibration signal bears the characteristics of nonlinearity and nonstationarity, and is easily disturbed by noise, thus it may be difficult to accurately extract hidden fault features. To extract [...] Read more.
When rotating machinery fails, the consequent vibration signal contains rich fault feature information. However, the vibration signal bears the characteristics of nonlinearity and nonstationarity, and is easily disturbed by noise, thus it may be difficult to accurately extract hidden fault features. To extract effective fault features from the collected vibration signals and improve the diagnostic accuracy of weak faults, a novel method for fault diagnosis of rotating machinery is proposed. The new method is based on Fast Iterative Filtering (FIF) and Parameter Adaptive Refined Composite Multiscale Fluctuation-based Dispersion Entropy (PARCMFDE). Firstly, the collected original vibration signal is decomposed by FIF to obtain a series of intrinsic mode functions (IMFs), and the IMFs with a large correlation coefficient are selected for reconstruction. Then, a PARCMFDE is proposed for fault feature extraction, where its embedding dimension and class number are determined by Genetic Algorithm (GA). Finally, the extracted fault features are input into Fuzzy C-Means (FCM) to classify different states of rotating machinery. The experimental results show that the proposed method can accurately extract weak fault features and realize reliable fault diagnosis of rotating machinery. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

24 pages, 1681 KiB  
Article
Preference-Driven Classification Measure
by Jan Kozak, Barbara Probierz, Krzysztof Kania and Przemysław Juszczuk
Entropy 2022, 24(4), 531; https://0-doi-org.brum.beds.ac.uk/10.3390/e24040531 - 10 Apr 2022
Cited by 3 | Viewed by 2260
Abstract
Classification is one of the main problems of machine learning, and assessing the quality of classification is one of the most topical tasks, all the more difficult as it depends on many factors. Many different measures have been proposed to assess the quality [...] Read more.
Classification is one of the main problems of machine learning, and assessing the quality of classification is one of the most topical tasks, all the more difficult as it depends on many factors. Many different measures have been proposed to assess the quality of the classification, often depending on the application of a specific classifier. However, in most cases, these measures are focused on binary classification, and for the problem of many decision classes, they are significantly simplified. Due to the increasing scope of classification applications, there is a growing need to select a classifier appropriate to the situation, including more complex data sets with multiple decision classes. This paper aims to propose a new measure of classifier quality assessment (called the preference-driven measure, abbreviated p-d), regardless of the number of classes, with the possibility of establishing the relative importance of each class. Furthermore, we propose a solution in which the classifier’s assessment can be adapted to the analyzed problem using a vector of preferences. To visualize the operation of the proposed measure, we present it first on an example involving two decision classes and then test its operation on real, multi-class data sets. Additionally, in this case, we demonstrate how to adjust the assessment to the user’s preferences. The results obtained allow us to confirm that the use of a preference-driven measure indicates that other classifiers are better to use according to preferences, particularly as opposed to the classical measures of classification quality assessment. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

18 pages, 1016 KiB  
Article
Stock Index Prediction Based on Time Series Decomposition and Hybrid Model
by Pin Lv, Qinjuan Wu, Jia Xu and Yating Shu
Entropy 2022, 24(2), 146; https://0-doi-org.brum.beds.ac.uk/10.3390/e24020146 - 19 Jan 2022
Cited by 17 | Viewed by 4380
Abstract
The stock index is an important indicator to measure stock market fluctuation, with a guiding role for investors’ decision-making, thus being the object of much research. However, the stock market is affected by uncertainty and volatility, making accurate prediction a challenging task. We [...] Read more.
The stock index is an important indicator to measure stock market fluctuation, with a guiding role for investors’ decision-making, thus being the object of much research. However, the stock market is affected by uncertainty and volatility, making accurate prediction a challenging task. We propose a new stock index forecasting model based on time series decomposition and a hybrid model. Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) decomposes the stock index into a series of Intrinsic Mode Functions (IMFs) with different feature scales and trend term. The Augmented Dickey Fuller (ADF) method judges the stability of each IMFs and trend term. The Autoregressive Moving Average (ARMA) model is used on stationary time series, and a Long Short-Term Memory (LSTM) model extracts abstract features of unstable time series. The predicted results of each time sequence are reconstructed to obtain the final predicted value. Experiments are conducted on four stock index time series, and the results show that the prediction of the proposed model is closer to the real value than that of seven reference models, and has a good quantitative investment reference value. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

20 pages, 709 KiB  
Article
Immunity in the ABM-DSGE Framework for Preventing and Controlling Epidemics—Validation of Results
by Jagoda Kaszowska-Mojsa, Przemysław Włodarczyk and Agata Szymańska
Entropy 2022, 24(1), 126; https://0-doi-org.brum.beds.ac.uk/10.3390/e24010126 - 14 Jan 2022
Cited by 1 | Viewed by 1787
Abstract
The COVID-19 pandemic has raised many questions on how to manage an epidemiological and economic crisis around the world. Since the beginning of the COVID-19 pandemic, scientists and policy makers have been asking how effective lockdowns are in preventing and controlling the spread [...] Read more.
The COVID-19 pandemic has raised many questions on how to manage an epidemiological and economic crisis around the world. Since the beginning of the COVID-19 pandemic, scientists and policy makers have been asking how effective lockdowns are in preventing and controlling the spread of the virus. In the absence of vaccines, the regulators lacked any plausible alternatives. Nevertheless, after the introduction of vaccinations, to what extent the conclusions of these analyses are still valid should be considered. In this paper, we present a study on the effect of vaccinations within the dynamic stochastic general equilibrium model with an agent-based epidemic component. Thus, we validated the results regarding the need to use lockdowns as an efficient tool for preventing and controlling epidemics that were obtained in November 2020. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

21 pages, 682 KiB  
Article
Breaking Data Encryption Standard with a Reduced Number of Rounds Using Metaheuristics Differential Cryptanalysis
by Kamil Dworak and Urszula Boryczka
Entropy 2021, 23(12), 1697; https://0-doi-org.brum.beds.ac.uk/10.3390/e23121697 - 18 Dec 2021
Cited by 4 | Viewed by 2527
Abstract
This article presents the author’s own metaheuristic cryptanalytic attack based on the use of differential cryptanalysis (DC) methods and memetic algorithms (MA) that improve the local search process through simulated annealing (SA). The suggested attack will be verified on a set of ciphertexts [...] Read more.
This article presents the author’s own metaheuristic cryptanalytic attack based on the use of differential cryptanalysis (DC) methods and memetic algorithms (MA) that improve the local search process through simulated annealing (SA). The suggested attack will be verified on a set of ciphertexts generated with the well-known DES (data encryption standard) reduced to six rounds. The aim of the attack is to guess the last encryption subkey, for each of the two characteristics Ω. Knowing the last subkey, it is possible to recreate the complete encryption key and thus decrypt the cryptogram. The suggested approach makes it possible to automatically reject solutions (keys) that represent the worst fitness function, owing to which we are able to significantly reduce the attack search space. The memetic algorithm (MASA) created in such a way will be compared with other metaheuristic techniques suggested in literature, in particular, with the genetic algorithm (NGA) and the classical differential cryptanalysis attack, in terms of consumption of memory and time needed to guess the key. The article also investigated the entropy of MASA and NGA attacks. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

16 pages, 293 KiB  
Article
Minimum Query Set for Decision Tree Construction
by Wojciech Wieczorek, Jan Kozak, Łukasz Strąk and Arkadiusz Nowakowski
Entropy 2021, 23(12), 1682; https://0-doi-org.brum.beds.ac.uk/10.3390/e23121682 - 14 Dec 2021
Cited by 3 | Viewed by 2043
Abstract
A new two-stage method for the construction of a decision tree is developed. The first stage is based on the definition of a minimum query set, which is the smallest set of attribute-value pairs for which any two objects can be distinguished. To [...] Read more.
A new two-stage method for the construction of a decision tree is developed. The first stage is based on the definition of a minimum query set, which is the smallest set of attribute-value pairs for which any two objects can be distinguished. To obtain this set, an appropriate linear programming model is proposed. The queries from this set are building blocks of the second stage in which we try to find an optimal decision tree using a genetic algorithm. In a series of experiments, we show that for some databases, our approach should be considered as an alternative method to classical ones (CART, C4.5) and other heuristic approaches in terms of classification quality. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

36 pages, 1647 KiB  
Article
Real-World Data Difficulty Estimation with the Use of Entropy
by Przemysław Juszczuk, Jan Kozak, Grzegorz Dziczkowski, Szymon Głowania, Tomasz Jach and Barbara Probierz
Entropy 2021, 23(12), 1621; https://0-doi-org.brum.beds.ac.uk/10.3390/e23121621 - 01 Dec 2021
Cited by 8 | Viewed by 2771
Abstract
In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information [...] Read more.
In the era of the Internet of Things and big data, we are faced with the management of a flood of information. The complexity and amount of data presented to the decision-maker are enormous, and existing methods often fail to derive nonredundant information quickly. Thus, the selection of the most satisfactory set of solutions is often a struggle. This article investigates the possibilities of using the entropy measure as an indicator of data difficulty. To do so, we focus on real-world data covering various fields related to markets (the real estate market and financial markets), sports data, fake news data, and more. The problem is twofold: First, since we deal with unprocessed, inconsistent data, it is necessary to perform additional preprocessing. Therefore, the second step of our research is using the entropy-based measure to capture the nonredundant, noncorrelated core information from the data. Research is conducted using well-known algorithms from the classification domain to investigate the quality of solutions derived based on initial preprocessing and the information indicated by the entropy measure. Eventually, the best 25% (in the sense of entropy measure) attributes are selected to perform the whole classification procedure once again, and the results are compared. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

17 pages, 625 KiB  
Article
Learning to Classify DWDM Optical Channels from Tiny and Imbalanced Data
by Paweł Cichosz, Stanisław Kozdrowski and Sławomir Sujecki
Entropy 2021, 23(11), 1504; https://0-doi-org.brum.beds.ac.uk/10.3390/e23111504 - 13 Nov 2021
Cited by 1 | Viewed by 1434
Abstract
Applying machine learning algorithms for assessing the transmission quality in optical networks is associated with substantial challenges. Datasets that could provide training instances tend to be small and heavily imbalanced. This requires applying imbalanced compensation techniques when using binary classification algorithms, but it [...] Read more.
Applying machine learning algorithms for assessing the transmission quality in optical networks is associated with substantial challenges. Datasets that could provide training instances tend to be small and heavily imbalanced. This requires applying imbalanced compensation techniques when using binary classification algorithms, but it also makes one-class classification, learning only from instances of the majority class, a noteworthy alternative. This work examines the utility of both these approaches using a real dataset from a Dense Wavelength Division Multiplexing network operator, gathered through the network control plane. The dataset is indeed of a very small size and contains very few examples of “bad” paths that do not deliver the required level of transmission quality. Two binary classification algorithms, random forest and extreme gradient boosting, are used in combination with two imbalance handling methods, instance weighting and synthetic minority class instance generation. Their predictive performance is compared with that of four one-class classification algorithms: One-class SVM, one-class naive Bayes classifier, isolation forest, and maximum entropy modeling. The one-class approach turns out to be clearly superior, particularly with respect to the level of classification precision, making it possible to obtain more practically useful models. Full article
(This article belongs to the Special Issue Entropy in Real-World Datasets and Its Impact on Machine Learning)
Show Figures

Figure 1

Back to TopTop