Data Analysis and Domain Knowledge

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Engineering Mathematics".

Deadline for manuscript submissions: closed (10 April 2023) | Viewed by 13827

Special Issue Editor

Prague University of Economics and Bussiness, Prague, nám. W. Churchilla 4, 130 67 Prague 3, Czech Republic
Interests: Data mining and KDD; applications of domain knowledge in KDD; association rules; GUHA method; observational calculi

Special Issue Information

Dear Colleagues,

Data analysis is one of the application areas of mathematics. Statistical hypothesis testing belongs to the first tasks of data analysis. The goal of exploratory data analysis is using data to suggest hypotheses to test. Data mining and knowledge discovery in data bases have brought new goals to data analysis—to find patterns hidden in given large data, which can be useful for data owners. Fast development of tools for generating and storing data has led to tremendous heterogeneous data sets to be analyzed.

It is thus becoming clearer that the domain knowledge, i.e., the knowledge of the field that the data belongs to, must be considered when analyzing data. The goal of this special issue is to present recent developments in applications of domain knowledge in data analysis. We are especially interested in papers describing tools for formalizations of items of domain knowledge such that the formalized items can be used to generate reasonable analytical questions solvable by available tools for data analysis. We are also interested in papers describing approaches when consequences of formalized items of domain knowledge are automatically filtered out from outputs of analytical procedures. Of course, any other interesting aspects of applications of domain knowledge in data analysis are welcomed.

Prof. Dr. Jan Rauch
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • exploratory data analysis
  • data mining
  • data science
  • big data
  • formalization of domain knowledge
  • applications of domain knowledge

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

33 pages, 11942 KiB  
Article
Data Analysis and Domain Knowledge for Strategic Competencies Using Business Intelligence and Analytics
by Mauricio Olivares Faúndez and Hanns de la Fuente-Mella
Mathematics 2023, 11(1), 34; https://0-doi-org.brum.beds.ac.uk/10.3390/math11010034 - 22 Dec 2022
Cited by 1 | Viewed by 1813
Abstract
This research arises from the demand in business management for capabilities that put into practice—in an autonomous way—skills and knowledge in BI&A of all those who make decisions and lead organizations. To this end, this study aims to analyze the development of scientific [...] Read more.
This research arises from the demand in business management for capabilities that put into practice—in an autonomous way—skills and knowledge in BI&A of all those who make decisions and lead organizations. To this end, this study aims to analyze the development of scientific production over the last 20 years in order to provide evidence of possible gaps, patterns and emphasis on domains of strategic leadership competencies in BI&A. The study was split into two methodological phases. Methodological Phase 1: Application of analytical techniques of informetrics. Methodological Phase 2: natural language processing and machine learning techniques. The records collected were 1231 articles from the Web of Science and Scopus databases on 16 August 2021. The results confirm, with an r2 = 96.9%, that a small group of authors published the largest number of articles on strategic leadership competencies in BI&A. There is also a strong emphasis on studies in the domain of professional capability development (92.29%), and there are few studies in the domain of enabling environment for learning (0.72%); the domain of expertise (3.01%) and strategic vision of BI&A was also rare (3.37%). Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

25 pages, 5639 KiB  
Article
A New Text-Mining–Bayesian Network Approach for Identifying Chemical Safety Risk Factors
by Zhiyong Zhou, Jianhui Huang, Yao Lu, Hongcai Ma, Wenwen Li and Jianhong Chen
Mathematics 2022, 10(24), 4815; https://0-doi-org.brum.beds.ac.uk/10.3390/math10244815 - 18 Dec 2022
Cited by 2 | Viewed by 1725
Abstract
The frequent occurrence of accidents in the chemical industry has caused serious economic loss and negative social impact. The chemical accident investigation report is of great value for analyzing the risk factors involved. However, traditional manual analysis is time-consuming and labor-intensive, while existing [...] Read more.
The frequent occurrence of accidents in the chemical industry has caused serious economic loss and negative social impact. The chemical accident investigation report is of great value for analyzing the risk factors involved. However, traditional manual analysis is time-consuming and labor-intensive, while existing keyword extraction methods still need to be improved. This study aims to propose an improved text-mining method to analyze a large number of chemical accident reports. A workflow was designed for building and updating lexicons of word segmentation. An improved keyword extraction algorithm was proposed to extract the top 100 keywords from 330 incident reports. A total of 51 safety risk factors was obtained by standardizing these keywords. In all, 294 strong association rules were obtained by Apriori. Based on these rules, a Bayesian network was built to analyze safety risk factors. The mean accuracy and mean recall of the BM25 model in the comparison experiments were 10.5% and 14.38% higher than those of TF-IDF, respectively. The results of association-rule mining and Bayesian network analysis can clearly demonstrate the interrelationship between the safety risk factors. The methodology of this study can quickly and efficiently extract key information from incident reports which can provide managers with new insights and suggestions. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

15 pages, 3686 KiB  
Article
Multifractal Characteristics on Temporal Maximum of Air Pollution Series
by Nurulkamal Masseran
Mathematics 2022, 10(20), 3910; https://0-doi-org.brum.beds.ac.uk/10.3390/math10203910 - 21 Oct 2022
Cited by 3 | Viewed by 998
Abstract
Presenting and describing a temporal series of air pollution data with longer time lengths provides more concise information and is, in fact, one of the simplest techniques of data reduction in a time series. However, this process can result in the loss of [...] Read more.
Presenting and describing a temporal series of air pollution data with longer time lengths provides more concise information and is, in fact, one of the simplest techniques of data reduction in a time series. However, this process can result in the loss of important information related to data features. Thus, the purpose of this study is to determine the type of data characteristics that might be lost when describing data with different time lengths corresponding to a process of data reduction. In parallel, this study proposes the application of a multifractal technique to investigate the properties on an air pollution series with different time lengths. A case study has been carried out using an air pollution index data in Klang, Malaysia. Results show that hourly air pollution series contain the most informative knowledge regarding the behaviors and characteristics of air pollution, particularly in terms of the strength of multifractality, long-term persistent correlations, and heterogeneity of variations. On the other hand, the statistical findings found that data reduction corresponding to a longer time length will change the multifractal properties of the original data. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

26 pages, 993 KiB  
Article
Comparable Studies of Financial Bankruptcy Prediction Using Advanced Hybrid Intelligent Classification Models to Provide Early Warning in the Electronics Industry
by You-Shyang Chen, Chien-Ku Lin, Chih-Min Lo, Su-Fen Chen and Qi-Jun Liao
Mathematics 2021, 9(20), 2622; https://0-doi-org.brum.beds.ac.uk/10.3390/math9202622 - 18 Oct 2021
Cited by 12 | Viewed by 2428
Abstract
In recent years in Taiwan, scholars who study financial bankruptcy have mostly focused on individual listed and over-the-counter (OTC) industries or the entire industry, while few have studied the independent electronics industry. Thus, this study investigated the application of an advanced hybrid Z-score [...] Read more.
In recent years in Taiwan, scholars who study financial bankruptcy have mostly focused on individual listed and over-the-counter (OTC) industries or the entire industry, while few have studied the independent electronics industry. Thus, this study investigated the application of an advanced hybrid Z-score bankruptcy prediction model in selecting financial ratios of listed companies in eight related electronics industries (semiconductor, computer, and peripherals, photoelectric, communication network, electronic components, electronic channel, information service, and other electronics industries) using data from 2000 to 2019. Based on 22 financial ratios of condition attributes and one decision attribute recommended and selected by experts and in the literature, this study used five classifiers for binary logistic regression analysis and in the decision tree. The experimental results show that for the Z-score model, samples analyzed using the five classifiers in five groups (1:1–5:1) of different ratios of companies, the bagging classifier scores are worse (40.82%) than when no feature selection method is used, while the logistic regression classifier and decision tree classifier (J48) result in better scores. However, it is significant that the bagging classifier score improved to over 90% after using the feature selection technique. In conclusion, it was found that the feature selection method can be effectively applied to improve the prediction accuracy, and three financial ratios (the liquidity ratio, debt ratio, and fixed assets turnover ratio) are identified as being the most important determinants affecting the prediction of financial bankruptcy in providing a useful reference for interested parties to evaluate capital allocation to avoid high investment risks. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

14 pages, 715 KiB  
Article
Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge
by Peihuang Huang, Pei Yao, Zhendong Hao, Huihong Peng and Longkun Guo
Mathematics 2021, 9(19), 2390; https://0-doi-org.brum.beds.ac.uk/10.3390/math9192390 - 26 Sep 2021
Cited by 3 | Viewed by 2189
Abstract
Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model [...] Read more.
Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model domain knowledge (i.e., background knowledge or side information), respecting some applications as must-link and cannot-link sets, for the sake of collaborating with k-means for better accuracy. We first propose an algorithm for constrained k-means, considering only must-links. The key idea is to consider a set of data points constrained by the must-links as a single data point with a weight equal to the weight sum of the constrained points. Then, for clustering the data points set with cannot-link, we employ minimum-weight matching to assign the data points to the existing clusters. At last, we carried out a numerical simulation to evaluate the proposed algorithms against the UCI datasets, demonstrating that our method outperforms the previous algorithms for constrained k-means as well as the traditional k-means regarding the clustering accuracy rate although with a slightly compromised practical runtime. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

15 pages, 2560 KiB  
Article
Domain Heuristic Fusion of Multi-Word Embeddings for Nutrient Value Prediction
by Gordana Ispirova, Tome Eftimov and Barbara Koroušić Seljak
Mathematics 2021, 9(16), 1941; https://0-doi-org.brum.beds.ac.uk/10.3390/math9161941 - 14 Aug 2021
Cited by 3 | Viewed by 1834
Abstract
Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is crucial for some patients, and methods for easing their calculations are [...] Read more.
Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is crucial for some patients, and methods for easing their calculations are emerging. Our proposed machine learning pipeline deals with nutrient prediction based on learned vector representations on short text–recipe names. In this study, we explored how the prediction results change when, instead of using the vector representations of the recipe description, we use the embeddings of the list of ingredients. The nutrient content of one food depends on its ingredients; therefore, the text of the ingredients contains more relevant information. We define a domain-specific heuristic for merging the embeddings of the ingredients, which combines the quantities of each ingredient in order to use them as features in machine learning models for nutrient prediction. The results from the experiments indicate that the prediction results improve when using the domain-specific heuristic. The prediction models for protein prediction were highly effective, with accuracies up to 97.98%. Implementing a domain-specific heuristic for combining multi-word embeddings yields better results than using conventional merging heuristics, with up to 60% more accuracy in some cases. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

10 pages, 1510 KiB  
Article
Revealing Driver’s Natural Behavior—A GUHA Data Mining Approach
by Esko Turunen and Klara Dolos
Mathematics 2021, 9(15), 1818; https://0-doi-org.brum.beds.ac.uk/10.3390/math9151818 - 31 Jul 2021
Cited by 3 | Viewed by 1576
Abstract
We investigate the applicability and usefulness of the GUHA data mining method and its computer implementation LISp-Miner for driver characterization based on digital vehicle data on gas pedal position, vehicle speed, and others. Three analytical questions are assessed: (1) Which measured features, also [...] Read more.
We investigate the applicability and usefulness of the GUHA data mining method and its computer implementation LISp-Miner for driver characterization based on digital vehicle data on gas pedal position, vehicle speed, and others. Three analytical questions are assessed: (1) Which measured features, also called attributes, distinguish each driver from all other drivers? (2) Comparing one driver separately in pairs with each of the other drivers, which are the most distinguishing attributes? (3) Comparing one driver separately in pairs with each of the other drivers, which attributes values show significant differences between drivers? The analyzed data consist of 94,380 measurements and contain clear and understandable patterns to be found by LISp-Miner. In conclusion, we find that the GUHA method is well suited for such tasks. Full article
(This article belongs to the Special Issue Data Analysis and Domain Knowledge)
Show Figures

Figure 1

Back to TopTop