Data Clustering: Algorithms and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 June 2023) | Viewed by 19529

Special Issue Editors


E-Mail Website
Guest Editor
1. Department of Computer Science, Electrical Engineering and Computer Science Faculty, Lublin University of Technology, 20-618 Lublin, Poland
2. Systems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland
Interests: data mining; computational intelligence; applied mathematics; knowledge discovery; image processing; information technology

E-Mail Website
Guest Editor
1. Faculty of Physics and Applied Computer Science, AGH University of Science and Technology, 30-059 Cracow, Poland
2. Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland
Interests: data mining; artificial intelligence; computational intelligence; neural networks; metaheuristics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

At present, with the rapid growth of computer science and information technology, the amount of data obtained has increased significantly. Such pools of data are ubiquitous and play important roles in many fields of business and science. Using existing data analysis methods to reveal natural structures and identify interesting patterns in the underlying data, as well as interpreting the results, represents a vast challenge. Clustering has become a fundamental and commonly used technique for knowledge discovery and data mining. Still, the need to cluster huge datasets with a high dimensionality poses a challenge to clustering algorithms. The collecting and use of data for analysis purposes needs to be fast in real applications. However, a large proportion of data have irrelevant features that may cause a decrease in processing efficiency. These necessitate higher requirements for the effectiveness of clustering methods. Hence, it is important to treat cluster analysis, anomaly detection, and dimensionality reduction as concepts that are not inseparable.

This Special Issue focuses on data clustering as well as knowledge discovery and machine learning. The aim of this Special Issue is to compile the recent advances in this contemporary research area, studies of the primary aspects of data clustering, key techniques commonly used for clustering, and insights discussing important features of the clustering process in a variety of application areas. We invite high-quality submissions from researchers of the field of data clustering to exchange and share their experiences and research results, whether theoretical or applicational.

Prof. Dr. Małgorzata Charytanowicz
Prof. Dr. Piotr A. Kowalski
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data clustering
  • cluster analysis
  • data mining
  • machine learning
  • knowledge discovery
  • unsupervised learning
  • clustering algorithms
  • anomaly detection
  • dimensionality reduction
  • imprecise information

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 28602 KiB  
Article
Data Clustering in Urban Computational Modeling by Integrated Geometry and Imagery Features for Probabilistic Navigation
by Chenyi Cai, Mohamed Zaghloul and Biao Li
Appl. Sci. 2022, 12(24), 12704; https://0-doi-org.brum.beds.ac.uk/10.3390/app122412704 - 11 Dec 2022
Viewed by 1534
Abstract
Cities are considered complex and open environments with multidimensional aspects including urban forms, urban imagery, and urban energy performance. So, a platform that supports the dialogue between the user and the machine is crucial in urban computational modeling (UCM). In this paper, we [...] Read more.
Cities are considered complex and open environments with multidimensional aspects including urban forms, urban imagery, and urban energy performance. So, a platform that supports the dialogue between the user and the machine is crucial in urban computational modeling (UCM). In this paper, we present a novel urban computational modeling framework, which integrates urban geometry and urban visual appearance aspects. The framework applies unsupervised machine learning, self-organizing map (SOM), and information retrieval techniques. We propose the instrument to help designers navigate among references from the built environment. The framework incorporates geometric and imagery aspects by encoding urban spatial and visual appearance characteristics with Isovist and semantic segmentation for integrated geometry and imagery features (IGIF). A ray SOM and a mask SOM are trained with the IGIF, using building footprints and street view images of Nanjing as a dataset. By interlinking the two SOMs, the program retrieves urban plots which have similar spatial traits or visual appearance, or both. The program provides urban designers with a navigatable explorer space with references from the built environment to inspire design ideas and learn from them. Our proposed framework helps architects and urban designers with both design inspiration and decision making by bringing human intelligence into UCM. Future research directions using and extending the framework are also discussed. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

19 pages, 22258 KiB  
Article
Ensemble Clustering in GPS Velocities: A Case Study of Turkey
by Batuhan Kılıç and Seda Özarpacı
Appl. Sci. 2022, 12(24), 12636; https://0-doi-org.brum.beds.ac.uk/10.3390/app122412636 - 09 Dec 2022
Cited by 3 | Viewed by 1787
Abstract
Block modeling is an effective way to understand Earth’s crustal deformation. However, the choice of block boundaries and the number of blocks affect the model results. Therefore, the subjectivity of this analysis should be avoided. Clustering analysis can be used to define the [...] Read more.
Block modeling is an effective way to understand Earth’s crustal deformation. However, the choice of block boundaries and the number of blocks affect the model results. Therefore, the subjectivity of this analysis should be avoided. Clustering analysis can be used to define the blocks of GPS (Global Positioning System) velocity fields without a priori information. Unfortunately, clustering methods also have unique solutions and differ with various algorithms. Ensemble methods could be an answer to enhance the clustering results for GPS velocities. In this study, we use ensemble clustering to identify block boundaries before block modeling without a priori information about the data. The ensemble clustering method is used for the first time in the clustering of GPS velocities and the case of Turkey is discussed. The published horizontal GPS velocities were first clustered with five different clustering methods and the optimum classes were determined using ensemble clustering methods. It is proven that the Meta-CLustering Algorithm can be used in terms of ensemble clustering for this region. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

12 pages, 531 KiB  
Article
K-RBBSO Algorithm: A Result-Based Stochastic Search Algorithm in Big Data
by Sungjin Park and Sangkyun Kim
Appl. Sci. 2022, 12(23), 12451; https://0-doi-org.brum.beds.ac.uk/10.3390/app122312451 - 05 Dec 2022
Viewed by 862
Abstract
Clustering is widely used in client-facing businesses to categorize their customer base and deliver personalized services. This study proposes an algorithm to stochastically search for an optimum solution based on the outcomes of a data clustering process. Fundamentally, the aforementioned goal is achieved [...] Read more.
Clustering is widely used in client-facing businesses to categorize their customer base and deliver personalized services. This study proposes an algorithm to stochastically search for an optimum solution based on the outcomes of a data clustering process. Fundamentally, the aforementioned goal is achieved using a result-based stochastic search algorithm. Hence, shortcomings of existing stochastic search algorithms are identified, and the k-means-initiated rapid biogeography-based silhouette optimization (K-RBBSO) algorithm is proposed to overcome them. The proposed algorithm is validated by creating a data clustering engine and comparing the performance of the K-RBBSO algorithm with those of currently used stochastic search techniques, such as simulated annealing and artificial bee colony, on a validation dataset. The results indicate that K-RBBSO is more effective with larger volumes of data compared to the other algorithms. Finally, we describe some prospective beneficial uses of a data clustering algorithm in unsupervised learning based on the findings of this study. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

24 pages, 24832 KiB  
Article
Understanding Unsupervised Deep Learning for Text Line Segmentation
by Ahmad Droby, Berat Kurar Barakat, Raid Saabni, Reem Alaasam, Boraq Madi and Jihad El-Sana
Appl. Sci. 2022, 12(19), 9528; https://0-doi-org.brum.beds.ac.uk/10.3390/app12199528 - 22 Sep 2022
Cited by 2 | Viewed by 1315
Abstract
We propose an unsupervised feature learning approach for segmenting text lines of handwritten document images with no labelling effort. Humans can easily group local text line features to global coarse patterns. We leverage this coherent visual perception of text lines as a supervising [...] Read more.
We propose an unsupervised feature learning approach for segmenting text lines of handwritten document images with no labelling effort. Humans can easily group local text line features to global coarse patterns. We leverage this coherent visual perception of text lines as a supervising signal by formulating the feature learning as a global pattern differentiation task. The machine is trained to detect whether a document patch contains a similar global text line pattern with its identity or neighbours, and a different global text line pattern with its 90-degree-rotated identity or neighbours. Clustering the central windows of document image patches using their extracted features, forms blob lines which strike through the text lines. The blob lines guide an energy minimization function for extracting text lines in a binary image and guide a seam carving function for detecting baselines in a colour image. In identifying the aspect of the input patch that supports the actual prediction and clustering, we contribute toward the understanding of input patch functionality. We evaluate the method on several variants of text line segmentation datasets to demonstrate its effectiveness, visualize what it has learned, and enable it to comprehend its clustering strategy from a human perspective. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

17 pages, 8508 KiB  
Article
An Intelligent Hybrid Scheme for Customer Churn Prediction Integrating Clustering and Classification Algorithms
by Rencheng Liu, Saqib Ali, Syed Fakhar Bilal, Zareen Sakhawat, Azhar Imran, Abdullah Almuhaimeed, Abdulkareem Alzahrani and Guangmin Sun
Appl. Sci. 2022, 12(18), 9355; https://0-doi-org.brum.beds.ac.uk/10.3390/app12189355 - 18 Sep 2022
Cited by 10 | Viewed by 2640
Abstract
Nowadays, customer churn has been reflected as one of the main concerns in the processes of the telecom sector, as it affects the revenue directly. Telecom companies are looking to design novel methods to identify the potential customer to churn. Hence, it requires [...] Read more.
Nowadays, customer churn has been reflected as one of the main concerns in the processes of the telecom sector, as it affects the revenue directly. Telecom companies are looking to design novel methods to identify the potential customer to churn. Hence, it requires suitable systems to overcome the growing churn challenge. Recently, integrating different clustering and classification models to develop hybrid learners (ensembles) has gained wide acceptance. Ensembles are getting better approval in the domain of big data since they have supposedly achieved excellent predictions as compared to single classifiers. Therefore, in this study, we propose a customer churn prediction (CCP) based on ensemble system fully incorporating clustering and classification learning techniques. The proposed churn prediction model uses an ensemble of clustering and classification algorithms to improve CCP model performance. Initially, few clustering algorithms such as k-means, k-medoids, and Random are employed to test churn prediction datasets. Next, to enhance the results hybridization technique is applied using different ensemble algorithms to evaluate the performance of the proposed system. Above mentioned clustering algorithms integrated with different classifiers including Gradient Boosted Tree (GBT), Decision Tree (DT), Random Forest (RF), Deep Learning (DL), and Naive Bayes (NB) are evaluated on two standard telecom datasets which were acquired from Orange and Cell2Cell. The experimental result reveals that compared to the bagging ensemble technique, the stacking-based hybrid model (k-medoids-GBT-DT-DL) achieve the top accuracies of 96%, and 93.6% on the Orange and Cell2Cell dataset, respectively. The proposed method outperforms conventional state-of-the-art churn prediction algorithms. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

16 pages, 519 KiB  
Article
DPS Clustering: New Results
by Sergey M. Agayan, Shamil R. Bogoutdinov, Boris A. Dzeboev, Boris V. Dzeranov, Dmitriy A. Kamaev and Maxim O. Osipov
Appl. Sci. 2022, 12(18), 9335; https://0-doi-org.brum.beds.ac.uk/10.3390/app12189335 - 17 Sep 2022
Cited by 4 | Viewed by 1261
Abstract
The results presented in this paper are obtained as part of the continued development and research of clustering algorithms based on the discrete mathematical analysis. The article briefly describes the theory of Discrete Perfect Sets (DPS-sets) that is the basis for the construction [...] Read more.
The results presented in this paper are obtained as part of the continued development and research of clustering algorithms based on the discrete mathematical analysis. The article briefly describes the theory of Discrete Perfect Sets (DPS-sets) that is the basis for the construction of DPS-clustering algorithms. The main task of the previously constructed DPS-algorithms is to search for clusters in multidimensional arrays with noise. DPS-algorithms have two stages: the first stage is the recognition of the maximum perfect set of a given density level from the initial array, the second stage is the partitioning of the result of the first stage into connected components, which are considered to be clusters. Study of qualities of DPS-algorithms showed that, in a number of situations in the first stage, the result does not include all clusters which have practical sense. In the second stage, partitioning into connected components can lead to unnecessarily small clusters. Simple variation of parameters in DPS-algorithms does not allow for eliminating these drawbacks. The present paper is devoted to the construction on the basis of DPS-algorithms of their new versions, more free from these drawbacks. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

16 pages, 2020 KiB  
Article
Analysis and Evaluation of Clustering Techniques Applied to Wireless Acoustics Sensor Network Data
by Antonio Pita, Francisco J. Rodriguez and Juan M. Navarro
Appl. Sci. 2022, 12(17), 8550; https://0-doi-org.brum.beds.ac.uk/10.3390/app12178550 - 26 Aug 2022
Cited by 4 | Viewed by 1812
Abstract
Exposure to environmental noise is related to negative health effects. To prevent it, the city councils develop noise maps and action plans to identify, quantify, and decrease noise pollution. Smart cities are deploying wireless acoustic sensor networks that continuously gather the sound pressure [...] Read more.
Exposure to environmental noise is related to negative health effects. To prevent it, the city councils develop noise maps and action plans to identify, quantify, and decrease noise pollution. Smart cities are deploying wireless acoustic sensor networks that continuously gather the sound pressure level from many locations using acoustics nodes. These nodes provide very relevant updated information, both temporally and spatially, over the acoustic zones of the city. In this paper, the performance of several data clustering techniques is evaluated for discovering and analyzing different behavior patterns of the sound pressure level. A comparison of clustering techniques is carried out using noise data from two large cities, considering isolated and federated data. Experiments support that Hierarchical Agglomeration Clustering and K-means are the algorithms more appropriate to fit acoustics sound pressure level data. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

19 pages, 3115 KiB  
Article
Cluster Analysis with K-Mean versus K-Medoid in Financial Performance Evaluation
by Emilia Herman, Kinga-Emese Zsido and Veronika Fenyves
Appl. Sci. 2022, 12(16), 7985; https://0-doi-org.brum.beds.ac.uk/10.3390/app12167985 - 10 Aug 2022
Cited by 13 | Viewed by 2616
Abstract
Nowadays there is a large amount of information at our disposal, which is increasing day by day, and right now the question is not whether we have a method to process it, but which method is most effective, faster and best. When processing [...] Read more.
Nowadays there is a large amount of information at our disposal, which is increasing day by day, and right now the question is not whether we have a method to process it, but which method is most effective, faster and best. When processing large databases, with different data, the formation of homogeneous groups is recommended. This paper presents the financial performance of Hungarian and Romanian food retail companies by using two well-known cluster analyzing methods (K-Mean and K-Medoid) based on ROS (Return on Sales), ROA (Return on Assets) and ROE (Return on Equity) financial ratios. The research is based on two complete databases, including the financial statements for five years of all retail food companies from one Hungarian and one Romanian county. The hypothesis of the research is: in the case of large databases with variable quantitative data, cluster analysis is necessary in order to obtain accurate results and the method chosen can bring different results. It is justified to think carefully about choosing a method depending on the available data and the research aim. The aim of this study is to highlight the differences between the results of these two grouping procedures. Using the two methods we reached different results, which means a different evaluation of financial performance. The results demonstrate that the method chosen for grouping may influence the assessment of the financial performance of companies: the K-Mean method produces a greater variety of groups and the range of results obtained after grouping is larger; whereas, the group distribution and the results obtained by the K-Medoid method are more balanced. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

21 pages, 2450 KiB  
Article
A Clustering Algorithm for Evolving Data Streams Using Temporal Spatial Hyper Cube
by Redhwan Al-amri, Raja Kumar Murugesan, Mubarak Almutairi, Kashif Munir, Gamal Alkawsi and Yahia Baashar
Appl. Sci. 2022, 12(13), 6523; https://0-doi-org.brum.beds.ac.uk/10.3390/app12136523 - 27 Jun 2022
Cited by 3 | Viewed by 1701
Abstract
As applications generate massive amounts of data streams, the requirement for ways to analyze and cluster this data has become a critical field of research for knowledge discovery. Data stream clustering’s primary objective and goal are to acquire insights into incoming data. Recognizing [...] Read more.
As applications generate massive amounts of data streams, the requirement for ways to analyze and cluster this data has become a critical field of research for knowledge discovery. Data stream clustering’s primary objective and goal are to acquire insights into incoming data. Recognizing all possible patterns in data streams that enter at variable rates and structures and evolve over time is critical for acquiring insights. Analyzing the data stream has been one of the vital research areas due to the inevitable evolving aspect of the data stream and its vast application domains. Existing algorithms for handling data stream clustering consider adding various data summarization structures starting from grid projection and ending with buffers of Core-Micro and Macro clusters. However, it is found that the static assumption of the data summarization impacts the quality of clustering. To fill this gap, an online clustering algorithm for handling evolving data streams using a tempo-spatial hyper cube called BOCEDS TSHC has been developed in this research. The role of the tempo-spatial hyper cube (TSHC) is to add more dimensions to the data summarization for more degree of freedom. TSHC when added to Buffer-based Online Clustering for Evolving Data Stream (BOCEDS) results in a superior evolving data stream clustering algorithm. Evaluation based on both the real world and synthetic datasets has proven the superiority of the developed BOCEDS TSHC clustering algorithm over the baseline algorithms with respect to most of the clustering metrics. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

22 pages, 1485 KiB  
Article
Optimization Algorithms for Scalable Stream Batch Clustering with k Estimation
by Paulo Gustavo Lopes Cândido, Jonathan Andrade Silva, Elaine Ribeiro Faria and Murilo Coelho Naldi
Appl. Sci. 2022, 12(13), 6464; https://0-doi-org.brum.beds.ac.uk/10.3390/app12136464 - 25 Jun 2022
Viewed by 1223
Abstract
The increasing volume and velocity of the continuously generated data (data stream) challenge machine learning algorithms, which must evolve to fit real-world problems. The data stream clustering algorithms face issues such as the rapidly increasing volume of the data, the variety of the [...] Read more.
The increasing volume and velocity of the continuously generated data (data stream) challenge machine learning algorithms, which must evolve to fit real-world problems. The data stream clustering algorithms face issues such as the rapidly increasing volume of the data, the variety of the number of clusters, and their shapes. The present work aims to improve the accuracy of sequential clustering batches of data streams for scenarios in which clusters evolve dynamically and continuously, automatically estimating their number. In order to achieve this goal, three evolutionary algorithms are presented, along with three novel algorithms designed to deal with clusters of normal distribution based on goodness-of-fit tests in the context of scalable batch stream clustering with automatic estimation of the number of clusters. All of them are developed on top of MapReduce, Discretized-Stream models, and the most recent MPC frameworks to provide scalability, reliability, resilience, and flexibility. The proposed algorithms are experimentally compared with state-of-the-art methods and present the best results for accuracy for normally distributed data sets, reaching their goal. Full article
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)
Show Figures

Figure 1

Back to TopTop