Spectral Clustering of Mixed-Type Data
Abstract
:1. Introduction
2. Background: Spectral Clustering
3. Spectral Clustering for Mixed Type Data
4. Application
4.1. Competitors
4.2. Simulation Design
- The degree of overlap in the variables: high–high, high–low, low–high and low–low (continuous-categorical);
- The number of continuous-categorical variables: 2-2, 1-3, and 3-1;
- The number of levels in the categorical variables: 3 and 5;
- Whether or not the clusters were balanced (number of points per cluster): 200-200 vs. 320-80.
- Cluster shape (convex vs. non-convex);
- The degree of overlap in the variables.
4.3. Simulation Results
4.3.1. Results for 2-Cluster Data Sets
4.3.2. Results for Four-Cluster Data Sets
4.4. Real Data—Diamonds Data Set
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- McLachlan, G.J.; Basford, K.E. Mixture Models: Inference and Applications to Clustering; Marcel Dekker Inc.: New York, NY, USA, 1988; p. 253. [Google Scholar]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium, Berkeley, CA, USA, 21 June–28July 1965; Volume 1, pp. 281–297. [Google Scholar]
- Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. (TODS) 2017, 42, 1–21. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference of Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
- Murugesan, N.; Cho, I.; Tortora, C. Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means. In Proceedings of the Conference of the International Federation of Classification Societies, Thesssaloniki, Greece, 26–29 August 2019; Springer: Cham, Switzerland, 2019; pp. 175–185. [Google Scholar]
- David, G.; Averbuch, A. SpectralCAT: Categorical spectral clustering of numerical and nominal data. Pattern Recognit. 2012, 45, 416–433. [Google Scholar] [CrossRef]
- Gower, J.C. A general coefficient of similarity and some of its properties. Biometrics 1971, 27, 857–871. [Google Scholar] [CrossRef]
- Foss, A.H.; Markatou, M.; Ray, B. Distance metrics and clustering methods for mixed-type data. Int. Stat. Rev. 2019, 87, 80–109. [Google Scholar] [CrossRef]
- Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
- Foss, A.H.; Markatou, M. KAMILA: Clustering mixed-type data in R and Hadoop. J. Stat. Softw. 2018, 83, 1–44. [Google Scholar] [CrossRef] [Green Version]
- McParland, D.; Gormley, I.C. Model based clustering for mixed data: ClustMD. Adv. Data Anal. Classif. 2016, 10, 155–169. [Google Scholar] [CrossRef] [Green Version]
- Hunt, L.; Jorgensen, M. Clustering mixed data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 352–361. [Google Scholar] [CrossRef]
- Ahmad, A.; Khan, S.S. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 2019, 7, 31883–31902. [Google Scholar] [CrossRef]
- van de Velden, M.; Iodice D’Enza, A.; Markos, A. Distance-based clustering of mixed data. Wiley Interdiscip. Rev. Comput. Stat. 2019, 11, e1456. [Google Scholar] [CrossRef]
- Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
- Hagen, L.; Kahng, A.B. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 1992, 11, 1074–1085. [Google Scholar] [CrossRef] [Green Version]
- Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
- Priebe, C.E.; Park, Y.; Vogelstein, J.T.; Conroy, J.M.; Lyzinski, V.; Tang, M.; Athreya, A.; Cape, J.; Bridgeford, E. On a two-truths phenomenon in spectral graph clustering. Proc. Natl. Acad. Sci. USA 2019, 116, 5995–6000. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems. 2002. Available online: https://www.bibsonomy.org/bibtex/2e9c06dab81a9a2e06123cd9b31d3d83f/mhwombat (accessed on 9 October 2021).
- Hartigan, J.A.; Wong, M.A. A K-means clustering algorithm. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Cluster: Cluster Analysis Basics and Extensions; R Package Version 2.1.0; MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
- Qiu, Y.; Mei, J. RSpectra: Solvers for Large-Scale Eigenvalue and SVD Problems. R Package Version 0.16-0. 2019. Available online: https://cran.r-project.org/package=RSpectra (accessed on 9 October 2021).
- Szepannek, G. clustMixType: User-Friendly Clustering of Mixed-Type Data in R. R J. 2018, 42, 200–208. [Google Scholar] [CrossRef]
- Jimeno, J.; Roy, M.; Tortora, C. Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes. In Proceedings of the Conference of the International Federation of Classification Societies, Thessaloniki, Greece, 26–29 August 2019; Springer: Cham, Switzerland, 2019; pp. 83–91. [Google Scholar]
- Genz, A.; Bretz, F.; Miwa, T.; Mi, X.; Leisch, F.; Scheipl, F.; Hothorn, T. Mvtnorm: Multivariate Normal and t Distributions. R Package Version 1.1-1. 2020. Available online: https://mran.microsoft.com/snapshot/2017-02-04/web/packages/mvtnorm/index.html (accessed on 9 October 2021).
- Qiu, W.; Joe, H. ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R Package Version 1.3.4. 2015. Available online: https://cran.r-project.org/web/packages/clusterGeneration/index.html (accessed on 9 October 2021).
- Tortora, C.; Browne, R.P.; ElSherbiny, A.; Franczak, B.C.; McNicholas, P.D. Model-Based Clustering, Classification, and Discriminant Analysis Using the Generalized Hyperbolic Distribution: MixGHD R package. R package version 2.3.3. J. Stat. Softw. 2021, 98, 1–24. [Google Scholar] [CrossRef]
- Hornik, K.; Grün, B. movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions. R package version 0.2.4. J. Stat. Softw. 2014, 58, 1–31. [Google Scholar] [CrossRef] [Green Version]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
- Agrawal, S. Diamonds Data Set. Kaggle. 2017. Available online: https://www.kaggle.com/shivam2503/diamonds (accessed on 9 April 2021).
Size | Con. | Cat. | Levs | KAMILA | k-Proto | Spectral |
---|---|---|---|---|---|---|
eq | 1 | 3 | 3 | 0.980 | 0.970 | 0.980 |
eq | 2 | 2 | 3 | 0.931 | 0.680 | 0.941 |
eq | 3 | 1 | 3 | 0.359 | 0.341 | 0.680 |
eq | 1 | 3 | 5 | 0.970 | 0.341 | 0.990 |
eq | 2 | 2 | 5 | 0.921 | 0.285 | 0.926 |
eq | 3 | 1 | 5 | 0.365 | 0.222 | 0.647 |
neq | 1 | 3 | 3 | 0.108 | 0.966 | 0.988 |
neq | 2 | 2 | 3 | 0.102 | 0.036 | 0.965 |
neq | 3 | 1 | 3 | 0.082 | 0.026 | 0.524 |
neq | 1 | 3 | 5 | 0.110 | 0.010 | 1.000 |
neq | 2 | 2 | 5 | 0.078 | 0.018 | 0.965 |
neq | 3 | 1 | 5 | 0.062 | 0.027 | −0.018 |
Size | Con. | Cat. | Levs | KAMILA | k-Proto | Spectral |
---|---|---|---|---|---|---|
eq | 1 | 3 | 3 | 0.931 | 0.921 | 0.951 |
eq | 2 | 2 | 3 | 0.860 | 0.615 | 0.865 |
eq | 3 | 1 | 3 | 0.358 | 0.335 | 0.503 |
eq | 1 | 3 | 5 | 0.931 | 0.133 | 0.951 |
eq | 2 | 2 | 5 | 0.765 | 0.160 | 0.823 |
eq | 3 | 1 | 5 | 0.238 | 0.134 | 0.482 |
neq | 1 | 3 | 3 | 0.108 | 0.880 | 0.977 |
neq | 2 | 2 | 3 | 0.101 | 0.082 | 0.896 |
neq | 3 | 1 | 3 | 0.047 | 0.018 | 0.291 |
neq | 1 | 3 | 5 | 0.087 | 0.044 | 0.965 |
neq | 2 | 2 | 5 | 0.042 | 0.022 | 0.908 |
neq | 3 | 1 | 5 | 0.056 | 0.032 | 0.236 |
Size | Con. | Cat. | Levs | KAMILA | k-Proto | Spectral |
---|---|---|---|---|---|---|
eq | 1 | 3 | 3 | 0.002 | 0.990 | 0.990 |
eq | 2 | 2 | 3 | 0.016 | 0.898 | 0.931 |
eq | 3 | 1 | 3 | 0.004 | 0.060 | 0.639 |
eq | 1 | 3 | 5 | 0.002 | 0.194 | 0.990 |
eq | 2 | 2 | 5 | 0.007 | 0.050 | 0.921 |
eq | 3 | 1 | 5 | 0.002 | 0.008 | 0.301 |
neq | 1 | 3 | 3 | 0.001 | 0.977 | 0.988 |
neq | 2 | 2 | 3 | 0.003 | −0.001 | 0.965 |
neq | 3 | 1 | 3 | 0.000 | −0.001 | 0.449 |
neq | 1 | 3 | 5 | 0.000 | −0.001 | 0.988 |
neq | 2 | 2 | 5 | −0.001 | −0.001 | 0.954 |
neq | 3 | 1 | 5 | −0.002 | −0.002 | −0.024 |
Size | Con. | Cat. | Levs | KAMILA | k-Proto | Spectral |
---|---|---|---|---|---|---|
eq | 1 | 3 | 3 | 0.001 | 0.941 | 0.941 |
eq | 2 | 2 | 3 | 0.010 | 0.243 | 0.828 |
eq | 3 | 1 | 3 | 0.002 | 0.040 | 0.499 |
eq | 1 | 3 | 5 | 0.002 | 0.032 | 0.941 |
eq | 2 | 2 | 5 | 0.004 | 0.005 | 0.828 |
eq | 3 | 1 | 5 | 0.001 | 0.000 | 0.005 |
neq | 1 | 3 | 3 | 0.001 | 0.921 | 0.965 |
neq | 2 | 2 | 3 | 0.002 | −0.001 | 0.908 |
neq | 3 | 1 | 3 | 0.000 | −0.001 | 0.265 |
neq | 1 | 3 | 5 | 0.000 | 0.001 | 0.944 |
neq | 2 | 2 | 5 | 0.003 | −0.001 | 0.885 |
neq | 3 | 1 | 5 | −0.001 | −0.001 | 0.007 |
Categorical | k-Prototypes | KAMILA | Spectral |
---|---|---|---|
high | 0.017 | 0.020 | 0.864 |
medium | 0.098 | 0.025 | 0.073 |
low | 0.508 | 0.151 | 0.896 |
Continuous | Categorical | k-Prototypes | KAMILA | Spectral |
---|---|---|---|---|
low | high | 0.216 | 0.330 | 0.389 |
medium | medium | 0.450 | 0.251 | 0.208 |
high | low | 0.720 | 0.359 | 0.778 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mbuga, F.; Tortora, C. Spectral Clustering of Mixed-Type Data. Stats 2022, 5, 1-11. https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010001
Mbuga F, Tortora C. Spectral Clustering of Mixed-Type Data. Stats. 2022; 5(1):1-11. https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010001
Chicago/Turabian StyleMbuga, Felix, and Cristina Tortora. 2022. "Spectral Clustering of Mixed-Type Data" Stats 5, no. 1: 1-11. https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010001