Next Issue
Volume 5, June
Previous Issue
Volume 4, December

Stats, Volume 5, Issue 1 (March 2022) – 18 articles

Cover Story (view full-size image): Variable selection plays an important role in model-based small area estimation. A simple and effective method for variable selection under a three-fold sub-subarea model, linking sub-subarea means to related covariates and random effects at the area, sub-area, and sub-subarea levels, is presented. The proposed method transforms the subarea means to reduce the three-fold model to a standard regression model and applies commonly used criteria for variable selection, such as AIC and BIC, to the reduced model. The resulting criteria depend on the unknown sub-subarea means, which are estimated using the sample sub-subarea means. The estimated selection criteria are used for variable selection. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Readerexternal link to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
Article
Importance of Weather Conditions in a Flight Corridor
Stats 2022, 5(1), 312-338; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010018 - 09 Mar 2022
Viewed by 719
Abstract
Current research initiatives, such as the Single European Sky Air Traffic Management Research Program, call for an air traffic system with improved safety and efficiency records and environmental compatibility. The resulting multi-criteria system optimization and individual flight trajectories require, in particular, reliable three-dimensional [...] Read more.
Current research initiatives, such as the Single European Sky Air Traffic Management Research Program, call for an air traffic system with improved safety and efficiency records and environmental compatibility. The resulting multi-criteria system optimization and individual flight trajectories require, in particular, reliable three-dimensional meteorological information. The Global (Weather) Forecast System only provides data at a resolution of around 100 km. We postulate a reliable interpolation at high resolution to compute these trajectories accurately and in due time to comply with operational requirements. We investigate different interpolation methods for aerodynamic crucial weather variables such as temperature, wind speed, and wind direction. These methods, including Ordinary Kriging, the radial basis function method, neural networks, and decision trees, are compared concerning cross-validation interpolation errors. We show that using the interpolated data in a flight performance model emphasizes the effect of weather data accuracy on trajectory optimization. Considering a trajectory from Prague to Tunis, a Monte Carlo simulation is applied to examine the effect of errors on input (GFS data) and output (i.e., Ordinary Kriging) on the optimized trajectory. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

Article
Properties and Limiting Forms of the Multivariate Extended Skew-Normal and Skew-Student Distributions
Stats 2022, 5(1), 270-311; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010017 - 09 Mar 2022
Viewed by 626
Abstract
This paper is concerned with the multivariate extended skew-normal [MESN] and multivariate extended skew-Student [MEST] distributions, that is, distributions in which the location parameters of the underlying truncated distributions are not zero. The extra parameter leads to greater variability in the moments and [...] Read more.
This paper is concerned with the multivariate extended skew-normal [MESN] and multivariate extended skew-Student [MEST] distributions, that is, distributions in which the location parameters of the underlying truncated distributions are not zero. The extra parameter leads to greater variability in the moments and critical values, thus providing greater flexibility for empirical work. It is reported in this paper that various theoretical properties of the extended distributions, notably the limiting forms as the magnitude of the extension parameter, denoted τ in this paper, increases without limit. In particular, it is shown that as τ, the limiting forms of the MESN and MEST distributions are different. The effect of the difference is exemplified by a study of stockmarket crashes. A second example is a short study of the extent to which the extended skew-normal distribution can be approximated by the skew-Student. Full article
(This article belongs to the Special Issue Multivariate Statistics and Applications)
Show Figures

Figure 1

Review
Resampling under Complex Sampling Designs: Roots, Development and the Way Forward
Stats 2022, 5(1), 258-269; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010016 - 08 Mar 2022
Cited by 1 | Viewed by 709
Abstract
In the present paper, resampling for finite populations under an iid sampling design is reviewed. Our attention is mainly focused on pseudo-population-based resampling due to its properties. A principled appraisal of the main theoretical foundations and results is given and discussed, together with [...] Read more.
In the present paper, resampling for finite populations under an iid sampling design is reviewed. Our attention is mainly focused on pseudo-population-based resampling due to its properties. A principled appraisal of the main theoretical foundations and results is given and discussed, together with important computational aspects. Finally, a discussion on open problems and research perspectives is provided. Full article
(This article belongs to the Special Issue Re-sampling Methods for Statistical Inference of the 2020s)
Article
The Stacy-G Class: A New Family of Distributions with Regression Modeling and Applications to Survival Real Data
Stats 2022, 5(1), 215-257; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010015 - 04 Mar 2022
Viewed by 636
Abstract
We study the Stacy-G family, which extends the gamma-G class and provides four of the most well-known forms of the hazard rate function: increasing, decreasing, bathtub, and inverted bathtub. We provide some of its structural properties. We estimate the parameters by maximum likelihood, [...] Read more.
We study the Stacy-G family, which extends the gamma-G class and provides four of the most well-known forms of the hazard rate function: increasing, decreasing, bathtub, and inverted bathtub. We provide some of its structural properties. We estimate the parameters by maximum likelihood, and perform a simulation study to verify the asymptotic properties of the estimators for the Burr-XII baseline. We construct the log-Stacy-Burr XII regression for censored data. The usefulness of the new models is shown through applications to uncensored and censored real data. Full article
(This article belongs to the Section Regression Models)
Show Figures

Figure 1

Article
Modeling Secondary Phenotypes Conditional on Genotypes in Case–Control Studies
Stats 2022, 5(1), 203-214; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010014 - 22 Feb 2022
Viewed by 697
Abstract
Traditional case–control genetic association studies examine relationships between case–control status and one or more covariates. It is becoming increasingly common to study secondary phenotypes and their association with the original covariates. The Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) project, a study [...] Read more.
Traditional case–control genetic association studies examine relationships between case–control status and one or more covariates. It is becoming increasingly common to study secondary phenotypes and their association with the original covariates. The Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) project, a study of temporomandibular disorders (TMD), motivates this work. Numerous measures of interest are collected at enrollment, such as the number of comorbid pain conditions from which a participant suffers. Examining the potential genetic basis of these measures is of secondary interest. Assessing these associations is statistically challenging, as participants do not form a random sample from the population of interest. Standard methods may be biased and lack coverage and power. We propose a general method for the analysis of arbitrary phenotypes utilizing inverse probability weighting and bootstrapping for standard error estimation. The method may be applied to the complicated association tests used in next-generation sequencing studies, such as analyses of haplotypes with ambiguous phase. Simulation studies show that our method performs as well as competing methods when they are applicable and yield promising results for outcome types, such as time-to-event, to which other methods may not apply. The method is applied to the OPPERA baseline case–control genetic study. Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
Show Figures

Figure 1

Article
Bootstrap Prediction Intervals of Temporal Disaggregation
Stats 2022, 5(1), 190-202; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010013 - 18 Feb 2022
Viewed by 646
Abstract
In this article, we propose an interval estimation method to trace an unknown disaggregate series within certain bandwidths. First, we consider two model-based disaggregation methods called the GLS disaggregation and the ARIMA disaggregation. Then, we develop iterative steps to construct AR-sieve bootstrap prediction [...] Read more.
In this article, we propose an interval estimation method to trace an unknown disaggregate series within certain bandwidths. First, we consider two model-based disaggregation methods called the GLS disaggregation and the ARIMA disaggregation. Then, we develop iterative steps to construct AR-sieve bootstrap prediction intervals for model-based temporal disaggregation. As an illustration, we analyze the quarterly total balances of U.S. international trade in goods and services between the first quarter of 1992 and the fourth quarter of 2020. Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
Show Figures

Figure 1

Article
Multivariate Threshold Regression Models with Cure Rates: Identification and Estimation in the Presence of the Esscher Property
Stats 2022, 5(1), 172-189; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010012 - 11 Feb 2022
Viewed by 589
Abstract
The first hitting time of a boundary or threshold by the sample path of a stochastic process is the central concept of threshold regression models for survival data analysis. Regression functions for the process and threshold parameters in these models are multivariate combinations [...] Read more.
The first hitting time of a boundary or threshold by the sample path of a stochastic process is the central concept of threshold regression models for survival data analysis. Regression functions for the process and threshold parameters in these models are multivariate combinations of explanatory variates. The stochastic process under investigation may be a univariate stochastic process or a multivariate stochastic process. The stochastic processes of interest to us in this report are those that possess stationary independent increments (i.e., Lévy processes) as well as the Esscher property. The Esscher transform is a transformation of probability density functions that has applications in actuarial science, financial engineering, and other fields. Lévy processes with this property are often encountered in practical applications. Frequently, these applications also involve a ‘cure rate’ fraction because some individuals are susceptible to failure and others not. Cure rates may arise endogenously from the model alone or exogenously from mixing of distinct statistical populations in the data set. We show, using both theoretical analysis and case demonstrations, that model estimates derived from typical survival data may not be able to distinguish between individuals in the cure rate fraction who are not susceptible to failure and those who may be susceptible to failure but escape the fate by chance. The ambiguity is aggravated by right censoring of survival times and by minor misspecifications of the model. Slightly incorrect specifications for regression functions or for the stochastic process can lead to problems with model identification and estimation. In this situation, additional guidance for estimating the fraction of non-susceptibles must come from subject matter expertise or from data types other than survival times, censored or otherwise. The identifiability issue is confronted directly in threshold regression but is also present when applying other kinds of models commonly used for survival data analysis. Other methods, however, usually do not provide a framework for recognizing or dealing with the issue and so the issue is often unintentionally ignored. The theoretical foundations of this work are set out, which presents new and somewhat surprising results for the first hitting time distributions of Lévy processes that have the Esscher property. Full article
(This article belongs to the Special Issue Multivariate Statistics and Applications)
Article
All-NBA Teams’ Selection Based on Unsupervised Learning
Stats 2022, 5(1), 154-171; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010011 - 09 Feb 2022
Viewed by 734
Abstract
All-NBA Teams’ selections have great implications for the players’ and teams’ futures. Since contract extensions are highly related to awards, which can be seen as indexes that measure a players’ production in a year, team selection is of mutual interest for athletes and [...] Read more.
All-NBA Teams’ selections have great implications for the players’ and teams’ futures. Since contract extensions are highly related to awards, which can be seen as indexes that measure a players’ production in a year, team selection is of mutual interest for athletes and franchises. In this paper, we are interested in studying the current selection format. In particular, this study aims to: (i) identify the factors that are taken into consideration by voters when choosing the three All-NBA Teams; and (ii) suggest a new selection format to evaluate players’ performances. Average game-related statistics of all active NBA players in regular seasons from 2013-14 to 2018-19, were analyzed using LASSO (Logistic) Regression and Principal Component Analysis (PCA). It was possible: (i) to determine an All-NBA player profile; (ii) to determine that this profile can cause a misrepresentation of players’ modern and versatile gameplay styles; and (iii) to suggest a new way to evaluate and select players, through PCA. As the results of this paper a model is presented that may help not only the NBA to better evaluate players, but any basketball league; it also may be a source to researchers that aim to investigate player performance, development, and their impact over many seasons. Full article
(This article belongs to the Special Issue Multivariate Statistics and Applications)
Show Figures

Figure 1

Article
Analysis of Household Pulse Survey Public-Use Microdata via Unit-Level Models for Informative Sampling
Stats 2022, 5(1), 139-153; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010010 - 07 Feb 2022
Viewed by 821
Abstract
The Household Pulse Survey, recently released by the U.S. Census Bureau, gathers information about the respondents’ experiences regarding employment status, food security, housing, physical and mental health, access to health care, and education disruption. Design-based estimates are produced for all 50 states and [...] Read more.
The Household Pulse Survey, recently released by the U.S. Census Bureau, gathers information about the respondents’ experiences regarding employment status, food security, housing, physical and mental health, access to health care, and education disruption. Design-based estimates are produced for all 50 states and the District of Columbia (DC), as well as 15 Metropolitan Statistical Areas (MSAs). Using public-use microdata, this paper explores the effectiveness of using unit-level model-based estimators that incorporate spatial dependence for the Household Pulse Survey. In particular, we consider Bayesian hierarchical model-based spatial estimates for both a binomial and a multinomial response under informative sampling. Importantly, we demonstrate that these models can be easily estimated using Hamiltonian Monte Carlo through the Stan software package. In doing so, these models can readily be implemented in a production environment. For both the binomial and multinomial responses, an empirical simulation study is conducted, which compares spatial and non-spatial models. Finally, using public-use Household Pulse Survey micro-data, we provide an analysis that compares both design-based and model-based estimators and demonstrates a reduction in standard errors for the model-based approaches. Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
Show Figures

Figure 1

Article
Selection of Auxiliary Variables for Three-Fold Linking Models in Small Area Estimation: A Simple and Effective Method
Stats 2022, 5(1), 128-138; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010009 - 05 Feb 2022
Viewed by 632
Abstract
Model-based estimation of small area means can lead to reliable estimates when the area sample sizes are small. This is accomplished by borrowing strength across related areas using models linking area means to related covariates and random area effects. The effective selection of [...] Read more.
Model-based estimation of small area means can lead to reliable estimates when the area sample sizes are small. This is accomplished by borrowing strength across related areas using models linking area means to related covariates and random area effects. The effective selection of variables to be included in the linking model is important in small area estimation. The main purpose of this paper is to extend the earlier work on variable selection for area level and two-fold subarea level models to three-fold sub-subarea models linking sub-subarea means to related covariates and random effects at the area, sub-area, and sub-subarea levels. The proposed variable selection method transforms the sub-subarea means to reduce the linking model to a standard regression model and applies commonly used criteria for variable selection, such as AIC and BIC, to the reduced model. The resulting criteria depend on the unknown sub-subarea means, which are then estimated using the sample sub-subarea means. Then, the estimated selection criteria are used for variable selection. Simulation results on the performance of the proposed variable selection method relative to methods based on area level and two-fold subarea level models are also presented. Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
Article
A General Description of Growth Trends
Stats 2022, 5(1), 111-127; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010008 - 03 Feb 2022
Viewed by 680
Abstract
Time series that display periodicity can be described with a Fourier expansion. In a similar vein, a recently developed formalism enables the description of growth patterns with the optimal number of parameters. The method has been applied to the growth of national GDP, [...] Read more.
Time series that display periodicity can be described with a Fourier expansion. In a similar vein, a recently developed formalism enables the description of growth patterns with the optimal number of parameters. The method has been applied to the growth of national GDP, population and the COVID-19 pandemic; in all cases, the deviations of long-term growth patterns from purely exponential required no more than two additional parameters, mostly only one. Here, I utilize the new framework to develop a unified formulation for all functions that describe growth deceleration, wherein the growth rate decreases with time. The result offers the prospects for a new general tool for trend removal in time-series analysis. Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
Show Figures

Figure 1

Editorial
Acknowledgment to Reviewers of Stats in 2021
Stats 2022, 5(1), 108-110; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010007 - 28 Jan 2022
Viewed by 647
Abstract
Rigorous peer-reviews are the basis of high-quality academic publishing [...] Full article
Article
A Bayesian Approach for Imputation of Censored Survival Data
Stats 2022, 5(1), 89-107; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010006 - 26 Jan 2022
Viewed by 924
Abstract
A common feature of much survival data is censoring due to incompletely observed lifetimes. Survival analysis methods and models have been designed to take account of this and provide appropriate relevant summaries, such as the Kaplan–Meier plot and the commonly quoted median survival [...] Read more.
A common feature of much survival data is censoring due to incompletely observed lifetimes. Survival analysis methods and models have been designed to take account of this and provide appropriate relevant summaries, such as the Kaplan–Meier plot and the commonly quoted median survival time of the group under consideration. However, a single summary is not really a relevant quantity for communication to an individual patient, as it conveys no notion of variability and uncertainty, and the Kaplan–Meier plot can be difficult for the patient to understand and also is often mis-interpreted, even by some physicians. This paper considers an alternative approach of treating the censored data as a form of missing, incomplete data and proposes an imputation scheme to construct a completed dataset. This allows the use of standard descriptive statistics and graphical displays to convey both typical outcomes and the associated variability. We propose a Bayesian approach to impute any censored observations, making use of other information in the dataset, and provide a completed dataset. This can then be used for standard displays, summaries, and even, in theory, analysis and model fitting. We particularly focus on the data visualisation advantages of the completed data, allowing displays such as density plots, boxplots, etc, to complement the usual Kaplan–Meier display of the original dataset. We study the performance of this approach through a simulation study and consider its application to two clinical examples. Full article
(This article belongs to the Special Issue Survival Analysis: Models and Applications)
Show Figures

Figure 1

Article
A Noncentral Lindley Construction Illustrated in an INAR(1) Environment
Stats 2022, 5(1), 70-88; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010005 - 10 Jan 2022
Viewed by 617
Abstract
This paper proposes a previously unconsidered generalization of the Lindley distribution by allowing for a measure of noncentrality. Essential structural characteristics are investigated and derived in explicit and tractable forms, and the estimability of the model is illustrated via the fit of this [...] Read more.
This paper proposes a previously unconsidered generalization of the Lindley distribution by allowing for a measure of noncentrality. Essential structural characteristics are investigated and derived in explicit and tractable forms, and the estimability of the model is illustrated via the fit of this developed model to real data. Subsequently, this model is used as a candidate for the parameter of a Poisson model, which allows for departure from the usual equidispersion restriction that the Poisson offers when modelling count data. This Poisson-noncentral Lindley is also systematically investigated and characteristics are derived. The value of this count model is illustrated and implemented as the count error distribution in an integer autoregressive environment, and juxtaposed against other popular models. The effect of the systematically-induced noncentrality parameter is illustrated and paves the way for future flexible modelling not only as a standalone contender in continuous Lindley-type scenarios but also in discrete and discrete time series scenarios when the often-encountered equidispersed assumption is not adhered to in practical data environments. Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
Show Figures

Figure 1

Article
A Flexible Mixed Model for Clustered Count Data
Stats 2022, 5(1), 52-69; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010004 - 07 Jan 2022
Viewed by 640
Abstract
Clustered count data are commonly modeled using Poisson regression with random effects to account for the correlation induced by clustering. The Poisson mixed model allows for overdispersion via the nature of the within-cluster correlation, however, departures from equi-dispersion may also exist due to [...] Read more.
Clustered count data are commonly modeled using Poisson regression with random effects to account for the correlation induced by clustering. The Poisson mixed model allows for overdispersion via the nature of the within-cluster correlation, however, departures from equi-dispersion may also exist due to the underlying count process mechanism. We study the cross-sectional COM-Poisson regression model—a generalized regression model for count data in light of data dispersion—together with random effects for analysis of clustered count data. We demonstrate model flexibility of the COM-Poisson random intercept model, including choice of the random effect distribution, via simulated and real data examples. We find that COM-Poisson mixed models provide comparable model fit to well-known mixed models for associated special cases of clustered discrete data, and result in improved model fit for data with intermediate levels of over- or underdispersion in the count mechanism. Accordingly, the proposed models are useful for capturing dispersion not consistent with commonly used statistical models, and also serve as a practical diagnostic tool. Full article
(This article belongs to the Special Issue Statistics, Analytics, and Inferences for Discrete Data)
Show Figures

Figure 1

Article
Optimal Neighborhood Selection for AR-ARCH Random Fields with Application to Mortality
Stats 2022, 5(1), 26-51; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010003 - 30 Dec 2021
Viewed by 548
Abstract
This article proposes an optimal and robust methodology for model selection. The model of interest is a parsimonious alternative framework for modeling the stochastic dynamics of mortality improvement rates introduced recently in the literature. The approach models mortality improvements using a random field [...] Read more.
This article proposes an optimal and robust methodology for model selection. The model of interest is a parsimonious alternative framework for modeling the stochastic dynamics of mortality improvement rates introduced recently in the literature. The approach models mortality improvements using a random field specification with a given causal structure instead of the commonly used factor-based decomposition framework. It captures some well-documented stylized facts of mortality behavior including: dependencies among adjacent cohorts, the cohort effects, cross-generation correlations, and the conditional heteroskedasticity of mortality. Such a class of models is a generalization of the now widely used AR-ARCH models for univariate processes. A the framework is general, it was investigated and illustrated a simple variant called the three-level memory model. However, it is not clear which is the best parameterization to use for specific mortality uses. In this paper, we investigate the optimal model choice and parameter selection among potential and candidate models. More formally, we propose a methodology well-suited to such a random field able to select thebest model in the sense that the model is not only correct but also most economical among all thecorrectmodels. Formally, we show that a criterion based on a penalization of the log-likelihood, e.g., the using of the Bayesian Information Criterion, is consistent. Finally, we investigate the methodology based on Monte-Carlo experiments as well as real-world datasets. Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
Show Figures

Figure 1

Article
Path Analysis of Sea-Level Rise and Its Impact
Stats 2022, 5(1), 12-25; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010002 - 24 Dec 2021
Viewed by 987
Abstract
Global sea-level rise has been drawing increasingly greater attention in recent years, as it directly impacts the livelihood and sustainable development of humankind. Our research focuses on identifying causal factors and pathways on sea level changes (both global and regional) and subsequently predicting [...] Read more.
Global sea-level rise has been drawing increasingly greater attention in recent years, as it directly impacts the livelihood and sustainable development of humankind. Our research focuses on identifying causal factors and pathways on sea level changes (both global and regional) and subsequently predicting the magnitude of such changes. To this end, we have designed a novel analysis pipeline including three sequential steps: (1) a dynamic structural equation model (dSEM) to identify pathways between the global mean sea level (GMSL) and various predictors, (2) a vector autoregression model (VAR) to quantify the GMSL changes due to the significant relations identified in the first step, and (3) a generalized additive model (GAM) to model the relationship between regional sea level and GMSL. Historical records of GMSL and other variables from 1992 to 2020 were used to calibrate the analysis pipeline. Our results indicate that greenhouse gases, water, and air temperatures, change in Antarctic and Greenland Ice Sheet mass, sea ice, and historical sea level all play a significant role in future sea-level rise. The resulting 95% upper bound of the sea-level projections was combined with a threshold for extreme flooding to map out the extent of sea-level rise in coastal communities using a digital coastal tracker. Full article
Show Figures

Figure 1

Article
Spectral Clustering of Mixed-Type Data
Stats 2022, 5(1), 1-11; https://0-doi-org.brum.beds.ac.uk/10.3390/stats5010001 - 23 Dec 2021
Viewed by 792
Abstract
Cluster analysis seeks to assign objects with similar characteristics into groups called clusters so that objects within a group are similar to each other and dissimilar to objects in other groups. Spectral clustering has been shown to perform well in different scenarios on [...] Read more.
Cluster analysis seeks to assign objects with similar characteristics into groups called clusters so that objects within a group are similar to each other and dissimilar to objects in other groups. Spectral clustering has been shown to perform well in different scenarios on continuous data: it can detect convex and non-convex clusters, and can detect overlapping clusters. However, the constraint on continuous data can be limiting in real applications where data are often of mixed-type, i.e., data that contains both continuous and categorical features. This paper looks at extending spectral clustering to mixed-type data. The new method replaces the Euclidean-based similarity distance used in conventional spectral clustering with different dissimilarity measures for continuous and categorical variables. A global dissimilarity measure is than computed using a weighted sum, and a Gaussian kernel is used to convert the dissimilarity matrix into a similarity matrix. The new method includes an automatic tuning of the variable weight and kernel parameter. The performance of spectral clustering in different scenarios is compared with that of two state-of-the-art mixed-type data clustering methods, k-prototypes and KAMILA, using several simulated and real data sets. Full article
(This article belongs to the Special Issue Recent Developments in Clustering and Classification Methods)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop