Computational Statistics and Data Analysis

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Probability and Statistics".

Deadline for manuscript submissions: closed (30 January 2023) | Viewed by 25796

Special Issue Editor


E-Mail Website
Guest Editor
School of Statistics, Beijing Normal University, Beijing 100875, China
Interests: high-dimensional statistics; nonparametric statistics and complex data analysis; model/variable selection; statistical learning; causal inference; longitudinal/panel data analysis; measurement error model; empirical likelihood
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the development of scientific techniques, computational statistics and data analysis have become increasingly important in diverse areas of science, engineering, and humanities, ranging from genomics and health sciences to economics, finance, and machine learning. To analyze the data in these fields, statistical methodology and computing for data analysis are fundamental to statistical modeling and data analysis, and have become hot topics in statistics.

In this Special Issue, we are inviting high-quality research papers in computational statistics and data analysis. We invite investigators to contribute original research articles as well as review articles that will stimulate the continuing efforts to develop the statistical methodology and applications concerning data analysis.

Prof. Dr. Gaorong Li
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Bootstrapping
  • Causal inference
  • Classification
  • Data analytical strategies and methodologies applied in biostatistics
  • Dimension reduction of high-dimensional data analysis
  • Large-scale inference for Gaussian graphical models and covariance estimation
  • Longitudinal/panel data analysis
  • Massive networks
  • Medical statistics
  • Nonparametric and semiparametric models
  • Optimal portfolio
  • Robust statistics
  • Statistical methodology and computing for data analysis
  • Statistical methodology and computing for noise data, such as measurement error data, missing data, etc.
  • Sufficient dimension reduction methods in regression analysis
  • Variable/model selection for high-dimensional data

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

24 pages, 516 KiB  
Article
Modeling Under-Dispersed Count Data by the Generalized Poisson Distribution via Two New MM Algorithms
by Xun-Jian Li, Guo-Liang Tian, Mingqian Zhang, George To Sum Ho and Shuang Li
Mathematics 2023, 11(6), 1478; https://0-doi-org.brum.beds.ac.uk/10.3390/math11061478 - 17 Mar 2023
Cited by 1 | Viewed by 1195
Abstract
Under-dispersed count data often appear in clinical trials, medical studies, demography, actuarial science, ecology, biology, industry and engineering. Although the generalized Poisson (GP) distribution possesses the twin properties of under- and over-dispersion, in the past 50 years, many authors only treat the GP [...] Read more.
Under-dispersed count data often appear in clinical trials, medical studies, demography, actuarial science, ecology, biology, industry and engineering. Although the generalized Poisson (GP) distribution possesses the twin properties of under- and over-dispersion, in the past 50 years, many authors only treat the GP distribution as an alternative to the negative binomial distribution for modeling over-dispersed count data. To our best knowledge, the issues of calculating maximum likelihood estimates (MLEs) of parameters in GP model without covariates and with covariates for the case of under-dispersion were not solved up to now. In this paper, we first develop a new minimization–maximization (MM) algorithm to calculate the MLEs of parameters in the GP distribution with under-dispersion, and then we develop another new MM algorithm to compute the MLEs of the vector of regression coefficients for the GP mean regression model for the case of under-dispersion. Three hypothesis tests (i.e., the likelihood ratio, Wald and score tests) are provided. Some simulations are conducted. The Bangladesh demographic and health surveys dataset is analyzed to illustrate the proposed methods and comparisons with the existing Conway–Maxwell–Poisson regression model are also presented. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

13 pages, 277 KiB  
Article
A New Quantile-Based Approach for LASSO Estimation
by Ismail Shah, Hina Naz, Sajid Ali, Amani Almohaimeed and Showkat Ahmad Lone
Mathematics 2023, 11(6), 1452; https://0-doi-org.brum.beds.ac.uk/10.3390/math11061452 - 16 Mar 2023
Cited by 2 | Viewed by 1088
Abstract
Regularization regression techniques are widely used to overcome a model’s parameter estimation problem in the presence of multicollinearity. Several biased techniques are available in the literature, including ridge, Least Angle Shrinkage Selection Operator (LASSO), and elastic net. In this work, we study the [...] Read more.
Regularization regression techniques are widely used to overcome a model’s parameter estimation problem in the presence of multicollinearity. Several biased techniques are available in the literature, including ridge, Least Angle Shrinkage Selection Operator (LASSO), and elastic net. In this work, we study the performance of the classical LASSO, adaptive LASSO, and ordinary least squares (OLS) methods in high-multicollinearity scenarios and propose some new estimators for estimating the LASSO parameter “k”. The performance of the proposed estimators is evaluated using extensive Monte Carlo simulations and real-life examples. Based on the mean square error criterion, the results suggest that the proposed estimators outperformed the existing estimators. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
31 pages, 1548 KiB  
Article
Mathematical Analysis and Modeling of the Factors That Determine the Quality of Life in the City Councils of Chile
by Gonzalo Ríos-Vásquez and Hanns de la Fuente-Mella
Mathematics 2023, 11(5), 1218; https://0-doi-org.brum.beds.ac.uk/10.3390/math11051218 - 02 Mar 2023
Cited by 1 | Viewed by 1552
Abstract
The quality of life index is an indicator published yearly since 2010 by the Institute on Urban and Territorial Studies and the Chilean Chamber of Construction, involving 99 municipalities and communes from the national territory. This research provides an approach to understanding how [...] Read more.
The quality of life index is an indicator published yearly since 2010 by the Institute on Urban and Territorial Studies and the Chilean Chamber of Construction, involving 99 municipalities and communes from the national territory. This research provides an approach to understanding how various dimensions and variables interact with quality of life in Chilean communes considering multiple factors and perspectives through information from public sources and social indicators. For the research, variables were analyzed considering demographic, sociodemographic, economics and urban indicators, where the model developed allows for an understanding of how the variables are related. In addition, it was discovered that education, own incomes, municipal spending and green areas directly relate to quality of life, while overcrowding and municipal funds negatively affect rates of communal welfare. Moreover, the variables chosen as explanatory variables allow for the development of an efficiency model. For this purpose, Cobb–Douglas and trans-logarithmic forms were tested, and it was found that Cobb–Douglas fits better to the data set and structures of the variables. The results of the efficiency model show that education, municipal funds and own incomes significantly affect efficiency, with a mean value of approximately 47%, minimum values close to 30% and maximum values of approximately 60%. Finally, a cluster analysis was developed through k-means, k-medoids and hierarchical clustering algorithms, where, in all cases, the results were similar, suggesting four groups with differences and variations in analyzed variables, especially in overcrowding, education, quality of life and wellness. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

18 pages, 362 KiB  
Article
High-Dimensional Regression Adjustment Estimation for Average Treatment Effect with Highly Correlated Covariates
by Zeyu Diao, Lili Yue, Fanrong Zhao and Gaorong Li
Mathematics 2022, 10(24), 4715; https://0-doi-org.brum.beds.ac.uk/10.3390/math10244715 - 12 Dec 2022
Viewed by 881
Abstract
Regression adjustment is often used to estimate average treatment effect (ATE) in randomized experiments. Recently, some penalty-based regression adjustment methods have been proposed to handle the high-dimensional problem. However, these existing high-dimensional regression adjustment methods may fail to achieve satisfactory performance when the [...] Read more.
Regression adjustment is often used to estimate average treatment effect (ATE) in randomized experiments. Recently, some penalty-based regression adjustment methods have been proposed to handle the high-dimensional problem. However, these existing high-dimensional regression adjustment methods may fail to achieve satisfactory performance when the covariates are highly correlated. In this paper, we propose a novel adjustment estimation method for ATE by combining the semi-standard partial covariance (SPAC) and regression adjustment methods. Under some regularity conditions, the asymptotic normality of our proposed SPAC adjustment ATE estimator is shown. Some simulation studies and an analysis of HER2 breast cancer data are carried out to illustrate the advantage of our proposed SPAC adjustment method in addressing the highly correlated problem of the Rubin causal model. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
14 pages, 2568 KiB  
Article
Convergence Behavior of Optimal Cut-Off Points Derived from Receiver Operating Characteristics Curve Analysis: A Simulation Study
by Oke Gerke and Antonia Zapf
Mathematics 2022, 10(22), 4206; https://0-doi-org.brum.beds.ac.uk/10.3390/math10224206 - 10 Nov 2022
Cited by 1 | Viewed by 2086
Abstract
The area under the receiver operating characteristics curve is a popular measure of the overall discriminatory power of a continuous variable used to indicate the presence of an outcome of interest, such as disease or disease progression. In clinical practice, the use of [...] Read more.
The area under the receiver operating characteristics curve is a popular measure of the overall discriminatory power of a continuous variable used to indicate the presence of an outcome of interest, such as disease or disease progression. In clinical practice, the use of cut-off points as benchmark values for further treatment planning is greatly appreciated, despite the loss of information that such a dichotomization implies. Optimal cut-off points are often derived from fixed sample size studies, and the aim of this study was to investigate the convergence behavior of optimal cut-off points with increasing sample size and to explore a heuristic and path-based algorithm for cut-off point determination that targets stagnating cut-off point values. To this end, the closest-to-(0,1) criterion in receiver operating characteristics curve analysis was used, and the heuristic and path-based algorithm aimed at cut-off points that deviated less than 1% from the cut-off point of the previous iteration. Such a heuristic determination stopped after only a few iterations, thereby implicating practicable sample sizes; however, the result was, at best, a rough estimate of an optimal cut-off point that was unbiased and positively and negatively biased for a prevalence of 0.5, smaller than 0.5, and larger than 0.5, respectively. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Graphical abstract

10 pages, 290 KiB  
Article
Parametric Frailty Analysis in Presence of Collinearity: An Application to Assessment of Infant Mortality
by Olayan Albalawi, Anu Sirohi, Piyush Kant Rai and Ayed R. A. Alanzi
Mathematics 2022, 10(13), 2255; https://0-doi-org.brum.beds.ac.uk/10.3390/math10132255 - 27 Jun 2022
Cited by 1 | Viewed by 1183
Abstract
This paper analyzes the time to event data in the presence of collinearity. To address collinearity, the ridge regression estimator was applied in multiple and logistic regression as an alternative to the maximum likelihood estimator (MLE), among others. It has a smaller mean [...] Read more.
This paper analyzes the time to event data in the presence of collinearity. To address collinearity, the ridge regression estimator was applied in multiple and logistic regression as an alternative to the maximum likelihood estimator (MLE), among others. It has a smaller mean square error (MSE) and is therefore more precise. This paper generalizes the approach to address collinearity in the frailty model, which is a random effect model for the time variable. A simulation study is conducted to evaluate its performance. Furthermore, the proposed method is applied on real life data taken from the largest sample survey of India, i.e., national family health survey (2005–2006 ) data to evaluate the association of different determinants on infant mortality in India. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

30 pages, 7745 KiB  
Article
Short- and Medium-Term Power Demand Forecasting with Multiple Factors Based on Multi-Model Fusion
by Qingqing Ji, Shiyu Zhang, Qiao Duan, Yuhan Gong, Yaowei Li, Xintong Xie, Jikang Bai, Chunli Huang and Xu Zhao
Mathematics 2022, 10(12), 2148; https://0-doi-org.brum.beds.ac.uk/10.3390/math10122148 - 20 Jun 2022
Cited by 5 | Viewed by 1828
Abstract
With the continuous development of economy and society, power demand forecasting has become an important task of the power industry. Accurate power demand forecasting can promote the operation and development of the power supply industry. However, since power consumption is affected by a [...] Read more.
With the continuous development of economy and society, power demand forecasting has become an important task of the power industry. Accurate power demand forecasting can promote the operation and development of the power supply industry. However, since power consumption is affected by a number of factors, it is difficult to accurately predict the power demand data. With the accumulation of data in the power industry, machine learning technology has shown great potential in power demand forecasting. In this study, gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) are integrated by stacking to build an XLG-LR fusion model to predict power demand. Firstly, preprocessing was carried out on 13 months of electricity and meteorological data. Next, the hyperparameters of each model were adjusted and optimized. Secondly, based on the optimal hyperparameter configuration, a prediction model was built using the training set (70% of the data). Finally, the test set (30% of the data) was used to evaluate the performance of each model. Mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and goodness-of-fit coefficient (R^2) were utilized to analyze each model at different lengths of time, including their seasonal, weekly, and monthly forecast effect. Furthermore, the proposed fusion model was compared with other neural network models such as the GRU, LSTM and TCN models. The results showed that the XLG-LR model achieved the best prediction results at different time lengths, and at the same time consumed the least time compared to the neural network model. This method can provide a more reliable reference for the operation and dispatch of power enterprises and future power construction and planning. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

25 pages, 746 KiB  
Article
Statistical Inference of Dynamic Conditional Generalized Pareto Distribution with Weather and Air Quality Factors
by Chunli Huang, Xu Zhao, Weihu Cheng, Qingqing Ji, Qiao Duan and Yufei Han
Mathematics 2022, 10(9), 1433; https://0-doi-org.brum.beds.ac.uk/10.3390/math10091433 - 24 Apr 2022
Cited by 4 | Viewed by 1803
Abstract
Air pollution is a major global problem, closely related to economic and social development and ecological environment construction. Air pollution data for most regions of China have a close correlation with time and seasons and are affected by multidimensional factors such as meteorology [...] Read more.
Air pollution is a major global problem, closely related to economic and social development and ecological environment construction. Air pollution data for most regions of China have a close correlation with time and seasons and are affected by multidimensional factors such as meteorology and air quality. In contrast with classical peaks-over-threshold modeling approaches, we use a deep learning technique and three new dynamic conditional generalized Pareto distribution (DCP) models with weather and air quality factors for fitting the time-dependence of the air pollutant concentration and make statistical inferences about their application in air quality analysis. Specifically, in the proposed three DCP models, a dynamic autoregressive exponential function mechanism is applied for the time-varying scale parameter and tail index of the conditional generalized Pareto distribution, and a sufficiently high threshold is chosen using two threshold selection procedures. The probabilistic properties of the DCP model and the statistical properties of the maximum likelihood estimation (MLE) are investigated, simulating and showing the stability and sensitivity of the MLE estimations. The three proposed models are applied to fit the PM2.5 time series in Beijing from 2015 to 2021. Real data are used to illustrate the advantages of the DCP, especially compared to the estimation volatility of GARCH and AIC or BIC criteria. The DCP model involving both the mixed weather and air quality factors performs better than the other two models with weather factors or air quality factors alone. Finally, a prediction model based on long short-term memory (LSTM) is used to predict PM2.5 concentration, achieving ideal results. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

17 pages, 2137 KiB  
Article
Effect of Money Supply, Population, and Rent on Real Estate: A Clustering Analysis in Taiwan
by Cheng-Hong Yang, Borcy Lee and Yu-Da Lin
Mathematics 2022, 10(7), 1155; https://0-doi-org.brum.beds.ac.uk/10.3390/math10071155 - 02 Apr 2022
Cited by 6 | Viewed by 2813
Abstract
Real estate is a complex and unpredictable industry because of the many factors that influence it, and conducting a thorough analysis of these factors is challenging. This study explores why house prices have continued to increase over the last 10 years in Taiwan. [...] Read more.
Real estate is a complex and unpredictable industry because of the many factors that influence it, and conducting a thorough analysis of these factors is challenging. This study explores why house prices have continued to increase over the last 10 years in Taiwan. A clustering analysis based on a double-bottom map particle swarm optimization algorithm was applied to cluster real estate–related data collected from public websites. We report key findings from the clustering results and identify three essential variables that could affect trends in real estate prices: money supply, population, and rent. Mortgages are issued more frequently as additional real estate is created, increasing the money supply. The relationship between real estate and money supply can provide the government with baseline data for managing the real estate market and avoiding unlimited growth. The government can use sociodemographic data to predict population trends to in turn prevent real estate bubbles and maintain a steady economic growth. Renting and using social housing is common among the younger generation in Taiwan. The results of this study could, therefore, assist the government in managing the relationship between the rental and real estate markets. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

23 pages, 465 KiB  
Article
Bootstrap Tests for the Location Parameter under the Skew-Normal Population with Unknown Scale Parameter and Skewness Parameter
by Rendao Ye, Bingni Fang, Weixiao Du, Kun Luo and Yiting Lu
Mathematics 2022, 10(6), 921; https://0-doi-org.brum.beds.ac.uk/10.3390/math10060921 - 13 Mar 2022
Cited by 2 | Viewed by 1638
Abstract
In this paper, the inference on location parameter for the skew-normal population is considered when the scale parameter and skewness parameter are unknown. Firstly, the Bootstrap test statistics and Bootstrap confidence intervals for location parameter of single population are constructed based on the [...] Read more.
In this paper, the inference on location parameter for the skew-normal population is considered when the scale parameter and skewness parameter are unknown. Firstly, the Bootstrap test statistics and Bootstrap confidence intervals for location parameter of single population are constructed based on the methods of moment estimation and maximum likelihood estimation, respectively. Secondly, the Behrens-Fisher type and interval estimation problems of two skew-normal populations are discussed. Thirdly, by the Monte Carlo simulation, the proposed Bootstrap approaches provide the satisfactory performances under the senses of Type I error probability and power in most cases regardless of the moment estimator or ML estimator. Further, the Bootstrap test based on the moment estimator is better than that based on the ML estimator in most situations. Finally, the above approaches are applied to the real data examples of leaf area index, carbon fibers’ strength and red blood cell count in athletes to verify the reasonableness and effectiveness of the proposed approaches. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

18 pages, 361 KiB  
Article
Variable Selection for Generalized Linear Models with Interval-Censored Failure Time Data
by Rong Liu, Shishun Zhao, Tao Hu and Jianguo Sun
Mathematics 2022, 10(5), 763; https://0-doi-org.brum.beds.ac.uk/10.3390/math10050763 - 27 Feb 2022
Cited by 1 | Viewed by 1690
Abstract
Variable selection is often needed in many fields and has been discussed by many authors in various situations. This is especially the case under linear models and when one observes complete data. Among others, one common situation where variable selection is required is [...] Read more.
Variable selection is often needed in many fields and has been discussed by many authors in various situations. This is especially the case under linear models and when one observes complete data. Among others, one common situation where variable selection is required is to identify important risk factors from a large number of covariates. In this paper, we consider the problem when one observes interval-censored failure time data arising from generalized linear models, for which there does not seem to exist an established method. To address this, we propose a penalized least squares method with the use of an unbiased transformation and the oracle property of the method is established along with the asymptotic normality of the resulting estimators of regression parameters. Simulation studies were conducted and demonstrated that the proposed method performed well for practical situations. In addition, the method was applied to a motivating example about children’s mortality data of Nigeria. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
18 pages, 5752 KiB  
Article
Estimation of COVID-19 Transmission and Advice on Public Health Interventions
by Qingqing Ji, Xu Zhao, Hanlin Ma, Qing Liu, Yiwen Liu and Qiyue Guan
Mathematics 2021, 9(22), 2849; https://0-doi-org.brum.beds.ac.uk/10.3390/math9222849 - 10 Nov 2021
Cited by 2 | Viewed by 1602
Abstract
At the end of 2019, an outbreak of the novel coronavirus (COVID-19) made a profound impact on the country’s production and people’s daily lives. Up until now, COVID-19 has not been fully controlled all over the world. Based on the clinical research progress [...] Read more.
At the end of 2019, an outbreak of the novel coronavirus (COVID-19) made a profound impact on the country’s production and people’s daily lives. Up until now, COVID-19 has not been fully controlled all over the world. Based on the clinical research progress of infectious diseases, combined with epidemiological theories and possible disease control measures, this paper establishes a Susceptible Infected Recovered (SIR) model that meets the characteristics of the transmission of the new coronavirus, using the least square estimation (LSE) method to estimate the model parameters. The simulation results show that quarantine and containment measures as well as vaccine and drug development measures can control the spread of the epidemic effectively. As can be seen from the prediction results of the model, the simulation results of the epidemic development of the whole country and Nanjing are in agreement with the real situation of the epidemic, and the number of confirmed cases is close to the real value. At the same time, the model’s prediction of the prevention effect and control measures have shed new light on epidemic prevention and control. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

36 pages, 1006 KiB  
Article
A New Generalized t Distribution Based on a Distribution Construction Method
by Ruijie Guan, Xu Zhao, Weihu Cheng and Yaohua Rong
Mathematics 2021, 9(19), 2413; https://0-doi-org.brum.beds.ac.uk/10.3390/math9192413 - 28 Sep 2021
Cited by 2 | Viewed by 1754
Abstract
In this paper, a new generalized t (new Gt) distribution based on a distribution construction approach is proposed and proved to be suitable for fitting both the data with high kurtosis and heavy tail. The main innovation of this article consists of four [...] Read more.
In this paper, a new generalized t (new Gt) distribution based on a distribution construction approach is proposed and proved to be suitable for fitting both the data with high kurtosis and heavy tail. The main innovation of this article consists of four parts. First of all, the main characteristics and properties of this new distribution are outined. Secondly, we derive the explicit expression for the moments of order statistics as well as its corresponding variance–covariance matrix. Thirdly, we focus on the parameter estimation of this new Gt distribution and introduce several estimation methods, such as a modified method of moments (MMOM), a maximum likelihood estimation (MLE) using the EM algorithm, a novel iterative algorithm to acquire MLE, and improved probability weighted moments (IPWM). Through simulation studies, it can be concluded that the IPWM estimation performs better than the MLE using the EM algorithm and the MMOM in general. The newly-proposed iterative algorithm has better performance than the EM algorithm when the sample kurtosis is greater than 2.7. For four parameters of the new Gt distribution, a profile maximum likelihood approach using the EM algorithm is developed to deal with the estimation problem and obtain acceptable. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Show Figures

Figure 1

12 pages, 294 KiB  
Article
A More Accurate Estimation of Semiparametric Logistic Regression
by Xia Zheng, Yaohua Rong, Ling Liu and Weihu Cheng
Mathematics 2021, 9(19), 2376; https://0-doi-org.brum.beds.ac.uk/10.3390/math9192376 - 24 Sep 2021
Cited by 3 | Viewed by 1955
Abstract
Growing interest in genomics research has called for new semiparametric models based on kernel machine regression for modeling health outcomes. Models containing redundant predictors often show unsatisfactory prediction performance. Thus, our task is to construct a method which can guarantee the estimation accuracy [...] Read more.
Growing interest in genomics research has called for new semiparametric models based on kernel machine regression for modeling health outcomes. Models containing redundant predictors often show unsatisfactory prediction performance. Thus, our task is to construct a method which can guarantee the estimation accuracy by removing redundant variables. Specifically, in this paper, based on the regularization method and an innovative class of garrotized kernel functions, we propose a novel penalized kernel machine method for a semiparametric logistic model. Our method can promise us high prediction accuracies, due to its capability of flexibly describing the complicated relationship between responses and predictors and its compatibility of the interactions among the predictors. In addition, our method can also remove the redundant variables. Our numerical experiments demonstrate that our method yields higher prediction accuracies compared to competing approaches. Full article
(This article belongs to the Special Issue Computational Statistics and Data Analysis)
Back to TopTop