Next Article in Journal
Generalized Refinements of Reversed AM-GM Operator Inequalities for Positive Linear Maps
Next Article in Special Issue
Dynamics Analysis of a Discrete-Time Commensalism Model with Additive Allee for the Host Species
Previous Article in Journal
Optimizing Sugar Manufacturing: A Hybrid Simulation-Based Approach and MCDM for Efficient Decision Making
Previous Article in Special Issue
Mathematical Model of Cyber Risks Management Based on the Expansion of Piecewise Continuous Analytical Approximation Functions of Cyber Attacks in the Fourier Series
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modeling Environmental Pollution Using Varying-Coefficients Quantile Regression Models under Log-Symmetric Distributions

by
Luis Sánchez
1,
Germán Ibacache-Pulgar
2,3,
Carolina Marchant
4,* and
Marco Riquelme
2
1
Institute of Statistics, Universidad Austral de Chile, Valdivia 5110234, Chile
2
Institute of Statistics, Universidad de Valparaíso, Valparaíso 2360102, Chile
3
Centro Interdisciplinario de Estudios Atmosféricos y Astroestadística, Universidad de Valparaíso, Valparaíso 2360102, Chile
4
Faculty of Basic Sciences, Universidad Católica del Maule, Talca 3480112, Chile
*
Author to whom correspondence should be addressed.
Submission received: 9 September 2023 / Revised: 12 October 2023 / Accepted: 13 October 2023 / Published: 17 October 2023
(This article belongs to the Special Issue Mathematical Models and Simulations)

Abstract

:
Many phenomena can be described by random variables that follow asymmetrical distributions. In the context of regression, when the response variable Y follows such a distribution, it is preferable to estimate the response variable for predictor values using the conditional median. Quantile regression models can be employed for this purpose. However, traditional models do not incorporate a distributional assumption for the response variable. To introduce a distributional assumption while preserving model flexibility, we propose new varying-coefficients quantile regression models based on the family of log-symmetric distributions. We achieve this by reparametrizing the distribution of the response variable using quantiles. Parameter estimation is performed using a maximum likelihood penalized method, and a back-fitting algorithm is developed. Additionally, we propose diagnostic techniques to identify potentially influential local observations and leverage points. Finally, we apply and illustrate the methodology using real pollution data from Padre Las Casas city, one of the most polluted cities in Latin America and the Caribbean according to the World Air Quality Index Ranking.

1. Introduction

In the process of data modeling, it is common to utilize regression models that assume that the response variable follows a normal distribution, as this is well-established in theory. However, there are situations where using such models may not be appropriate, particularly when the response variable exhibits an asymmetric distribution and is restricted to the positive real line. Failing to account for this behavior can introduce bias in parameter estimates and the estimation of associated measures of variability; see [1]. To address the limitations associated with the assumption of normality, several authors have proposed alternative approaches that employ more flexible distributional assumptions. This allows for a better representation of the underlying data. Some examples of such approaches include the works of [2,3,4,5,6,7].
Vanegas and Paula [8] proposed a family of log-symmetric distributions, which are obtained by transforming symmetric distributed random variables whose probability density functions involve the exponential function. Some examples of log-symmetric distributions are the log-normal, log-power-exponential, log-Laplace, log-logistic, log-slash, log-hyperbolic, Birnbaum–Saunders (BS), and log-Student-t cases. This family of distributions includes special cases that exhibit bimodal behavior, as well as distributions with tails that are either lighter or heavier than the log-normal distribution. Regression models based on log-symmetric distributions have been studied by Vanegas and Paula [1,9,10].
In many real-life phenomena, the focus of interest is often on modeling a specific quantile of the response variable rather than the mean, as commonly done in classical regression models. This is particularly relevant when the distribution of the response variable exhibits asymmetry, where the median becomes a more appropriate measure of central tendency for estimating the response. Another reason is that our interest can be to model the relation between another non-central position measure and the covariates. This happens, for example, when we want to analyze the relationship between the greater (or lower) values of the response variable and the covariates; see [11]. Therefore, quantile regression models are useful for modeling the relationship between a set of predictor variables and specific quantiles of a response variable. Unlike traditional regression models, quantile regression does not assume a specific distribution for the response variable [11,12]. However, if we introduce a distributional assumption, it is possible to formulate quantile regression models based on the reparameterization of the distribution using a quantile. This approach has been successfully applied by [5,6,13]. Quantile regression models based on reparametrized log-symmetric distributions by quantiles (QLS) have been recently developed by Saulo et al. [7], albeit from a purely parametric perspective.
Considering the inclusion of nonparametric functions in the modeling, it becomes possible to incorporate the nonlinear effects of covariates. Semiparametric models have been developed to address this, where linear structures are described by parametric components and nonlinear structures are described by nonparametric components. Therefore, these models offer better flexibility for modeling data than those using only a parametric approach. Semiparametric structures have been effectively utilized to represent nonlinear components, as demonstrated in previous studies such as [1,14,15,16,17,18,19,20,21]. Based on our literature review, it appears that no semiparametric quantile regression models based on log-symmetric distributions have been developed thus far.
For over 30 years, Chile has been grappling with a significant public health issue related to the contamination of respirable particulate matter, particularly during winter periods. In the context of Latin America and the Caribbean, Chile currently ranks second, following Peru, in terms of cities with the highest levels of fine particulate matter (PM2.5), as reported by the World Air Quality Index Ranking (https://bit.ly/3MXVP38; accessed on 20 August 2023). It is concerning to note that these levels often exceed both national and international regulations, highlighting the severity of the problem in terms of public health. Statistical models provide a valuable approach to understanding and describing air quality, enabling us to study the relative impact of atmospheric contaminants on human health and the urban environment. Periodic episodes of extreme air pollution concentrations can occur with certain atmospheric contaminants, varying with geographical and meteorological factors and dependent on changes in emission sources and types; see [22]. Considering this variability, air pollutant concentrations are treated as non-negative random variables. In general, the distribution of these variables is asymmetrical and exhibits positive skewness, aligning with the characteristics of log-symmetric distributions.
The primary aim of this article is to develop varying-coefficients quantile regression models based on the family of log-symmetric distributions. Our secondary objectives encompass the following: (i) to estimate the parameters of the model using the maximum penalized likelihood (MPL) technique and a back-fitting algorithm; (ii) to incorporate the nonparametric structure through natural cubic smoothing splines (iii) to calculate local influence techniques for model diagnostics by assess the normal curvatures under different perturbation scenarios; (iv) to implement the obtained outcomes computationally within the R programming environment; and (v) to apply these results to real data related to atmospheric pollutants in Padre Las Casas, a city in Chile recognized as one of the most contaminated cities in Latin America and the Caribbean, as per the World Air Quality Index Ranking (https://www.iqair.com/; accessed on 20 August 2023).
The remainder of this work is organized as follows. Section 2 presents the proposed model for varying-coefficients quantile regression based on log-symmetric distributions. In Section 3, we explain the parameter estimation procedure utilizing the MPL method and a back-fitting algorithm. Section 4 extends the local influence method to assess the potential impact of specific observations on the proposed model, including the derivation of the generalized leverage matrix. In Section 5, we apply the proposed model to analyze a real dataset, demonstrating its potential applications. Finally, in Section 6, we provide concluding remarks and suggestions for future research.

2. Log-Symmetric Varying-Coefficient Quantile Regression Models

In this section, we introduce the varying-coefficients quantile regression models based on the log-symmetric distribution family.

2.1. Formulation

Let q ( 0 , 1 ) be a fixed number. We will denote by Y QLS ( Q , ϕ , g ) to the log-symmetric distribution reparametrized by the q-quantile of Y (Q), where ϕ > 0 is a power parameter and g ( · ) is the probability density function generator kernel; see [7]. Let Y 1 , Y 2 , , Y n be independent random variables such that Y i QLS ( Q i , ϕ , g ) , for i { 1 , 2 , , n } . We assume the semiparametric additive structure for Q i given by
h ( Q i ) = w i α + x 1 i β 1 ( t 1 i ) + + x s i β s ( t s i ) , i { 1 , 2 , , n } ,
where w i = ( 1 , w 1 i , , w p i ) , α is a ( p + 1 ) × 1 unknown regression coefficients vector, with p + 1 < n , β 1 , , β s are unspecified smooth real functions of the explanatory variable T k that do not depend on α or some other parameter. Also, x j i , w j i and t j i are the values of covariates X j , W j and T j for the ith observation, respectively. The function h has positive support and is at least twice differentiable, called the link function. The structure of the right side in Equation (1) defines the so-called partially varying-coefficients regression models; see [23]. Therefore, we have defined a partially varying-coefficient quantile regression model based on the family of log-symmetric distributions. Equation (1) can be written as
h ( Q i ) = w i α + n ˜ 1 i β 1 + + n ˜ s i β s , i { 1 , 2 , , n } ,
where n ˜ k i denote the ith row of the matrix N ˜ k = X ( k ) N k , X ( k ) = diag { x k 1 , , , x k n } , N k is the incidence matrix n × r k whose ( i , l ) -th element equals to the indicator function I ( t k i = t k l 0 ) , β k = ( ζ k 1 , , ζ k r k ) is a r k × 1 vector called a vector of spline coefficients such that ζ k l = β k ( t k l 0 ) , with t k l 0 for l { 1 , , r k }  representing the distinct and ordered values of the explanatory variable  T k usually called knots. For a similar formulation, see [16].

2.2. Penalized Log-Likelihood Function

The log-likelihood function for the proposed model in Equation (1) is given by
( θ ) = i = 1 n log ξ nc y i + log ( g ( v i 2 ) ) 1 2 log ( ϕ ) ,
where θ = ( α , β 1 , , β s , ϕ ) and v i = ( log ( y i ) log ( Q i ) + ϕ z q ) / ϕ . To address the identifiability issues of the regression coefficient α and mitigate overfitting in the semiparametric modeling process, penalties are commonly incorporated into the smooth functions. The MPL method, initially introduced by Good and Gaskins [24] for estimating probability density curves, has been extended to the nonparametric regression context by researchers such as [25,26]. These extensions have provided effective solutions to handle the challenges of identifiability and overfitting in semiparametric models. This same approach is used to fit our model, optimizing the penalized log-likelihood function expressed as
p ( θ , λ ) = ( θ ) k = 1 s λ k * J ( β k ) ,
where J ( β k ) corresponds to a penalty function on the function β k that regulates the lack of smoothness of the estimated curve. Assuming that the design points t k 0 belong to the compact set [ a k , b k ] and that the functions β k ’s belongs to the Sobolev function space [27]
W [ a k , b k ] = β k : β k and β k are absolutely continuous on [ a k , b k ] , and a k b k [ β k ( t k ) ] 2 d t k < ,
Then one way to measure the roughness of the function β k over the interval [ a k , b k ] is by their squared norm given by J ( β k ) = β k 2 = a k b k [ β k ( t k ) ] 2 d t k . Green and Silverman [15] showed that J ( β k ) = β k K k β k , where K k is a r k × r k non-negative definite matrix. Please note that both β k and K k are evaluated at the values belonging to the set of knots { t k 1 0 , t k 2 0 , , t k r k 0 } , for k { 1 , 2 , , s } , and therefore have finite dimensions. Taking λ k * = λ k / 2 , we can obtain the maximum penalized likelihood estimator (MPLE) of θ , denoted by θ ^ , maximizing
p ( θ , λ ) = ( θ ) k = 1 s λ k 2 β k K k β k ,
where λ = ( λ 1 , , λ s ) denotes an s × 1 vector of smoothing parameters. Each λ k 0 measures the “rate of exchange” between goodness-of-fit and variability of the function β k . In this scenario, the estimators of β k ’s result in a cubic spline that is completely determined by the finite-dimensional set of knots { t k 1 0 , t k 2 0 , , t k r k 0 } .

3. Parameter Estimation and Inference

In this section, we focus on estimating the parameters of the model described in Equation (1) and discuss aspects of statistical inference. We also provide a brief discussion on calculating the effective degrees of freedom and selecting smoothing parameters. To facilitate the parameter estimation process and associated inference, we have developed a routine in the R-project (https://www.r-project.org/; accessed on 15 May 2023).

3.1. Penalized Score Vector

First, we make the assumption that the function p ( θ , λ ) given in Equation (2) is regular, meaning that it has first and second partial derivatives with respect to the elements of the parameter vector θ . By performing partial derivative operations, we can express the score function for θ in matrix form as follows:
U p ( θ ) = p ( θ ) θ = U p α ( θ ) U p β 1 ( θ ) U p β s ( θ ) U p ϕ ( θ ) ,
where U p α ( θ ) = W D a z , U p β k ( θ ) = N ˜ k D a z λ k K k β k , for k { 1 , , s } , and U p ϕ ( θ )  = tr(Db) with Da = diag{a1, …, an}, Db = diag{b1, …, bn}, z = (z1, …, zn ) , zi = vi r(vi)/Qi ϕ , bi = r(vi)ϕ−3/2vi[log(yi) − log(Qi)]/2 − 1/2ϕ and ai = 1/h′(Qi), being r(vi) = −2g′( v i 2 )/g( v i 2 ). Please note that g′ represents the derivative of the function g.

3.2. Penalized Hessian Matrix

To obtain the penalized Hessian matrix, we need to compute the second derivate of p ( θ , λ ) with respect to each element of θ , i.e., 2 p ( θ , λ ) / θ j * θ l * for j * , l * { 1 , , p * } and p * = 2 + p + k = 1 s r k . After performing some algebraic manipulations, we obtain the penalized Hessian matrix in the following form:
L ¨ p ( θ ) = 2 p ( θ , λ ) θ θ = L ¨ p α α L ¨ p α β 1 L ¨ p α β s L ¨ p α ϕ L ¨ p α β 1 L ¨ p β 1 β 1 L ¨ p β 1 β s L ¨ p β 1 ϕ L ¨ p α β s L ¨ p β 1 β s L ¨ p β s β s L ¨ p β s ϕ L ¨ p α ϕ L ¨ p β 1 ϕ L ¨ p β s ϕ L ¨ p ϕ ϕ ,
with L ¨ p α α = W D c W , L ¨ p α β k = W D c N ˜ k , L ¨ p α ϕ = W D a m , L ¨ p β k ϕ = N ˜ k D a m , for k { 1 , , s } , L ¨ p ϕ ϕ = tr ( D d ) , and
L ¨ p β k β k = N ˜ k D c N ˜ k λ k K k , k = k N ˜ k D c N ˜ k , k k
where the matrices D c = diag { c 1 , , c n } , D a = diag { a 1 , , a n } and vector m = ( m 1 , , m n ) , with c i , a i and m i defined in Appendix A. The Hessian matrix presented in this section will be used in the construction of the normal curvature for the local influence method developed in Section 4.

3.3. Penalized Fisher Information Matrix

By taking the expectation of the matrix L ¨ p ( θ ) given in Equation (4), we derive the p * × p * penalized expected information matrix given by
J p ( θ ) = J p α α J p α β 1 J p α β s J p α ϕ J p α β 1 J p β 1 β 1 J p β 1 β s J p β 1 ϕ J p α β s J p β 1 β s J p β s β s J p β s ϕ J p α ϕ J p β 1 ϕ J p β s ϕ J p ϕ ϕ ,
whose elements can be expressed as J p α α = W D v W , J p α β k = W D v N ˜ k , J p α ϕ = W D a s , J p β k ϕ = N ˜ k D a s , for k { 1 , , s } , J p ϕ ϕ = tr ( D u ) , and
J p β k β k = N ˜ k D v N ˜ k + λ k K k , k = k N ˜ k D v N ˜ k , k k ,
where D v = diag { e 1 , , e n } , D u = diag { u 1 , , u n } and s = ( s 1 , , s n ) , being e i = E [ c i ] , s i = E [ m i ] and u i = E [ d i ] , with E [ · ] denoting the expected value operator. This matrix will be utilized to approximate the variance-covariance matrix of θ ^ , as discussed in Section 3.5.

3.4. Iterative Process

The MPLE of θ is obtained by maximizing the penalized log-likelihood function presented in Equation (3). Since the resulting estimation equation U p ( θ ) = 0 is nonlinear, an iterative process is necessary to solve it. In this regard, we propose to employ the Fisher scoring algorithm, which updates θ using the matrix equation
J p ( θ ) θ ( m + 1 ) θ ( m ) = U p ( θ ) ( m ) , m = 0 , 1 , .

3.4.1. ϕ Unknown

After some algebraic operations, we obtain the following expressions for the iterative solutions for the case where ϕ unknown:
α ( m + 1 ) = ( W D v ( m ) W ) 1 W D v ( m ) ψ α ( m ) D v , a ( m ) s Φ ϕ ( m + 1 , m ) k = 1 s N ˜ k Φ β k ( m + 1 , m ) , β ( m + 1 ) = ( N ˜ D v ( m ) N ˜ + λ k K ) 1 N ˜ D v ( m ) [ ψ β ( m ) D v , a ( m ) s Φ ϕ ( m + 1 , m ) W Φ α ( m + 1 , m ) k = 1 , k s N ˜ k Φ β k ( m + 1 , m ) ] , { 1 , , s } and ϕ ( m + 1 ) = tr 1 ( D u ( m ) [ tr ( D b ( m ) ) + tr ( D u ( m ) ) ϕ ( m ) s D a ( m ) W Φ α ( m + 1 , m ) s D a ( m ) k = 1 s N ˜ k Φ β k ( m + 1 , m ) ] ,
where ψ α ( m ) = D v , a ( m ) z ( m ) + W α ( m ) and ψ β ( m ) = D v , a ( m ) z ( m ) + N ˜ β ( m ) , with D v , a ( m ) = D v ( m ) 1 D a ( m ) .

3.4.2. ϕ Known

When ϕ is known, it is possible to obtain simplified expressions for the iterative solutions of α ( m + 1 ) and β ( m + 1 ) . In this case, we have that
α ( m + 1 ) = ( W D v ( m ) W ) 1 W D v ( m ) r v , a ( m ) k = 1 s N ˜ k β k ( m + 1 ) , and β ( m + 1 ) = ( N ˜ D v ( m ) N ˜ + λ k K ) 1 N ˜ D v ( m ) r v , a ( m ) W α ( m + 1 ) k = 1 , k s N ˜ k β k ( m + 1 ) ,
for { 1 , , s } , where r v , a ( m ) = D ( v , a ) ( m ) z ( m ) + η ( m ) , with η ( m ) = W α ( m ) + k = 1 s N ˜ k β k ( m ) . It is possible to prove that these expressions correspond to the weighted back-fitting (Gauss-Seidel) iterations considering r v , a ( m ) as dependent modified variable and D v as a matrix of weights that changes with each iteration of the process; see, for instance [28]. A general expression for these iterations is as follows:
β ( m + 1 ) = S ( m ) r v , a ( m ) k = 0 , k s N ˜ k β k ( m + 1 ) , { 1 , , s } ,
where r v , a ( m ) = D v , a ( m ) z ( m ) + η ( m ) , with η ( m ) = k = 0 s N ˜ k β k ( m ) , N ˜ 0 = W , β 0 = α , S 0 ( m ) = ( N ˜ 0 D v ( m ) N ˜ 0 ) 1 N ˜ 0 D v ( m ) and S k ( m ) = ( N ˜ k D v ( m ) N ˜ k + λ k K k ) 1 N ˜ k D v ( m ) . A discussion about the consistency of the system of Equations (6) and the convergence of the back-fitting algorithm in (7) is given, for example, in [29].

3.5. Approximate Standard Errors

In this work, we propose to approximate the variance-covariance matrix of θ ^ using the inverse of the penalized Fisher information matrix defined in Equation (5). In effect, an estimation of the variance-covariance matrix of θ ^ is given by
Cov ^ ( θ ^ ) J p ( θ ^ ) 1 .
Following [14], we can consider an approximate pointwise standard error band (SEB) for nonparametric functions β k s to evaluate the accuracy of the estimators β ^ k s for different locations within the range of interest. In our case, these approximate pointwise SEBs are provided by
SEB approx β k t l 0 = β ^ k t l 0 ± 2 Var ^ β ^ k t l 0 ,
where Var β ^ k t l is the l -th principal diagonal element of the matrix provided in Equation (8) for l { 1 , 2 , , r k } . Please note that t l 0 corresponds to the knots associated with each variable with a nonparametric contribution to the model.

3.6. Effective Degrees of Freedom and λ k ’s

The calculation of the degrees of freedom associated with the parametric and non-parametric contributions is based on the iterative process used in the parameters estimation of the proposed model. Assuming ϕ fixed, we have from the convergence of the iterative process that
β ^ = ( N ˜ D ^ v N ˜ + λ k K ) 1 N ˜ D ^ v r ^ v , a * , { 1 , , s } ,
where r ^ v , a * = r ^ v , a k = 0 , k s N ˜ k β ^ k , r ^ v , a = D ^ ( a , v ) z ^ + η ^ , η ^ = W α ^ + k = 1 s N ˜ k β ^ k and z ^ = ( z ^ 1 , , z ^ n ) , with z i ( i { 1 , 2 , , n } ) defined in Section 3.1. Note that r v , a * can be interpreted as a modified variable and D v a weight matrix that is updated at each stage of the iterative process. From this, we define the effective degrees of freedom (edf) associated with the smooth functions as (see, for instance [14])
edf ( λ k ) = tr N ˜ ( N ˜ D ^ v N ˜ + λ k K ) 1 N ˜ D ^ v , { 1 , , s } .
Following Ibacache-Pulgar and Reyes [23], we choose the optimal smoothing parameter for each smooth function by specifying an appropriate edf ( λ k ) value. Another way to select the λ k ’s is to consider the Akaike Information Criterion (AIC). The idea is to minimize a function with respect to λ formulated as follows:
AIC ( λ ) = 2 p ( θ ^ , λ ) + 2 ( 2 + p + edf ( λ ) ) ,
where p ( θ ^ , λ ) denotes the penalized log-likelihood function evaluated at θ ^ for a fixed λ and edf ( λ ) = k = 1 s edf ( λ k ) denoting the number of effective parameters involved in the modeling of the smooth functions. A grid for different values of λ and its corresponding AIC ( λ ) are helpful to choose the suitable smoothing parameters. The criteria defined in Equation (9) can also be used to select the best model within the class of varying coefficients quantile regression models based on the log-symmetric family.

4. Diagnostic Analysis

In this section, we extend the local influence method for the model given in Equation (1) and derive the generalized leverage matrix, which allows us to assess the influence of each observed value of the response variable y i on its corresponding predicted value y ^ i .

4.1. Local Influence Analysis

Let ω = ( ω 1 , , ω n ) be an n × 1 vector of perturbations restricted to some open subset Ω R n and p ( θ , λ | ω ) be the logarithm of the perturbed penalized likelihood function. It is assumed that exists ω 0 Ω , a vector of non-perturbation, such that p ( θ , λ | ω 0 ) = p ( θ , λ ) . To assess the influence of small perturbations on the MPL estimate θ ^ , we can consider the displacement of the penalized likelihood, which is given by LD ( ω ) = 2 p ( θ ^ , λ ) p ( θ ^ ω , λ ) , where θ ^ ω is the MPL estimate under p ( θ , λ | ω ) . The measure LD( ω ) is helpful for assessing the distance between θ ^ and θ ^ ω . Cook [30] suggested studying the local behavior of LD( ω ) around ω 0 . The procedure involves selecting a unit direction d Ω with | d | = 1 and plotting LD( ω 0 + a d ) against a R . This plot, known as a lifted line, can be characterized by considering the normal curvature C d ( θ ) around a = 0 . To determine the direction d = d max that corresponds to the largest curvature C d max ( θ ) , one can examine the index plot of d max . This plot helps identify cases that, under small perturbations, may have a significant potential influence on LD( ω ). According to Cook [30], the normal curvature at the unit direction d can be expressed as
C d ( θ ) = 2 d Δ p L ¨ p 1 Δ p d ,
with L ¨ p ( θ ) = 2 p ( θ , λ ) / θ θ and Δ p = 2 p ( θ , λ | ω ) / θ ω evaluated at θ = θ ^ and ω = ω 0 . Δ p is called a penalized perturbation matrix. Observe that C d ( θ ) denotes the local influence on the estimate θ ^ after perturbing the model or data. Escobar and Meeker [31] proposed to study the normal curvature at the direction d = e i , where e i is an n × 1 vector with a one at the i th position and zeros at the remaining positions. Thus, the normal curvature, called the total local influence of the i th case, assumes the form C e i ( θ ) = 2 | c ii | , for i { 1 , , n } , where c ii is the i th principal diagonal element of the matrix C = Δ p L ¨ p 1 Δ p .
Next, we present the perturbed penalized log-likelihood function for four perturbation schemes, namely case weight, response variable, power parameter, and explanatory variable perturbation. The matrix Δ p for each case is presented in Appendix B.
  • The case-weight perturbation scheme considers the perturbed penalized log-likelihood function as
    p ( θ , λ | ω ) = i = 1 n ω i i ( Q i , ϕ ; y i ) k = 1 s λ k 2 β k K k β k ,
    where ω = ( ω 1 , , ω n ) is the vector of weights, with 0 ω i 1 for i { 1 , , n } .
  • Regarding the response variable perturbation scheme, we consider an additive type of perturbation weighted by a scaling factor on the i th response variable, i.e., y i ( ω i ) = y i + ω i s Y i , where s Y i is a scale factor that can be the sample standard deviation of Y i and ω i R , for i { 1 , , n } . Then, the perturbed penalized log-likelihood function is written as
    p ( θ , λ | ω ) = i = 1 n i ( Q i , ϕ ; y i ( ω i ) ) k = 1 s λ k 2 β k K k β k .
  • Initially, the model given in Equation (1) assumes that the power parameter is constant across observations. However, we can introduce a perturbation in the power parameter such that it is not constant between the observations, i.e., Y i QLS Q i , ϕ i , g , where ϕ i = ϕ / ω i , with ω i > 0 for i { 1 , , n } . Under this perturbation scheme, the perturbed penalized log-likelihood function is constructed from the expression defined in Equation (3) with ϕ being replaced by ϕ i .
  • The last perturbation scheme considered in this work consists of incorporating an additive type perturbation on one of the covariates X 1 , , X s , say X l , given by x li ( ω i ) = x li + ω i s x l , where s x l is a scale factor that can be the sample standard deviation of X l and ω i R , for i { 1 , , n } . In this case, the perturbed penalized log-likelihood function can be expressed as
    p ( θ , λ | ω ) = i = 1 n i ( Q i ( ω i ) , ϕ ; y i ) k = 1 s λ k 2 β k K k β k ,
    where Q i ( ω i ) is as given in Equation (1) replacing w li for w li ( ω i ) .

4.2. Generalized Leverage Matrix

The generalized leverage (GL) measures the influence of the observed value of the response variable y i on its corresponding predicted value y ^ i based on the model given in Equation (1). Following the approach proposed by Wei et al. [32], the GL for θ ^ can be computed using the lemma they provided. The expression for the GL is given by y ^ / y = H θ ( L ¨ p ( θ ) ) 1 ¨ θ y | θ = θ ^ , where H θ = μ / θ , L ¨ p ( θ ) = 2 p ( θ ) / θ θ , ¨ θ y = 2 p ( θ ) / θ y , y = ( y 1 , , y n ) and μ = ( μ 1 , , μ n ) , with μ i being the mean of the Y i . Using the chain rule, we have
y ^ y = μ Q Q θ ( L ¨ p ( θ ) ) 1 ¨ θ y | θ = θ ^ .
Because of μ = log ( λ ) and Q = λ exp ( ϕ z q ) , where z q is the q -quantile of the distribution S ( 0 , 1 , g ) [7], we have μ = log ( Q ) ϕ z q . Therefore, μ / Q = diag { 1 / Q 1 , , 1 / Q n } . Also, we can obtain the n × p * matrix
Q θ = Q α , Q β 1 , , Q β s , Q ϕ = D a W D a N ˜ 1 D a N ˜ s 0 n ,
where 0 n is the n × 1 null vector and ¨ θ y = D ψ D a W D ψ D a N ˜ 1 D ψ D a N ˜ s τ is a n × p * matrix. Please note that the computation of the matrix L ¨ p ( θ ) relies on the availability of the penalized Hessian matrix given in Equation (4). By utilizing this penalized Hessian matrix, we have all the necessary elements to calculate the GL matrix y ^ / y .

5. Real Data Analysis

In this section, we apply the model proposed in Section 2 to real pollution data from the Padre Las Casas Air Quality Monitoring Station (AQMS). The AQMS is situated in the commune of Padre Las Casas in the Araucanía region of southern Chile, approximately 695 km away from Santiago, the capital city of Chile. Padre Las Casas has gained notoriety for its elevated levels of pollution, particularly concerning PM2.5. It is recognized as one of the most heavily polluted cities in Latin America and the Caribbean, as indicated by the World Air Quality Index Ranking (https://bit.ly/3MXVP38; accessed on 20 August 2023). The average concentration of PM2.5 in Padre Las Casas exceeds the limits set by national and international regulations [22], highlighting the significance of analyzing this type of data and developing models that accurately capture its behavior.
By studying the pollution data from the Padre Las Casas AQMS, we aim to gain insights into the underlying patterns and factors contributing to pollution levels. The proposed model will help us to describe and understand the behavior of pollution in this area, providing valuable information for monitoring and management purposes.

5.1. Exploratory Data Analysis

The dataset used in this analysis consists of hourly (h) average values for the months of June and July 2020, acquired from the Chilean Ministry of Environment (MMA) website (http://sinca.mma.gob.cl; accessed on 11 January 2022). The dataset includes measurements of various variables related to air pollution and meteorological conditions in Padre Las Casas. The considered random variables in this dataset are: (i) Median of PM2.5 concentrations: this variable represents the median concentration of fine particulate matter with a diameter less than 2.5 micrometers in micrograms per normal cubic meter ( μ g/Nm 3 ). PM2.5 is a commonly monitored pollutant and is known to have detrimental effects on human health; (ii) Median of PM10 concentrations: this variable represents the median concentration of particulate matter with a diameter smaller than 10 micrometers (PM10) in μ g/Nm 3 . PM10 includes both fine and coarse particles and is also considered a significant air pollutant; (iii) Ambient temperature (TEMP): this variable represents the temperature at the monitoring station in degrees Celsius. Temperature is an important meteorological parameter that can influence air quality and pollutant levels; (iv) Wind speed (WIND): this variable represents the speed of wind at the monitoring station in meters per second. Wind speed plays a crucial role in the dispersion and transport of pollutants in the atmosphere; (v) Relative air humidity (HR): this variable represents the percentage of moisture in the air at the AQMS. Humidity can affect atmospheric stability and the formation of certain pollutants. By analyzing these variables, we can gain insights into the relationship between air pollution levels and meteorological conditions in Padre Las Casas during the specified period.
In the exploratory data analysis (EDA) of the median PM2.5 concentrations recorded by the Padre Las Casas AQMS during June–July 2020, Figure 1a shows a histogram with density kernel estimation. This plot provides an overview of the distribution of the data, and permits us to visualize the shape of the empirical distribution. From the histogram, it appears that the distribution of the PM2.5 concentrations has a positive skewness, indicating that most of the observations have lower values with a few extremely high values. Figure 1b presents a boxplot for the median PM2.5 concentrations. From the boxplot, we can see that there are some observations labeled as atypical data (#1, #3, #4, #14, #36, #45) that lie outside the whiskers. These observations deviate from the overall pattern of the data and may represent extreme or unusual values. This suggests that there may be some extreme pollution events or unusual conditions during the observed period. Based on the positive skewness of the empirical distribution and the presence of atypical data points, it is reasonable to consider using log-symmetrical distributions to model the PM2.5 concentrations. Log-symmetrical distributions can better capture the positive skewness and accommodate the potential presence of extreme values in the data.
Table 1 provides descriptive statistics for the median PM2.5 concentrations recorded by the Padre Las Casas AQMS during June-July 2020. These statistics include measures of central tendency (mean, median), dispersion (range, standard deviation –SD–), as well as coefficients of skewness (CS) and kurtosis (CK). The descriptive statistics reveal that the median PM2.5 concentrations have a mean of 43.4 μ g/Nm 3 and a median of 36.0 μ g/Nm 3 . The SD is relatively high, with a value of 26.0 μ g/Nm 3 , indicating substantial variability in the data. The CS is 1.3, indicating a positive skewness and confirming the observation from the histogram in Figure 1a. The positive skewness suggests that most of the observations have lower values, while a few extremely high values contribute to the right tail of the distribution. The CK is 0.8, which indicates a moderately peaked distribution compared to a normal distribution. Furthermore, as mentioned in the text, a significant quantity of levels that surpass the recommended Chilean thresholds for PM2.5, set at 50 μ g/Nm 3 . This suggests that the air pollution level in Padre Las Casas is dangerous from a toxicological perspective, posing potential health risks for the inhabitants of this commune in southern Chile. Overall, the descriptive statistics and Figure 1a,b provide evidence of the high pollution levels and the need for modeling approaches that can adequately capture the characteristics of the PM2.5 concentrations in this region.
Figure 2 shows a correlation matrix for PM2.5, PM10, TEMP, WIND, and HR. From this figure, we detect: (i) a high positive association between PM2.5 and PM10 (Pearson coefficient of correlation equal to 0.99); (ii) medium negative association between PM2.5 and TEMP and WIND (Pearson coefficient of correlation equal to −0.70); (iii) low positive association between PM2.5 and HR (Pearson coefficient of correlation equal to 0.38). In Figure 3, scatter plots depicting the explanatory variables, response variable, and potential interactions among the explanatory variables are presented. In Figure 3a, note that the relationship between PM2.5 and PM10 is linear, while in Figure 3b, the relationship between PM2.5 and WIND is not linear. Furthermore, Figure 3c,d imply that the explanatory variables TEMP and HR may be engaging with the WIND variable in a nonlinear manner.

5.2. Parameter Estimation

Based on the EDA and the observed relationships between the median PM2.5 concentration and the variables as PM10, WIND, TEMP, and HR, we suggest the following varying-coefficients quantile regression models to capture the trends:
Q i = w i α + x 1 i β 1 ( t i ) + x 2 i β 2 ( t i ) , i { 1 , 2 , , 61 }
where y i QLS ( Q i , ϕ , g ) with Student- t and normal PDF generator g , β represents the vector of regression coefficients, while w i = ( 1 , w 1 i ) with w 1 i denoting the values of the parametric covariate for the i th observation (PM10). The coefficients β k ( for k { 1 , 2 } ) correspond to unknown, smooth, and arbitrary functions of the explanatory variable t i (WIND), which are linked to the explanatory variables x 1 i (TEMP) and x 2 i (HR) from the i th case. These varying-coefficients quantile regression models allow for a more flexible and comprehensive characterization of the relationships between the median PM2.5 concentration and the other variables, considering potential variations across quantiles.
Table 2 presents the MPL estimates for the model parameters, their approximate standard errors (SEs), p -values obtained from a z -test, the AIC, selected smoothing parameters, and the degrees of freedom df ( · ) for the models defined by Equation (10). The best values of λ 1 and λ 2 were selected by considering a grid of values and choosing those that yielded a range of df ( λ 1 ) and df ( λ 2 ) within the range of (4, 12), while minimizing the AIC value.
When comparing the results reported in Table 2, we observe that the estimates for α 0 and α 1 show similarity between both models, but the log- t model has smaller estimated standard errors (SEs) for these parameters compared to the log-normal model. Additionally, the estimated value of ϕ in the log- t model is smaller than that in the log-normal model. It is worth noting that based on the (AIC), the log- t model is preferred as it yields a lower AIC value.
To assess the distributional assumption made in the model, we examine the goodness-of-fit plots based on generalized Cox-Snell (GCS) residuals, as shown in Figure 4. Additionally, we provide the p -values associated with the Kolmogorov–Smirnov (KS) test, which are 0.73 for the log-normal model and 0.89 for the log- t ( ν = 4 ) model. Based on the goodness-of-fit plots, the KS test, and the AIC, we can conclude that the log- t ( ν = 4 ) model provides a better fit to the dataset. The log- t model captures the underlying distribution of the data more accurately compared to the log-normal model, as indicated by the higher p -value and better fit observed in the goodness-of-fit plots.
Figure 5 displays the plots of the partial residuals relative to the WIND covariate, with the superimposed estimated smooth functions β 1 (on the left) and β 2 (on the right). The behavior of the partial residuals (dots) in these plots appears reasonable, indicating that the fit of the log- t ( ν = 4 ) varying-coefficients quantile regression model to the pollution dataset is adequate. The dots are closely aligned with the estimated curves, as expected, suggesting that the model captures the relationship between the WIND covariate and the partial residuals effectively. This agreement between the partial residuals and the estimated curves supports the appropriateness of the log- t ( ν = 4 ) varying-coefficients quantile regression model for analyzing the pollution data.

5.3. Diagnostic Analysis

In this section, we investigate the potential influence of individual observations using the local influence method for the selected varying-coefficients quantile regression model. We consider four perturbation schemes: case-weight perturbation, response variable perturbation, power parameter perturbation, and explanatory variable perturbation. Additionally, we examine the GL to assess the influence of each observed value on its own predicted value. These analyses allow us to identify potentially influential cases and understand their impact on the selected model. Details on the local influence method and the perturbation schemes can be found in Section 4.2.
In Figure 6, we present index plots illustrating C i ( θ ) as defined in Section 4.2 for α , β 1 , β 2 and ϕ under the case-weight perturbation (a,b,c,d), under response perturbation (e,f,g,h) and perturbation on the power parameter (i,j,k,l) schemes. Also, Figure 7 showcases the index plots of C i ( θ ) when introducing perturbations in covariates X 1 (a, b, c, d) and X 2 (e, f, g, h). Despite different observations being detected as potentially influential, it is worth noting that there are four cases (#13, #18, #31, and #45) that consistently appear as potentially influential across multiple perturbation schemes. These cases exhibit characteristics that make them stand out and have a notable impact on the model results.
Figure 8 displays the GL plot, which assesses the influence of each observation on its own predicted value. From this plot, we observe that cases #45, #36, #14, #1, #3, #4 are potentially leverage points. These observations have response variable values that can exert a significant influence on their own predicted values. It is worth noting that these cases correspond to the outliers identified by the boxplot in Figure 1b. Their extreme values contribute to their influential nature within the model.
It is interesting to observe that the cases identified as potentially influential in the parametric component may not necessarily be detected in the nonparametric component, and vice versa. This indicates that different aspects of the data and model may be driving their influence in different ways. Additionally, the local influence analysis technique has successfully detected some atypical cases that were previously identified as outliers in Figure 1b. This reinforces the effectiveness of the local influence method in identifying observations that have a considerable impact on the model.
In the sense of evaluating the impact of these observations in the selected model, the subsets of cases {#13}, {#18}, {#31}, {#45}, {#13, #18}, {#13, #31}, {#13, #45}, {#18, #31}, {#18, #45}, {#31, #45}, and {#13, #18, #31}, {#13, #18, #45}, {#18, #31, #45} and {#13, #18, #31, #45} are removed and the model parameters are re-estimated. To determine the variation in the estimates of model parameters, we use the value of the relative changes (RCs) for each parameter. The RCs for each estimated parameter are calculated using the formula: RC θ = | ( θ ^ j θ ^ j ( i ) ) / θ ^ j | × 100 % , where θ ^ j represents the MPLE of θ j , and θ ^ j ( i ) represents the MPLE of θ j after removing the subset i of observations. Here, j = 1 , 2 , 3 with θ 1 = α 0 , θ 2 = α 1 , and θ 3 = ϕ .
Table 3 reports the values of RCs for the varying-coefficients quantile regression model after removing the indicated sets of cases. In this table, the individual elimination of cases #13 and #45 produces a RC in α 0 and α 1 of 5.1%, 4.7% and 5.3%, 5.6%, respectively, while the elimination of case #18 produces an RC in ϕ of 5.5%. In addition, note that set of cases {#13, #18} and {#13, #18, #31} produces the largest RCs in α 0 , α 1 and ϕ .
During the analyzed period, it was observed that observation #45 had particularly high concentrations of PM2.5 and PM10 compared to other observations. On the other hand, observation purple #31 had a very low wind speed, close to the minimum recorded during the entire period. These observations exhibit extreme values in their respective covariates. When the sets of potentially influential cases {#13, #18, #31, #45} are excluded from the analysis, it is observed that their removal results in notable alterations solely in the estimation of ϕ , displaying a percentage change of 21.4%. This suggests that these observations have a notable influence on the estimation of the dispersion parameter ϕ in the model.
Finally, in Table 3, while certain RCs exhibit considerable values, there are no substantial alterations in inference, as evidenced by the diminutive p -values (less than 0.001) associated with the parameter estimates. It is important to note that when observations detected as influential in the diagnostic plots are eliminated, it can lead to significant changes in the parameter estimates. This indicates that the well-known robustness properties of maximum likelihood estimates from Student-t models may not necessarily apply to other perturbation schemes. Therefore, it is crucial to conduct diagnostic examinations specific to each case to properly assess the influence of observations and ensure the reliability of the model estimates.

6. Discussion, Conclusions and Future Research

In this work, we propose new varying-coefficients semiparametric quantile regression models based on the family of log-symmetric distributions, following the approach of [5,6,7]. By reparametrizing the family of log-symmetric distributions using a quantile, we introduce new quantile models that offer greater flexibility in modeling data compared to the model proposed by Saulo et al. [7], as a nonparametric component has been added (Section 2). We develop parameter estimation based on the penalized likelihood function and propose a back-fitting iterative algorithm implemented in the R language (Section 3). Additionally, we discuss diagnostic techniques for detecting potentially influential local observations and identifying leverage points (Section 4). Please note that the local influence analysis reinforces the need for diagnostic evaluation. It has been observed that parameter estimators in this class of models tend to be sensitive to the presence of atypical or influential data points. To the best of our knowledge, techniques for detecting leverage points have not been developed for semiparametric quantile regression models.
We illustrate the methodology developed in this work using data associated with PM2.5 pollution in Padre Las Casas city for predicting the daily median of 1-h average values. We propose and fit two models (log-normal and log- t ( ν = 4 ) ) and evaluate them using CGS residuals and their AIC values. The plots of CGS residuals and partial residuals show a good fit of the selected model (log- t ( ν = 4 ) ) to the data. We also apply our diagnostic techniques to detect potentially influential cases and leverage points (Section 4.2); however, no inferential changes are observed in the parameter estimates.
Thus, among the accomplishments of this work, we can highlight: (i) The development of novel quantile regression models suitable for modeling data following asymmetric distributions, which can be added into the existing toolkit for quantile modeling; (ii) The expansion of our model beyond the one presented in [7], incorporating nonlinear structures arising from interaction effects. (iii) The derivation of analytical tools for identifying potentially influential observations and leverage points.
One limitation of our study is that the proposed models may not be suitable for describing other types of data, such as temporally or spatially correlated data, as well as censored data. In such cases, the utilization of multivariate distributions for the response variable, reparametrized by quantiles of marginal distributions, may be necessary. Another area for future investigation is conducting a simulation study to evaluate the distributional behavior of the residuals used in this study and exploring alternative types of residuals appropriate for this type of regression. This aspect has received limited attention in the existing literature. Furthermore, we aim to establish inferences about the model parameters through asymptotic analysis of specific estimators. Lastly, we intend to compare our model with others, including models proposed by [7,12]. These are additional areas that remain unexplored, and we plan to address these open questions in our future research.

Author Contributions

Conceptualization, L.S. and G.I.-P.; methodology, L.S., G.I.-P. and C.M.; software, L.S. and C.M.; validation, L.S. and C.M.; formal analysis, L.S. and C.M.; investigation, L.S., G.I.-P. and C.M.; resources, C.M. and M.R.; data curation, L.S. and C.M.; writing—original draft preparation, L.S., G.I.-P. and C.M.; writing—review and editing, L.S., G.I.-P., C.M. and M.R.; visualization, L.S. and C.M. All authors have read and agreed to the published version of the manuscript.

Funding

The research was partially funded by FONDECYT, project grant number 11190636 (C.M.) from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science, Technology, Knowledge and Innovation.

Data Availability Statement

Data and computational codes are available upon request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Here, we present the quantities c i , m i , and d i , involved in the definition of the Penalized Hessian matrix presented in Section 3.2. In fact, we have
c i = 1 ϕ r ( v i ) Q i 2 ϕ + v i Q i r ( v i ) Q i v i r ( v i ) Q i 2 a i 2 z i a i h ( Q i ) ( h ( Q i ) ) 2 , m i = 1 Q i ϕ 2 2 ( log ( Q i ) log ( y i ) ) r ( v i ) + r ( v i ) ϕ v i ϕ 1 / 2 1 2 v i r ( v i ) ϕ 3 / 2 , and d i = 1 2 [ log ( y i ) log ( Q i ) ] ϕ 3 / 2 [ v i r ( v i ) ϕ 3 2 v i r ( v i ) ϕ 1 1 2 r ( v i ) ( log ( y i ) log ( Q i ) ) ϕ 3 / 2 ] + 1 2 ϕ 2 .
In addition, the expression r ( v i ) / Q i and r ( v i ) / ϕ are, respectively,
r ( v i ) Q i = 4 g ( v i 2 ) g ( v i 2 ) ( g ( v i 2 ) ) 2 ( g ( v i 2 ) ) 2 v i Q i ϕ , and r ( v i ) ϕ = 2 g ( v i 2 ) g ( v i 2 ) ( g ( v i 2 ) ) 2 ( g ( v i 2 ) ) 2 v i ϕ 3 / 2 [ log ( y i ) log ( Q i ) ] .

Appendix B

Here we present the matrix Δ p for four perturbation schemes, namely case weight, response variable, power parameter, and explanatory variable perturbation. In general, this matrix is defined as
Δ p = Δ α Δ β 1 Δ β s Δ ϕ .

Appendix B.1. Case-Weight Perturbation

Here, the elements of the matrix Δ p are given by
Δ α = W D ^ a D ^ z , Δ β k = N ˜ k D ^ a D ^ z , for k { 1 , , s } , Δ ϕ = b ^ ,
with D ^ a , D ^ z and b ^ correspond to D a , D z and b = ( b 1 , , b n ) evaluated at θ = θ ^ and ω 0 = ( 1 , , 1 ) , respectively.

Appendix B.2. Response Variable Perturbation

Under this perturbation schemes, the elements of the matrix Δ p are given by Δ α = W D ^ a D ^ ψ D ^ ϑ , Δ β k = N ˜ k D ^ a D ^ ψ D ^ ϑ , for k { 1 , , s } , Δ ϕ = τ ^ D ϑ , with
  • D ^ ϑ = diag { ϑ ^ 1 , , ϑ ^ n } , D ^ ψ = diag { ψ ^ 1 , , ψ ^ n } , and τ ^ = ( τ ^ 1 , , τ ^ n ) , with
ϑ ^ i = s Y i , ψ ^ i = 1 ϕ ^ Q ^ i y i r ( v i ^ ) + v i ^ r ( v i ^ ) , τ ^ i = ϕ ^ 3 / 2 2 v i r ( v i ^ ) + r ( v i ^ ) log ( y i ) log ( Q ^ i ) y i ϕ ^ + r ( v i ^ ) v i ^ y i , i { 1 , , n } ,
and r ( v i ^ ) = d r ( v i ^ ) / d v i ^ . In this case, v i ^ , Q ^ i and ϕ ^ are evaluated at θ = θ ^ and ω = ( 0 , , 0 ) .

Appendix B.3. Power Parameter Perturbation

Considering the power parameter perturbation, the elements of the matrix Δ p are given by  Δ β k = N ˜ k D ^ a D ^ ϖ , for k { 1 , , s } , Δ ϕ = φ ^ , where D ^ ϖ = diag { ϖ ^ 1 , , ϖ ^ n } and φ = ( φ ^ 1 , , φ ^ n ) , with ϖ i = ϕ ^ m ^ i and φ ^ i = ϕ ^ d ^ i , for i { 1 , , n } . Here, m ^ i and d ^ i correspond to m i and d i evaluated at θ = θ ^ and ω 0 = ( 1 , , 1 ) , respectively.

Appendix B.4. Explanatory Variable Perturbation

In this case, the elements of the matrix Δ p can be expressed as follows:
( i
for l = k ,
Δ α = W D ^ a D ^ z + D ^ a D ^ c D ^ a s X l D ^ N ˜ l f l , Δ β l = N ˜ l D ^ a D ^ z s X l + N ˜ l D ^ a D ^ N ˜ l f l s X l D ^ a D ^ z + D ^ c λ l K l β ^ l , for k { 1 , , s } , Δ ϕ = m ^ D ^ a D ^ N ˜ l f l s X l ;
( ii
for l k ,
Δ α = W D ^ a D ^ z + D ^ a D ^ c D ^ a s X l D ^ N ˜ l · f l , Δ β l = N ˜ 1 D ^ a D ^ N ˜ l β l s X l D ^ a D ^ z + D ^ c λ 1 K 1 β ^ 1 , for k { 1 , , s } , Δ ϕ = m ^ D ^ a D ^ N ˜ l β l s X l .
where D a = diag { a 1 , , a n } , with a i = da i / dQ i , and D N ˜ l β l is the diagonalization of the vector N ˜ l β l . Here, ω 0 = ( 0 , , 0 ) T corresponds to the vector of no perturbation.

References

  1. Vanegas, L.; Paula, G. A semiparametric approach for joint modeling of median and skewness. Test 2015, 24, 110–135. [Google Scholar] [CrossRef]
  2. Arellano-Valle, R.B.; Gómez, H.W.; Quintana, F.A. A New Class of Skew-Normal Distributions. Commun. Stat. Theory Methods 2004, 33, 1465–1480. [Google Scholar] [CrossRef]
  3. Paula, G.A.; Leiva, V.; Barros, M.; Liu, S. Robust statistical modeling using the Birnbaum-Saunders-t distribution applied to insurance. Appl. Stoch. Model. Bus. Ind. 2012, 28, 16–34. [Google Scholar] [CrossRef]
  4. Leiva, V.; Santos-Neto, M.; Cysneiros, F.J.A.; Barros, M. Birnbaum–Saunders statistical modelling: A new approach. Stat. Model. 2014, 14, 21–48. [Google Scholar] [CrossRef]
  5. Sánchez, L.; Leiva, V.; Galea, M.; Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 2021, 37, 53–73. [Google Scholar] [CrossRef]
  6. Sánchez, L.; Leiva, V.; Marchant, C.; Saulo, H.; Sarabia, J.M. A new quantile regression model and its diagnostic analytics for a Weibull distributed response with applications. Mathematics 2021, 9, 2768. [Google Scholar] [CrossRef]
  7. Saulo, H.; Dasilva, A.; Leiva, V.; Sanchez, L.; de la Fuente-Mella, H. Log-symmetric quantile regression models. Stat. Neerl. 2022, 76, 124–163. [Google Scholar] [CrossRef]
  8. Vanegas, L.; Paula, G. Log-symmetric distributions: Statistical properties and parameter estimation. Braz. J. Probab. Stat. 2016, 30, 196–220. [Google Scholar] [CrossRef]
  9. Vanegas, L.; Paula, G. An extension of log-symmetric regression models: R codes and applications. J. Stat. Simul. Comput. 2016, 86, 1709–1735. [Google Scholar] [CrossRef]
  10. Ventura, M.; Saulo, H.; Leiva, V.; Monsueto, S.E. Log-symmetric regression models: Information criteria and application to movie business and industry data. Appl. Stoch. Model. Bus. Ind. 2019, 35, 963–977. [Google Scholar] [CrossRef]
  11. Hao, L.; Naiman, D.Q. Quantile Regression; Sage Publications: London, UK, 2007. [Google Scholar]
  12. Koenker, R.; Chernozhukov, V.; He, X.; Peng, L. Handbook of Quantile Regression; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  13. Noufaily, A.; Jones, M.C. Parametric quantile regression based on the generalized gamma distribution. J. R. Stat. Soc. Ser. 2013, 62, 723–740. [Google Scholar] [CrossRef]
  14. Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman and Hall: New York, NY, USA, 1990. [Google Scholar]
  15. Green, P.J.; Silverman, B.W. Nonparametric Regression and Generalized Linear Models; Chapman and Hall: Boca Raton, FL, USA, 1994. [Google Scholar]
  16. Ibacache-Pulgar, G.; Paula, G.A.; Cysneiros, F.J.A. Semiparametric additive models under symmetric distributions. Test 2013, 22, 103–121. [Google Scholar] [CrossRef]
  17. Ramires, T.; Ortega, E.; Hens, N.; Cordeiro, G.; Paula, G. A flexible semiparametric regression model for bimodal, asymmetric and censored data. J. Appl. Stat. 2018, 45, 1303–1324. [Google Scholar] [CrossRef]
  18. Manghi, R.; Cysneiros, F.J.A.; Paula, G. Generalized additive partial linear models for analyzing correlated data. Comput. Stat. Data Anal. 2019, 129, 47–60. [Google Scholar] [CrossRef]
  19. Oliveira, R.A.; Paula, G.A. Additive models with autoregressive symmetric errors based on penalized regression splines. Comput. Stat. 2021, 36, 2435–2466. [Google Scholar] [CrossRef]
  20. Ferreira, C.; Montoril, M.; Paula, G. Partially linear models with p-order autoregressive skew-normal errors. Braz. J. Probab. Stat. 2022, 36, 792–806. [Google Scholar]
  21. Cardozo, C.A.; Paula, G.A.; Vanegas, L.H. Generalized log-gamma additive partial linear models with P-spline smoothing. Stat. Pap. 2022, 63, 1953–1978. [Google Scholar] [CrossRef]
  22. Cavieres, M.F.; Leiva, V.; Marchant, C.; Rojas, F. A methodology for data-driven decision making in the monitoring of particulate matter environmental contamination in Santiago of Chile. Rev. Environ. Contam. Toxicol. 2020, 250, 5–67. [Google Scholar]
  23. Ibacache-Pulgar, G.; Reyes, S. Local influence for elliptical partially varying coefficient model. Stat. Model. 2018, 18, 149–174. [Google Scholar] [CrossRef]
  24. Good, I.J.; Gaskins, R.A. Nonparametric roughness penalties for probability densities. Biometrika 1971, 58, 255–277. [Google Scholar] [CrossRef]
  25. Silverman, B.W. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. R. Stat. 1985, 47, 1–52. [Google Scholar] [CrossRef]
  26. Green, P.J. Penalized likelihood for general semi-parametric regression models. Int. Stat. Rev. 1987, 55, 245–259. [Google Scholar] [CrossRef]
  27. Adams, R.A.; Fournier, J. Sobolev Spaces. In Pure and Applied Mathematics; Academic Press: Boston, MA, USA, 2003. [Google Scholar]
  28. Rigby, R.A.; Stasinopoulos, D.M. Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. (Appl. Stat.) 2005, 54, 507–554. [Google Scholar] [CrossRef]
  29. Berhane, K.; Tibshirani, J. Generalized additive models for longitudinal data. Can. J. Stat. 1998, 26, 517–535. [Google Scholar] [CrossRef]
  30. Cook, R.D. Assessment of local influence (with discussion). J. R. Stat. Soc. 1986, 48, 133–169. [Google Scholar]
  31. Escobar, L.; Meeker, W. Assessing influence in regression analysis with censored data. Biometrics 1992, 48, 507–528. [Google Scholar] [CrossRef]
  32. Wei, B.C.; Hu, Y.Q.; Fung, W.K. Generalized leverage and its applications. Scand. J. Stat. 1998, 25, 25–37. [Google Scholar] [CrossRef]
Figure 1. Histogram with density kernel estimation (solid black line) (a) and boxplot (b) for median PM2.5 concentrations recorded by Padre Las Casas AQMS during June–July 2020.
Figure 1. Histogram with density kernel estimation (solid black line) (a) and boxplot (b) for median PM2.5 concentrations recorded by Padre Las Casas AQMS during June–July 2020.
Axioms 12 00976 g001
Figure 2. Correlation matrix displaying the respective Pearson correlation coefficient for the specified explanatory and response variables.
Figure 2. Correlation matrix displaying the respective Pearson correlation coefficient for the specified explanatory and response variables.
Axioms 12 00976 g002
Figure 3. Scatter plots for median PM2.5 vs. PM10 concentrations (a); median PM2.5 vs. WIND (b); median PM2.5 vs. HR*WIND (c); and, median PM2.5 vs. TEMP*WIND (d) recorded by Padre Las Casas AQMS during June–July 2020.
Figure 3. Scatter plots for median PM2.5 vs. PM10 concentrations (a); median PM2.5 vs. WIND (b); median PM2.5 vs. HR*WIND (c); and, median PM2.5 vs. TEMP*WIND (d) recorded by Padre Las Casas AQMS during June–July 2020.
Axioms 12 00976 g003
Figure 4. Goodness-of-fit plots with simulated envelope for GCS residual under the indicated model with the analyzed data set.
Figure 4. Goodness-of-fit plots with simulated envelope for GCS residual under the indicated model with the analyzed data set.
Axioms 12 00976 g004
Figure 5. Plots of partial residuals in relation to the WIND covariate, with the estimated smooth functions β 1 (on the left) and β 2 (on the right) superimposed.
Figure 5. Plots of partial residuals in relation to the WIND covariate, with the estimated smooth functions β 1 (on the left) and β 2 (on the right) superimposed.
Axioms 12 00976 g005
Figure 6. Case weight (ad), response (eh) and on the power parameter (il) perturbation for α , β 1 , β 2 and ϕ .
Figure 6. Case weight (ad), response (eh) and on the power parameter (il) perturbation for α , β 1 , β 2 and ϕ .
Axioms 12 00976 g006
Figure 7. Perturbationin the covariate X 1 (ad) and X 2 (eh) scheme for α , β 1 , β 2 and ϕ .
Figure 7. Perturbationin the covariate X 1 (ad) and X 2 (eh) scheme for α , β 1 , β 2 and ϕ .
Axioms 12 00976 g007
Figure 8. Generalized leverage.
Figure 8. Generalized leverage.
Axioms 12 00976 g008
Table 1. Descriptive statistics for median PM2.5 concentrations recorded by Padre Las Casas AQMS during June–July 2020.
Table 1. Descriptive statistics for median PM2.5 concentrations recorded by Padre Las Casas AQMS during June–July 2020.
VariablenMinMaxRangeMeanMedianSDCSCK
PM2.56115121.5106.543.436.026.01.30.8
Table 2. MPL estimates, SEs, p -values, AIC and selected smoothing parameters and df ( · ) of the indicated model.
Table 2. MPL estimates, SEs, p -values, AIC and selected smoothing parameters and df ( · ) of the indicated model.
ModelParameterEstimateSEp-ValueAIC
Log-normal α 0 3.0722.2 × 10 5 <0.001374.1
α 1 0.0681.1 × 10 3 <0.001
ϕ 0.0134.1 × 10 6
λ 1 4034.3
λ 2 2.2 × 10 5
df( λ 1 )4.001
df( λ 2 )4.466
Log- t ( ν = 4 ) α 0 3.0521.7 × 10 5 <0.001361.3
α 1 0.0708.3 × 10 4 <0.001
ϕ 0.0074.9 × 10 6
λ 1 4034.3
λ 2 5.9 × 10 5
df( λ 1 ) 4.556
df( λ 2 )4.198
Table 3. RC (in %) on the MPL estimate of α j and ϕ and respective p -values (in parenthesis) for varying-coefficients quantile regression model after removing the indicated sets of cases.
Table 3. RC (in %) on the MPL estimate of α j and ϕ and respective p -values (in parenthesis) for varying-coefficients quantile regression model after removing the indicated sets of cases.
ParametersRelative Changes
Removed Case α 0 α 1 ϕ RC α 0 RC α 1 RC ϕ
none3.0520.0690.007
(<0.001)(<0.001)
{#13}3.2130.0660.0075.1%4.7%4.0%
(<0.001)(<0.001)
{#18}2.9610.0720.0063.0%3.4%5.5%
(<0.001)(<0.001)
{#31}3.0950.0690.0071.4%1.1%3.9%
(<0.001)(<0.001)
{#45}2.8910.0730.0065.3%5.6%4.0%
(<0.001)(<0.001)
{#13, #18}3.4150.0650.00611.9%6.7%18.0%
(<0.001)(<0.001)
{#13, #31}3.2230.0660.0065.6%4.7%9.4%
(<0.001)(<0.001)
{#13, #45}3.0930.0690.0061.4%0.4%13.0%
(<0.001)(<0.001)
{#18, #31}3.0110.0710.0061.3%2.2%9.5%
(<0.001)(<0.001)
{#18, #45}2.9010.0730.0064.9%5.4%10.9%
(<0.001)(<0.001)
{#13, #18, #31}3.4880.0640.00514.3%7.8%20.5%
(<0.001)(<0.001)
{#13, #18, #45}3.0050.0710.0061.5%2.8%17.4%
(<0.001)(<0.001)
{#18, #31, #45}2.9600.0720.0063.0%4.0%14.5%
(<0.001)(<0.001)
{#13, #18, #31, #45}3.0460.0710.0050.2%1.9%21.4%
(<0.001)(<0.001)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sánchez, L.; Ibacache-Pulgar, G.; Marchant, C.; Riquelme, M. Modeling Environmental Pollution Using Varying-Coefficients Quantile Regression Models under Log-Symmetric Distributions. Axioms 2023, 12, 976. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms12100976

AMA Style

Sánchez L, Ibacache-Pulgar G, Marchant C, Riquelme M. Modeling Environmental Pollution Using Varying-Coefficients Quantile Regression Models under Log-Symmetric Distributions. Axioms. 2023; 12(10):976. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms12100976

Chicago/Turabian Style

Sánchez, Luis, Germán Ibacache-Pulgar, Carolina Marchant, and Marco Riquelme. 2023. "Modeling Environmental Pollution Using Varying-Coefficients Quantile Regression Models under Log-Symmetric Distributions" Axioms 12, no. 10: 976. https://0-doi-org.brum.beds.ac.uk/10.3390/axioms12100976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop