Next Article in Journal
Assessment of Climate Change in Italy by Variants of Ordered Correspondence Analysis
Next Article in Special Issue
Measuring Bayesian Robustness Using Rényi Divergence
Previous Article in Journal
A Consistent Estimator of Nontrivial Stationary Solutions of Dynamic Neural Fields
Previous Article in Special Issue
Improving the Efficiency of Robust Estimators for the Generalized Linear Model
Article

Cumulative Median Estimation for Sufficient Dimension Reduction

School of Mathematics, Cardiff University, Cardiff CF24 4AG, UK
*
Author to whom correspondence should be addressed.
Academic Editor: Marco Riani
Received: 20 January 2021 / Revised: 12 February 2021 / Accepted: 14 February 2021 / Published: 20 February 2021
(This article belongs to the Special Issue Robust Statistics in Action)

Abstract

In this paper, we present the Cumulative Median Estimation (CUMed) algorithm for robust sufficient dimension reduction. Compared with non-robust competitors, this algorithm performs better when there are outliers present in the data and comparably when outliers are not present. This is demonstrated in simulated and real data experiments.
Keywords: L1 median; regression; supervised dimension reduction L1 median; regression; supervised dimension reduction

1. Introduction

Sufficient Dimension Reduction (SDR) is a framework of supervised dimension reduction algorithms that have been proposed mainly for regression and classification settings. The main objective is to reduce the dimension of the p-dimensional predictor vector X without losing information on the conditional distribution of Y | X . In other words, we can state the objective as the effort to estimate a p × d matrix , such that the following conditional independence model holds:
Y X | T X ,
where d < p . The space spanned by the columns of is called the Dimension Reduction Subspace (DRS). There are many ’s that satisfy the conditional independence model above, and therefore, there are many DRSs. The objective is to find the matrix which achieves the minimum d. The space spanned by such is called the Central Dimension Reduction Subspace (CDRS), or simply the Central Space (CS). CS does not always exist, but if it exists, it is unique. Conditions of existence of the CS are mild (see [1]) and we consider that these are met in the rest of this work. For a more comprehensive overview of the SDR literature, the interested reader is referred to [2].
The first approach to the SDR framework was Sliced Inverse Regression (SIR) introduced by [3] and which used inverse means to achieve efficient feature extraction. There are a number of other methods that used the idea of inverse moments like Sliced Average Variance Estimation (SAVE) by [4] which uses inverse variance, Directional Regression (DR) by [5] which uses both the inverse mean and variance, and Sliced Inverse Mean Difference (SIMD) by [6] which uses the inverse mean difference. A common aspect of these methods is the fact that one needs to define the number of slices as a tuning parameter. To avoid this, [7] proposed the Cumulative Mean Estimation (CUME), which uses the first moment and removes the necessity to tune the number of slices.
More recently, a number of methods were introduced for robust sufficient dimension reduction. [8] proposed Robust SIR by using the L1 median to achieve sufficient dimension reduction and [9] proposed the use of Tukey and Oja medians to robustify SIR. Similarly, ref. [10] proposed Sliced Inverse Median Difference (SIMeD), the robust version of SIMD using the L1 median. The main reason for using the L1 median is the fact that it is uniquely defined for p > 2 . On the other hand, Tukey and Oja medians may not be uniquely defined, but they are affine equivariant.
In this paper, we will investigate the use of the L1 median to robustify CUME. The new method is called Cumulative Median estimation (CUMed). As CUMed is the robust version of CUME, it has all the advantages CUME has with the additional advantage that it is robust to the presence of outliers. The rest of the paper is organised as follows. In Section 2 we discuss the previous methods in more detail, and we introduce the new method in Section 3. We present numerical studies in Section 4, and we close with a small discussion.

2. Literature Review

In this section, we discuss some of the existing methods in the SDR framework that mostly relate to our proposal. We discuss Sliced Inverse Regression (SIR), Cumulative Mean Estimation (CUME), and Sliced Inverse Median (SIME). We first introduce some standard notation. Let X be the p dimensional predictor vector, Y be the response variable (which we assume to be univariate without loss of generality), Σ = var ( X ) be the covariance matrix, and Z = Σ 1 / 2 ( X E ( X ) ) be the standardized predictors.

2.1. Sliced Inverse Regression (SIR)

SIR was introduced by [3] and it is considered the first method introduced in the SDR framework. SIR depends on the linearity assumption (or the linear conditional mean assumption), which states that for any R p , the conditional expectation E ( X | T X ) is a linear function of T X . The author proposed to standardize the predictors and then use the inverse mean E ( Z | Y ) . By performing an eigenvalue decomposition of the variance-covariance matrix var ( E ( Z | Y ) ) , the authors find the directions which span the CS, S Y | Z . One can then use the invariance principle [11] to find the direction that spans S Y | X .
A key element in SIR is the use of inverse conditional means, that is, E ( Z | Y ) . To find these in a sample setting, one has to discretize Y, that is, to bin the observations into a number of intervals (denoted with I 1 , , I H , where H is the number of intervals). Therefore, the inverse mean E ( Z | Y ) is, in practice, replaced by using E ( Z | Y I i ) for i = 1 , , H , and thus, it is the eigenvalue decomposition of var ( E ( Z | Y I i ) ) which is used to find the directions which span the CS, S Y | Z .

2.2. Sliced Inverse Median (SIME)

A number of proposals exist in the literature to try to robustify SIR (see, for example, refs. [8,9]). We discuss here the work by [8] which uses the inverse L1 median instead of means to achieve this. The authors preferred the L1 median due to its uniqueness in cases where p 2 . In their work, they defined the inverse L1 median as:
Definition 1.
The inverse L1 median is given by m ˜ = arg min μ R p E ( X μ | Y ) , where · is the Euclidean norm.
Similarly, they denote with m ˜ Z = arg min μ R p E ( Z μ | Y ) . To estimate the CS S Y | Z , they performed an eigenvalue decomposition of the covariance matrix var ( m ˜ Z ) in a similar manner that SIR works with the decomposition of the covariance matrix of the means.

2.3. Cumulative Mean Estimation (CUME)

As we described above, to use SIR one needs to define the number of slices. To avoid this, [7] proposed the use of the Cumulative Mean Estimation (CUME) algorithm, which removes the need of tuning for the number of slices by taking the cumulative mean idea. CUME uses the eigenvalue decomposition of the matrix var ( E ( Z I ( Y y ) ) , where I ( · ) is the indicator function, to find the directions which span the CS, S Y | Z . To achieve this, one needs to increase the value of y and recalculate the mean each time the value of y is large enough for a new observation to be included in the range { Y y } .
Like SIR, SIME needs to tune the number of slices as well. To our knowledge, there is not another proposal in the literature that removes the need to tune for the number of slices while simultaneously being robust to outliers. Therefore, in the next section, we will present our proposal to robustify CUME by using the Cumulative L1 median.

3. Cumulative Median Estimation

In this section, we introduce our new robust method. Like [8,10] we will use the L1 median as the basis for our proposed method due to the fact that it is unique in higher dimensions. To be more accurate, ref. [8] used the inverse L1 median as it was stated in Definition 1, and [10] used the inverse median difference. In this work, we will be using the cumulative L1 median, as it is defined below:
Definition 2.
The cumulative L1 median is given by m ˜ ( y ) = arg min μ R p E ( X μ I ( Y y ) ) , where · is the Euclidean norm and I ( · ) is the indicator function.
Before we prove the basic theorem of this work, let’s give some basic notation. Let ( Y i , X i ) , i = 1 , , n the data pairs, where the response is considered univariate. Let also Z i denote the standardized value of X i . It is standard in the SDR literature to work with the regression of Y i on Z i , and then using the invariance principle (see [11]) to find the CS, S Y | X . In the following theorem, we demonstrate that one can use the cumulative L1 median to estimate the CS.
Theorem 1.
Let K ^ C U M e d be the kernel matrix for our method, as defined in (4). Suppose Z has an elliptical distribution with p 2 . Then, s p a n ( K ^ C U M e d ) S Y | Z .
Proof. 
First of all, note that
E ( Z μ I ( Y y ) ) = E ( E ( Z μ I ( Y y ) | Y ) ) = E ( E ( Z μ | Y ) I ( Y y ) ) .
Now suppose that S Y | Z has an orthonormal basis A = { η 1 , , η d } . Let { η d + 1 , , η p } be the orthonormal basis for S Y | Z , the complement of S Y | Z . Then,
I p = η 1 η 1 T + + η d η d T + η d + 1 η d + 1 T + + η p η p T .
For t R p i = d + 1 , , p consider the linear transformation ϕ i : R p R p such that
ϕ i ( t ) = ( η 1 η 1 T + + η d η d T + η d + 1 η d + 1 T + η i η i T + + η p η p T ) t ,
where j = 1 , , p , η j T ϕ i ( t ) = η j T t for j i and η j T ϕ i ( t ) = η j T t for j = i . Using this notation, Dong et al. (2015) showed that
E [ | | Z μ | | | Y ] = E [ | | Z ϕ ( μ ) | | | Y ] .
Using (2) and (3), one can then show that
E ( Z μ I ( Y y ) ) = E ( Z ϕ ( μ ) I ( Y y ) ) .
From this, ϕ ( m ˜ ( y ) ) is the minimizer of E ( Z μ I ( Y y ) ) . By definition, m ˜ ( y ) is also the minimizer of E ( Z μ I ( Y y ) ) . Therefore, m ˜ ( y ) = ϕ ( m ˜ ( y ) ) , and therefore, η i T m ˜ ( y ) = η i T m ˜ ( y ) which means η i T m ˜ ( y ) = 0 for i = d + 1 , , p . From the construction at the beginning of the proof m ˜ ( y ) S Y | Z = S Y | Z , and therefore s p a n ( K ^ C U M e d ) S Y | Z . □

Estimation Procedure

In this section we will present the algorithm to estimate the CS using the cumulative L1 median.
(1)
Standardise data to find Z ^ = Σ ^ 1 / 2 ( X μ ^ ) , where ( μ ^ ) is the sample mean of X i and ( Σ ^ ) is an estimate of the covariance matrix of X i .
(2)
Sort Y and then for each value of y, find the cumulative L1 median
m ˜ ^ ( y ) = arg min μ R p E ( Z μ I ( Y y ) ) .
(3)
Using the m ˜ ^ ( y ) ’s from the previous step, estimate the candidate matrix by
K ^ C U M e d = y Ω Y p ^ y m ˜ ^ ( y ) m ˜ ^ ( y ) T , ,
where Ω Y is the range of Y, p ^ y the proportion of points used to find m ˜ ^ ( y ) .
(4)
Calculate the eigenvectors β ^ 1 , , β ^ d , which correspond to the largest eigenvalues of K ^ C U M e d , and estimate S Y | X with s p a n ( Σ ^ 1 / 2 β ^ 1 , , Σ ^ 1 / 2 β ^ d ) .
Note that the in the first step, it is more appropriate to use a robust estimator for Σ, and in this work we propose the use of the minimum covariance determinant (MCD) estimator, which is implemented in function covMcd in package robustbase in R (see [12]. Note that other alternative estimators, such as RFCH and RMVN [13,14] or the forward search estimator ([15]) could have been used. The same estimator is used in the last step as well to recover S Y | X .

4. Numerical Results

In this section, we will discuss the numerical performance of the algorithm, both in a simulated as well as a real dataset setting.

4.1. Simulated Datasets

We compare the performance of our algorithms with algorithms based on means like SIR and CUME as well as SIME, which is the robust version of SIR proposed by [8] and, similarly to our case, it uses the L1 median. We ran a number of simulation experiments using the following models:
(1)
Model 1: Y = X 1 + X 2 + σ ϵ
(2)
Model 2: Y = X 1 / [ 0.5 + ( X 2 + 1 ) 2 ] + σ ϵ
(3)
Model 3: Y = ( 1 + ( σ / 2 ) ϵ ) X 1
(4)
Model 4: Y = e x p ( X 1 + σ ϵ )
(5)
Model 5: Y = s i n ( X 1 + σ ϵ )
for p = 5 , 10 or 20 and ϵ N ( 0 , 1 ) . Note that σ = 0.2 . For SIR and SIME we used 10 slices. Note also that the predictors X = ( X 1 , , X p ) come either from multivariate standard normal, which produces no outliers, or from a multivariate standard Cauchy, which produces outliers. We denote them in the result Tables as follows:
  • (Outl a): X is from multivariate standard normal.
  • (Outl b): X is from multivariate standard Cauchy.
We ran 100 simulations and report the average trace correlation and its standard errors to compare performance. Trace correlation is calculated as follows:
r = trace { P B P B ^ } d ,
where P B is the projection matrix on the true subspace and P B ^ the projection matrix on the estimated subspace. It takes values between 0 and 1 and the closer to 1, the better the estimation.
The results in Table 1 demonstrate the advantages of our algorithm. Our method CUMed performs well when there are no outliers (comparable performance with CUME) and it is better than CUME in the case where there are outliers. Comparing to SIME (the robust version of SIR), our method performs better with the exception of model 2, where SIME performs slightly better. We run a more extensive simulation and we have seen that when d > 1 , SIME tends to work better than CUMed.
On the second experiment, we describe the importance of using the L1 median by running a comparison of CUMed when the L1 median is used with CUMed when the Oja median is used. Here, we emphasize that the Oja median is not unique, and it is also computationally more expensive to be calculated. We run the experiment for n = 200 and p = 10 under both scenarios where there are outliers and where there are not outliers. As we can see from Table 2, the L1 median behaves better for all models and both scenarios.

4.2. Real Data—Concrete Data

In this section, we compare the SIR, SIME, and CUME methods against our CUMed method on a real data set. We use the Concrete strength dataset (see [16]), which has eight predictors (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age) and a response variable (Concrete Strength). It has 1030 observations. To highlight our method’s ability to estimate the central subspace in the presence of outliers, we ran 100 experiments where we selected 30 data points randomly and multiplied them by 10 (therefore creating about 3 % outliers in the dataset). We then computed the trace correlation distance between the original projection matrix and the estimate projection matrix containing the outliers. As it is shown in Figure 1, if we compare the median trace correlations, both robust methods, SIME and CUMed, tend to have much less distance from the original projection (the one before introducing outliers) than SIR and CUME. CUMed is, overall, the best method, as it seems to have most distances above 0.8 and only in a few cases falls below that. On the other hand, for SIME, we can see from the boxplot that Q1 is at 0.35 (as opposed to CUMed, which has Q1 at 0.9).

5. Discussion

In this work, we discussed a robust SDR methodology which uses the L1 median to expand the scope of a previously proposed algorithm in the SDR literature, CUME, which uses the first moments. The new method utilizes all the advantages of CUME, by not having to tune for the number of slices as earlier SDR methods, like SIR, had to. In addition, it is robust to the presence of outliers, which CUME is not, as we demonstrated in our numerical experiments.
There have been a number of papers introduced recently in the SDR literature to robustify against outliers. We believe that this work adds to the existing literature and opens new roads to further discuss robustness within the SDR framework. In addition to the ones already discussed, which are based on inverse-moments, there is literature in the Support Vector Machine (SVM) based on SDR literature—see, for example, ref. [17], who used adaptively weighted schemes for SVM-based robust estimation, as well as the literature in iterative methods, like the ones proposed in [18,19].

Author Contributions

Conceptualization, A.A.; methodology, A.A. and S.B.; software, S.B.; validation, A.A. and S.B.; writing—original draft preparation, S.B.; writing—review and editing, A.A.; visualization, S.B. and A.A.; supervision, A.A.; project administration, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The study used publicly available data from the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/ml/index.php, accessed on 18 February 2021.

Acknowledgments

The authors would like to thank the Editors and three reviewers for their useful comments on the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yin, X.; Li, B.; Cook, R.D. Successive direction extraction for estimating the central subspace in a multiple-index regression. J. Multivar. Anal. 2008, 99, 1733–1757. [Google Scholar] [CrossRef]
  2. Li, B. Sufficient Dimension Reduction. Methods and Applications with R; Chapman and Hall/CRC: New York, NY, USA, 2018. [Google Scholar]
  3. Li, K.-C. Sliced inverse regression for dimension reduction (with discussion). J. Am. Stat. Assoc. 1991, 86, 316–342. [Google Scholar] [CrossRef]
  4. Cook, R.D.; Weisberg, S. Discussion of ‘’Sliced inverse regression for dimension reduction”. J. Am. Stat. Assoc. 1991, 86, 316–342. [Google Scholar] [CrossRef]
  5. Li, B.; Wang, S. On directional regression for dimension reduction. J. Am. Stat. Assoc. 2007, 102, 997–1008. [Google Scholar] [CrossRef]
  6. Artemiou, A.; Tian, L. Using slice inverse mean difference for sufficient dimension reduction. Stat. Probab. Lett. 2015, 106, 184–190. [Google Scholar] [CrossRef]
  7. Zhu, L.P.; Zhu, L.X.; Feng, Z.H. Dimension Reduction in Regression through Cumulative Slicing Estimation. J. Am. Stat. Assoc. 2010, 105, 1455–1466. [Google Scholar] [CrossRef]
  8. Dong, Y.; Yu, Z.; Zhu, L. Robust inverse regression for dimension reduction. J. Multivar. Anal. 2015, 134, 71–81. [Google Scholar] [CrossRef]
  9. Christou, E. Robust Dimension Reduction using Sliced Inverse Median Regression. Stat. Pap. 2018, 61, 1799–1818. [Google Scholar] [CrossRef]
  10. Babos, S.; Artemiou, A. Sliced inverse median difference regression. Stat. Methods Appl. 2020, 29, 937–954. [Google Scholar] [CrossRef]
  11. Cook, R.D. Regression Graphics: Ideas for Studying Regressions through Graphics; Wiley: New York, NY, USA, 1998. [Google Scholar]
  12. Maechler, M.; Rousseeuw, P.; Croux, C.; Todorov, V.; Ruckstuhl, A.; Salibian-Barrera, M.; Verbeke, T.; Koller, M.; Conceicao, E.L.T.; di Palma, M.A. robustbase: Basic Robust Statistics R Package Version 0.93-3. 2018. Available online: http://cran.r-project.org/package=robustbase (accessed on 10 February 2021).
  13. Olive, D.J. Robust Multivariate Statistics; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  14. Olive, D.J. Robust Statistics. 2020. Available online: http://parker.ad.siu.edu/Olive/robbook.htm (accessed on 10 February 2021).
  15. Cerioli, A.; Farcomeni, A.; Riani, M. Strong consistency and robustness of the Forward Search estimator of multivariate location and scatter. J. Multivar. Anal. 2014, 126, 167–183. [Google Scholar] [CrossRef]
  16. Yeh, I.-C. Modeling of strength of high performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
  17. Artemiou, A. Using adaptively weighted large margin classifiers for robust sufficient dimension reduction. Statistics 2019, 53, 1037–1051. [Google Scholar] [CrossRef]
  18. Wang, H.; Xia, Y. Sliced regression for dimension reduction. J. Am. Stat. Assoc. 2008, 103, 811–821. [Google Scholar] [CrossRef]
  19. Kong, E.; Xia, Y. An adaptive composite quantile approach to dimension reduction. Ann. Stat. 2014, 42, 1657–1688. [Google Scholar] [CrossRef]
Figure 1. Performance of 100 repetitions of the experiment of the introduction of outliers in the concrete strength dataset.
Figure 1. Performance of 100 repetitions of the experiment of the introduction of outliers in the concrete strength dataset.
Stats 04 00011 g001
Table 1. Mean and standard errors (in parentheses) of trace correlation (4) for 100 replications for all the models for different values of p. We use n = 200 and the number of slices H = 10 in SIR and SIME.
Table 1. Mean and standard errors (in parentheses) of trace correlation (4) for 100 replications for all the models for different values of p. We use n = 200 and the number of slices H = 10 in SIR and SIME.
Method
ModelpoutlSIRSIMECUMECUMed
a0.997 (0.002)0.998 (0.002)0.989 (0.010)0.976 (0.018)
5b0.647 (0.250)0.436 (0.331)0.479 (0.328)0.600 (0.173)
1 a0.993 (0.004)0.996 (0.002)0.978 (0.013)0.949 (0.034)
10b0.520 (0.227)0.279 (0.356)0.217 (0.247)0.572 (0.185)
a0.985 (0.006)0.992 (0.003)0.943 (0.025)0.891 (0.043)
20b0.437 (0.249)0.080 (0.152)0.082 (0.146)0.515 (0.203)
a0.962 (0.037)0.981 (0.014)0.961 (0.034)0.803 (0.151)
5b0.543 (0.204)0.650 (0.183)0.493 (0.247)0.572 (0.139)
2 a0.911 (0.039)0.960 (0.020)0.902 (0.043)0.638 (0.130)
10b0.329 (0.213)0.555 (0.133)0.283 (0.229)0.483 (0.079)
a0.828 (0.045)0.910 (0.034)0.790 (0.063)0.490 (0.056)
20b0.282 (0.183)0.456 (0.110)0.124 (0.157)0.399 (0.100)
a0.998 (0.002)0.997 (0.002)0.988 (0.009)0.975 (0.019)
5b0.603 (0.397)0.478 (0.420)0.405 (0.377)0.957 (0.076)
3 a0.994 (0.003)0.996 (0.002)0.975 (0.018)0.951 (0.029)
10b0.563 (0.375)0.229 (0.361)0.147 (0.251)0.835 (0.241)
a0.986 (0.006)0.993 (0.003)0.946 (0.027)0.886 (0.044)
20b0.362 (0.354)0.057 (0.162)0.059 (0.141)0.774 (0.211)
a0.906 (0.070)0.949 (0.030)0.888 (0.088)0.855 (0.104)
5b0.669 (0.342)0.488 (0.436)0.357 (0.349)0.921 (0.174)
4 a0.820 (0.090)0.880 (0.054)0.780 (0.098)0.691 (0.135)
10b0.539 (0.385)0.181 (0.298)0.158 (0.248)0.824 (0.265)
a0.630 (0.133)0.791 (0.085)0.595 (0.127)0.491 (0.158)
20b0.417 (0.368)0.031 (0.097)0.052 (0.129)0.751 (0.238)
a0.935 (0.048)0.967 (0.019)0.918 (0.059)0.869 (0.095)
5b0.176 (0.250)0.166 (0.304)0.259 (0.349)0.318 (0.337)
5 a0.871 (0.076)0.929 (0.034)0.844 (0.078)0.741 (0.123)
10b0.136 (0.238)0.060 (0.151)0.111 (0.236)0.172 (0.240)
a0.743 (0.101)0.871 (0.036)0.700 (0.100)0.542 (0.136)
20b0.035 (0.074)0.023 (0.064)0.038 (0.117)0.066 (0.155)
Table 2. Mean and standard errors (in parentheses) of trace correlation (4) for 100 replications for the performance of CUMed using L1 and Oja medians for all five models with and without outliers. n = 200 and p = 10 .
Table 2. Mean and standard errors (in parentheses) of trace correlation (4) for 100 replications for the performance of CUMed using L1 and Oja medians for all five models with and without outliers. n = 200 and p = 10 .
Median
ModeloutlL1 MedianOja
1a0.955 (0.023)0.947 (0.044)
b0.598 (0.242)0.133 (0.194)
2a0.636 (0.128)0.586 (0.117)
b0.484 (0.063)0.212 (0.208)
3a0.946 (0.024)0.904 (0.099)
b0.835 (0.283)0.190 (0.354)
4a0.684 (0.169)0.507 (0.249)
b0.851 (0.199)0.249 (0.107)
5a0.754 (0.128)0.583 (0.285)
b0.122 (0.224)0.108 (0.273)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Back to TopTop