Next Article in Journal
Path Analysis of Sea-Level Rise and Its Impact
Previous Article in Journal
Resampling Plans and the Estimation of Prediction Error
Previous Article in Special Issue
Incorporating Clustering Techniques into GAMLSS
 
 
Article
Peer-Review Record

Spectral Clustering of Mixed-Type Data

by Felix Mbuga and Cristina Tortora *,†
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 9 October 2021 / Revised: 19 November 2021 / Accepted: 27 November 2021 / Published: 23 December 2021
(This article belongs to the Special Issue Recent Developments in Clustering and Classification Methods)

Round 1

Reviewer 1 Report

The authors propose an extension of spectral clustering to mixed-type data. It is an interesting topic as real data sets are often mixed, albeit most clustering methods are designed to continuous only, or categorical data only.

The paper is well written, but there is room for improvement in making it more self-containing. Moreover, some potential computational issues and some of the choices made for model selection need to be discussed.

General comments

  1. Section 2 must be improved, the intuition for spectral clustering should be provided, together with the description of steps.
  2. Related work could also be expanded, including other reviews on clustering of mixed data, such as
    • Foss, A. H., Markatou, M., & Ray, B. (2019). Distance metrics and clustering methods for mixed‐type data. International Statistical Review, 87(1), 80-109.

    • van de Velden, M., Iodice D'Enza, A., & Markos, A. (2019). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3), e1456.
  3. The graph Laplacian matrix is nxn and it has to be decomposed, would it be a problem for applications involving larger data sets? if not, the authors should explain why.
  4. The letter k is used both for the number of clusters and the number of eigenvalues and vectors retained from the decomposition: since they're not supposed to be the same thing, then a change in notation is needed. Moreover, K is also the weighted dissimilarity matrix: that's even more confusing.
  5. The tuning is based on the ratio of between and within SS in the K-means step: this does not sound like an appropriate metric for non-spherical shaped clusters. E.g. it would work poorly in the examples in Fig1. 
  6. A grid tuning would be problematic, considering that, with \omega and \sigma, the number of clusters is a tuning parameter itself. So is the number of the retained eigen arrays (what you called k). So, unless there is a very small subset of values for each of the parameters to look into, the computations might become cumbersome.
  7. In fact, the data size and dimensionality considered in the simulations is quite small, whereas all the factors of the simulations are well chosen and described.

Author Response

 

  1. Section 2 must be improved, the intuition for spectral clustering should be provided, together with the description of steps.

 

Thank you for pointing this out, we now include a short introduction of spectral clustering at the beginning of Section 2 Background: Spectral Clustering and a figure to help explain the concept.

 

  1. Related work could also be expanded, including other reviews on clustering of mixed data, such as
    • Foss, A. H., Markatou, M., & Ray, B. (2019). Distance metrics and clustering methods for mixed‐type data. International Statistical Review, 87(1), 80-109.
    • van de Velden, M., Iodice D'Enza, A., & Markos, A. (2019). Distance‐based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3), e1456.

Thank you, we now cite these other reviews.

 

  1. The graph Laplacian matrix is nxn and it has to be decomposed, would it be a problem for applications involving larger data sets? if not, the authors should explain why.

 

We currently use RSpectra package for the decomposition of the Laplacian matrix instead of the eigen function; this significantly improves the speed; however, the decomposition of the Laplacian matrix remains the most problematic computational aspect. We have added some comments in the conclusion. 

 

4.  The letter k is used both for the number of clusters and the number of eigenvalues and vectors retained from the decomposition: since they're not supposed to be the same thing, then a change in notation is needed. Moreover, K is also the weighted dissimilarity matrix: that's even more confusing.

 

Thank you for this comment, our notation was not clear. We now do not use K but we use J  to indicate the weighted dissimilarity matrix and we clarify that the number of eigenvalues is equal to the number of clusters which is equal to k.

 

  1. The tuning is based on the ratio of between and within SS in the K-means step: this does not sound like an appropriate metric for non-spherical shaped clusters. E.g. it would work poorly in the examples in Fig1. 

 

Thank you for this comment, our explanation was not clear, we now added a figure that shows the original and eigen space and the following sentence:

The tuning is obtained at the $k$-means step, in the space obtained after the eigen-decomposition, in this space clusters are spherical, and therefore, a ratio based on the between-to-within sum-of-squares is appropriate.”

 

  1. A grid tuning would be problematic, considering that, with \omega and \sigma, the number of clusters is a tuning parameter itself. So is the number of the retained eigen arrays (what you called k). So, unless there is a very small subset of values for each of the parameters to look into, the computations might become cumbersome.

 

The number of eigenvectors is equal to the number of clusters, therefore it does not need to be tuned, and we fix the number of clusters, although this can become a tuning parameter. 

It is true the omega and sigma require a grid tuning that can be time consuming. We now discuss this in the conclusion, including the need to study the effect of using a smaller grid

 

 

  1. In fact, the data size and dimensionality considered in the simulations is quite small, whereas all the factors of the simulations are well chosen and described.

 

This is true, we added a discussion about current speed limitations and possible future developments in the conclusion.

Reviewer 2 Report

In section 3 you specify the distances to be used for both the continuous and categorical variables. It might be worth noting that these can always be changed, depending on the application. For example, if some of the categorical variables are ordinal, one might have a non-binary similarity. It doesn't appear that making such application-dependent choices would be a problem for your algorithm.

I would like to see notched boxplots for Tables 1-4, as these will likely be more informative than the tabulation of the medians, and give a sense of how significant the differences are.

You might look at Priebe et al, PNAS 116 (13) 5995-6000, "On a 'Two Truths' Phenomenon in Spectral Graph Clustering", 2019. There is a choice to be made at the beginning of any spectral clustering procedure: Laplacian or Adjacency? This paper (and others) suggests that the choice is not fixed -- that is, sometimes is best, sometimes the other. Again, it doesn't the choice of the normalized Laplacian is central to your algorithm (although since you are using the normalized version this is moot, since the two methods are the same in this case). In any case, it might be worthy of discussion.

Author Response

In section 3 you specify the distances to be used for both the continuous and categorical variables. It might be worth noting that these can always be changed, depending on the application. For example, if some of the categorical variables are ordinal, one might have a non-binary similarity. It doesn't appear that making such application-dependent choices would be a problem for your algorithm.

Absolutely true, we added a comment about this in the conclusion. 

It would like to see notched boxplots for Tables 1-4, as these will likely be more informative than the tabulation of the medians, and give a sense of how significant the differences are.

Thank you for the suggestion, we added the boxplots, however, for a matter of space, we put them in the supplementary material.

You might look at Priebe et al, PNAS 116 (13) 5995-6000, "On a 'Two Truths' Phenomenon in Spectral Graph Clustering", 2019. There is a choice to be made at the beginning of any spectral clustering procedure: Laplacian or Adjacency? This paper (and others) suggests that the choice is not fixed -- that is, sometimes is best, sometimes the other. Again, it doesn't the choice of the normalized Laplacian is central to your algorithm (although since you are using the normalized version this is moot, since the two methods are the same in this case). In any case, it might be worthy of discussion.

Thank you for the suggestion, very interesting reading, we now cite it.  

Round 2

Reviewer 1 Report

In the revised manuscript, the authors tackled all the raised points. 

Back to TopTop