Next Article in Journal
Comparison of Extrainsensitive Input Shaping and Swing-Angle-Estimation-Based Slew Control Approaches for a Tower Crane
Next Article in Special Issue
Very Short-Term Photoplethysmography-Based Heart Rate Variability for Continuous Autoregulation Assessment
Previous Article in Journal
Air Traffic Trajectory Operation Mode Mining Based on Clustering
Previous Article in Special Issue
Efficacy of Two Toothpaste in Preventing Tooth Erosive Lesions Associated with Gastroesophageal Reflux Disease
 
 
Article
Peer-Review Record

Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

by Stavros Pitoglou 1,2,*, Arianna Filntisi 2, Athanasios Anastasiou 1, George K. Matsopoulos 1,* and Dimitrios Koutsouris 1
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Submission received: 18 April 2022 / Revised: 31 May 2022 / Accepted: 8 June 2022 / Published: 10 June 2022
(This article belongs to the Special Issue Advances in Biomedical Signal Processing in Health Care)

Round 1

Reviewer 1 Report

Review of “Exploring anonymized EHR datasets’ utility on machine learning experiments in the context of the MODELHealth project” by Pitoglou et al.

This is timely research on utilizing machine learning for analysis of anonymized HER datasets. Since the data anonymization is a standard nowadays, the comparations and conclusions of this research could be interested by many. I would recommend this study for publish. Following are few comments that may help improve this study-

  1. The authors mentioned that a possible way to diminish the effects of data anonymization is to change the machine learning parameters and/or classifiers. Does this need a non-anonymized dataset to train the machine learning?
  2. This study is based on a HER dataset of 117181 records. It would be better if the authors can apply their findings on another data set, even a mock one, to confirm that these conclusions are universally appliable.

Author Response

We are thankful for the reviewer's feedback and comments. Please find below our point-to-point responses:

  1. Comment: The authors mentioned that a possible way to diminish the effects of data anonymization is to change the machine learning parameters and/or classifiers. Does this need a non-anonymized dataset to train the machine learning? Response: Indeed, a non-anonymized dataset has been included as an input to the five tested machine learning models, and is represented by the parameter values k=1, and qi=0 which indicate a lack of anonymization. The classification results of the models on the non-anonymized dataset can be seen: 
    •    In table A2, for k=0, QI=0, with the corresponding information loss values being GIL=0, DM=0, C_AVG=0.
    •    In Figure 1, the first subplot of each row corresponding to k=1, qi=0.
  2. Comment: This study is based on an EHR dataset of 117181 records. It would be better if the authors can apply their findings on another data set, even a mock one, to confirm that these conclusions are universally appliable. Response: The reviewer makes a fair point, however, due to time constraints we have not been able to include an additional dataset in our experiments.

Reviewer 2 Report

Summary:

            The increased adoption of machine learning in the clinical field has raised significant concerns for the privacy of patient identity that may be inadvertently revealed in medical datasets. Anonymization encompasses a range of procedures to ensure the protection of patient privacy by reducing linkage of patient attributes to specific individuals; however, the impact of anonymization techniques on downstream machine learning applications remains of significant interests. Authors in this article examines how parameter choice in Mondrian algorithm affects classification model performance. They found that each classification model is variably sensitive to the effect of anonymization.

 

Comments:

  1. While it explores a highly relevant and important subject, the study, as a whole, comes off as too descriptive and reveals very little novel insights into the relationship between data anonymization practices and machine learning performance. I was unable to connect the observations made from this study to how one should practice machine learning on anonymized dataset more broadly.
  2. How does subject discernibility index relate to the general information loss, and thereby classification model performance?
  3. While specific models perform worse with several instances of anonymization, is this due to the default values of model hyperparameters? Anonymization changes the Euclidean distances between samples, and so it is not entirely surprising that linear classifiers such as SVC or NB perform substantially worse on some of the anonymized datasets while non-linear classifiers such as KNN and Decision Trees are more robust. In practice, one would perform K-fold cross-validation to identify optimal sets of model hyperparameters - did authors attempt this?
  4. Authors did not acknowledge/cite several studies published on this subject with similar goal: Djordje Slijepčević et al. (2021), Benjamin Fung et al. (2007), Jiuyong Li et al. (2011), Mark Last et al. (2014). Authors should highlight how their study yields novel insights considering previous publications.
  5. The results lack statistical analysis. Are the values shown in Figure 1 and 2 based on averages of multiple experiments or are they from a single experiment? K-fold cross-validation should be employed to explore how different instantiation of data subset and anonymization might affect model performance. Confidence interval for each experimental condition needs to be computed so that statistical significance can be assessed.
  6. Separate graphs should be plotted for the choice of QI set and k parameters (separate out the conditions into two sets). Anonymization based on a wider range of k values and smaller k intervals can be computed. A linear model should be constructed to quantify the relationship between GIL and model performance.
  7. Since we do not know the composition of the dataset, authors should check whether the dataset class is well-balanced. If the dataset class is poorly balanced, F1 score should be computed and shown in placed of AUC.

Author Response

We are thankful for the reviewer's feedback and comments. Please find below point-to-point responses:

1.    Comment: While it explores a highly relevant and important subject, the study, as a whole, comes off as too descriptive and reveals very little novel insights into the relationship between data anonymization practices and machine learning performance. I was unable to connect the observations made from this study to how one should practice machine learning on anonymized dataset more broadly. Response: The purpose of this study is not to provide actionable pathways toward practising machine learning on an anonymized dataset, but rather to add to the growing body of evidence that suggests that it can be successfully performed, as well as to report on the insights that have arisen during the experimental process. The observation that the study is mostly descriptive is accurate, and it reflects its nature as the product of experiments conducted during the implementation of a broader context project. However, we strongly believe that regardless of the level of theoretical novelty, evidence derived from real-world data and implementations is warranted for answering important, yet inconclusive, questions, such as the feasibility of performant machine learning on anonymized datasets, which bears high ethical, technical and methodological significance.

2.    Comment: How does subject discernibility index relate to the general information loss, and thereby classification model performance? Response:  According to [1], the discernibility metric measures how indistinguishable a record is from others, assigning a penalty to each record, equal to the size of the EQ to which it belongs. The idea behind this metric is that larger EQs represent more information loss, thus lower values for this metric are desirable. Therefore, even though indeed the discernibility metric is not synonymous with information loss, it can be seen as a complementary metric that can provide a more comprehensive view of the occurring information loss.

3.    Comment: While specific models perform worse with several instances of anonymization, is this due to the default values of model hyperparameters? Anonymization changes the Euclidean distances between samples, and so it is not entirely surprising that linear classifiers such as SVC or NB perform substantially worse on some of the anonymized datasets while non-linear classifiers such as KNN and Decision Trees are more robust. In practice, one would perform K-fold cross-validation to identify optimal sets of model hyperparameters - did authors attempt this? Response: The reviewer makes a fair point, however, the goal of this paper was to assess the resilience of the various machine learning models and highlight the weaknesses or strengths in terms of information loss.


4.    Comment: Authors did not acknowledge/cite several studies published on this subject with similar goal: Djordje Slijepčević et al. (2021), Benjamin Fung et al. (2007), Jiuyong Li et al. (2011), Mark Last et al. (2014). Authors should highlight how their study yields novel insights considering previous publications. Response: The indicated studies have been included in the Introduction and the Discussion of the revised paper.

5.    Comment: The results lack statistical analysis. Are the values shown in Figure 1 and 2 based on averages of multiple experiments or are they from a single experiment? K-fold cross-validation should be employed to explore how different instantiation of data subset and anonymization might affect model performance. Confidence interval for each experimental condition needs to be computed so that statistical significance can be assessed. Response: An additional statistical analysis would perhaps enhance the experimental section of the paper. However, the Mondrian anonymization utilized in the context of this paper is a deterministic procedure. In addition, the classification results that have been presented emerged after 10-fold validation, which is a tactic applied in order to diminish data-based bias from the application of machine learning. Finally, statistical analysis has not been a part of similarly themed published papers [2]. 

6.    Comment: Separate graphs should be plotted for the choice of QI set and k parameters (separate out the conditions into two sets). Anonymization based on a wider range of k values and smaller k intervals can be computed. A linear model should be constructed to quantify the relationship between GIL and model performance.

Response

•    Figures 1 and 2 have been modified in order to present the performance of the models plotted against the anonymization parameters k and qi separately. More specifically, it can be seen that Figure 1 (a) depicts the AUC and MCC results of the models on the test set plotted against k, separately for each value of the qi parameter, while Figures 1 (b) depict AUC and MCC against qi, separately for each value of the k parameter. Figure 2 follows the same pattern, with (a) depicting AUC, MCC and GIL plotted against k, and (b) depicting the same metrics plotted against qi.
•    It has been our approach that the range of k values selected is sufficient, given the small size of the tested quasi identifier sets. 
•    A number of linear regression models have been created for the representation of the relationship between AUC and GIL, as well as the relationship between MCC and GIL. Figure 3(a) depicts the linear models fitted for AUC as a function of GIL plotted against the test data for each applied machine learning model, while figure 3(b) depicts the corresponding linear models representing MCC as a function of GIL. The equations representing the linear models are presented in Table A3. We also considered creating a single linear regression model for the relationship between a performance metric and GIL for all five machine learning models, however, this linear model would not carry any useful information, since the AUC, and MCC values depend strongly on the machine learning model examined.

7.    Comment: Since we do not know the composition of the dataset, authors should check whether the dataset class is well-balanced. If the dataset class is poorly balanced, F1 score should be computed and shown in placed of AUC. Response: The MCC score was selected as a performance metric since it has been proven to be a reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [3].

References
1.     Ayala-Rivera, V.; McDonagh, P.; Cerqueus, T.; Murphy, L. A Systematic Comparison and Evaluation of K-Anonymization Algorithms for Practitioners. Trans. Data Priv. 2014, 7, 337–370.
2.     Slijepčević, D.; Henzl, M.; Daniel Klausner, L.; Dam, T.; Kieseberg, P.; Zeppelzauer, M. K-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers. Comput. Secur. 2021, 111, 102488, doi:10.1016/j.cose.2021.102488.
3.     Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics 2020, 21, doi:10.1186/s12864-019-6413-7.

 

Reviewer 3 Report

Although this is an interesting manuscript, my current knowledge on the subject won't be sufficient enough to judge the quality of the paper.

Author Response

We are thankful for the reviewer's feedback. 

Round 2

Reviewer 2 Report

Authors did not satisfactorily address comments 1, 3, 5, and 6.

The manuscript, while of high relevance, is largely descriptive and does not yield new understanding of the effect of anonymization on classification model given already existing literature. As such, I cannot recommend it for publication.

Some of the more specific comments for future improvements:

1. Data accessibility should be provided in the manuscript. For the particular purpose of this manuscript, a publicly accessible dataset should be used.

2. Analysis script and code should be made available on a repository (e.g. github).

3. The deterministic nature of Mondrian anonymization does not imply stochastic sub-sampling of the dataset will yield identical strata. Therefore, statistical test is necessary to evaluate the significance in performance difference between conditions.

4. Separated plots are more clearly presented now. But the figure resolution is very poor, and font sizes are too small for the legend. This makes it hard to read what each line color represents. Ensure at least > 300 dpi for publication figures.

5. Why does performance of GaussianNB improve with greater k value in the qi = 4 set? One would generally expect greater anonymization increases bias error and should therefore reduce classification accuracy. It appears anonymization does not follow a simple monotonic relationship with model prediction performance. Is this due to dimensionality reduction? Authors need to address the underlying cause of this observation.

6. quasi identifiers vary between datasets, and conclusion drawn from this study is not generalizable to other studies using other datasets (line 347 - 348).



Author Response

We are grateful for the feedback on our submitted paper and we have made an effort to integrate the proposed additions and corrections to our work. Below you will find our response to the reviewer’s comments.

Comment(s): 1. Data accessibility should be provided in the manuscript. For the particular purpose of this manuscript, a publicly accessible dataset should be used. 2. Analysis script and code should be made available on a repository (e.g. github).

Response:  The submitted paper is part of the MODELHealth project,  which has been co-funded by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship, and Innovation, under the call RESEARCH - CREATE -INNOVATE (Project Code: T1EDK-04066). Therefore, the project datasets and code cannot be made publicly available.


Comment(s): 3. The deterministic nature of Mondrian anonymization does not imply stochastic sub-sampling of the dataset will yield identical strata. Therefore, statistical test is necessary to evaluate the significance in performance difference between conditions.
Response: A number of statistical tests have been carried out in order to explore the statistical significance of the results. Table 3 and Figure 4, presenting the mean, and median values, as well as the confidence interval of the test AUC, and MCC values, were added. Figure 5, depicting the correlation between GIL and MCC (bottom left) and the respective densities of the GIL (upper left), and MCC (bottom right) metric values, was added. 
The results were also subjected to parametric one way ANOVA analysis using the Welch test, as well as non-parametric analysis using the Kruskal-Wallis test, which showed statistically significant differences among the different tested algorithms (p<0.001), as presented in tables A4, A7 of Appendix A. Finally, the results of the Games-Howell Post-Hoc Test and the Dwass-Steel-Critchlow-Fligner test are presented in tables A5, A6, A8, and A9 of Appendix A.


Comment(s): 4. Separated plots are more clearly presented now. But the figure resolution is very poor, and font sizes are too small for the legend. This makes it hard to read what each line colour represents. Ensure at least > 300 dpi for publication figures.
Response: All figures have a resolution of 600 dpi (as before), and have been modified in order to fit better in a small space and increase visibility.


Comment(s): 5. Why does performance of GaussianNB improve with greater k value in the qi = 4 set? One would generally expect greater anonymization increases bias error and should therefore reduce classification accuracy. It appears anonymization does not follow a simple monotonic relationship with model prediction performance. Is this due to dimensionality reduction? The authors need to address the underlying cause of this observation.
Response: On the one hand anonymization could increase bias and therefore cause the model to underfit, however one could make the argument that anonymization can reduce variance and therefore lead to a reduction of overfitting. Indeed, the relationship between information loss and prediction performance does not seem to be monotonic. However, this conclusion is not surprising, since the performance of machine learning models can be affected by various factors, and a good performance is often achieved by finding a balance between them. As an example, one can consider the effect of factors such as the architecture or the regularization parameters of a machine learning model on its bias and variance. Exploring in-depth the underlying causes of these observations is not in the scope of this paper, and is deferred to future relevant research.


Comment(s): 6. quasi identifiers vary between datasets, and conclusion drawn from this study is not generalizable to other studies using other datasets (line 347 - 348).
Response: It is true that different datasets can have very different quasi-identifiers. However, the authors do not think that the choice of quasi-identifiers affecting prediction results is a risky observation since different features have a different significance regarding model performance.

 

Round 3

Reviewer 2 Report

All comments addressed.

Back to TopTop