Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain

Electronics 2022, 11(5), 812; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11050812

by Mikel Hernandez^1,*

, Gorka Epelde^1,2,*

, Andoni Beristain^1,2

, Roberto Álvarez^1,2

, Cristina Molina¹

, Xabat Larrea^1,3

, Ane Alberdi³

, Michalis Timoleon⁴

, Panagiotis Bamidis^4,†

and Evdokimos Konstantinidis^4,5,†

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2022, 11(5), 812; https://doi.org/10.3390/electronics11050812

Submission received: 28 January 2022 / Revised: 2 March 2022 / Accepted: 3 March 2022 / Published: 4 March 2022

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

Round 1

Reviewer 1 Report

The paper describes a new framework for the secondary sharing of personal data using synthetic data generation. The main problem with the paper is that it completely neglects current research on synthetic data, particularly research with implications on the applicability of their framework. The points below elaborate this concern from multiple facets:

The whole approach relies on the fact that models built on synthetic data (SD) will be eventually useful (after multiple iterations of testing on real data and regenerating SD). However, the authors do not provide any evidence to support the accuracy of models built using their framework
It has been proven in prior work that, when running multiple classifiers on SD and real data, the best classifier (in terms of accuracy) on SD and real do not always match (particularly for the SDG chosen for this study: SDV). So how will this affect the framework? (check A Multi-Dimensional Evaluation of Synthetic Data Generators in IEEE Access).
Users are allowed to repeat their analysis on another SD if the results are not satisfactory. There are multiple issues with that:

On what basis is a new SD chosen? (is there a utility measure to guide the choice? if yes, please elaborate and justify the choice of the measure? (check: Generation and evaluation of synthetic patient data in BMC MRM)
Related to the previous point, what is the stopping criteria? Without using a utility measure to guide SD generation, this is not clear, the process may never converge
What is the effect of repeating the analysis multiple times on the scientists? (particularly when it is not clear whether prior analysis can be used, or when the process will converge)
What is the effect of this iterative framework on privacy?

The author performed one experiment to prove the usefulness of their methodology, this is not enough particularly when no evidence is supplied.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper aims to demonstrate a pipeline through which the adoption of the synthetic data generation techniques can be applied for real-world applications particularly to address issues such as how to augment Real Data for training different ML models and data privacy/security issues. The paper is well presented (except one aspect which is mentioned below) and results are useful from real-life application point of view.

Using a test-bed problem (heart rate measurements), forecasting analysis with SD and remote forecasting analysis with RD, the efficiency and utility of the complete workflow has been demonstrated and validated.

The authors need to include a complete section on how the Synthetic Data has been generated for this test-bed problem and the associated mathematics. There are several methods to generate synthetic data (time-series data), however authors have not mentioned much about it before using the SD and performing comparisons between synthetic data and real data. Couple of recent articles which can be looked into to address this issue: 1) white-box model ( An Application of Machine Learning for Plasma Current Quench Studies via Synthetic Data Generation): https://0-doi-org.brum.beds.ac.uk/10.1016/j.fusengdes.2021.112578

2) ML based model (Generative Adversarial Network for Synthetic Time Series Data Generation in Smart Grids): 10.1109/SmartGridComm.2018.8587464

The quality of the SD will depend on which method has been used to generate the synthetic data. General statistical measures to establish that generated synthetic data is valuable only after authors clearly mention how the SD has been generated. Currently, this is the main weakness of the paper and once addressed it would make it a good read.

In Fig.-1 - authors may explain in the text what "check results" actually mean and how it is achieved.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

It is more of an engineering paper to help out with the research. Therefore, certain components need to be added to the visuals and discussions.
A comparison with similar systems as far as performance and/or results could help researchers a lot.
It is mentioned about SDG in other fields such as education which may have similar privacy concerns, however, it is not much discussed. Comparison to other fields is always very helpful.
Architectural model of the tool is discussed. However, the visuals are not following standard software architectures which is ok. But, it seems that with the help of colors and dashed vs. dark lines show how the system workflow and data flow is. DFD might be a good idea to show the data flow. Also, on the diagrams, it is needed to clarify what the different colors mean as well as dashed vs. dark lines (thick and thin ones).
Performance analysis of the system based on certain type hardware and size of data.
Discussion on reliability is mainly on the privacy and security preserving. However, the method reliability can be verified through statistical analysis with running many different experiments to make sure the model is robust.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors responded to most of my comments. The issue of stopping criteria is still problematic.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Authors have addressed the important issues in the revised version and the presentation quality has also improved.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain

Further Information

Guidelines

MDPI Initiatives

Follow MDPI