Next Article in Journal
Numerical Assessment and Repair Method of Runway Pavement Damage Due to CBU Penetration and Blast Loading
Next Article in Special Issue
Resource Profiling and Performance Modeling for Distributed Scientific Computing Environments
Previous Article in Journal
Simultaneously Wavelength- and Temperature-Insensitive Mid-Infrared Optical Parametric Amplification with LiGaS2 Crystal
Previous Article in Special Issue
Calculation of Surface Offset Gathers Based on Reverse Time Migration and Its Parallel Computation with Multi-GPUs
 
 
Article
Peer-Review Record

PSciLab: An Unified Distributed and Parallel Software Framework for Data Analysis, Simulation and Machine Learning—Design Practice, Software Architecture, and User Experience

by Stefan Bosse
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 20 January 2022 / Revised: 2 March 2022 / Accepted: 5 March 2022 / Published: 11 March 2022
(This article belongs to the Special Issue Applications of Parallel Computing)

Round 1

Reviewer 1 Report

The paper is very complicted and difficult to follow. The results re not well adjusted by the data presented. The framework description is not clear and provides no arguments on performance and usability. The paper should be significantly rewritten.

Author Response

CHANGES
-----------------


- Spellchecking and typo fixes

- Entire text: WEB -> Web

- Introduction section extended and revised

- Fig. 1 [#psilab3] and caption revised

- Sections merged to new section 3. "Problem Formalization and Taxonomy"
  + Numerical Computation: Exploitation of Parallelism
  + Problem Classes (new)
  + Data Classes (revised)
  + Algorithmic Classes (revised and extended)
  + Virtual Machine
  
- Sub-section "Processes and Composition" revised

- Fig. 2 [#clusters] and caption revised

- Fig. 3 [#ipc] and caption revised

- Sub-section "Communication, Synchronisation, and Data Sharing" revised and extended

- Sub-section "Process Classes" revised and extended

- Sub-section "Worker" revised and extended
  + Ex. 1 [#shellworkcreate1] revised
  + Ex. 2 [#shellworkexec1] revised

- Sub-section "Pool Server and Working Groups" revised and extended

- Sub-section "Processes and Composition" revised and significantly extended

- Section "Data-path parallelism on GPU" revised and extended

- Sub-section "Multi-model and Distributed Ensemble Machine Learning" revised and extended
  + New experiments added: 
  1. Hybrid Distributed-Parallel 
  2. Participative Crowd Computing with Smartphones

- Table 3 [#testplatforms] extended

- Fig. 13 [#multiworkerarch] added

- Fig. 10 [#resultspcap1] and caption revised
- Fig. 11 [#resultspcap2] and caption revised
- Fig. 14 [#wscnnresults] and caption revised
- Fig. 15 [#wbcnnresults] and caption revised

- Ex. 12 [#exmmML] caption revised

- Fig. 16 [#dpcnnresults] added
- Fig. 17 [#spcnnresults] added

Reviewer 2 Report

The submission presents PSciLab, a JavaScript-based framework for utilizing parallel computing in areas such as simulation and machine learning. The framework can be either used via standalone worker processes (WorkShell) or via interconnected web browsers (WorkBook) running the computations; it supports both shared memory and distributed memory programming styles. The main advantage of this framework is its easy access to various kinds of computing resources (locally and remote, CPUs as well as GPUs) via a single JavaScript API. The software is presented in detail and evaluated by two use cases. Generally the approach seems to have merit and may lead towards a publication.

The reasons, however, why I cannot yet recommend the acceptance of the paper in its current form, are given below.

The introduction claims justifies the main value of the paper as "the work provides a rigorous experimental evaluation of the performance of parallel and distributed data processing systems by typical numerical use-cases...". However, exactly this experimental evaluation has several flaws:

  1. The evaluation shown in Figures 13 and 14 (figures labelled ML CNN Training) show no speedups but actually slowdowns  (larger execution times) for growing numbers of processing nodes. It is unclear why the accompanying text claims a nearly linear scaling of the speedup. This may be caused by a mislabeling or misinterpretation of the presented results, but I don't see how the conclusions are justified.
  2. The experimental results cover GPU-based data parallelism (Section 5, speed-up up to 5), shared memory parallelism within a machine for cellular automata simulation (Section 7.2, speedup up to 6), and an embarrasingly parallel (data parallel) message passing application for ML model training (Section 7.3, questionable speedup, see item a) executed with up to 16 cores on a single machine. No widely scalable or actually physically distributed application is presented. The comment (line 1484) "A distributed-parallel worker-tree with different physical nodes would scale linearly, too, except for the data transfer via LAN" is not really justified: it is exactly the communication bottleneck that poses the biggest problem in distributed applications; exactly this point requires a detailed evaluation. So all in all the scalability of the presented approach to more than one or two multi-core CPUs remains questionable.
  3. As for the evaluation of the Web-based "WorkBook" approach: only Figure 7 and Figure 11 present use cases with Web browsers. However, Figure 7 illustrates GPU parallelism, which apparently can be utilized essentially in the same way from a Web browser or a standalone shell process. Figure 11 presents the shared memory approach, but apparently (?) only in the first figure labeled "WEB Worker (CF-LX6 Chromium 90)". The second figure is labeled "Shell Worker (RP3B)" and thus seems to illustrate actually the shell-based approach (?). No example for an actually distributed approach using Web browsers is presented. Furthermore, as state on line 1028 (due to security restrictions) "hence the usage of shared memory ... is limited to older browsers". Thus, all in all, thus the actual value of the web-based approach is mostly undemonstrated and seems to be restricted in the long-term future.
  4. The presented software is not clearly visible in the web, no link is given in the paper. All I found by web search was a section "Workbook" on the author's home page with some code files that seem to be related to the presented work ("workbook", "workshell") but a search for the term "PSciLab" itself was unsuccessful (only its predecessor "PsiLab" could be found). Thus the actual state of the presented work and its usability for a general audience remains unclear.

Further minor comments for potential improvements of the paper:

  • The actual concrete use of the software remains unclear. Figure 1 gives an abstract view of the "WorkBook" interface and the conclusions mention that the "WorkBooks" and "WEB worker instances" must be started by a user explicitly (in contrast to the WorkShell Service) but how exactly remains unclear. Please show a concrete use case that demonstrates the actual software interface and the whole deployment process of both the web-based and the shell-based approach with multiple processes running on different machines.
  • "WEB" should be generally spelled "Web" (a plain word, not an acronym). Similarly, "OCaML" is actually spelt "OCaml".
  • line 51:  "Paralelisation ... were" -> "was"
  • line 72: "framework ... do " -> "does"
  • line 82: "JS" acronym not introduced yet.
  • lines 106-116: the content of the rest of the paper should be more explicit outlined on a section by section basis.
  • line 121: "the demand of computation doubles currently every three month" -> "months". A numerically very specific claim, please back it up by a citation.
  • line 143: "up to one seconds" -> "second"
  • line 184: "CA" acronym not yet introduced.
  • Section 2: the general topics of "grid computing" and "cloud computing" should be also addressed here.
  • line 259: "&rARR" typo.
  • line 280: "tilde(D)...." typos.
  • line 287: "tilde(M)" typo.
  • line 790: "in the proposed system architecture": is it only proposed or actually implemented? (see the question about the status of the work above)
  • Table 3: the listed acronyms do not always match those given in Figure 7 and the following.
  • line 1251: "parallel scaling index": the widely used terminology is "efficiency"
  • line 1386: "Dsitributed" -> "Distributed" please use a spelling checker.
  • Figure 9: "SBO(..." label cut off.
  • line 1343: "CPu" -> "CPU"
  • line 1488: "&approx" typo
  • line 1601: "any moral or law aspects" I do not understand what kind of aspects you refer here to, please explain.

 

Author Response

Dear reviewer,

thank you for the valuable comments. The paper was revised rigorously. Below is a list of major changes beside a lot of minor corrections and type fixes.

1. The evaluation shown in Figures 13 and 14 (figures labelled ML CNN Training) show no speedups but actually slowdowns  (larger execution times) for growing numbers of processing nodes. It is unclear why the accompanying text claims a nearly linear scaling of the speedup. This may be caused by a mislabeling or misinterpretation of the presented results, but I don't see how the conclusions are justified.

>> Two classes of problems are considered in the use-case sections: Firstly, static sized problems where a partitionining of data and one task in multiple parallel tasks should reduce the overall computation time (speed-up > 1). This is addressed by the simulation example.  Secondly, dynamic sized problems where the tasks (or data) grows with the number of parallel processes with the aim to keep the overall computational constant! This is addressed by the multi-model ML example. This confusion is now clarified in the introduction and in later sections. 

2. The experimental results cover GPU-based data parallelism (Section 5, speed-up up to 5), shared memory parallelism within a machine for cellular automata simulation (Section 7.2, speedup up to 6), and an embarrasingly parallel (data parallel) message passing application for ML model training

>> ML example addresses control-path parallelism, only (see above)

(Section 7.3, questionable speedup, see item a) executed with up to 16 cores on a single machine. No widely scalable or actually physically distributed application is presented. The comment (line 1484) "A distributed-parallel worker-tree with different physical nodes would scale linearly, too, except for the data transfer via LAN" is not really justified: it is exactly the communication bottleneck that poses the biggest problem in distributed applications; exactly this point requires a detailed evaluation. So all in all the scalability of the presented approach to more than one or two multi-core CPUs remains questionable.

>> For control-path parallelism the evaluation (multi-model ML problem) that the communication via message channels can be neglected.

 

3. As for the evaluation of the Web-based "WorkBook" approach: only Figure 7 and Figure 11 present use cases with Web browsers.

>> Fig 7 evaluates matrix multiplication for Web and Shell Workers, Fig. 8 evaluates typical neural network data flows for Web and Shell Workers

However, Figure 7 illustrates GPU parallelism, which apparently can be utilized essentially in the same way from a Web browser or a standalone shell process. Figure 11 presents the shared memory approach, but apparently (?) only in the first figure labeled "WEB Worker (CF-LX6 Chromium 90)". The second figure is labeled "Shell Worker (RP3B)" and thus seems to illustrate actually the shell-based approach (?). No example for an actually distributed approach using Web browsers is presented. Furthermore, as state on line 1028 (due to security restrictions) "hence the usage of shared memory ... is limited to older browsers". Thus, all in all, thus the actual value of the web-based approach is mostly undemonstrated and seems to be restricted in the long-term future.

>> Fig. 10 evaluates parallel simulation in shell workers, Fig. 11 evaluates parallel simulation in Web workers

4. The presented software is not clearly visible in the web, no link is given in the paper. All I found by web search was a section "Workbook" on the author's home page with some code files that seem to be related to the presented work ("workbook", "workshell") but a search for the term "PSciLab" itself was unsuccessful (only its predecessor "PsiLab" could be found). Thus the actual state of the presented work and its usability for a general audience remains unclear.

>> Reference [30] points to github repository (development snapshots)

 

CHANGES
-----------------

- Spellchecking and typo fixes

- Entire text: WEB -> Web

- Introduction section extended and revised

- Fig. 1 [#psilab3] and caption revised

- Sections merged to new section "Problem Formalization and Taxonomy"
  + Numerical Computation: Exploitation of Parallelism
  + Problem Classes (new)
  + Data Classes (revised)
  + Algorithmic Classes (revised and extended)
  + Virtual Machine
  
- Sub-section "Processes and Composition" revised

- Fig. 2 [#clusters] and caption revised

- Fig. 3 [#ipc] and caption revised

- Sub-section "Communication, Synchronisation, and Data Sharing" revised and extended

- Sub-section "Process Classes" revised and extended

- Sub-section "Worker" revised and extended
  + Ex. 1 [#shellworkcreate1] revised
  + Ex. 2 [#shellworkexec1] revised

- Sub-section "Pool Server and Working Groups" revised and extended

- Sub-section "Processes and Composition" revised and significantly extended

- Section "Data-path parallelism on GPU" revised and extended

- Sub-section "Multi-model and Distributed Ensemble Machine Learning" revised and extended
  + New experiments added: 
  1. Hybrid Distributed-Parallel 
  2. Participative Crowd Computing with Smartphones

- Table 3 [#testplatforms] extended

- Fig. 13 [#multiworkerarch] added

- Fig. 10 [#resultspcap1] and caption revised
- Fig. 11 [#resultspcap2] and caption revised
- Fig. 14 [#wscnnresults] and caption revised
- Fig. 15 [#wbcnnresults] and caption revised

- Ex. 12 [#exmmML] caption revised

- Fig. 16 [#dpcnnresults] added
- Fig. 17 [#spcnnresults] added

 

 

Round 2

Reviewer 2 Report

On the positive side, the author has clearly invested significant efforts to settle various issues I have raised in the previous review with respect to the interpretation of the benchmark results of the ML use case (see below) and now also presents a "real" distributed implementation, giving much more credibility to the claim of a "distributed" execution framework.

However, several issues still remain:

  1. In the original review, I complained that the software is not to be found on the web and its actual usability for third parties remained unclear. The author has responded by setting up a GitHub page whose only content (at the time of review) is a README file with text "PSciLab - Distributed-Parallel Scientific Software Fraemwork (sic!)". The complaint thus still stands: I don't see the point of publishing a paper on a software that is practically inaccessible and/or unusable for third parties.
  2. The paper is still very vague about the actual use of the framework. In the previous review I asked "Please show a concrete use case that demonstrates the actual software interface and the whole deployment process of both the web-based and the shell-based approach with multiple processes running on different machines." I don't see this request satisfied in the revision. The paper discusses in Sections 2 and 3 in great (actually a bit too much) length the basic principles of parallel processing, but does not really give a single (perhaps very simple but) complete example that includes the whole code/notebook file, commands to start up the environment, and run the computation. Such an example should come at the very beginning before going into further details that can be elaborated in the subsequent sections. The lack of such an overview example (already at the very beginning) makes the paper difficult and tiresome to read.
  3. The presentation of the scalability of the second case study on ML is awkward to me. Rather than giving for (multiple) fixed workloads the actual computation times, speedups, efficiencies with varying number of workers or, alternatively, for (multiple) fixed time bounds the largest problem sizes that can be solved within that time bound (both are generally acknowledged ways to investigate scalability), the presentation gives (this was thankfully clarified in the revision) the computation times for having N workers solving N models and bases the claim of scalability on the fact that these working times remain roughly constant (or grow only moderately for larger N). The reported "speedups" are actually computed as N*(T_1/T_N) where T_1 is the time for one worker solving one model and T_N is the time for N workers solving N models. However, as stated in the paragraph starting at line 1570, the various models have different characteristics, so it is not a priori clear whether the workload indeed scales indeed linearly with the number N of models, it may well that the workload for N workers is less than N times the workload for 1 worker. Also line 1596 states "during parallel training, the best models are selected for further training iterations", so there may be some algorithmic improvement achieved by having multiple models at hand. All in all, I am quite suspicious about this form of evaluation of scalability and would very much prefer to have it (at least) complemented by the more traditional form of evaluation sketched above.

Considering all of this, I am sorry that I cannot improve my overall recommendation.

Author Response

In the original review, I complained that the software is not to be found on the web and its actual usability for third parties remained unclear. The author has responded by setting up a GitHub page whose only content (at the time of review) is a README file with text "PSciLab - Distributed-Parallel Scientific Software Fraemwork (sic!)". The complaint thus still stands: I don't see the point of publishing a paper on a software that is practically inaccessible and/or unusable for third parties.

Firstly, the software is and was available publicly via www.edu-9.de and www.sblab.de Web pages. E.g., it is used by our students regularly. Secondly, the software was provided via the github repository. There is a dist folder containing all relevant software components and the README file that gives a short introduction. There were some glitches with the repository in the past. Maybe there is a misunderstanding about the complexity of the software franmework. The framework consists basically only of less than 10 files (workbook.html, webwork.html, wproxy, sqld, worksh, wex, and some plugi-ins). 


The paper is still very vague about the actual use of the framework. In the previous review I asked "Please show a concrete use case that demonstrates the actual software interface and the whole deployment process of both the web-based and the shell-based approach with multiple processes running on different machines." I don't see this request satisfied in the revision. The paper discusses in Sections 2 and 3 in great (actually a bit too much) length the basic principles of parallel processing, but does not really give a single (perhaps very simple but) complete example that includes the whole code/notebook file, commands to start up the environment, and run the computation. Such an example should come at the very beginning before going into further details that can be elaborated in the subsequent sections. The lack of such an overview example (already at the very beginning) makes the paper difficult and tiresome to read.

The paper presents two real-world use-case studies used to evaluate ther performance and scalability of the architecture that are additionally back referenced by two recent scientific publications using this software and the parallel/distributed computational capabilities (but just as a tool, not part of that scientific work). All aspects including the programming API are discussed in suitable depth witgh a lot of examples addressing difefrent aspects of parallel/distributed worker processing and data management (Ex. 1, Ex. 2, Ex. 3, Ex. 4, Ex. 5, Ex. 6, Ex. 7, Ex. 8, Ex. 11, Ex. 12). The source code for the two use-cases were added in an appendix section (only slightly condensed).

The presentation of the scalability of the second case study on ML is awkward to me. Rather than giving for (multiple) fixed workloads the actual computation times, speedups, efficiencies with varying number of workers or, alternatively, for (multiple) fixed time bounds the largest problem sizes that can be solved within that time bound (both are generally acknowledged ways to investigate scalability), the presentation gives (this was thankfully clarified in the revision) the computation times for having N workers solving N models and bases the claim of scalability on the fact that these working times remain roughly constant (or grow only moderately for larger N). The reported "speedups" are actually computed as N*(T_1/T_N) where T_1 is the time for one worker solving one model and T_N is the time for N workers solving N models. However, as stated in the paragraph starting at line 1570, the various models have different characteristics, so it is not a priori clear whether the workload indeed scales indeed linearly with the number N of models, it may well that the workload for N workers is less than N times the workload for 1 worker. Also line 1596 states "during parallel training, the best models are selected for further training iterations", so there may be some algorithmic improvement achieved by having multiple models at hand. All in all, I am quite suspicious about this form of evaluation of scalability and would very much prefer to have it (at least) complemented by the more traditional form of evaluation sketched above.

The evaluation and assessment of the second use-case with respect to speed-up and scalability addressing typical distributed processing of dynamically  growing problems is correct and in accordance with the state of knowledge and usual methods. The multi-modell training indeed trains N independent models in parallel, and of course, the development of the training process and the trained models should differ, but here not structural! All ML training processes have the same computational complexity. They differ in their parameter state due to randomness in the training process! And the models with the lowest training error are selected and either used for inference or for further training or transfer learning. The hyperparameter space of ML is not bounded, therefore there is no upper computation time bound! This is basic ML knowledge and not the scope of this publication. So the workload of all processes is nearly the same, and the assumptions made in the paper are correct. But the text in the paper was slightly updated to clarify this better.

Back to TopTop