Next Article in Journal
Recommendation Algorithm Using Clustering-Based UPCSim (CB-UPCSim)
Previous Article in Journal
Fraud Detection Using the Fraud Triangle Theory and Data Mining Techniques: A Literature Review
 
 
Article
Peer-Review Record

Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their Application

by Ruslan Kuchumov * and Vladimir Korkhov
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 19 August 2021 / Revised: 24 September 2021 / Accepted: 27 September 2021 / Published: 2 October 2021

Round 1

Reviewer 1 Report

The paper describes several theoretical and empirical results on scheduling tasks under the assumption that one computing node can process multiple tasks at once with reduced processing speed. The premise is interesting and the paper is in the scope of the journal. The general outlook is positive. Nonetheless, the paper has some issues, some larger and a multitude of smaller ones. I will describe issues in roughly the order they appear in the paper.

From the literature overview, it follows that co-scheduling has been considered before, including the online approach. Thus, the authors should better explain what is different about the definition compared to existing approaches, since this definition is claimed as one of the contributions.

Probably the biggest issue of the paper is its Introduction. In journal papers it usually contains several of the following:
a) motivation (e.g. what is missing from existing methods),
b) practical justification (why is the research important, applications),
c) what are the authors proposing (in-text),
d) contributions (as a short list of precise points),
e) structure of the remainder of the paper,
while the Introduction is usually not longer than 1.5 pages (when the literature overview is done as a separate section, which is the case here).

The issue is that the authors essentially only did points a) and c), but the Introduction is already 1.5 pages. Doing all 5 points might be an overkill, but I feel at least 2 things need to be rectified:

1. There is no information why such scheduling/research is important. No references to what HPC, cloud scheduling, and similar are used for. Actually, Introduction contains no references at all, which is very unusual. I feel the authors should write more about scheduling (or cloud/HPC scheduling or HPC) in general and its applications, before moving onto co-scheduling and other parts of the Introduction. This will increase the significance of any results the authors provide. Uncovering specific applications is not very easy, but after Google Schoolar search I have found several papers that the authors could consider:

* "Speculative Scheduling for Stochastic HPC Applications" mentions of specific applications
* "Online scheduling for a Testing-as-a-Service system" large-scale online scheduling with incomplete information with applications for software testing
* "Scalable system scheduling for HPC and big data" connection to big data etc.
* "User Estimates Inaccuracy Study in HPC Scheduler" study on users reported resource usage vs actual usage
* "Expanding an HPC Cluster to Support the Computational Demands of Digital Pathology" digital pathology application
* "Many-Body Quantum Chemistry on Massively Parallel Computers" chemistry application
* "Job Scheduling with Machine Speeds for Password Cracking Using Hashtopolis" cryptology application (password cracking)

Of course, the list is not limited to that. Furthermore, such correction of the Introduction would help fix the other big issue: the relatively low number of references (for a topic such as HPC/cloud computing). A reference to a review/survey paper on HCP/cloud scheduling could be useful as well. Moreover, if such review is available for co-scheduling then it should be definitely included (in the literature overview section).

To summarize: points b), d) and e) should be considered (out of which point b) is a must), while the existing parts of the Introduction should be shortened (by rewriting from scratch, if necessary). I understand the resulting Introduction will be larger than it is now, but its significance will be higher than it is currently.

I also feel the authors skim over notions like "online scheduling" and "competitiveness" too easily for a paper focusing on them. Some references could be added with this as well. Also, it is not just "online", because the authors actually consider several scenarios of what data is known and what data is not. I believe those are actually existing terms like "clairvoyant", "non-clairvoyant" and "semi-clairvoyant" online scheduling.

Is all research restricted to a single node/machine? If so, it should be highlighted more, as it affects the complexity of building schedules (I will mention that later).

The authors wrote: "we theoretically show that the schedules that it produces can not be more than two times slower than the optimal schedule". First, "longer" is probably better here than "slower". Secondly, why not simply say "it is 2-competitive (with regards to optimum)"? Authors needlessly try to use too many words, when simpler notions exist. This may also be why the introduction feels so long (plus many paragraphs and with very short lines).

Are there other papers similar to paper [6] or is there only one such approach? If not, a line pointing to similar research could be added for comprehensiveness. 

For papers [10] and [11-14] it would be good to describe what were the results. Or is it just the "3%" result covered by paper [15]? 

The authors wrote "earlier in our research we have done the similar work but in the context of HPC applications. We have also provided theoretical boundaries for task speed values when naive parallel strategy can not be applied and showed how it deviated from the optimal
strategy" This seems like a reference to a past research paper. Why is there no appropriate citation (if this is paper [18] mentioned later then there is no problem to cite it twice)?

I feel conflicted about the size/contents of section 3. I feel the problem statement (model, notation, constraints) should be separate from strategies, algorithms and proofs. Thus section 3 should end at the (proper) definition of the goal function/overall problem.

Lines 188-190: "That is, it produces a sequence of combinations and their processing times" Do authors really mean processing times? Then how come it creates a schedule when the order of tasks in the combination is not known? Or does it not matter because there is only one node/machine?

What does S_{j,1} mean? (at this point we only know what S_j means)? Is S_{j,k} the k-th element in S_j? If so, then in what order (especially since S_j does not need to contain all tasks)?

Are incomplete tasks restarted from scratch when preemption happens? Generally, aside from earlier part of the paper preemption is glossed over and its mechanic, importance and frequency is not described again. Did preemption actually happen during the experiments? Is it common?

Line 203: over what j and k values is this iterating?

Relating to an earlier point: the definition of Cmax shown in 3.1 is true for any strategy (it is problem-related, not algorithm-related), so why is it described in the strategy/method/algorithm portion instead of before 3.1?

Line 235-237: a straightforward C_max^FCS / C_max^OPT formula would be handy here.

Lines 242-252: notation around here is confusing and introduced "backwards" i.e. "C_u" symbol is introduced before "u" symbol. Moreover, u* has no clear meaning, please elaborate on that.

It would make things clearer if the authors said outright that the areas in Figure 2 equal work done (speed multiplied by time etc.) if I understand this correctly.

Some equations are not numbered, but I do not see a reason why they shouldn't be.

Could authors provide a (probably very small, up to a few tasks) example, like instance data (possibly in a form of a table) plus figure(s) with OPT and FCS schedules corresponding to some solutions? It would be especially interesting if for that example C_max^FCS / C_max^OPT = 2.

Line 335: authors wrote "and the optimal value found on the previous iteration" What does "optimal" mean here? Optimal with regards to what?

What do dots (S_., a_{1,.}) mean in Figure 2? Maybe asterisks could be used instead? Or center dots (\cdot in LaTeX) instead of regular dots.

UCBj is somewhat clear, but PIj and EIj should be commented on: what are they meant to represent (what is the interpretation), what is "improvement" here? 

It seems authors focus entirely on CPU/task speed, while other resources (RAM) are not discussed, despite that the introduction to the paper mentions other resources. Namely, I assume all tasks in a combination run concurrently, meaning they all use RAM. What happens when the required RAM exceeds the available RAM? I ask this, because, for example, in lines 401-408 authors talk about blocking resources e.g. memory bus might be busy, but the task will be able to access it eventually, but if there is not enough RAM, a task will likely crash. Is there an assumption that there are enough resources for any (non-empty) combination to run and they just differ in makespan? In other words: are all combinations feasible?

It would be good to emphasize that each benchmark is a separate task (if I understand it correctly) and thus n = 10. In many areas (e.g. image recognition) a benchmark is an entire dataset of many examples, so this is not immediately clear.

Why is paper [18] not part of the literature review earlier? It can be simply referenced twice in the paper: once in review and once here.

"We have generated 50 systems" The word "system" sounds very vague here. Did the authors simply mean "(scheduling) problems" or "(scheduling) problem instances"?

Figures generally appear rather late with regards to their first reference point, though I understand fixing this is not always easy. Maybe some figures could be put on the bottom (2 pictures on page, one top, one bottom) of some pages? Or two figures at the top of the page, one below the other? Generally tables can be put on the bottom of the page as well (Table 2 in the current version of the paper specifically).

Another less trivial issue: I feel very conflicted about the "discussion" section. Most of the paragraphs repeat what was said already in abstract and in introduction (which was already too large) and which can be easily (and actually *is*) mentioned (again) in conclusions. Of the 11 paragraphs in the "discussion" I would leave only paragraphs 6, 8, 9, 10 and 11. Some of the remaining parts of "discussion" can be moved to "conclusions" (although as much shorter summary rather than entire page of text).

Are there any future work plans for this line of research? If so they could be added to conclusions. 

Aside from the above points, there are some typography issues:

* Line 104: goes into the margin (easy to fix).

* Reference [2] has no title

* In reference [4] it probably should be "HPC", not "hpc". This should be achievable by using curly braces (i.e. {HPC}) in the BibTeX entry. Generally the references should be re-checked for mistakes.

There are numerous typos and grammatical errors, some of which are highlighted below.

* "that task processing speed value are still available" should be "is" (or "values") not "are". And why not simply say "that task processing speed is known[, but the required amount of work for each task is not]"?
* "Other reasons are related to that this dependency" probably "that" or "this" is redundant.
* "linear programming programming problem"
* "done the similar work" there should be no "the" here 
* "but in the context of HPC applications" I am not sure whether authors mean HPC application or non-HPC applications. Please make sure this is what was meant.
* "does not need compete" -> "to compete"
* "Since it is the sum" -> "a sum"
* "until the of the schedule at the time point" -> "the end of schedule"?
* "plugging" this seems to informal (happens more than once), please consider using "substituting" instead
* "only the schedules that has the form" -> have
* figure 4 -> Figure 4 or Fig. 4
* section 7 -> Section 7 (or Sec. 7, but that is rare)
* table 1 -> Table 1 or Tab. 1
* "As task speed values has" -> have
* "would an unitless value" -> would be?
* "The resulting of benchmarks" -> resulting benchmarks? results of benchmarks?
* "to find processing speed the task itself" -> speed of the task?
* "To show that formula 7" -> "that formula (7)"" (this can be conveniently done in LaTeX with \eqref{} instead of \ref{}) 

In the light of the above remarks, my recommendation is to perform moderate changes to the paper and perform another review round.

Author Response

We thank the reviewer for valuable comments and suggestions that helped us to improve the manuscript.

 

> the authors should better explain what is different about the definition compared to existing approaches, since this definition is claimed as one of the contributions

 

Related work section was extended with paragraphs 4,9,10 (Lines from 130 and 184)  containing comparison of the proposed approach with existing approaches. 

 

> I feel the authors should write more about scheduling (or cloud/HPC scheduling or HPC) in general and its applications, before moving onto co-scheduling and other parts of the Introduction. […]

 

Introduction was rewritten in accordance with the recommendations. Paragraph 1 now contains citations of HPC applications in different fields, justifies the importance of scheduling in HPC in general and contains references to survey papers related to scheduling. Paragraph 2 justifies the importance of co-scheduling in particular. The description of paper’s contribution (in-text) was shortened, added paragraphs 7,8 (Lines from 70) with a list with contribution points and structure of the remainder of the paper.

 

> I also feel the authors skim over notions like "online scheduling" and "competitiveness" too easily for a paper focusing on them. Some references could be added with this as well. Also, it is not just "online", because the authors actually consider several scenarios of what data is known and what data is not.

 

Description of online/offline scheduling and types of clairvoyant scheduling were added to Section “3. Background”. References to scheduling theory books containing detailed descriptions were added to this section. The notion of competitiveness is now described in section 5 (paragraph 2). 

 

> Is all research restricted to a single node/machine? If so, it should be highlighted more

 

Clarification about single machine problem definition is now highlighted in section 4 (Problem of tasks co-scheduling) paragraph 6. Applicability of the model to multiple nodes/machines is now described in the Discussion section in paragraph 4 (Lines from 712).  

 

> The authors wrote: "we theoretically show that the schedules that it produces can not be more than two times slower than the optimal schedule". First, "longer" is probably better here than "slower". Secondly, why not simply say "it is 2-competitive (with regards to optimum)"?

 

Longer descriptions of competitive ratio were removed.

 

> Are there other papers similar to paper [6] or is there only one such approach?

 

An additional paper similar to [6] is now referenced in the review.

 

> For papers [10] and [11-14] it would be good to describe what were the results. Or is it just the "3%" result covered by paper [15]?

 

Description of the results obtained in papers [10,11-14] were added to Related Work section

 

> The authors wrote "earlier in our research … Why is there no appropriate citation?

 

Citation to the paper is now added.

 

> I feel conflicted about the size/contents of section 3. I feel the problem statement (model, notation, constraints) should be separate from strategies, algorithms and proofs. Thus section 3 should end at the (proper) definition of the goal function/overall problem. [...] the definition of Cmax shown in 3.1 is true for any strategy 

 

This section is now changed in accordance with the recommendations. A new section 3. Background is added containing a general description of scheduling model taxonomy. Formal problem definition (including makespan definition) is now separated into a separate section. Deterministic strategies (optimal and FCS) and their comparison are now separated into a dedicated section.

 

> Lines 188-190: "That is, it produces a sequence of combinations and their processing times" Do authors really mean processing times? Then how come it creates a schedule when the order of tasks in the combination is not known? Or does it not matter because there is only one node/machine? [...] What does S_{j,1} mean? (at this point we only know what S_j means)? [...] Line 203: over what j and k values is this iterating?

 

The formal definition of the schedule for co-scheduling problems was refined (at the end of the section 4, lines 261).

 

> Are incomplete tasks restarted from scratch when preemption happens? Generally, aside from earlier part of the paper preemption is glossed over and its mechanic, importance and frequency is not described again.

 

Task preemption is now described in the problem definition (section 4) and technical considerations are mentioned in discussion section (paragraph 2, 6 (lines 692-701, 732-741)).

 

> Could authors provide a (probably very small, up to a few tasks) example …  It would be especially interesting if for that example C_max^FCS / C_max^OPT = 2.

 

Example of such problem instance is now presented in the subsection 5.2 

 

> What do dots (S_., a_{1,.}) mean in Figure 2? Maybe asterisks could be used instead?

 

The notation was changed to S_j, a_{1,j}  in accordance to the notation in the remaining paper.

 

> UCBj is somewhat clear, but PIj and EIj should be commented on: what are they meant to represent (what is the interpretation), what is "improvement" here?

 

“Improvement” term along with acquisition functions are now explained in more detail in the beginning of the subsection 6.2

 

> It seems authors focus entirely on CPU/task speed, while other resources (RAM) are not discussed, [...]  In other words: are all combinations feasible?

 

Description of combinations feasibility and requirements for the main memory is now presented in the Discussion in the third paragraph (lines 702-712).  

 

> Figures generally appear rather late with regards to their first reference point, though I understand fixing this is not always easy.

 

Figures positions were changed to appear before references in text, where it was possible (all figures, except for the last). Tables were moved to the bottom for the page.

 

> Another less trivial issue: I feel very conflicted about the "discussion" section. Most of the paragraphs repeat what was said already in abstract and in introduction

 

Discussion section is rewritten in accordance with recommendations. Paragraphs that repeat introduction were removed. 

 

> Are there any future work plans for this line of research? If so they could be added to conclusions

 

Plans for future work are now added at the end of the conclusion section. 

 

> Line 235-237: a straightforward C_max^FCS / C_max^OPT formula would be handy here.

> Lines 242-252: notation around here is confusing and introduced "backwards"

> It would make things clearer if the authors said outright that the areas in Figure 2 equal work

> Some equations are not numbered,

> Line 335: authors wrote "and the optimal value found on the previous iteration" What does "optimal" mean here? Optimal with regards to what?

> "We have generated 50 systems" The word "system" sounds very vague here.

> It would be good to emphasize that each benchmark is a separate task

> Line 104: goes into the margin (easy to fix).

> Reference [2] has no title

> In reference [4] it probably should be "HPC", not "hpc". This should be achievable by using curly braces (i.e. {HPC}) in the BibTeX entry. Generally the references should be re-checked for mistakes.

> There are numerous typos and grammatical errors, some of which are highlighted below.

 

All of these issues are addressed in the revised manuscript

Author Response File: Author Response.pdf

Reviewer 2 Report

In the paper the authors have defined a model for solving the co-scheduling problem and proposed multiple scheduling strategies. The fact that deserves attention is that the paper was very carefully prepared. In the consequence, the quality of the paper is very high.

The paper is properly organized and includes all important elements: introduction, related work discussion, detailed description of the problem, proposed solutions description and utilization, results and their discussion.

In my opinion, the paper can be accepted and published in the present form. 

Author Response

We thank the reviewer for positive evaluation of the manuscript.

Reviewer 3 Report

This work presents an analytical and numerical evaluation of co-scheduling strategies and their application for HPF. The main problem of this work is its overall organization. It is an interesting work though, so the authors should improve some issues as described below: 1) Introduction: The main problem is that the authors have not clearly declared their contributions. They merely state that they propose several scheduling strategies that can be applied to solve co-scheduling problem. Also they mention that the problem is solved via LP, but what are the contributions? This is importatnt. 2) The Related Work section need restructuring. First, we need to see more recent approaches, if any, within the last 2 years and apparently more papers. This problem is well studied so more papers are needed. Moreover, a simple presentation is inadequate. The papers should be described in groups based on the strategy being used and a discussion should follow based on the advantages/disadvantages of each group of schemes. Also, the authors shoud clarify where THEIR work is placed within the proposed taxonomy. Generally, a strict presentation of the form "the authors did this" and "the authors proposed this" is not acceptable. 3. The title should change to "THE problem of tasks co-scheduling" From this point on, there is an organization issue: 1. Section 3 is supposed to formalize the co-scheduling problem and provide an optimal solution. Then, the online strategy is proposed. Maybe, paragraph 3.2 should be removed and moved to Section 4. The same holds for Section 3.3. 2. I think that all the steps for performing FCS should be gathered perhaps in a pseudocode. The mathematics and the proofs are absolutely required, the whole co-scheduling is based on them and they are presented nicely but one needs to see the scheduling itself, separately. 3. The authors need to provide an overall complexity of their scheduling approach

Author Response

We thank the reviewer for valuable comments and suggestions that helped us to improve the manuscript.

 

> The main problem is that the authors have not clearly declared their contributions. [...] Also they mention that the problem is solved via LP, but what are the contributions?

 

Introduction was rewritten. The description of the paper's contribution (in-text) was shortened, added paragraph 7 with a list of contribution points.

 

> The Related Work section need restructuring. [...] Generally, a strict presentation of the form "the authors did this" and "the authors proposed this" is not acceptable.

 

Related work section was extended in accordance with recommendations. We added paragraphs 4,9,10 (Lines starting from 130 and 184) containing comparison of the proposed approach with existing approaches.  Reference list was extended with additional references.

 

> The title should change to "THE problem of tasks co-scheduling"

 

The title of the section was changed accordingly.

 

> Section 3 is supposed to formalize the co-scheduling problem and provide an optimal solution. Then, the online strategy is proposed. Maybe, paragraph 3.2 should be removed and moved to Section 4. The same holds for Section 3.3.

 

The structure of the paper was updated. A new section 3. Background is added containing a general description of scheduling model taxonomy. Formal problem definition (including makespan definition) is now separated into a separate section. Deterministic strategies (optimal and FCS) and their comparison are now separated into a dedicated section.

 

> I think that all the steps for performing FCS should be gathered perhaps in a pseudocode.

 

Pseudocode description of FCS strategy is now added to the subsection 5.2 

 

> The authors need to provide an overall complexity of their scheduling approach

 

Computational complexity of scheduling strategies is now described in “Discussion” section in paragraph 3 (lines 702-711).

Round 2

Reviewer 1 Report

Most of the issues have been fixed. Most of the remaining issues are minor and easily fixed and most are related to Section 3: 

* At the beginning of section 3, there is "classical theory". By this do authors mean "classical scheduling theory"? This should be clarified.

* I feel conflicted about the title of the section. The name "background" fits more into sections 1 and 2 (which provide background/motivation). My suggestion would be to merge sections 3 and 4 into one section (titled "The problem of task co-scheduling") and divide it into two subsections. First could be untitled (or titled something like "General assumptions"), while what is currently section 4 would become subsection 3.1 (or 3.2), titled "Problem formulation" or something similar.

Aside from that equation (22) could be followed by another equation of the same result, but in a limit form, i.e. lim_{n -> infinity} CmaxFCS / CmaxOPT = 2.

I agree that OPT strategy is "offline" and theoretical (assuming appropriate knowledge) FCS is "online clairvoyant", but the practical approach to FCS shown in section 6 seems more like "online semi-clairvoyant", since the information is approximated.

There are some typos still:

"fishes" -> "finishes"
"generosity" -> "generality"

My recommendation is to accept the paper after minor revision.

Author Response

Thank you for the suggestions that help improve the manuscript.

 

> At the beginning of section 3, there is "classical theory". By this do authors mean "classical scheduling theory"? This should be clarified.

> Aside from that equation (22) could be followed by another equation

> There are some typos still:

 

All of these issues are addressed in the revised manuscript

 

> My suggestion would be to merge sections 3 and 4 into one section

 

Sections structure was updated in accordance with the recommendations.

 

> I agree that OPT strategy is "offline" and theoretical (assuming appropriate knowledge) FCS is "online clairvoyant", but the practical approach to FCS shown in section 6 seems more like "online semi-clairvoyant", since the information is approximated.

 

In semi-clairvoyant models the assumption is that approximation of the task processing time is available upon task release (e.g. [1] first paragraphs of sections 1,2). In our case, there is processing speed instead of processing time. In practical approaches, task speed values indeed are being estimated in some combinations, but this happens only after the measurements for other combinations with the task are available. In order to avoid such confusion we decided not to refer to the model as “online semi-clairvoyant”. Instead, we define it as a stochastic model, where task speed being random variables and its parameters of the distribution being estimated in the runtime. 

 

[1] Semi-clairvoyant scheduling. Becchetti Luca et al. https://0-doi-org.brum.beds.ac.uk/10.1016/j.tcs.2004.05.023

Reviewer 3 Report

I think the authors have addressed my issues so i believe the paper is suitable for publication.

 

 

Author Response

> I think the authors have addressed my issues so i believe the paper is suitable for publication.

 

We thank the reviewer for valuable comments and suggestions that helped us to improve the manuscript.

Back to TopTop