Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization

Qiu, Weiming; Chen, Yonghao; Chen, Dihu; Su, Tao; Yang, Simei

doi:10.3390/electronics11071094

Open AccessArticle

Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization

by

Weiming Qiu

,

Yonghao Chen

,

Dihu Chen

,

Tao Su

^*

and

Simei Yang

^*

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China

^*

Authors to whom correspondence should be addressed.

Electronics 2022, 11(7), 1094; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11071094

Submission received: 28 February 2022 / Revised: 25 March 2022 / Accepted: 25 March 2022 / Published: 30 March 2022

(This article belongs to the Special Issue Advanced Macromodeling and Optimization Techniques in Electrical Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Heterogeneous cluster-based multi/many-core systems (e.g., ARM big.LITTLE, supporting dynamic voltage and frequency scaling (DVFS) at cluster level and dynamic power management (DPM) at core level) have attracted much attention to optimize energy on modern embedded systems. For concurrently executing applications on such a platform, this paper aims to study how to appropriately apply the three system configurations (mapping, DVFS, and DPM) to reduce both dynamic and static energy. To this end, this paper first formulates the dependence of the three system configurations on heterogeneous cluster-based systems as a 0–1 integrated linear programming (ILP) model, taking into account run-time configuration overheads (e.g., costs of DPM mode switching and task migration). Then, with the 0–1 ILP model, different run-time strategies (e.g., considering the three configurations in fully separate, partially separate, and holistic manners) are compared based on a hierarchical management structure and design-time prepared data. Experimental case studies offer insights into the effectiveness of different management strategies on different platform sizes (e.g., #cluster × #core, 2 × 4, 2 × 8, 4 × 4, 4 × 8), in terms of application migration, energy efficiency, resource efficiency, and complexity.

Keywords:

energy optimization; heterogeneous cluster-based multi/many-core systems; 0–1 ILP model; task mapping; DVFS; DPM; hierarchical management

1. Introduction

Heterogeneous cluster-based multi/many-core platforms (e.g., ARM’s big.LITTLE architectures [1,2]) have attracted much attention as a solution to deliver both high-performance and energy-efficient embedded systems. Such a platform groups homogeneous cores into clusters, and exploits heterogeneity by using different core types among clusters. In addition, the platform allows updating voltage/frequency (

v / f

) levels of each cluster via dynamic voltage and frequency scaling (per-cluster DVFS, for dynamic power/energy optimization) and switching off each idle core via dynamic power management (per-core DPM, for static power/energy optimization). The diversity of core types, cluster

v / f

configurations and the possibility of shutting off idle cores provides many opportunities for energy optimization of the overall system.

Heterogeneous cluster-based systems allow multiple applications to execute concurrently. A set of simultaneously executing applications can be defined as a use case according to [3]. From one use case to another, the change in performance demands lead to the need for run-time mapping. Run-time mapping has been often combined with DVFS and DPM to achieve power/energy efficiency in state-of-the-arts (SotA). The SotA strategies are compared in Table 1. As shown in ❶ rows, there have been two approaches to combining mapping and DVFS in the literature: (1) separately (denoted by → ) and (2) holistically (or integratedly, denoted by & ). The first approaches (i.e., mapping → DVFS), such as [4,5,6,7], separate application mapping and DVFS into two subsequent steps. After application mappings are fixed,

v / f

configurations are set according to the system-performance constraints. The second approach (i.e., mapping & DVFS) is often used in recent works [8,9,10,11,12,13]. Such approaches apply task mapping and DVFS in a holistic manner. These approaches take into account the mutual influence between mapping and DVFS, and any changes in the mapping can trigger an update of

v / f

configurations. The holistic approaches can achieve a more energy-efficient solution, and may come at the cost of exploring complexity [6]. Due to the exploring complexity, the recent works of [9,10] consider the mapping and DVFS configuration for one multi-task application. For concurrently executing applications, some recent works [10,11,13] reduce exploration complexity by using heuristic strategies.

DMP has been combined with mapping and DVFS in multi-core systems such as in [14,15,16] (see ❷ rows in Table 1). These works consider independent tasks or one single application (can have multi-tasks) mapping on a per-core DVFS platform (denoted by DVFS* in Table 1). These approaches are limited for our considered case where multi-task applications are concurrently executing on a per-cluster DVFS platform. On the other hand, DPM is rarely used in ARM big.LITTLE systems in existing works due to two main reasons: (1) extra DPM overheads of switching-on/-off cores can be much higher than DVFS overheads [17]; (2) the limited resources of the typical ARM big.LITTLE architecture (i.e., ODROID-XU4 Exynos 5 Octa (5422) [1], 4 cores in each of the LITTLE and big clusters) make idle cores or a whole idle cluster a rare situation for multiple concurrently executing applications. However, for a future platform with more cores/clusters (e.g., ODROID-MC1 [2], an extension of ODROID-XU4 in a single board consisting of 4 LITTLE clusters and 4 big clusters), shutting off idle cores/clusters can have energy benefits. Moreover, driven by the continued success of reducing the technology size of transistors in the semiconductor industry, leakage power has increased its dominance in total power consumption [18]. It is significant to trade off static and dynamic energy optimization.

The focus of this paper is to study how to appropriately apply DPM with mapping and DVFS to have an overall energy optimization (i.e., including dynamic and static parts) for multiple applications concurrently executing on a heterogeneous cluster-based system. This paper presents three possible management strategies, and highlights their differences (see ❸ rows in Table 1). The first two management strategies include mapping → DVFS → DPM and mapping & DVFS → DPM), which can be respectively regarded as fully separate and partially separate strategies. The two strategies perform mapping and DVFS according to existing strategies (e.g., [4,11]; see ❶), and then switch off cores that are not assigned workloads to reduce static energy. These two straightforward extension strategies can achieve lower cluster

v / f

configurations (DVFS) by using more cores (depending on mapping) in each cluster for dynamic energy optimization. However, because there are fewer idle cores that can be shut off (DPM), static energy can be increased. Due to such dependencies of the three system configurations, a holistic strategy (i.e., mapping & DVFS & DPM) can be a promising solution for better dynamic and static energy tradeoffs. The proposed holistic management strategy (for mapping, per-cluster DVFS and DPM) is the key novelty of our work. Most of the existing works focus on holistic strategies without DPM consideration (see ❶) or they consider per-core DVFS (see ❷). Our study shows that the partially separate and holistic strategies outperform each other on different platform sizes (in term of #clusters or #cores, see Section 4). To the best of our knowledge, such insights are not seen in existing approaches.

In large-scale systems (e.g., with more clusters/cores), the complexity and scalability of management strategies are challenging. As summarized in this survey [19], run-time strategy complexity can be handled by hybrid management, which reduces run-time computation complexity based on some design-time prepared data. The scalabilty issue is usually resolved by hierarchical management structure [11,20], which typically uses two-level hierarchical management, consisting of a global and a local management strategy. Based on the hybrid hierarchical management structure (HHMS) in [11], this paper studies the effectiveness of different management strategies (e.g., partially separate, holistic) on heterogeneous cluster-based platforms. The demonstration of the HHMS [11] adaptability (e.g., to DPM consideration and different management strategies) is another novelty of this paper. In summary, the main contributions of this paper are as follows.

A 0–1 ILP model is presented to establish the dependencies of mapping, per-cluster DVFS and per-core DPM. The objective is to optimize dynamic and static energy of different sets of concurrently active applications (in varied use cases) on heterogeneous cluster-based platforms, taking into account possible run-time configurations overheads due to application migration and DPM mode switching.
With the 0–1 ILP model, different strategies (i.e., fully separate, partially separate and holistic) are realized into global and local management of the HHMS [11], based on some design-time prepared data. The differences and benefits of the management strategies are illustrated.
Experimental evaluations are performed for 1023 use cases (i.e., generated in random orders for 10 applications) and on different platform sizes (#cluser × #cores, 2 × 4, 2 × 8, 4 × 4, 4 × 8). The experiments offer insights into the effectiveness of the different management strategies (fully separate, partially separate, and holistic) in terms of application migration, energy efficiency, resource efficiency, and complexity.

The rest of this paper is organized as follows: Section 2 presents the backgrounds of HHMS [11] and presents the motivation for this paper. Section 3 formulates the 0–1 ILP model and presents the studied HHMS strategies (fully separate, partially separate, and holistic). Section 4 demonstrates and discusses the experimental results. Finally, Section 5 concludes this paper.

2. Background and Motivation

This section presents the background on HHMS [11] and illustrates the motivation of this paper. For reference, Table 2 lists the parameters and other variables used in this paper.

2.1. System Models and HHMS

Figure 1 depicts the HHMS in [11]. The management level of the system is linked to the application level and platform level. At the application level (see part a), periodic multi-task applications are considered. The period of an application

a p p_{i}

is denoted as

P e r i o d_{a p p_{i}}

. Each application consists of a set of computation tasks and a set of communication edges representing dependencies among the tasks. A set of simultaneously active applications is defined as a use case [3]. For instance,

u_{1} = {a p p_{1}, a p p_{2}}

refers to the use case consisting of two active applications.

At the hardware level (see part b), the platform is composed of J clusters. Each cluster, indexed by

c l u s t e r_{j}

, is composed of an

N_{j}

number of cores. The core type within each cluster is the same, and core types between clusters can be different. The communication resources among clusters and within each cluster can be either a bus or a network-on-chip (NoC). In addition, the considered platform supports per-cluster DVFS. All the cores within a cluster share the same

v / f

level. The available discrete frequency levels of

c l u s t e r_{j}

are denoted as {

f_{j, 1}, f_{j, 2}, \dots, f_{j, m a x}

} and operating voltages are adapted to the frequency. The platform also supports per-core DPM, which can switch off any idle core to reduce leakage power/energy.

At the management level (see part c), the HHMS [11] performs hybrid management, where design-time data is prepared to reduce run-time computation burdens and still guarantee mapping quality [21]. For each application, the design time-prepared data includes the following.

A set of mappings using a different number of cores (i.e., different threads). $X_{a p p_{i}}^{c}$ denotes the prepared mapping of $a p p_{i}$ mapped on c number of cores.
A set of minimum allowed frequencies (MAFs). MAF refers to the minimum frequency that allows the application to respect its timing constraint. Each prepared mapping has its own MAF. $M A F_{i, j}^{c}$ denotes the MAF of $X_{a p p_{i}}^{c}$ in $c l u s t e r_{j}$ .

For example, Figure 1 (c.1) presents the MAFs prepared for

a p p_{1}

design-time mappings, which use one core (

X_{a p p_{1}}^{1}

) and two cores (

X_{a p p_{1}}^{2}

).

M A F_{1, j}^{2}

is smaller than

M A F_{1, j}^{1}

because the higher performance of more threads (

X_{a p p_{1}}^{2}

) allows execution at a lower frequency level under the timing constraint (

P e r i o d_{a p p_{i}}

). In particular,

M A F_{i, j}^{c_{m a x}}

is used to denote the MAF of the design-time mapping by using the maximum number of cores for

a p p_{i}

. It means that

M A F_{i, j}^{c_{m a x}}

is the minimum of all prepared MAFs for an application (e.g.,

M A F_{1, j}^{c_{2}}

for

a p p_{1}

in part (c.1),

M A F_{2, j}^{c_{3}}

for

a p p_{2}

in part (c.2) of Figure 1).

Based on the design-time-prepared mappings and MAFs, the HHMS [11] performs run-time management via two hierarchical levels for scalability consideration. For each use case, the global management (GM) determines the application-to-cluster mapping, wherein each application is assigned to one cluster to avoid costly communications among clusters. For the assigned applications in each cluster, the local management (LM) determines the task-to-core mapping. A run-time management example is shown in Figure 1 (c.3), where

a p p_{1}

and

a p p_{2}

(e.g., in

u_{1}

) are mapped to the same cluster (

c l u s t e r_{j}

) and the design-time mappings are selected according to MAFs. Due to per-cluster DVFS, all applications mapped in

c l u s t e r_{j}

should execute at the same frequency level. Let

f_{j}

denotes the run-time configured frequency of

c l u s t e r_{j}

. In [11], the selected mapping corresponds to the ones supporting the lowest common frequency (denoted by

M A F_{j}^{L C}

), which is equal to the highest

M A F_{i, j}^{c_{m a x}}

(

M A F_{1, j}^{2}

in this example) of all assigned applications in

c l u s t e r_{j}

. Then the selected mappings (e.g.,

X_{a p p_{1}}^{2}

and

X_{a p p_{2}}^{2}

, as indicated by the red dotted line in Figure 1 (c.3)) can be combined by a task-to-core mapping strategy at

f_{j} = M A F_{j}^{L C}

.

The HHMS in [11] simplifies the mapping problem by linking task-to-core mapping to the cluster frequency configuration (per-cluster DVFS). This is the reason why the HHMS in [11] is selected as the baseline management structure in this paper. From here on, mapping refers to application-to-cluster mapping by default if not specified. Based on the HHMS, different GM and LM strategies can be used according to the optimization objectives. In this paper, LM uses the straightforward first come first served (FCFS) strategy [11], whereby the design-time prepared mapping is directly used at run time. One core is allowed to be only mapped with tasks of one application. The number of used cores in a cluster is the sum of cores used by the selected mappings. FCFS is selected due to its simplicity and feasibility in both virtual prototype simulation (e.g., Gem5) [7] and real operating system [4]. More details about the system model and HHMS can refer to [11].

2.2. Motivation of This Paper

This paper aims to study the effectiveness of different management strategies (applying mapping, DVFS, and DPM in fully separate, partially separate, and holistic manners) on heterogeneous cluster-based platforms based on the HHMS [11]. To better illustrate the motivation, this section first introduces the limitations of [11] (see benefits in Section 2.1), and then discusses possible energy improvement by using HHMS. Figure 2 shows energy optimization possibilities of

u_{1} = {a p p_{1}, a p p_{2}}

. For clarity, let

M a_{i, j}

denote the run-time application-to-cluster mapping of

a p p_{i}

in

c l u s t e r_{j}

. For the two applications in a two-cluster platform, there are four (i.e.,

J^{I}

in general) application-to-cluster mapping possibilities (❹ column of Figure 2). The first two mappings (in case 1,

{M a_{1, 1}, M a_{2, 1}}

and

{M a_{1, 2}, M a_{2, 2}}

) map the two applications to the same cluster. The third and fourth mappings (in case 2,

{M a_{1, 1}, M a_{2, 2}}

and

{M a_{1, 2}, M a_{2, 1}}

) divide two applications into two different clusters.

Different application-to-cluster mappings lead to different cluster frequency

f_{j}

configurations as well as task-to-core mappings (due to different design-time

X_{a p p_{i}}^{c}

selections). As previously explained in Section 2.1, the work of [11] sets

f_{j}

according to the lowest common MAF (i.e.,

M A F_{j}^{L C}

, highlighted in red in the column ❺) of the mapped applications in the same cluster. Once

f_{j}

is selected for each cluster, the selection of design-time

X_{a p p_{i}}^{c}

can be known based on MAFs [11]. In the example of

{M a_{1, 1}, M a_{2, 1}}

,

f_{j} = M A F_{2, 1}^{2}

indicates the task-mapping selections

X_{a p p_{1}}^{2}

and

X_{a p p_{2}}^{2}

(see red dotted line in part b of Figure 2). Thus the number of used cores in

c l u s t e r_{1}

is 4 (

N_{1}^{u s e d} = 4

, because each selected design-time mapping uses two cores/threads). On the other hand, when each cluster has a limited number of available cores (e.g.,

N_{j} = 3

), the selected design-time mappings based on

M A F_{j}^{L C}

[11] can be infeasible (e.g., in case.1, due to

N_{j}^{u s e d} (X_{a p p_{1}}^{2}, X_{a p p_{2}}^{2}) > N_{j}

). The strategy of [11] discards the infeasible task-to-core mappings (at

f_{j} = M A F_{j}^{L C}

) in case 1, and it continues to explore energy-efficient configurations (mapping and

f_{j}

) only in case 2.

If task-mapping exploration at

f_{j} > M A F_{j}^{L C}

, the overall energy consumption can be further optimized under resource constraints (see column ❻). In case 1, when

f_{j}

shifts from

M A F_{1, j}^{2}

(in red) to

M A F_{2, j}^{1}

(in blue), the corresponding selected mappings shift to

X_{a p p_{1}}^{2}

and

X_{a p p_{2}}^{1}

accordingly.

N_{j}^{u s e d}

can be reduced to 3, satisfying resource constraint. In this way, the cluster frequencies are minimized (in both case 1 and case 2) for dynamic energy (as

P o w e r_{d y n a m i c} \propto

V_{d d}^{2} \times f_{j}

[22],

V_{d d}

is the supply voltage). Moreover, allowing task-mapping exploration at

f_{j} > M A F_{j}^{L C}

can enable tradeoffs between dynamic energy and static energy (see column ❼). In case.1, if

f_{j}

increases from

M A F_{2, j}^{1}

(in blue) to

M A F_{1, j}^{1}

(in purple),

N_{j}^{u s e d}

can decrease from 3 to 2. This allows switching off one core (DPM) to save static energy. Due to different static and dynamic energy tradeoffs, lower energy of the overall system can be achieved.

However, for each application-to-cluster mapping (in case 1 and case 2), there can be many possibilities of cluster frequency (or MAF) configurations and core shutting-off decisions. The dependencies of application-to-cluster mapping per-cluster DVFS and DPM (observed from ❹, ❻, and ❼) motivate this paper to develop a management strategy (which can be in fully separate, partially separate, and holistic manners). The optimization metrics include low energy consumption of the overall system and low strategy complexity. Additionally, the developed strategies should take into account run-time configuration overheads, such as DPM mode-switching overheads and application-migration overheads (e.g., due to different application-to-cluster mappings from one use case to another). Notice that the run-time configuration overheads are often neglected in existing woks [8,9,10,11,12,13].

3. 0–1 ILP Model Formulation and Proposed Solution

In this section, the energy optimization problem is formulated into a 0–1 ILP model. Based on the problem formulation and HHMS [11], different management strategies (fully separate, partially separate, and holistic) are presented.

3.1. 0–1 ILP Problem Formulation

Given: A set of periodic multitask applications (in a use case) to be executed concurrently on a heterogeneous cluster-based platform supporting per-cluster DVFS and per-core DPM.

Find: Application-to-cluster mapping, task-to-core mapping, cluster frequencies and switching-off cores, when use case varies (e.g., an application finishes or joins).

Application-to-cluster mapping: The mapping of active applications (in each use case) on a cluster-based platform is defined by a matrix variables ${[a_{i, j}]}_{I \times J}$ , where $a_{i, j}$ is a binary variable equal to:

$a_{i, j} = \{\begin{matrix} 1, & if a p p_{i} is assigned onto c l u s t e r_{j} . \\ 0, & otherwise . \end{matrix}$

(1)
Cluster frequency configurations (DVFS): The optimized $f_{j}$ is set according to the all MAFs of the mapped applications in a cluster. $f_{j}$ should not be smaller than the lowest common MAF ( $M A F_{j}^{L C}$ ), which refers to the maximum $M A F_{i, j}^{c_{m a x}}$ value of all assigned applications in $c l u s t e r_{j}$ (Equation (2)). In addition, $f_{j}$ configuration should be within cluster frequency range ( Equation (3)).

$f_{j} > = M A F_{j}^{L C} = max {M A F_{i, j}^{c_{m a x}} \times a_{i, j}}, \forall i$

(2)

$f_{j} \in {f_{j, 1}, f_{j, 2}, \dots, f_{j, m a x}}, \forall j .$

(3)
Task-to-core mapping: The mapping of tasks to cores should respect resource constraint in each cluster. $N_{j}^{u s e d}$ should not be larger than the available cores ( $N_{j}$ ) in a cluster. $N_{j}^{u s e d}$ depends on the application-to-cluster mapping matrix variables, and on the applied local management strategy. This paper uses the FCFS [11] task-to-core mapping strategy and the corresponding $N_{j}^{u s e d}$ is the sum of the used cores of selected $X_{a p p_{i}}^{c}$ (corresponding to $f_{j}$ ) in each cluster.

$N_{j} > = N_{j}^{u s e d} ({[a_{i, j}]}_{I \times J}, f_{j}, X_{a p p_{i}}^{c}) .$

(4)
Switching-off cores (DPM): The idle cores without any mapped tasks can be switched off. Note that Equations (4) and (5) highlight the dependencies between mapping (of both application level and task level), frequency configuration, and DPM.

$N_{o f f} = N_{j} - N_{j}^{u s e d}$

(5)

Objective: Minimizing the energy consumption of the overall system. The system energy is the sum of dynamic energy (

E_{s y s}^{d y n a m i c}

), static energy (

E_{s y s}^{s t a t i c}

), and system run-time configuration overheads (

E_{o v e r h e a d}^{D P M} + E_{o v e r h e a d}^{m i g r a t i o n}

). In our assumptions, the DVFS overhead is negligible compared to

E_{o v e r h e a d}^{D P M}

[17]. Moreover,

E_{o v e r h e a d}^{m i g r a t i o n}

refers to the cost of migrating an application from one cluster to another, whereas the cost of migrating tasks at the core level is assumed to be negligible.

min E_{s y s} = E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M} + E_{o v e r h e a d}^{m i g r a t i o n}

(6)

Subject to: Meeting platform resource constraints and application performance constraint of each application. These constraints have been expressed as Equations (2)–(4).

3.2. Proposed Management Strategies

Solving the above 0–1 ILP optimization problem is computationally expensive. Because obtaining the optimal task mapping on a multi-/many-core system is an NP-hard problem [3], combining mapping with DVFS and DPM makes exploration space even larger. In order to achieve optimized 0–1 ILP solutions at a reasonable time, this paper uses heuristic strategies based on HHMS [11]. Inspired by the existing works (as listed in Table 1), three heuristic management strategies are proposed to apply application mapping, DVFS and DPM in fully separate, partially separate, and holistic manners, denoted as Mapping → DVFS → DPM, Mapping & DVFS → DPM and Mapping & DVFS & DPM, respectively.

Figure 3 depicts the flowcharts of the three management strategies which are implemented in HHMS [11]. These three management strategies take as input the active application (in a use case detected at run time), the platform model, and design-time prepared data (Section 2.1) and provide an optimized mapping (application-to-cluster and task-to-core), per-cluster DVFS and per-core DPM configuration as output. In the run time-detected use case, the concurrently active applications are considered one after another (e.g., in random order). Based on HHMS and the formulated 0–1 ILP model, the global management (GM) determines application-to-cluster mapping (Equation (1)). After one application is mapped to a cluster, the next application is considered. The local management (LM) in each cluster determines task-to-core mapping, cluster frequency configurations (with regard to application-performance constraints and platform-resource constraints, Equations (2)–(4)), and core switching-off decisions (Equation (5)).

3.2.1. Fully Separate: Mapping → DVFS → DPM

Figure 3a illustrates the fully separate management strategy, which sequentially fixes the three system configurations. The GM uses the lowest power first (LPF) strategy in [4,23] for application-to-cluster mapping. LPF is a heuristic strategy which is often used for ARM big.LITTLE architectures [1]. For each active application, LPF first tries to map it to the lowest power cluster (e.g., LITTLE cluster). When the lowest power cluster cannot satisfy system constraints (application performance constraints, or platform resource constraints), LPF then maps the application to a higher power cluster (e.g., big cluster). This means that LPF provides clusters with different priorities for application-to-cluster mapping. Once the GM fixes every application mapping by LPF, the LM minimizes the cluster frequencies (

f_{j}

) subsequently to optimize

E_{s y s}^{d y n a m i c}

(which can include

E_{o v e r h e a d}^{m i g r a t i o n}

). Finally, the LM switches off the idle cores in each cluster to reduce

E_{s y s}^{s t a t i c}

(which can include

E_{o v e r h e a d}^{D P M}

). In the fully separate strategy, the three system configurations (application-to-cluster mapping, DVFS, and DPM) are divided into three subsequent steps (Mapping → DVFS → DPM). The former configuration is independent of the latter.

3.2.2. Partially Separate: Mapping & DVFS → DPM

Figure 3b describes the partially separate management strategy. The strategy is a variant of [11], which considers the mutual impact between application mapping and cluster frequency (

f_{j}

) configurations. The GM maps each application to the cluster leading to the lowest

E_{s y s}^{d y n a m i c}

(which can include

E_{o v e r h e a d}^{m i g r a t i o n}

in Equation (6)) via a greedy strategy. For each active application, the GM attempts to map the application to every cluster, and explores task-to-core mapping at

f_{j} > = M A F_{j}^{L C}

until a feasible task mapping is obtained (as shown in Figure 3d, in blue, recalling ❻). The GM compares the resulting

E_{s y s}^{d y n a m i c}

(which can include

E_{o v e r h e a d}^{m i g r a t i o n}

) of different application-to-cluster mappings for each application, and takes the mapping solution leading to the minimum energy value. The GM decision is highly dependent on the LM decisions (e.g.,

f_{j}

and task-to-core mapping). The application-to-cluster mapping configuration (in GM) and cluster

f_{j}

(indicating task-to-core mapping, in LM) are involved in a loop (see the red loop of Figure 3b). This indicates that mapping and DVFS are considered holistically (Mapping & DVFS). After the first two configurations are fixed for all active applications in the current use case, the LM switches off idle cores to optimize

E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M}

. It means that DMP is separated from both mapping and DVFS configurations (Mapping & DVFS → DPM) in the partially separate strategy. It is worth mentioning that compared to the work of [11], which fixes task-to-core mapping at

f_{j} = M A F_{j}^{L C}

(recall ❺), the proposed partially separate strategy allows a larger exploration space of task-to-core mapping at

f_{j} > = M A F_{j}^{L C}

(see Figure 3d in blue, recalling ❻).

3.2.3. Holistic: Mapping & DVFS & DPM

Figure 3c illustrates the holistic management strategy, which considers the mutual influence of the three system configurations. In this holistic strategy, the GM maps each application to the cluster leading to the lowest

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

(can include

E_{o v e r h e a d}^{m i g r a t i o n} + E_{o v e r h e a d}^{D P M}

in Equation (6)) via a greedy strategy. For each active application, the GM attempts to map the application to every cluster, and explores task-to-core mapping at

M A F_{j}^{L C} < = f_{j} < = f_{j, m a x}

. This means that the task-to-core mapping exploration continues even when resource constraints are met. The optimized

f_{j}

can be set as

f_{j}^{*}

(Figure 3d in purple, recalling ❼) for fewer used cores. This provides more opportunities for switching-off cores to optimize

E_{s y s}^{s t a t i c}

. The GM compares the resulting

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

(which can include

E_{o v e r h e a d}^{m i g r a t i o n} + E_{o v e r h e a d}^{D P M}

) of different application-to-cluster mappings for each application, and takes the mapping solution leading to the minimum energy value. In the holistic strategy, the GM and LM coordinate with each other to explore all the three system configurations (mapping, DVFS, and DPM) in the same loop (see the inner loop in red). Any change in one configuration triggers an update of the other two configurations (Mapping & DVFS & DPM). In contrast to the separate strategies (fully and partially) which separate dynamic energy optimization and static energy optimization into two steps, the holistic strategy simultaneously optimizes dynamic energy, static energy, and system run-time overheads of the overall system.

4. Experimental Results

4.1. Simulation Setup

All the management strategies are implemented in Python running on a PC with Intel CPU I7-9700 at 3.00 GHz configuration. In the experimental evaluations, the platform model, the application model, and design-time-prepared data are set according to the HHMS work of [11]. First, the heterogeneous cluster-based platforms consider big.LITTLE architectures by using ARM Cortex-A series processors (e.g., ARM-Cortex-A9, A7, A15, and A17 ( The A9, A7, A15, and A17 core types have different characteristics of power consumption (1:0.55:2.25:1.3 with regard to A9) and performance (1:1.25:0.625:0.645, with regard to A9). Additionally, the frequency ranges for the four core types respectively are: 0.4–2.0 GHz, 0.4–1.4 GHz, 0.4–2.0 GHz, and 0.4–1.8 GHz, step = 0.1 GHz)). Two-cluster (A7, A15) and four-cluster (A9, A7, A15, A17) platforms are considered, and each cluster contains four or eight cores. Second, as in [11], 10 applications (derived from H263-encoder and H263-decoder and JPEG-decoder with different periods) are captured as an SDF model [24]. All the possible (i.e., 1023) use cases of the 10 considered applications are generated in random sequences, and the use cases have different numbers of active applications. Third, the design-time data (for run-time configurations in HHMS) is also prepared according to [11]. For each of the 10 applications, the design-time prepared data includes design-time mappings (using different numbers of cores/threads) and their corresponding MAFs (meeting performance constraint) in the four core types.

Notice that the proposed energy-management strategy in this paper is not dependent on the used power models. For experimental evaluation purposes, this paper uses the dynamic power model

P o w e r_{d y n a m i c} (c l u s t e r_{j}, f_{j}) = ξ \times f_{j}^{3}

(

ξ

is a coefficient that is dependent on the application tasks and the assigned core type) for ARM big.LITTLE architectures [11,25], which implies a linear correlation between

f_{j}

and the supply voltage

V_{d d}

. With the generic static power model

P o w e r_{s t a t i c} = V_{d d} . I_{l e a k} (V_{d d}, T)

[22] (

I_{l e a k}

is the leakage current, which depends on the supply voltage and the cores’ temperature T), this paper converts the static power model into a frequency-squared (

P o w e r_{s t a t i c} \propto f^{2}

) relationship. This static model may not be perfectly accurate, but it roughly fits the real data measured on the ODROID XU3 platform [1] with a coarse-grained modeling process. Taking into account the variation of different platforms (different technology nodes), a frequency-linear model (

P o w e r_{s t a t i c} \propto f

) will be also discussed in the experiment.

This paper introduces the static energy ratio (SER) to give an intuition of the distribution of dynamic and static power/energy in the system. SER refers to the ratio of static energy compared to dynamic energy (i.e.,

S E R = \frac{E_{s y s}^{s t a t i c}}{E_{s y s}^{d y n a m i c}}

) for 10 applications executing in each cluster at 1GHz. The coefficient (e.g.,

β

) of the used static power model (e.g.,

P o w e r_{s t a t i c} = β \times f^{2}

) are set for different clusters (core types) according to the SER. In the experimental evaluations, SER is set to 0.02 (or 0.05, 0.1, or 0.2), meaning that static energy is 2% (or 5%, 10%, or 20%) of dynamic energy at 1GHz. Different SERs are used to evaluate the effectiveness of proposed management strategies on platforms with different power/energy characteristics.

Evaluated Hybrid-Hierarchical Management Strategies

This section introduces the management strategies that are compared in the experiments. All the considered management strategies are based on the same design-time-prepared data (e.g., design-time, task-to-core mappings and MAFs). The fully separate, partially separate, and holistic management strategies have been illustrated in Section 3.2. To compare the energy consumption, resource utilization, and complexity of three management strategies, exhaustive and genetic strategies are used as baselines.

Exhaustive: The exhaustive strategy explores all the possible system configurations, i.e., application-to-cluster mappings, task-to-core mappings (indicating DVFS or cluster frequency, see Section 2), and DPM decisions. It aims to achieve the minimum energy (including $E_{s y s}^{d y n a m i c}$ , $E_{s y s}^{s t a t i c}$ , $E_{o v e r h e a d}^{D P M}$ , and $E_{o v e r h e a d}^{m i g r a t i o n}$ ) of the overall system, satisfying application performance (e.g., period) constraints and resource constraints.
Genetic: The genetic algorithm is based on a standard genetic algorithm [26]. It initializes population individuals, which consist of a set of random chromosomes. Each chromosome consists of two genes, indicating application-to-cluster mapping and task-to-core mapping selection respectively. The fitness function is defined by $max (1 / \sum E_{s y s})$ (see Equation (6), $E_{s y s}$ is the system energy in a use case).

The HHMS-based management strategies are compared in special and general cases.

4.2. Special Case

In the special case, it is assumed that each use case starts on an empty platform, which indicates that the use cases are independent. All the active applications (in a use case) are considered for mapping, DVFS, and DPM configurations. Intuitively, energy-saving benefits (due to good system configurations) increase with execution time, whereas run-time overheads (e.g.,

E_{o v e r h e a d}^{D P M}

,

E_{o v e r h e a d}^{m i g r a t i o n}

; recall Section 3) depend on DPM and mapping/migration decisions (rather than system execution time). If the system execution time is long enough (as in the special case), the run-time overhead can be neglected. In the special case, each use case is assumed to be active for a long enough time (e.g., 100 least common multiple (LCM) of the active application periods in each use case) to neglect possible run-time overheads.

4.2.1. Evaluation of Different Platform Sizes

To gain a better insight into the heterogeneous cluster-based systems, this experiment compares the energy-optimal solutions (obtained by the exhaustive strategy) due to different platform sizes (#clusters × #cores, e.g., 2 × 4, 2 × 8, 4 × 4, 4 × 8), different SER (e.g., 0.02, 0.1, 0.2), and different static power models (e.g.,

P o w e r_{s t a t i c} \propto f^{2}

and

P o w e r_{s t a t i c} \propto f

). Figure 4 shows the comparison results of overall energy (

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

) and the total number of used cores (

N_{a l l}^{u s e d}

) for a use case (on average, normalized to 2 × 4 and SER = 0.02). Notice that there can be infeasible solutions for a big use case (e.g., >8 applications) on a small platform size, such as 2 × 4. For a fair comparison, only use cases (1002 use cases) that are feasible for all the 4 platform sizes are considered.

Three main observations can be made from Figure 4. First, considering different platform sizes with fixed SER and fixed static power model,

E_{s y s}

decreases with increasing platform sizes (#cluster × #cores). The 2 × 4 platform has the highest

E_{s y s}

, followed by 2 × 8, 4 × 4, and 4 × 8 platforms. This is the result of

E_{s y s}^{d y n a m i c}

and

E_{s y s}^{s t a t i c}

tradeoffs.

$E_{s y s}^{d y n a m i c}$ dominates the overall energy of each platform (see Figure 4a). As illustrated by Equation (2), $E_{s y s}^{d y n a m i c}$ is highly dependent on core types and cluster frequency, i.e., selected MAF in each cluster. Generally speaking, more clusters/cores can lead to lower $E_{s y s}^{d y n a m i c}$ . This allows applications with different MAFs to be separated into more different clusters, and lower cluster frequencies ( $f_{j}$ ) can be achieved for lower $E_{s y s}^{d y n a m i c}$ . Particularly for each cluster, more available cores allow more threads (of design-time mapping) to be selected for each application, and can consequently reduce the common MAF ( $f_{j}$ , recall Figure 2). This observation is aligned to the work of [11].
$E_{s y s}^{s t a t i c}$ is highly dependent on the number of used cores ( $N_{a l l}^{u s e d}$ ) and $f_{j}$ (considering the used static power model). A 2 × 4 platform has the highest static energy due to the high $f_{j}$ (reflected by $E_{s y s}^{d y n a m i c}$ ). A 2 × 8 platform has the second highest static energy due to the large $N_{a l l}^{u s e d}$ (see Figure 4b).

Secondly, on the same platform size and same static power model (e.g.,

P o w e r_{s t a t i c} \propto f^{2}

) when SER grows from 0.02 to 0.2,

E_{s y s}^{d y n a m i c}

slightly increases (SER = 0.02 vs. SER = 0.2).

E_{s y s}^{s t a t i c}

significantly increases, whereas

N_{a l l}^{u s e d}

decreases. This is because, with growing SER, optimizing

E_{s y s}^{s t a t i c}

becomes more important for the overall energy optimization. The exhaustive strategy puts more effort into

E_{s y s}^{s t a t i c}

optimization (by reducing

N_{a l l}^{u s e d}

), and reduces the effort in

E_{s y s}^{d y n a m i c}

optimization (by increasing

N_{a l l}^{u s e d}

for more threads in each cluster). This indicates that the optimal configuration of mapping, DVFS, and DPM is also dependent on SER characteristic of platforms.

Thirdly, comparing the results of

P o w e r_{s t a t i c} \propto f^{2}

with

P o w e r_{s t a t i c} \propto f

models, one significant difference lies in

N_{a l l}^{u s e d}

. Mathematically, the latter model is less dependent on cluster frequency, which means it is more dependent on

N_{a l l}^{u s e d}

. Thus, as SER grows, the exhaustive strategy puts more effort into reducing

N_{a l l}^{u s e d}

(with regard to the former static energy model; see the slopes of Figure 4b).

4.2.2. Evaluation of Different Management Strategies

This experiment evaluates the proposed management strategies (e.g., fully separate, partially separate, and holistic) on different platform sizes and different SERs (by using the

P o w e r_{s t a t i c} \propto f^{2}

model). Figure 5 shows

E_{s y s}

(

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

) and

N_{a l l}^{u s e d}

. This experiment uses exhaustive and genetic strategies as the comparison baseline, and all results are normalized to the exhaustive results when SER = 0.02.

As shown in Figure 5, the exhaustive method achieves the energy-optimal solutions. Notice that the exhaustive strategy is not necessary leading to the smallest

N_{a l l}^{u s e d}

, as it puts

E_{s y s}

optimization in the highest priorities. The genetic strategy achieves the closest

E_{s y s}

to the exhaustive strategy (up to 12.4% difference) on the 2 × 4 platform. However, genetic behavior gets worse as the platform sizes increase (e.g., especially on 4 × 8 platform). Because big platform sizes lead to increasing exploration space, it is difficult for the genetic strategy to explore optimized solutions within a reasonable time (for the run-time system). Moreover, the fully separate (FS in Figure 5) strategy leads to the highest

E_{s y s}

(up to 42.1% difference to exhaustive). This strategy uses LPF [4] for application-to-cluster mapping (recall Section 3.2.1), which gives the highest mapping priority to the lowest power cluster. As most active applications (in a use case) are assigned to one cluster, the cluster frequency becomes higher (due to higher common MAF; recall Section 2.2). This can lead to a large increase in

E_{s y s}^{d y n a m i c}

as well as

E_{s y s}

.

In contrast, both partially separate (PS in Figure 5) and holistic (H in Figure 5) strategies achieve close

E_{s y s}

to exhaustive (e.g., about 20% difference on 2 × 4 platform, and a 5% difference on 4 × 8 platform). The partially separate strategy achieves better energy optimization on two-cluster platforms (2 × 4 and 2 × 8). This strategy first optimizes

E_{s y s}^{d y n a m i c}

(setting low-cluster frequencies, via DVFS) then optimizes

E_{s y s}^{s t a t i c}

(via DPM). This indicates that

E_{s y s}^{d y n a m i c}

optimization is a higher priority than

E_{s y s}^{s t a t i c}

optimization on small platform sizes (e.g., 2 clusters). On the other hand, the holistic strategy achieves lower energy on four-cluster platforms (4 × 4 and 4 × 8). It reduces the importance of

E_{s y s}^{d y n a m i c}

optimization (with regard to the partially separate strategy). This explains why its

E_{s y s}^{d y n a m i c}

is higher than that of the partially separate strategy (see Figure 5(a.1–a.4)). It can be concluded that

E_{s y s}^{s t a t i c}

optimization becomes as important as

E_{s y s}^{d y n a m i c}

on big platform sizes (e.g., four clusters). Additionally, when SER = 0.02, the partially separate strategy is always more energy efficient than the holistic strategy. Low SER means

E_{s y s}^{d y n a m i c}

dominating

E_{s y s}

, reduction in cluster frequency is more effective than DPM to optimize the overall energy. Lastly, the holistic strategy uses the least

N_{a l l}^{u s e d}

, indicating that this strategy is the most resource efficient.

Similar observations can be seen for the

P o w e r_{s t a t i c} \propto f

model. Because the static power model shows lower dependency (with regard to

P o w e r_{s t a t i c} \propto f^{2}

), it has higher dependency on

N_{a l l}^{u s e d}

. A steeper slope can be seen from the evolution of

N_{a l l}^{u s e d}

on SER (recall Section 4.2.1), which is not specifically shown here due to the space limit of the paper.

4.3. General Case

In the general case, a use case can start on a non-empty platform (with existing applications). Ideally, only the newly active applications (with regard to the previous use case) are considered (in terms of mapping, DVFS, and DPM) in each use case. To further optimize the energy consumption of the overall system, old existing applications (that have been active since the previous use case) can be allowed to migrate from one cluster to another. It means that the three system configurations in each use case is dependent on its previous use case. In contrast to the special case (Section 4.2), the general case considers both long and short execution time of the system. With short (e.g., #LCM < 100) system execution time, it is important to trade off energy savings and run-time overheads (e.g.,

E_{o v e r h e a d}^{D P M}

,

E_{o v e r h e a d}^{m i g r a t i o n}

; recall Section 3). Notice that the exhaustive and genetic strategies are not considered in the general case, due to the huge design space and long exploration time of the two strategies.

4.3.1. Evaluation of Different Management Strategies without Migration

This experiment evaluates the proposed management strategies that prohibit migrating applications across clusters. Figure 6 shows

E_{s y s}

(

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M}

) and

N_{a l l}^{u s e d}

, for different platform sizes and different execution durations (#LCM = 1, 10, 100, 1000). Note that the different policies denoted by “-M0” mean that they allow 0 migration. The experimental results are normalized to exhaustive -M0 when #LCM = 1.

As shown in Figure 6, the observations in Section 4.2.2 (the special case) still hold when the general case allows 0 migration (across clusters). (1) Fully separate (FS-M0 in Figure 6) results in the highest

E_{s y s}

as well as the most

N_{a l l}^{u s e d}

. This indicates that giving clusters different mapping priorities is neither energy efficient nor resource efficient (recall Section 3.2.1). (2) The partially separate (PS-M0 in Figure 6) and holistic (H-M0 in Figure 6) strategies have better energy efficiency in two-cluster platforms (2 × 4, 2 × 8) and in four-cluster platforms (4 × 4, 4 × 8), respectively. The partially separate (PS-M0) strategy gives higher priority to minimizing cluster frequency (recall Section 3.2.2). This indicates that

E_{s y s}^{d y n a m i c}

optimization (via DVFS) is in higher priority than

E_{s y s}^{s t a t i c}

optimization (via DPM) on small platform sizes (e.g., two clusters). (3) The holistic (H-M0) strategy leads to the most resource-efficient solutions (see Figure 6(b.1–b.4)). Lastly, with the increases in execution time (#LCM = 1, 10, 100, 1000),

E_{s y s}^{d y n a m i c}

and

E_{s y s}^{s t a t i c}

keep increasing, whereas

E_{o v e r h e a d}^{D P M}

becomes negligible when execution time becomes large (e.g., #LCM > 10).

4.3.2. Evaluation of Different Management Strategies Allowing Migrations

To illustrate the benefit of allowing application migration across clusters, this experiment evaluates the partially separate and holistic strategies with 0,1,2 migrations (denoted with -M0,1,2). The fully separate strategy is not considered here due to its high energy consumption (see Section 3.2.1 and Section 3.2.2). Figure 7 shows

E_{s y s}

(

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M} + E_{o v e r h e a d}^{m i g r a t i o n}

),

N_{a l l}^{u s e d}

, and the actual number of migrations (#migration), for different platform sizes and different execution times (#LCM = 1, 10, 100). The experimental results are normalized to holistic-M0 (#LCM = 1). Notice that because even the two considered strategies allow one or two old existing applications (in the previous use case) to be migrated to new clusters, the actual #migration can be smaller than one (or two, on average per use case). This is because application migration does not necessarily lead to energy saving. This is a tradeoff between energy savings and migration overhead.

As shown in Figure 7, migration can generally lead to energy saving within long execution time (#LCM > 10). With increasing #LCM, the run-time overheads (

E_{r u n - t i m e}^{D P M} + E_{r u n - t i m e}^{m i g r a t i o n}

) become negligible compared to energy saving. This can be regarded as approaching the special case (see Section 4.2). However, when the execution time is small (#LCM = 1), migration can lead to an increase in

E_{s y s}

(e.g., Figure 7(a.3)). This indicates that frequent migrations in subsequent use cases (i.e., 1023 use cases in this experiment) are not energy efficient. Additionally, regarding the partially separate (PS-M0,1,2 in Figure 7) and holistic strategies (H-M0,1,2 in Figure 7), they lead to better energy efficiency in the 2 × 4 platform and 4-cluster platforms (4 × 4, 4 × 8), respectively. The two strategies lead to comparable

E_{s y s}

in 2 × 8 platform. This is also aligned with the previous observation (in Section 4.2.2 and Section 4.3.1) that minimization of cluster frequency (for

E_{s y s}^{d y n a m i c}

optimization) is a higher priority than DPM (

E_{s y s}^{s t a t i c}

optimization) in small platforms (e.g., 2 × 4). Moreover, it can be observed that the holistic strategy (H-M0,1,2) results in fewer

N_{a l l}^{u s e d}

and fewer migrations than the partially separate strategy (PS-M0,1,2).

4.4. Experimental Discussion

This section discusses the characteristics of different management strategies from the experimental observations. Before such discussions, Table 3 summarizes the exploration time (time (ms) to explore a use case solution, on average) to characterize strategy complexity. Compared to the special case, the general case has smaller exploration time. This is because the general case mainly considers applications that are newly activated at run time (with regard to previous use cases) to reduce their exploration burden. Particularly for the general case, allowing migration (-M1,2 vs. -M0) leads to an increase in exploration time.

In summary, the experimental observations derive insights into the characteristics of heterogeneous cluster-based platforms (supporting per-cluster DVFS and per-core DPM) and different management (implemented in HHMS manner), as follows.

Larger platform sizes (with more available clusters/cores) can lead to better energy efficiency. Dynamic energy is highly dependent on cluster frequency levels, which are determined by application-to-cluster mapping and task-to-core mapping (reflected by Equations (1) and (2)). Static energy is related to cluster frequency model (depending on the static power model) and the number of used cores.
The fully separate strategy gives different mapping priorities to clusters. The resulting high frequencies (of some clusters) can lead to high energy consumption of the overall system. This strategy has lowest strategy complexity (see Table 3) due to its simplicity.
The partially separate and holistic strategies lead to similar energy consumption (5–20% difference with regard to exhaustive) with low strategy complexity (up to 1652× lower with regard to exhaustive).
–
The partially separate strategy gives higher priority into dynamic energy optimization (with low frequency). It can achieve better energy efficiency than the holistic strategy on two-cluster platforms. This indicates that dynamic energy optimization is more important (with regard to static energy optimization) on small platforms.
–
The holistic strategy optimizes dynamic and static energy simultaneously, and it achieves better energy efficiency in four-cluster platforms. The holistic strategy has the best resource efficiency, i.e., using the least cores for energy optimization. Compared to the partially separate strategy, the holistic strategy has slightly higher complexity due to its larger exploration space (see Section 3.2).

5. Conclusions

For multiple applications concurrently executing on heterogeneous cluster-based platforms (e.g., ARM big.LITTLE), this paper formulates the dependencies between application-to-cluster mapping, task-to-core mapping, per-cluster DVFS, and per-core DPM configurations into a 0–1 ILP model to optimize overall energy (including run-time overhead of DPM setting and application migration). Based on the 0–1 ILP model and hybrid hierarchical management structure, three heuristic (or greedy) management strategies (fully separate, partially separate, and holistic) are compared. The experimental evaluations are performed in 1023 use cases (generated in random orders) on different platform sizes (e.g., with 2 and 4 clusters), showing that the fully separate and holistic strategies can lead to good energy efficiency (5–20% difference with regard to exhaustive) and low strategy complexity (up to 1652× lower with regard to exhaustive).

The proposed fully separate and holistic management strategies are expected to be applied to achieve energy efficiency for cluster-based platforms, such as embedded mobile platforms (e.g., ODROID XU3/4 [1]) and high-performance computing (HPC) platforms (e.g., ODROID-MC1 [2]) that support per-cluster DVFS. Particularly for HPC applications, their tasks are commonly used to represent kernels, and each kernel consists of one or multiple tasks. The kernel execution can be regarded as design/run-time mappings by using numbers of cores/threads in our proposed strategies. This shows the potential to use the proposed strategies in the HPC domains. To have further validations on embedded systems and HPC, our future work plans include: (1) validating the proposed strategies based on the real implementation by using multi-threaded PARSEC [27] on ODROID XU4 and ODROID-MC1 platforms, and (2) targeting machine-learning predictions in order to avoid the efforts of design-time preparation (MAFs for different threads).

Author Contributions

Conceptualization, W.Q. and S.Y.; methodology, W.Q. and S.Y.; software, W.Q.; validation, W.Q. and S.Y.; formal analysis, W.Q., Y.C. and S.Y.; investigation, W.Q. and S.Y.; resources, D.C. and T.S.; data curation, W.Q., Y.C. and S.Y.; writing—original draft preparation, W.Q. and S.Y.; writing—review and editing, W.Q. and S.Y.; visualization, W.Q. and S.Y.; supervision, S.Y.; project administration, S.Y.; funding acquisition, D.C. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Program of Guangdong Province under Grant No. 2021B1101270007 and Grant No. 2019B010140002.

Conflicts of Interest

The authors declare no conflict of interest.

References

Exynos 5 octa (5422). 2021. Available online: http://www.samsung.com/exynos (accessed on 1 February 2022).
Lintermann, A.; Pleiter, D.; Schröder, W. Performance of ODROID-MC1 for scientific flow problems. Future Gener. Comput. Syst. 2019, 95, 149–162. [Google Scholar] [CrossRef]
Benini, L.; Bertozzi, D.; Milano, M. Resource management policy handling multiple use cases in mpsoc platforms using constraint programming. In Proceedings of the International Conference on Logic Programming, Prague, Czech Republic, 10–12 September 2008; pp. 470–484. [Google Scholar]
Muthukaruppan, T.S.; Pricopi, M.; Venkataramani, V.; Mitra, T.; Vishin, S. Hierarchical power management for asymmetric multi-core in dark silicon era. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 29 May–7 June 2013; pp. 1–9. [Google Scholar]
Reddy, B.K.; Singh, A.K.; Biswas, D.; Merrett, G.V.; Al-Hashimi, B.M. Inter-cluster thread-to-core mapping and DVFS on heterogeneous multi-cores. IEEE Trans. Multi-Scale Comput. Syst. 2017, 4, 369–382. [Google Scholar] [CrossRef]
Basireddy, K.R.; Singh, A.K.; Al-Hashimi, B.M.; Merrett, G.V. AdaMD: Adaptive mapping and DVFS for energy-efficient heterogeneous multicores. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 39, 2206–2217. [Google Scholar] [CrossRef] [Green Version]
Pagani, S.; Pathania, A.; Shafique, M.; Chen, J.J.; Henkel, J. Energy efficiency for clustered heterogeneous multicores. IEEE Trans. Parallel Distrib. Syst. 2016, 28, 1315–1330. [Google Scholar] [CrossRef]
Tariq, U.U.; Ali, H.; Liu, L.; Panneerselvam, J.; Zhai, X. Energy-efficient static task scheduling on VFI-based NoC-HMPSoCs for intelligent edge devices in cyber-physical systems. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–22. [Google Scholar] [CrossRef] [Green Version]
Dey, S.; Saha, S.; Singh, A.; McDonald-Maier, K. Asynchronous Hybrid Deep Learning (AHDL): A Deep Learning Based Resource Mapping in DVFS Enabled Mobile MPSoCs. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14 June–31 July 2021; pp. 303–308. [Google Scholar]
Chen, J.; Manivannan, M.; Abduljabbar, M.; Pericàs, M. ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes. ACM Trans. Archit. Code Optim. 2022, 19, 1–29. [Google Scholar] [CrossRef]
Yang, S.; Le Nours, S.; Real, M.M.; Pillement, S. 0–1 ILP-based run-time hierarchical energy optimization for heterogeneous cluster-based multi/many-core systems. J. Syst. Archit. 2021, 116, 102035. [Google Scholar] [CrossRef]
Shamsa, E.; Kanduri, A.; Rahmani, A.M.; Liljeberg, P. Energy-Performance Co-Management of Mixed-Sensitivity Workloads on Heterogeneous Multi-core Systems. In Proceedings of the 26th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 18–21 January 2021; pp. 421–427. [Google Scholar]
Mo, L.; Zhou, Q.; Kritikakou, A.; Liu, J. Energy Efficient, Real-time and Reliable Task Deployment on NoC-based Multicores with DVFS. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE), Antwerp, Belgium, 14–23 March 2022. [Google Scholar]
Bhatti, M.K.; Belleudy, C.; Auguin, M. Hybrid power management in real time embedded systems: An interplay of DVFS and DPM techniques. Real-Time Syst. 2011, 47, 143–162. [Google Scholar] [CrossRef]
Chen, G.; Huang, K.; Knoll, A. Energy optimization for real-time multiprocessor system-on-chip with optimal DVFS and DPM combination. ACM Trans. Embed. Comput. Syst. 2014, 13, 1–21. [Google Scholar] [CrossRef] [Green Version]
Srinivasan, K.; Chatha, K.S. Integer linear programming and heuristic techniques for system-level low power scheduling on multiprocessor architectures under throughput constraints. Integration 2007, 40, 326–354. [Google Scholar] [CrossRef]
Mascitti, A.; Cucinotta, T.; Marinoni, M. An adaptive, utilization-based approach to schedule real-time tasks for ARM big. LITTLE architectures. ACM SIGBED Rev. 2020, 17, 18–23. [Google Scholar] [CrossRef]
Jejurikar, R.; Pereira, C.; Gupta, R. Leakage aware dynamic voltage scaling for real-time embedded systems. In Proceedings of the 41st annual Design Automation Conference, San Diego, CA, USA, 7–11 June 2004; pp. 275–280. [Google Scholar]
Singh, A.K.; Dziurzanski, P.; Mendis, H.R.; Indrusiak, L.S. A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi-/many-core systems. ACM Comput. Surv. 2017, 50, 24. [Google Scholar] [CrossRef] [Green Version]
Quan, W.; Pimentel, A.D. A hierarchical run-time adaptive resource allocation framework for large-scale MPSoC systems. Des. Autom. Embed. Syst. 2016, 20, 311–339. [Google Scholar] [CrossRef] [Green Version]
Singh, A.K.; Shafique, M.; Kumar, A.; Henkel, J. Resource and throughput aware execution trace analysis for efficient run-time mapping on MPSoCs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 35, 72–85. [Google Scholar] [CrossRef]
Henkel, J.; Khdr, H.; Pagani, S.; Shafique, M. New trends in dark silicon. In Proceedings of the 2015 52nd ACM/EDAC/IEEE Design Automation Conference (Dac), San Francisco, CA, USA, 8–12 June 2015; pp. 1–6. [Google Scholar]
Kanduri, A.; Miele, A.; Rahmani, A.M.; Liljeberg, P.; Bolchini, C.; Dutt, N. Approximation-aware coordinated power/performance management for heterogeneous multi-cores. In Proceedings of the 55th Annual Design Automation Conference, San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar]
SDF3. Available online: http://www.es.ele.tue.nl/sdf3 (accessed on 1 February 2022).
Zahaf, H.E.; Benyamina, A.E.H.; Olejnik, R.; Lipari, G. Energy-efficient scheduling for moldable real-time tasks on heterogeneous computing platforms. J. Syst. Archit. 2017, 74, 46–60. [Google Scholar] [CrossRef]
Lambora, A.; Gupta, K.; Chopra, K. Genetic Algorithm-A Literature Review. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 380–384. [Google Scholar] [CrossRef]
Bienia, C.; Kumar, S.; Singh, J.P.; Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, 25–29 October 2008; pp. 72–81. [Google Scholar]

Figure 1. HHMS in [11] with three abstraction levels. (a) At the application level, a use case is composed of multiple concurrently executing applications. (b) At the platform level, a cluster-based multi/many-core system supporting per-cluster DVFS and per-core DPM. (c) At the management level, a hybrid hierarchical management structure consists of global and local management, and the run-time management is based on design-time prepared data. (c.1,c.2) Design-time-prepared data (prepared mappings and corresponding MAFs) for

a p p_{1}

and

a p p_{2}

. (c.3) The consideration of prepared mappings and MAFs for run-time management (e.g.,

a p p_{1}

and

a p p_{2}

are mapped to

c l u s t e r_{j}

).

Figure 1. HHMS in [11] with three abstraction levels. (a) At the application level, a use case is composed of multiple concurrently executing applications. (b) At the platform level, a cluster-based multi/many-core system supporting per-cluster DVFS and per-core DPM. (c) At the management level, a hybrid hierarchical management structure consists of global and local management, and the run-time management is based on design-time prepared data. (c.1,c.2) Design-time-prepared data (prepared mappings and corresponding MAFs) for

a p p_{1}

and

a p p_{2}

. (c.3) The consideration of prepared mappings and MAFs for run-time management (e.g.,

a p p_{1}

and

a p p_{2}

are mapped to

c l u s t e r_{j}

).

Figure 2. For

u_{1} = {a p p_{1}, a p p_{2}}

executing on two-cluster platform: (a) all possible application-to-cluster mappings with some possible per-cluster DVFS, task-to-core mapping, and DPM configurations. (b) MAF and design-time mapping selection for case 1 where two applications are mapped to the same cluster. (c) MAF and design-time mapping selection for of

a p p_{2}

in case 2 where one application is mapped to one cluster.

Figure 2. For

u_{1} = {a p p_{1}, a p p_{2}}

executing on two-cluster platform: (a) all possible application-to-cluster mappings with some possible per-cluster DVFS, task-to-core mapping, and DPM configurations. (b) MAF and design-time mapping selection for case 1 where two applications are mapped to the same cluster. (c) MAF and design-time mapping selection for of

a p p_{2}

in case 2 where one application is mapped to one cluster.

Figure 3. Flowcharts of three management strategies, based on the 0–1 ILP model and HHMS (with GM and LM). (a) Fully separate management strategy, (b) Partially separate management strategy. (c) Holistic management strategy. (d) Cluster frequency selection based on MAFs.

Figure 4. (a) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

, in #LCM = 100) and (b) normalized number of used cores (with regard to 2 × 4 platform, SER = 0.02) in all common feasible use cases (1002 use cases) by using the exhaustive strategy. Results are given for different platform sizes and different static energy models (

P o w e r_{s t a t i c} \propto f^{2}

and

P o w e r_{s t a t i c} \propto f

) under constraints of increasing SER.

Figure 4. (a) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

, in #LCM = 100) and (b) normalized number of used cores (with regard to 2 × 4 platform, SER = 0.02) in all common feasible use cases (1002 use cases) by using the exhaustive strategy. Results are given for different platform sizes and different static energy models (

P o w e r_{s t a t i c} \propto f^{2}

and

P o w e r_{s t a t i c} \propto f

) under constraints of increasing SER.

Figure 5. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

, in #LCM = 100) and (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to exhaustive, SER = 0.02) in all common feasible use cases for different platform sizes. Results are given to management strategies under constraints of increasing SER (by using the

P o w e r_{s t a t i c} \propto f^{2}

model). Note: For a better demonstration, the energy of fully separate strategy is shrunk by a factor of 1.5, 3, 6, 7 in 2 × 4, 2 × 8, 4 × 4 and 4 × 8, respectively. H: holistic strategy, PS: partially separate strategy, FS: fully separate strategy.

Figure 5. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c}

, in #LCM = 100) and (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to exhaustive, SER = 0.02) in all common feasible use cases for different platform sizes. Results are given to management strategies under constraints of increasing SER (by using the

P o w e r_{s t a t i c} \propto f^{2}

model). Note: For a better demonstration, the energy of fully separate strategy is shrunk by a factor of 1.5, 3, 6, 7 in 2 × 4, 2 × 8, 4 × 4 and 4 × 8, respectively. H: holistic strategy, PS: partially separate strategy, FS: fully separate strategy.

Figure 6. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M}

, averaged in an LCM) and (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to exhaustive-M0, #LCM = 1) in all common feasible use cases for different platform sizes. Results are given for different management strategies allowing 0 migration, under constraints of increasing executing time (by using

P o w e r_{s t a t i c} \propto f^{2}

model, SER = 0.1). Note: For a better demonstration, the energy of the fully separate strategy is shrunk by a factor of 2, 3, 5 in 2 × 8, 4 × 4 and 4 × 8, respectively. H: holistic strategy, PS: partially separate strategy, FS: fully separate strategy, -M0: allowing 0 migration across clusters.

Figure 6. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M}

, averaged in an LCM) and (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to exhaustive-M0, #LCM = 1) in all common feasible use cases for different platform sizes. Results are given for different management strategies allowing 0 migration, under constraints of increasing executing time (by using

P o w e r_{s t a t i c} \propto f^{2}

model, SER = 0.1). Note: For a better demonstration, the energy of the fully separate strategy is shrunk by a factor of 2, 3, 5 in 2 × 8, 4 × 4 and 4 × 8, respectively. H: holistic strategy, PS: partially separate strategy, FS: fully separate strategy, -M0: allowing 0 migration across clusters.

Figure 7. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M} + E_{o v e r h e a d}^{m i g r a t i o n}

, averaged in an LCM), (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to the holistic (H) strategy, #LCM = 1) and (c.1–c.4) actual number of migrations in all common feasible use cases. Results are given for different management strategies, allowing 0,1,2 migrations, under constraints of increasing executing time (by using the

P o w e r_{s t a t i c} \propto f^{2}

model), (a.1–a.4) normalized energy (averaged in an LCM) and for different platform sizes. Note: H: holistic strategy, PS: partially separate strategy, -Mx: denoting a strategy allowing x migration(s) across clusters.

Figure 7. (a.1–a.4) Normalized

E_{s y s}

(i.e.,

E_{s y s}^{d y n a m i c} + E_{s y s}^{s t a t i c} + E_{o v e r h e a d}^{D P M} + E_{o v e r h e a d}^{m i g r a t i o n}

, averaged in an LCM), (b.1–b.4) normalized

N_{a l l}^{u s e d}

(with regard to the holistic (H) strategy, #LCM = 1) and (c.1–c.4) actual number of migrations in all common feasible use cases. Results are given for different management strategies, allowing 0,1,2 migrations, under constraints of increasing executing time (by using the

P o w e r_{s t a t i c} \propto f^{2}

model), (a.1–a.4) normalized energy (averaged in an LCM) and for different platform sizes. Note: H: holistic strategy, PS: partially separate strategy, -Mx: denoting a strategy allowing x migration(s) across clusters.

Table 1. Comparison of applying mapping, DVFS and DPM in SotA.

Management Strategy		References	Applications
❶	Mapping → DVFS	[4,5,6,7]	Multiple concurrent apps
	Mapping & DVFS	[8,9,10]	One single app
	Mapping & DVFS	[11,12,13]	Multiple concurrent apps
❷	Mapping → DVFS* & DPM	[14,15]	Periodic independent tasks
❷	Mapping & DVFS* → DPM	[16]	one single app
❸	Fully-separate: Mapping → DVFS → DPM	This work	Multiple concurrent apps
	Partially-separate: Mapping & DVFS → DPM
	Holistic: Mapping & DVFS & DPM

Note: → refers to separate manner. & refers to holistic manner. app: an application having dependent tasks. DVFS: per-cluster DVFS by default. DVFS*: per-core DVFS.

Table 2. Notations used in this paper.

System configuration
$H H M S$	Hybrid hierarchical management structure
$D V F S$	Dynamic voltage and frequency scaling
$D P M$	Dynamic power management
Application level, platform level, and management level
$a p p_{i}$	An application with index i
$P e r i o d_{a p p_{i}}$	The period of an application $a p p_{i}$
$u_{m}$ ={ $a p p_{1}, a p p_{2}, . . ., a p p_{I}$ }	A use case $u_{m}$ consists of a set of simultaneously active applications
I	The number of activate applications in a use case
$c l u s t e r_{j}$	A cluster with index j
J	The number of available clusters in the platform
$N_{j}$	The number of available cores in $c l u s t e r_{j}$
${f_{j, 1}, f_{j, 2}, . . ., f_{j, m a x}}$	The available discrete frequency levels of $c l u s t e r_{j}$
$G M$	Global management
$L M$	Local management
$M A F_{i, j}^{c}$	The minimum allowed frequency (MAF) of $a p p_{i}$ mapped on c cores in $c l u s t e r_{j}$
$c^{m a x}$	The maximum number of cores used by design-time-prepared mapping for an application
$M A F_{j}^{L C}$	The lowest common MAF of the mapped applications in $c l u s t e r_{j}$
$X_{a p p_{i}}^{c}$	The design-time mapping of $a p p_{i}$ mapped on c number of cores
$N_{j}^{u s e d}$	The number of available cores in $c l u s t e r_{j}$
$N_{j}^{u s e d}$	The number of used cores in $c l u s t e r_{j}$
$M a_{i, j}$	The run-time application-to-cluster mapping of $a p p_{i}$ to $c l u s t e r_{j}$
$f_{j}$	The run-time configured frequency of $c l u s t e r_{j}$
Energy model
$E_{s y s}$	The overall energy of system
$E_{s y s}^{d y n a m i c}$	The dynamic energy of system
$E_{s y s}^{s t a t i c}$	The static energy of system
$E_{o v e r h e a d}^{D P M}$	The overhead energy of DPM mode switching
$E_{o v e r h e a d}^{m i g r a t i o n}$	The overhead energy of application migration
Experiment
FS	Fully separate strategy, mapping → DVFS → DPM
PS	Partially separate strategy, mapping & DVFS → DPM
H	Holistic strategy, mapping & DVFS & DPM
$M x$	Allowing x migration(s) across clusters
$S E R$	Static energy ratio, the ratio of static energy compared to dynamic energy
$# L C M$	Least comment multiple of the active application periods in each use case
$# M i g r a t i o n$	least comment multiple of the active application periods in each use case
$F C F S$	First come first served task mapping strategy in local management
$N_{a l l}^{u s e d}$	The number of used cores of all clusters

Table 3. Time (ms) to explore a use-case solution (averaged in 1023 use cases) of different strategies.

Platform	Special Case					General Case
						-M0				-M1		-M2
	Exhaustive	Genetic	FS	PS	H	Exhaustive	FS	PS	H	PS	H	PS	H
2 × 4	1.35	804.35	0.05	0.12	0.12	0.23	0.07	0.13	0.14	0.15	0.16	0.16	0.18
2 × 8	1.54	40.93	0.05	0.12	0.12	0.27	0.08	0.12	0.13	0.15	0.15	0.15	0.16
4 × 4	282.99	46.15	0.06	0.18	0.18	7.69	0.08	0.16	0.17	0.19	0.19	0.20	0.22
4 × 8	313.98	31.20	0.06	0.18	0.19	8.29	0.09	0.16	0.17	0.19	0.20	0.21	0.22

FS: fully separate, PS: partially separate, H: holistic, -M0,1,2: allowing 0,1,2 migration(s) across clusters.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, W.; Chen, Y.; Chen, D.; Su, T.; Yang, S. Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization. Electronics 2022, 11, 1094. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11071094

AMA Style

Qiu W, Chen Y, Chen D, Su T, Yang S. Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization. Electronics. 2022; 11(7):1094. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11071094

Chicago/Turabian Style

Qiu, Weiming, Yonghao Chen, Dihu Chen, Tao Su, and Simei Yang. 2022. "Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization" Electronics 11, no. 7: 1094. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics11071094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Run-Time Hierarchical Management of Mapping, Per-Cluster DVFS and Per-Core DPM for Energy Optimization

Abstract

1. Introduction

2. Background and Motivation

2.1. System Models and HHMS

2.2. Motivation of This Paper

3. 0–1 ILP Model Formulation and Proposed Solution

3.1. 0–1 ILP Problem Formulation

3.2. Proposed Management Strategies

3.2.1. Fully Separate: Mapping → DVFS → DPM

3.2.2. Partially Separate: Mapping & DVFS → DPM

3.2.3. Holistic: Mapping & DVFS & DPM

4. Experimental Results

4.1. Simulation Setup

Evaluated Hybrid-Hierarchical Management Strategies

4.2. Special Case

4.2.1. Evaluation of Different Platform Sizes

4.2.2. Evaluation of Different Management Strategies

4.3. General Case

4.3.1. Evaluation of Different Management Strategies without Migration

4.3.2. Evaluation of Different Management Strategies Allowing Migrations

4.4. Experimental Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI