Next Article in Journal
Energy Demand Modeling for the Transition of a Coal-Dependent City to a Low-Carbon City: The Case of Ulaanbaatar City
Previous Article in Journal
Dynamic Overlapping Coalition Formation in Electricity Markets: An Extended Formal Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Time–Energy Correlation for Multithreaded Matrix Factorizations

Institute of Computer Science, Maria Curie-Sklodowska University, Pl. M. Curie-Skłodowskiej 5, 20-031 Lublin, Poland
*
Author to whom correspondence should be addressed.
Current address: Department of Information Systems Software, Institute of Computer Science, Faculty of Mathematics, Physics and Computer Science, Pl. M. Curie-Skłodowskiej 5, 20-031 Lublin, Poland.
These authors contributed equally to this work.
Submission received: 24 July 2023 / Revised: 18 August 2023 / Accepted: 24 August 2023 / Published: 29 August 2023
(This article belongs to the Section K: State-of-the-Art Energy Related Technologies)

Abstract

:
The relationship between time and energy is an important aspect related to energy savings in modern multicore architectures. In this paper, we investigated and analyzed the correlation between time and energy. We compared the execution time and energy consumption of the LU factorization algorithms (versions with and without pivoting) and Cholesky with the Math Kernel Library (MKL) on a multicore machine. To reduce the energy of these multithreaded factorizations, the Dynamic Voltage and Frequency Scaling (DVFS) technique was used. This technique allows the clock frequency to be scaled without changing the implementation. In particular, we studied the correlations between time and energy using two metrics: Energy Delay Product (EDP) and Greenup, Powerup, and Speedup (GPS-UP). An experimental evaluation was performed on an Intel Xeon Gold multicore machine as a function of the number of threads and the clock speed. Our test results showed that scalability in terms of execution time, expressed by the Speedup metric, had values close to a linear function as the number of threads increased. In contrast, the scalability in terms of energy consumption, expressed by the Greenup metric, had values close to a logarithmic function as the number of threads increased. The use of the EDP and GPS-UP metrics allowed us to evaluate the impact of the optimized code (DVFS and increase in the number of threads) on the time and energy consumption and to determine a better green category representing energy savings without losing performance.

1. Introduction

Time–energy correlation is one of the main requirements to be considered when implementing parallel software on multicore machines, especially for numerical algorithms involving many matrix computations. To save the time of matrix factorization on multicore machines, multithreaded code is required. To reduce the energy of these multithreaded factorizations, the Dynamic Voltage and Frequency Scaling (DVFS) technique is used. This technique allows the clock frequency to be scaled without changing the implementation. The scalability feature allows an increasing number of threads to be used on a multicore machine in the hope that both time and energy efficiency will increase rather than decrease. The classical approach to scalability in parallel processing focuses on performance in terms of execution time. In this work, we wanted to study scalability in terms of two criteria, both the runtime of the numerical algorithm and the energy consumption. The importance and the need to consider several criteria in relation to scalability in parallel processing is shown in [1]. A distinction is made between two basic concepts related to scalability: scalability in the strong sense and scalability in the weak sense. In this paper, we will only consider scalability in the strong sense.
A deep understanding of the scalability of execution time and power consumption, along with the interrelationships between them, can enable the formulation of targeted improvements to reduce both uptime and power consumption across application domains. It is particularly important to analyze applications that intentionally use cache mechanisms. Such applications typically come from the numerical domain of linear algebra and involve matrix computations. Linear algebra is a key component of many numerical algorithms that address a range of scientific and engineering challenges.
Over time, Basic Linear Algebra Subroutines (BLAS) [2] became the de facto standard interface for linear algebraic operations. Among the widespread BLAS libraries, the Math Kernel Library (MKL) [3] occupies a prominent place. In the MKL, you can find implementations of matrix factorization techniques, including LU factorization and Cholesky factorization. All of these factorization implementations rooted in the BLAS Level 3 library are cache-friendly and work to minimize memory accesses and make better use of data spatial locality. The conventional MKL approach for parallel matrix factorizations in cache-dependent systems uses fixed-sized blocks that fit in the cache, thus distributing the computational load evenly across the threads. We expect that the obtained results can be generalized to other algorithms from this group with a cache-friendly implementation, e.g., QR factorization and others.
Currently, the MKL is focused on optimizing runtime performance, with energy savings not yet taken into account. While it is widely recognized that reducing the computation time usually results in energy savings, this is not the only rationale for energy efficiency. Therefore, to increase the energy efficiency of the algorithms contained in the MKL without changing their existing implementations in multicore architectures, this study used the technique of Dynamic Voltage and Frequency Scaling (DVFS) [4]. DVFS is a technique that reduces the clock frequency and voltage level of various computing node components (CPU, DRAM, etc.) at the cost of some performance degradation. Dynamic Voltage and Frequency Scaling (DVFS) technology allows us to effectively manage power consumption by adjusting the voltage and frequency of the processor depending on the current load, contributing to significant energy savings. However, it should be remembered that limiting performance by reducing the frequency and voltage of the processor to save energy may adversely affect the operation of applications that require high computing power, reducing its performance. In addition, the use of DVFS in one system component may affect the overall performance of the system, as differences in the performance adjustment of individual components can lead to inefficient synchronization of activities.
In this article, we ran tests on a multicore machine with two Intel Xeon Gold 5218R server processors with a base clock of 2.10 GHz, which are part of the Xeon Scalable series and have a Non-Uniform Memory Access (NUMA) architecture. The NUMA architecture is a commonly used concept in multicore and multiprocessor computing systems. In NUMA systems, memory is divided into multiple areas, each associated with a specific set of processor cores. This means that access to local memory is faster than access to remote memory, which introduces some non-linear delays (hence the name “Non-Uniform Memory Access”). NUMA is one of the key aspects that affects the performance and optimization of multiprocessor systems.

2. Related Work

The section deals with a literature review of the issues related to numerical linear algebra computations (in particular, matrix factorizations) on multicore machines in the context of time and energy. The literature study will focus on two aspects. One aspect will consider ways of achieving energy efficiency in numerical linear algebra software. There are three ways: by making changes to the implementation, by autotuning the software parameters for a given hardware platform, and by adjusting the operating system (e.g., DVFS). The second aspect considered will be the different ways of assessing the time execution and consumed energy.
In the article [5], different techniques related to algorithm transformation were used to reduce static and dynamic energy consumption on multicore machines with shared memory. In particular, one of the most-popular matrix factorizations, namely the LU factorization, was considered. In the LU factorization code, the most-energetically expensive instructions were extracted, and then, by performing a code transformation, an attempt was made to reduce their number. This work investigated the effect of the number of threads on dynamic and total energy consumption, performance, and also the number of MFlops/W for different matrix sizes at a fixed number of threads.
There are also papers that have analyzed the energy impact of numerical linear algebra algorithms that do not change the code, but only control the software parameters (e.g., block size) accordingly. The Automatically Tuned Linear Algebra Software (ATLAS) [6] is an example of the automatic tuning of a library to the architecture of the machine to reduce the execution time. The focus of the paper [7] was to propose a method for tuning the ATLAS library, whereby it is possible to replace the execution time tuning process by tuning the energy consumption. In this work, the performance and the number of MFlops/J, as well as the execution time and energy consumption were investigated for single- and double-precision for different array sizes.
At the operating system level, the approach to optimizing energy consumption focuses on minimizing the energy consumption of the entire node through the use of a technique such as DVFS. The DVFS technique has a very high potential to save energy, as shown in the work of [8,9,10]. The article [8] focuses on the DVFS technique, which is available for multiple platforms with multiple cores. The impact of the number of threads executing a multithreaded application and the different values of the clock frequency were considered. Only the Energy Delay Product (EDP) model relating power to the square of the execution time was used to assess energy efficiency.
In our work, we extend the study to include scalability in the energy sense of multithreaded applications measured by the Greenup and Powerup metrics described in the next section for numerical linear algebra problems with fixed problem dimensions. This will allow us to investigate scalability, as well as correlations between runtime and energy consumption.

3. Materials and Methods

3.1. Metrics

Energy consumption is the product of runtime and power consumed.
Energy can be saved in various ways, e.g., by shortening runtime or reducing power consumption, or one of the two: extending the runtime, but reducing the power consumption more or, vice versa, increasing the power consumption slightly, but reducing the runtime. Therefore, the assessment of energy efficiency based on the analysis of runtime or energy consumption alone is insufficient and sometimes causes confusion.
In the current literature, we can find metrics that allow for a deeper analysis of energy efficiency.
Gonzalez and Horowitz introduced the Energy Delay Product metric [11], which is dedicated to the simultaneous analysis of runtime and energy consumption. The EDP was defined as:
EDP = E T .
It is the product of energy and runtime.
In [12], other metrics were introduced, namely: Speedup, Greenup and Powerup (GPS-UP), and code evaluation was introduced by assigning the code to one of the 8 categories depicted in the GPS-UP Software Energy Efficiency Quadrant Graph (Figure 1), where the green categories mean energy savings and the red categories mean energy loss compared to non-optimized code. The Speedup metric is known in the literature and used to analyze performance in parallel programming between different code implementations. It is assumed that we have two implementations of the algorithm, one non-optimized (basic) code running in time T B and the other optimized code running in time T O . Speedup is defined as follows:
Speedup = T B T O
In [12], Greenup is defined analogously to Speedup only in terms of energy consumption:
Greenup = E B E O
where E B is the total energy consumption of the non-optimized code and E O is the total energy consumption of the optimized code. We know that E B = T B P B and E O = T O P O , where P B and P O are the average power consumed by the non-optimized and optimized code, respectively. Therefore, we can write: Greenup = T B P B T O P O = Speedup P B P O .
The work [12] introduced Powerup as follows:
Powerup = P O P B = Speedup Greenup
It is well known that the lower the value of the EDP metric, the better the efficiency is, but this metric sometimes fails (especially when trying to optimize with DVFS), giving a better indication of optimizations running faster despite using more energy. GPS-UP metrics provide a deeper analysis of the relationship between performance and energy efficiency. In the above formulas, in the case of optimization consisting of increasing the number of threads, T B means the runtime of the multithreaded algorithm running for 1 thread and T O the runtime of the multithreaded algorithm running on p threads ( p > 1 ). However, in the case of optimization using DVFS techniques, T B means the runtime of the algorithm running at the highest-possible frequency and T O the runtime of the algorithm running at a lower frequency. The notations E B and E O should be interpreted analogously in relation to the consumed energy.

3.2. Algorithms

We will briefly introduce the LU without pivoting, LU with pivoting, and Cholesky algorithms used to solve systems of linear equations. The LU factorization transforms the square nonsingular matrix A into a product of two matrices:
A = L U
where L and U are lower and upper triangular matrices, respectively.
Implementing LU of the above form is very rare in practice due to the risk of procedural problems [13]. They can be removed by simply rearranging the rows of matrix A. The LU without pivoting (without rearranging the rows) exists if the matrix A has a strict dominant diagonal. The LU factorization with pivoting that factorizes the matrix into matrices has the following form:
P A = L U
where P is the permutation matrix. The Cholesky factorization is defined only for A being Hermitian and positive-definite and has the form:
A = L L T
where L is a lower triangular matrix. In this article, we investigate the LAPACK [14] implementation of the LU factorization from MKL, namely dgetrfnpi (LU without pivoting) [15], dgetrf (LU with pivoting), and dpotrf (Cholesky) routines. These implementations are based on BLAS and arise from the use of a multithreaded BLAS.
The total number of floating-point operations (add, multiply, divide) for the LU factorizations without and with pivoting are the same and equal approximately 2 3 n 3 . The number of floating point comparisons for the LU factorization without pivoting equals 0 and for the LU factorization with pivoting equals approximately 1 2 n 2 . The number of floating-point operations in the Cholesky factorization is 1 3 n 3 . The number n is the size of the factoring matrix A.

3.3. Methodology

We tested three versions of matrix factorization: LU without pivoting, LU with pivoting, and Cholesky. We tested all algorithms without parallelization (1 thread) and in parallelized versions for 10, 20, 30, and 40 threads.
Our test dataset is a random square matrix with double-precision values with dimension n x n , n = 32,786. Therefore, our test dataset is 1,073,741,824 cells, which is 8 GB of data. All versions of the algorithm match the rowwise layout and were implemented in C++ with vectorization and parallelism.
Our experimental setup included a computing platform equipped with a modern multicore processor, the Intel(R) Xeon(R) Gold 5218R with 40 cores and a 2.10 GHz clock. We used kernel Linux 4.18.0 and the AlmaLinux 8.4 operating system with Intel ICC v. 2021.5.0.
The Linux kernel supports CPU performance scaling using the CPU Frequency scaling (CPUFreq) subsystem, which consists of three layers of code: core, scaling drivers, and governors. The CPUFreq core provides a common code infrastructure and user space interfaces across all platforms that support CPU performance scaling. It defines the basic framework within which the other components operate. The scaling driver talks to the hardware. It provides the scale managers with information about available P-states (or ranges of P-states in some cases) and access platform-specific hardware interfaces to change the P-states of the processor as requested by the scale masters. The governor implements algorithms to estimate the required CPU capacity. As a rule, each manager implements one, possibly parameterized, scaling algorithm.
The default scaling driver and governor are selected automatically, but user space tools such as cpupower, acpid, laptop mode tools, or desktop GUI tools can still be used for advanced configuration.
We made a change to the clock frequencies using the CPUfreq. We used the acpi_cpufreq driver. By default, the acpi_cpufreq driver follows governor conservative, which increases or decreases the clock frequency depending on the load on the core by selecting one of several available frequencies from the minimum to the maximum supported by the processor.
We used the cpupower program to change the minimum and maximum values of the processor frequency limit at a given level. We used the following commands:
  • cpupower frequency-set -d 1700000
  • cpupower frequency-set -u 1700000
for setting the minimum and maximum frequency limit values to 1.7 GHz. These commands automatically change the governor to userspace, which allows one to set a specific frequency. The frequency setting was made for all cores.
We performed tests for the frequencies (P-states) available on our platform from 0.8 GHz to 2.1 GHz with a step of 0.1 GHz. We first tested the following frequencies: 0.8 GHz, 1.1 GHz, 1.4 GHz, 1.7 GHz, and 2.1 GHz.
To analyze the impact of algorithm optimizations on energy consumption, we used measurements from the Running Average Power Limit (RAPL) interface, which is available for Intel processors. RAPL uses machine-specific records to monitor and control energy consumption in real-time. In multi-socket systems, RAPL provides results for each socket (each package) separately, also providing separate measurement values for the memory modules (DRAM) associated with each socket. Starting with Haswell processors, which have fully integrated voltage regulators, the accuracy of the measurements returned by RAPL was improved and satisfactory [16]. During the tests, we took measurements every 1 s, and for the analysis, we took into account the total energy consumption of all sockets and their associated memory modules.

4. Results

4.1. Time and Energy Consumption

In Figure 2, we have the runtime and energy consumption of individual algorithms for 1 thread and in Figure 3 for parallel versions (i.e., 10, 20, 30, and 40 threads), respectively. For one thread, the Cholesky factorization works best using both less time and energy. The LU factorization without pivoting and with pivoting are similar in terms of time and energy consumption; LU without pivoting is slightly better. Due to the fact that the Cholesky factorization consumes less time and energy than the LU factorizations in Figure 3, we have different scales for the x-axis.
In Figure 3, we see that using clock frequency scaling down increases the runtime in all cases, while increasing the number of threads reduces it. Therefore, for our architecture, 40 threads and a frequency of 2.1 GHz were more optimal in terms of time. However, if we look at the energy consumption, in some cases, lowering the clock frequency to 1.7 GHz gave us a reduction in energy consumption. We saw this for the non-parallelized versions of the algorithms (about 19 % ), as well as for the parallelized versions in all cases for 10 threads ( 3 % for LU without pivoting, 2 % for LU with pivoting, and 9 % for Cholesky). Furthermore, for 20 threads, we noted a decrease in energy consumption at 1.7 GHz in the case of the Cholesky distribution ( 5 % ) and even for 40 threads in the case of the non-pivoting LU factorization ( 1.6 % ). Reducing the clock frequency to 1.4 GHz and below had no benefit. Energy consumption also decreased with the increase in the number of threads used for the calculations. The only exception was the case of the Cholesky algorithm at 2.1 GHz; the algorithm turned out to have 1.2 % more energy savings at 30 threads compared to 40 threads. We had a similar situation here at 2.0 and 1.9 GHz (Figure 4).
Due to energy drops when changing the clock frequency to 1.7 GHz, we performed further tests to investigate what happens to other frequencies between 1.4 GHz and 2.1 GHz. The results can be seen in Figure 4. Subsequent tests confirmed the occurrence of a local minimum for energy consumption at a frequency of 1.7 GHz for the LU without pivoting algorithm running on 40 threads. For the LU without pivoting algorithm running it on 40 threads and reducing the clock frequency to 1.7 GHz, we obtained the lowest energy consumption.
Table 1 presents a selection of algorithm versions and clock frequencies for which we obtained the lowest energy consumption for all three factorizations. In the table, the first column describes the algorithm and the selected settings, the second column shows the energy consumption in Joules, and the third shows the running time of the algorithm. The fourth and fifth columns show the loss of runtime and the gain in energy consumption in percent, respectively, compared to the algorithm and clock setting that gave the shortest time; for all three factorizations, it was 40 threads and a frequency of 2.1 GHz. We see that, for LU factorization with pivoting, an intuitive choice: more threads, a higher clock frequency actually gave the best result in terms of both time and energy. However, the other two tested factorizations showed exceptions to this rule. If we care more about saving energy than time, we can consider other choices of the number of threads and clock settings. Our tests showed that we saved the most energy when we used 40 threads for LU without pivoting, but reducing the clock frequency to 1.7 GHz, while for Cholesky, it was 30 threads and the clock frequency was reduced to 2.0 GHz.

4.2. Speedup and Greenup

In the graphs in Figure 5, we have the Speedup (left column) and Greenup (right column) values of the algorithms run for different clock frequencies based on the measurement values of our experience. Each graph is for one algorithm (LU without pivoting, LU with pivoting, and Cholesky). On the x-axis, we have the number of running threads marked; the y-axis shows the values of the Speedup or Greenup measure. For better reference, we marked the maximum expected Speedup (which is linear with the number of threads p) and Greenup (which is logarithmic with the number of threads p) with a black dashed line in the charts. We saw for all three algorithms that, as the number of threads increased, regardless of the frequency, we gained more runtime than energy consumption (time decreased faster than energy consumption decreased). For the LU without and with pivoting algorithms, the Speedup for all frequencies was almost linear, and with the increase in the number of threads, it deviated further from the maximum expected value. In terms of Speedup, the frequency of 0.8 GHz was the most-favorable, and the lowest was 2.1 GHz.
In general, the Greenup plot deviated much further from the maximum expected value than the Speedup in all cases, which confirmed our experience. For the algorithms tested by us, as in the case of Speedup, the highest Greenup values were achieved for 0.8 GHz, whereas the smallest ones were different than in the case of Speedup, and this was achieved for 2.0 GHz. For the 2.1 GHz frequency, which had the lowest Speedup values, we also obtained relatively high Greenup values (purple line in the graph).
On our architecture for the tested algorithms, we observed the relationship:
Greenup α log 2 ( Speedup )
where α > β , and in our case, β belongs to the interval ( 2.84 ; 2.85 ) . The research question that arises is: What is the value of α for other architectures?

4.3. Energy and Runtime Correlation

Since looking at Speedup and Greenup separately, we did not have complete information about which optimization was cost-effective in terms of runtime and energy consumption, we now looked at the metrics that would give us a more-complete picture of the situation. We used the EDP metric [11] and the GPS-UP metric [12] (Greenup, Powerup, Speedup). Let us look at Powerup itself, which expresses Speedup’s relationship with Greenup. From the graphs from Figure 6, we can see that the more threads we add, the more we gain on time than on energy (Powerup is getting bigger). We also see that Powerup was bounded from above by the curve γ log 2 ( p ) , where p is the number of threads and γ = 0.75 in our architecture for the tested algorithms.
To analyze which optimizations were cost-effective for energy consumption, let us look at the GPS-UP and EDP metrics. Let us first look at the optimization for increasing the number of threads. In Table 2, we can see a summary of the measures for the LU without pivoting algorithm operating on a single thread running at 2.1 GHz and its optimizations consisting of increasing the number of threads. With GPS-UP, we obtain Category 3, which is green in every case, i.e., when increasing the number of threads to 10, 20, 30, and 40. Thus, each of these optimizations results in energy savings. In this table, the last two columns present the measure of EDP and normalized EDP:
Normalized EDP = EDP O EDP B
where EDP O is the EDP of the optimized version and EDP B is the EDP of the base version. The EDP is the product of the time and energy used, so it was assumed that the smaller the EDP, the better. We present here the values of the EDP measure because it is a well-known and often-considered measure (among others in [17,18,19]), although sometimes, it can be misleading, as we can see in the results from Table 3.
Thus, each of the considered optimizations in Table 2 save energy, but the EDP indicated that the most energy savings would be achieved by the optimization with 40 threads. In the case of the pivoting LU algorithm, we obtained analogous results.
In the case of analogous optimizations of the Cholesky algorithm, we also obtained Category 3 based on GPS-UP (Table 3). Similarly, if we look at the EDP measure, we obtained an indication of 40 threads as the best optimization. In addition, we see that the EDP metrics for 40 and 30 thread optimization were slightly different. However, if we look at the Powerup column, we see a lower value for optimizing 30 threads. Looking at the runtime and energy consumption data, we can see that the algorithm for 30 threads works better than for 40 threads in terms of energy, but consumes more time, which is indicated by the lower Powerup. Thus, for the Cholesky algorithm at 2.1 GHz, the optimization with 30 threads is the best in terms of energy savings.
We will now look at the optimizations using DVFS. We start with the parallel version of the LU without pivoting algorithm on 40 threads running at 2.1 GHz. In Table 4, we have a summary of the results and measures for optimizing the change to a lower clock frequency. Looking at the EDP measure, we can conclude that these optimizations will not pay off for us due to too much time loss. However, we obtained a green category from the GPS-UP measure in two cases for the 2.0 and 1.7 GHz frequencies. In these cases, we lost time, but we saved energy, while in the others, we lost both time and energy—Category 6, red. Looking at Powerup for green optimizations, we can choose a more-energy-saving optimization; in this case, it was a frequency of 1.7 GHz.
Our experience has shown that, in the case of the LU algorithm with pivoting on 40 threads running at 2.1 GHz, the optimization consisting of reducing the frequency did not have a benefit, and we obtained a red category: 6. The same is true for 30 and 20 threads. On the other hand, the clock frequency scaling benefited this algorithm operating on 1 thread and 10 threads, as shown in Table 5 and Table 6, where the green Category 4 appears.
In the case of the Cholesky algorithm and optimization by reducing the clock frequency in Table 7, we can see that, for 40 threads, reducing the clock frequency had positive effects (green category) when the frequency was set to 2.0 GHz or 1.8 GHz. On the other hand, a lower Powerup value indicates a more-favorable clock frequency setting of 1.8 GHz in terms of lower energy consumption.

5. Conclusions

This paper considered the concept of scalability in terms of execution time and energy consumption for three matrix factorizations (LU without pivoting, LU with pivoting, and Cholesky) derived from the MKL. In order to reduce the energy consumption of these factorizations, the DVFS technique was used, which made it possible not to interfere with the implementation code, but to influence the clock frequency settings from the operating system level.
The study investigated the effect of two parameters, clock frequency and number of threads, on the execution time and energy consumption on a multicore machine. The execution time was always the shortest for the highest clock frequency and the highest number of threads for all three factorizations. We cannot say the same about the energy savings, as it depended on the number of threads and the clock frequency.
The Speedup and Greenup values increased with an increasing number of threads and usually with a decreasing clock frequency. The experimental results showed that the Speedup value was higher than the Greenup value. The Speedup was even up to 74% higher than the Greenup for certain settings of the clock value and the number of threads.
By examining the EDP and GPS-UP metrics, a correlation between runtime and energy consumption was observed for all three factorizations. Some parameter settings (number of threads and clock frequency) allowed energy savings despite increased runtime (Category 4), while other parameter settings showed less energy consumption by reducing the runtime (Category 3). Still, others showed more energy consumption with increased runtime and power savings, but not enough to save energy (Category 6). These results support the claim that minimizing energy consumption is not always derived only from minimizing the runtime of the applications.
Future work includes an investigation of the weak scalability in terms of execution time and in terms of energy consumption on different multicore machines and different applications. During the investigation of weak scalability, the problem size per processor will remain constant as more computing units are added. In addition, an important issue to be addressed would be to find a correlation between strong and weak scalability in terms of energy consumption.

Author Contributions

Conceptualization, B.B.; Methodology, M.P.; Software, B.B.; Formal analysis, M.P.; Investigation, M.P.; Resources, B.B.; Writing—original draft, M.P.; Writing—review & editing, B.B.; Visualization, M.P.; Project administration, B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data from all conducted experiments are available in the public repository at https://github.com/mdpiekarz/time-energy-correlation-article, accessed on 1 June 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ngoko, Y.; Trystram, D. Scalability in Parallel Processing. In Topics in Parallel and Distributed Computing, Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms; Prasad, S.K., Gupta, A., Rosenberg, A.L., Sussman, A., Weems, C.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 79–109. [Google Scholar] [CrossRef]
  2. Dongarra, J.; DuCroz, J.; Duff, I.S.; Hammarling, S. A Set of Level-3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 1990, 16, 1–28. [Google Scholar] [CrossRef]
  3. Intel Math Kernel Library 2014. Available online: http://software.intel.com/en-us/articles/intel-mkl/ (accessed on 1 June 2023).
  4. Weiser, M.; Welch, B.; Demers, A.; Shenker, S. Scheduling for reduced CPU energy. In Mobile Computing; The Kluwer International Series in Engineering and Computer Science; Imielinski, T., Korth, H.F., Eds.; Springer: Boston, MA, USA, 1994; Volume 353, pp. 13–23. [Google Scholar]
  5. Garcia, E.; Arteaga, J.; Pavel, R.; Gao, G.R. Optimizing the LU Factorization for Energy Efficiency on a Many-Core Architecture. In Proceedings of the Languages and Compilers for Parallel Computing; Cașcaval, C., Montesinos, P., Eds.; Springer: Cham, Switzerland, 2014; pp. 237–251. [Google Scholar]
  6. Whaley, R.C.; Petitet, A.; Dongarra, J.J. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 2001, 27, 3–35. [Google Scholar] [CrossRef]
  7. Jakobs, T.; Lang, J.; Rünger, G.; Stöcker, P. Tuning linear algebra for energy efficiency on multicore machines by adapting the ATLAS library. Future Gener. Comput. Syst. 2018, 82, 555–564. [Google Scholar] [CrossRef]
  8. Rauber, T.; Rünger, G.; Stachowski, M. Model-based optimization of the energy efficiency of multi-threaded applications. Sustain. Comput. Inform. Syst. 2019, 22, 44–61. [Google Scholar] [CrossRef]
  9. Carretero, J.; Distefano, S.; Petcu, D.; Pop, D.; Rauber, T.; Rünger, G.; Singh, D.E. Energy-efficient Algorithms for Ultrascale Systems. Supercomput. Front. Innov. 2015, 2, 77–104. [Google Scholar] [CrossRef]
  10. Bratek, R.; Szustak, L.; Wyrzykowski, R.; Olas, T. Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems. J. Parallel Distrib. Comput. 2023, 175, 121–133. [Google Scholar] [CrossRef]
  11. Gonzalez, R.; Horowitz, M. Energy dissipation in general purpose microprocessors. IEEE J. Solid-State Circuits 1996, 31, 1277–1284. [Google Scholar] [CrossRef]
  12. Abdulsalam, S.; Zong, Z.; Gu, Q.; Meikang, Q. Using the Greenup, Powerup, and Speedup metrics to evaluate software energy efficiency. In Proceedings of the 2015 Sixth International Green and Sustainable Computing Conference (IGSC), Las Vegas, NV, USA, 14–16 December 2015; pp. 1–8. [Google Scholar] [CrossRef]
  13. Trefethen, L.N.; Bau, D. Numerical Linear Algebra; SIAM: Philadelphia, PA, USA, 2022; Lecture 21. [Google Scholar]
  14. Anderson, E.; Bai, Z.; Bischof, C.; Demmel, J.; Du Croz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics; SIAM: Philadelphia, PA, USA, 1999. [Google Scholar] [CrossRef]
  15. Demmel, J.W. Applied Numerical Linear Algebra; SIAM: Philadelphia, PA, USA, 1997. [Google Scholar] [CrossRef]
  16. Khan, K.; Hirki, M.; Niemi, T.; Nurminen, J.; Ou, Z. RAPL in Action: Experiences in Using RAPL for Power Measurements. ACM Trans. Model. Perform. Eval. Comput. Syst. (TOMPECS) 2018, 3, 1–26. [Google Scholar] [CrossRef]
  17. Hajiamini, S.; Shirazi, B.A. Chapter Two—A Study of DVFS Methodologies for Multicore Systems with Islanding Feature; Elsevier: Amsterdam, The Netherlands, 2020; Volume 119, pp. 35–71. [Google Scholar] [CrossRef]
  18. Hoveida, M.; Aghaaliakbari, F.; Jalili, M.; Bashizade, R.; Arjomand, M.; Sarbazi-Azad, H. Chapter Two—Revisiting Processor Allocation and Application Mapping in Future CMPs in Dark Silicon Era. In Dark Silicon and Future On-Chip Systems; Hurson, A.R., Sarbazi-Azad, H., Eds.; Elsevier: Amsterdam, The Netherlands, 2018; Volume 110, pp. 35–81. [Google Scholar] [CrossRef]
  19. Murray, J.; Wettin, P.; Pande, P.P.; Shirazi, B. Chapter 10—Conclusions and Possible Future Explorations. In Sustainable Wireless Network-on-Chip Architectures; Murray, J., Wettin, P., Pande, P.P., Shirazi, B., Eds.; Morgan Kaufmann: Boston, MA, USA, 2016; pp. 143–155. [Google Scholar] [CrossRef]
Figure 1. Source [12], GPS-UP Software Energy Efficiency Quadrant Graph (Cat. 1: Powerup < 1 ,   Speedup > 1 ; Cat. 2: Powerup = 1 ,   Speedup > 1 ; Cat. 3: Powerup > 1 ,   Speedup > 1 ,   Speedup > Powerup ; Cat. 4: Powerup < 1 ,   Speedup < 1 ,   Speedup > Powerup ; Cat. 5: Powerup > 1 ,   Speedup > 1 ,   Powerup > Speedup ; Cat. 6: Powerup < 1 ,   Speedup < 1 ,   Speedup < Powerup ; Cat. 7: Powerup = 1 ,   Speedup < 1 ; Cat. 8: Powerup > 1 ,   Speedup < 1 .)
Figure 1. Source [12], GPS-UP Software Energy Efficiency Quadrant Graph (Cat. 1: Powerup < 1 ,   Speedup > 1 ; Cat. 2: Powerup = 1 ,   Speedup > 1 ; Cat. 3: Powerup > 1 ,   Speedup > 1 ,   Speedup > Powerup ; Cat. 4: Powerup < 1 ,   Speedup < 1 ,   Speedup > Powerup ; Cat. 5: Powerup > 1 ,   Speedup > 1 ,   Powerup > Speedup ; Cat. 6: Powerup < 1 ,   Speedup < 1 ,   Speedup < Powerup ; Cat. 7: Powerup = 1 ,   Speedup < 1 ; Cat. 8: Powerup > 1 ,   Speedup < 1 .)
Energies 16 06290 g001
Figure 2. Time execution for LU without pivoting, LU with pivoting, and Cholesky for 1 thread (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky for 1 thread (right).
Figure 2. Time execution for LU without pivoting, LU with pivoting, and Cholesky for 1 thread (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky for 1 thread (right).
Energies 16 06290 g002
Figure 3. Time execution for LU without pivoting, LU with pivoting, and Cholesky (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky (right).
Figure 3. Time execution for LU without pivoting, LU with pivoting, and Cholesky (left); energy consumption of LU without pivoting, LU with pivoting, and Cholesky (right).
Energies 16 06290 g003
Figure 4. Energy consumption of LU without pivoting, LU with pivoting, and Cholesky for frequencies from 1.6 GHz to 2.1 GHz.
Figure 4. Energy consumption of LU without pivoting, LU with pivoting, and Cholesky for frequencies from 1.6 GHz to 2.1 GHz.
Energies 16 06290 g004aEnergies 16 06290 g004b
Figure 5. Speedup for LU without pivoting, LU with pivoting, and Cholesky (left); Greenup of LU without pivoting, LU with pivoting, and Cholesky (right) (p denotes the number of threads).
Figure 5. Speedup for LU without pivoting, LU with pivoting, and Cholesky (left); Greenup of LU without pivoting, LU with pivoting, and Cholesky (right) (p denotes the number of threads).
Energies 16 06290 g005
Figure 6. Powerup for LU without pivoting, LU with pivoting, and Cholesky for a variable number of threads (p denotes the number of threads) at a fixed clock frequency.
Figure 6. Powerup for LU without pivoting, LU with pivoting, and Cholesky for a variable number of threads (p denotes the number of threads) at a fixed clock frequency.
Energies 16 06290 g006
Table 1. The most-energy-saving versions of the algorithms.
Table 1. The most-energy-saving versions of the algorithms.
Version of AlgorithmEnergy
Consumption (J)
Time (s)Waste of Time
(%)
Energy Saving
(%)
LU without pivoting
40 threads|1.7 GHz
6169.1026.537.91.6
LU with pivoting
40 threads|2.1 GHz
6306.6224.3100
Cholesky
30 threads|2.0 GHz
3375.7414.6710.11.3
Table 2. Base-version algorithm—LU without pivoting without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.
Table 2. Base-version algorithm—LU without pivoting without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
1/2.1716.8971,034.971.00001.00001.0000Base50,924,083.57311.0000
10/2.173.8110,927.719.71286.50041.49423806,557.60630.0158
20/2.137.917468.3118.91049.51151.98823283,120.95610.0056
30/2.127.926560.5425.674510.82762.37123183,184.39080.0036
40/2.124.436267.9429.339511.33312.58883153,152.25530.0030
Table 3. Base-version algorithm—Cholesky without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.
Table 3. Base-version algorithm—Cholesky without parallelization running at a frequency of 2.1 GHz compared to a parallel version for 10, 20, 30, and 40 threads.
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
1/2.1375.3737,661.121.00001.00001.0000Base14,136,853.01531.0000
10/2.137.985601.759.88336.72311.47013212,755.70390.0150
20/2.120.233942.9818.55309.55141.9424379,775.31970.0056
30/2.114.243377.3326.354811.15122.3634348,103.13620.0034
40/2.113.183418.3028.486411.01752.5856345,043.50850.0032
Table 4. Base-version algorithm LU without pivoting parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.
Table 4. Base-version algorithm LU without pivoting parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
40/2.124.436267.941.00001.00001.0000Base153,152.25531.0000
40/2.025.076264.050.97481.00060.97424157,012.98751.0252
40/1.925.876293.690.94460.99590.94856162,797.37051.0630
40/1.826.536270.340.92100.99960.92136166,353.35611.0862
40/1.727.396217.280.89221.00810.88504170,271.66471.1118
40/1.628.966318.910.84370.99190.85056183,010.98021.1950
40/1.431.906401.940.76610.97910.78256204,190.62011.3333
40/1.139.407007.560.62020.89450.69346276,088.43061.8027
40/0.853.018054.570.46090.77820.59236427,002.36072.7881
Table 5. Base-version algorithm LU with pivoting without parallelization version running at a frequency of 2.1 GHz compared to an algorithm running at lower CPU frequencies.
Table 5. Base-version algorithm LU with pivoting without parallelization version running at a frequency of 2.1 GHz compared to an algorithm running at lower CPU frequencies.
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
1/2.1724.7871,914.57111Base52,122,410.60351.0000
1/2.0767.2450,920.530.94471.41230.6689439,068,253.50850.7495
1/1.9806.0952,842.730.89911.36090.6607442,596,156.38180.8172
1/1.8849.5055,128.000.85321.30450.6540446,831,489.90000.8985
1/1.7897.8958,064.770.80721.23850.6517452,135,892.04721.0003
1/1.6953.1461,609.110.76041.16730.6514458,721,816.59181.1266
1/1.41085.9169,981.080.66741.02760.6495475,992,808.70771.4580
1/1.11378.0986,557.870.52590.83080.63306119,284,879.52552.2886
1/0.81891.50118,098.320.38320.60890.62936223,382,614.89204.2857
Table 6. Base-version algorithm LU with pivoting parallel version running at a frequency of 2.1 GHz on 10 threads compared to an algorithm running at lower CPU frequencies. (For 20, 30, and 40, we gained nothing by reducing the frequency, so we obtained a red category).
Table 6. Base-version algorithm LU with pivoting parallel version running at a frequency of 2.1 GHz on 10 threads compared to an algorithm running at lower CPU frequencies. (For 20, 30, and 40, we gained nothing by reducing the frequency, so we obtained a red category).
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
10/2.194.2013,510.11111Base12,726,269.6871.0000
10/2.099.1512,126.880.95001.11410.852841,202,399.72000.9448
10/1.9104.8512,437.410.89841.08620.827041,304,115.98381.0247
10/1.8109.6512,869.100.85901.04980.818341,411,154.59341.1089
10/1.7115.8113,174.410.81341.02550.793241,525,707.14151.1989
10/1.6122.6213,638.360.76820.99060.775561,672,284.40351.3140
10/1.4139.2814,497.220.67630.93190.725762,019,231.55861.5867
10/1.1177.6416,659.850.53030.81090.653962,959,463.09732.3255
10/0.8241.3319,580.230.39030.69000.565764,725,256.64063.7130
Table 7. Base-version algorithm Cholesky parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.
Table 7. Base-version algorithm Cholesky parallel version running at a frequency of 2.1 GHz on 40 threads compared to an algorithm running at lower CPU frequencies.
Threads/FrequencyTime [s]Energy [J]SpeedupGreenupPowerupCategoryEDPNormalized EDP
40/2.113.183418.301.00001.00001.0000Base45,043.50851.0000
40/2.013.263412.810.99401.00160.9924445,241.62451.0044
40/1.913.663425.570.96440.99790.9665646,803.76561.0391
40/1.813.903404.330.94771.00410.9439447,332.75711.0508
40/1.714.473423.690.91050.99840.9119649,548.75951.1000
40/1.615.183447.150.86790.99160.8752652,336.24091.1619
40/1.416.973501.770.77660.97620.7956659,414.38231.3190
40/1.120.173764.290.65320.90810.7193675,936.26631.6858
40/0.826.824365.300.49120.78310.62736117,093.87042.5996
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bylina, B.; Piekarz, M. Time–Energy Correlation for Multithreaded Matrix Factorizations. Energies 2023, 16, 6290. https://0-doi-org.brum.beds.ac.uk/10.3390/en16176290

AMA Style

Bylina B, Piekarz M. Time–Energy Correlation for Multithreaded Matrix Factorizations. Energies. 2023; 16(17):6290. https://0-doi-org.brum.beds.ac.uk/10.3390/en16176290

Chicago/Turabian Style

Bylina, Beata, and Monika Piekarz. 2023. "Time–Energy Correlation for Multithreaded Matrix Factorizations" Energies 16, no. 17: 6290. https://0-doi-org.brum.beds.ac.uk/10.3390/en16176290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop