State-of-the-Art High-Performance Computing and Networking

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 September 2022) | Viewed by 15688

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science, Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain
Interests: high-performance computing; programming models and runtime systems; parallel programming; resource heterogeneity; GPU Computing; network communications

E-Mail Website
Guest Editor
Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
Interests: high performance computing; parallel programming; linear algebra; computational fluid dynamics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

High-performance computing (HPC) plays an indispensable role in addressing many of the challenges of today’s society. From cancer fighting to astrophysics, supercomputers are used to solve extremely large problems in a reasonable time. As all fields of science evolve, more and more computational power is required to process the ever-increasing datasets. There are big challenges associated with all of the following three broad domains: hardware, software, and applications. Hardware support for exascales can be accomplished by increased energy-friendly heterogeneity (i.e., specialized resources). This heterogeneity is no longer exclusive to processing elements (e.g., CPUs and accelerators), as modern memory hierarchies tend to expose a variety of memory subsystems composed of different technologies. Another hardware-related challenge is the significant increase in the number of compute nodes compared with previous systems, which brings the potential for serious congestion problems because of the higher generated network traffic. These hardware-related challenges pose major implications in terms of all software layers, which tend to be under a heavier burden to hide the complexity of managing the hardware resources from the programmer.

This Special Issue “State-of-the-Art High-Performance Computing and Networking” aims to publish novel research related to all levels of HPC. Submissions are expected to focus on computing and/or networking aspects of current and emerging HPC platforms. Review articles are also welcome.

Topics of interest include, but are not limited to, the following:

  • Applications
  • Programming models, runtime systems, and compiler support
  • System software
  • Performance tools
  • Computer architecture
  • Network architecture and HPC protocols

Prof. Dr. Antonio J. Peña
Prof. Dr. Pedro Valero-Lara
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • high-performance computing
  • high performance networks
  • computational science
  • supercomputing

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

24 pages, 3466 KiB  
Article
The Effects of High-Performance Cloud System for Network Function Virtualization
by Wu-Chun Chung and Yun-He Wang
Appl. Sci. 2022, 12(20), 10315; https://0-doi-org.brum.beds.ac.uk/10.3390/app122010315 - 13 Oct 2022
Cited by 1 | Viewed by 1391
Abstract
Since ETSI introduced the architectural framework of network function virtualization (NFV), telecom operators have paid more attention to the synergy of NFV and cloud computing. With the integration of the NFV cloud platform, telecom operators decouple network functions from the dedicated hardware and [...] Read more.
Since ETSI introduced the architectural framework of network function virtualization (NFV), telecom operators have paid more attention to the synergy of NFV and cloud computing. With the integration of the NFV cloud platform, telecom operators decouple network functions from the dedicated hardware and run virtualized network functions (VNFs) on the cloud. However, virtualization degrades the performance of VNF, resulting in violating the performance requirements of the telecom industry. Most of the existing works were not conducted in a cloud computing environment, and fewer studies focused on the usage of enhanced platform awareness (EPA) features. Furthermore, few works analyze the performance of the service function chain on a practical cloud. This paper facilitates the OpenStack cloud with different EPA features to investigate the performance effects of VNFs on the cloud. A comprehensive test framework is proposed to evaluate the verification of functionality, performance, and application testing. Empirical results show that the cloud system under test fulfills the requirements of service level agreement in Rally Sanity testcases. The throughput of OVS-DPDK is up to 8.2 times as high as that of OVS in the performance test. Meanwhile, the hardware-assisted solution, SR-IOV, achieves the throughput at near the line rate in the end-to-end scenario. For the application test, the successful call rate for the vIMS service is improved by up to 14% while applying the EPA features on the cloud. Full article
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)
Show Figures

Figure 1

32 pages, 822 KiB  
Article
A Survey on Malleability Solutions for High-Performance Distributed Computing
by Jose I. Aliaga, Maribel Castillo, Sergio Iserte, Iker Martín-Álvarez and Rafael Mayo
Appl. Sci. 2022, 12(10), 5231; https://0-doi-org.brum.beds.ac.uk/10.3390/app12105231 - 22 May 2022
Cited by 5 | Viewed by 2108
Abstract
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale supercomputers. Process malleability is presented as a straightforward mechanism to address that issue. Nowadays, [...] Read more.
Maintaining a high rate of productivity, in terms of completed jobs per unit of time, in High-Performance Computing (HPC) facilities is a cornerstone in the next generation of exascale supercomputers. Process malleability is presented as a straightforward mechanism to address that issue. Nowadays, the vast majority of HPC facilities are intended for distributed-memory applications based on the Message Passing (MP) paradigm. For this reason, many efforts are based on the Message Passing Interface (MPI), the de facto standard programming model. Malleability aims to rescale executions on-the-fly, in other words, reconfigure the number and layout of processes in running applications. Process malleability involves resources reallocation within the HPC system, handling processes of the application, and redistributing data among those processes to resume the execution. This manuscript compiles how different frameworks address process malleability, their main features, their integration in resource management systems, and how they may be used in user codes. This paper is a detailed state-of-the-art devised as an entry point for researchers who are interested in process malleability. Full article
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)
Show Figures

Figure 1

23 pages, 1510 KiB  
Article
RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
by Qiqi Wang, Hongjie Zhang, Cheng Qu, Yu Shen, Xiaohui Liu and Jing Li
Appl. Sci. 2021, 11(20), 9448; https://0-doi-org.brum.beds.ac.uk/10.3390/app11209448 - 12 Oct 2021
Cited by 8 | Viewed by 2401
Abstract
The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue [...] Read more.
The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers. Full article
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)
Show Figures

Figure 1

16 pages, 1012 KiB  
Article
Analyzing the Performance of the S3 Object Storage API for HPC Workloads
by Frank Gadban and Julian Kunkel
Appl. Sci. 2021, 11(18), 8540; https://0-doi-org.brum.beds.ac.uk/10.3390/app11188540 - 14 Sep 2021
Cited by 3 | Viewed by 5686
Abstract
The line between HPC and Cloud is getting blurry: Performance is still the main driver in HPC, while cloud storage systems are assumed to offer low latency, high throughput, high availability, and scalability. The Simple Storage Service S3 has emerged as the de [...] Read more.
The line between HPC and Cloud is getting blurry: Performance is still the main driver in HPC, while cloud storage systems are assumed to offer low latency, high throughput, high availability, and scalability. The Simple Storage Service S3 has emerged as the de facto storage API for object storage in the Cloud. This paper seeks to check if the S3 API is already a viable alternative for HPC access patterns in terms of performance or if further performance advancements are necessary. For this purpose: (a) We extend two common HPC I/O benchmarks—the IO500 and MD-Workbench—to quantify the performance of the S3 API. We perform the analysis on the Mistral supercomputer by launching the enhanced benchmarks against different S3 implementations: on-premises (Swift, MinIO) and in the Cloud (Google, IBM…). We find that these implementations do not yet meet the demanding performance and scalability expectations of HPC workloads. (b) We aim to identify the cause for the performance loss by systematically replacing parts of a popular S3 client library with lightweight replacements of lower stack components. The created S3Embedded library is highly scalable and leverages the shared cluster file systems of HPC infrastructure to accommodate arbitrary S3 client applications. Another introduced library, S3remote, uses TCP/IP for communication instead of HTTP; it provides a single local S3 gateway on each node. By broadening the scope of the IO500, this research enables the community to track the performance growth of S3 and encourage sharing best practices for performance optimization. The analysis also proves that there can be a performance convergence—at the storage level—between Cloud and HPC over time by using a high-performance S3 library like S3Embedded. Full article
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)
Show Figures

Figure 1

12 pages, 1979 KiB  
Article
Improvements to Supercomputing Service Availability Based on Data Analysis
by Jae-Kook Lee, Min-Woo Kwon, Do-Sik An, Junweon Yoon, Taeyoung Hong, Joon Woo, Sung-Jun Kim and Guohua Li
Appl. Sci. 2021, 11(13), 6166; https://0-doi-org.brum.beds.ac.uk/10.3390/app11136166 - 02 Jul 2021
Viewed by 2247
Abstract
As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job [...] Read more.
As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements. Full article
(This article belongs to the Special Issue State-of-the-Art High-Performance Computing and Networking)
Show Figures

Figure 1

Back to TopTop