Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Browser

Hardware-Aware Deep Learning

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 May 2023) | Viewed by 25980

Share This Special Issue

Special Issue Editors

Dr. Deliang Fan

E-Mail Website
Guest Editor

School of Electrical, Computer and Energy Engineering, Arizona State University (ASU), Tempe, AZ, USA
Interests: hardware-aware deep learning; in-memory computing; emerging post-CMOS non-volatile memory; trustworthy AI

Dr. Zhezhi He

E-Mail Website
Guest Editor

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 00240, China
Interests: neuromorphic computing; secure and efficient deep learning; electronic design automation

Dr. Alessandro Bruno

E-Mail Website
Guest Editor

Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, 20072 Milan, Italy
Interests: computer vision; artificial intelligence; deep learning; image analysis and processing; visual saliency; biomedical engineering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

One of the main factors that contributes to the success of deep learning (DL) is the mighty computing power provided by modern hardware, spanning from high-performance server systems to resource-limited edge devices. The edge side (e.g., embedded systems, IoT) demands not only extreme energy-efficiency but also real-time inference capability, which requires cross-stack techniques, including model compression, compilation, architecture and circuit design of AI chips, emerging devices, etc. Beyond that, recent investigations, such federated learning, also bring model training to the edge side, with the data-security and computing limitations of mobile devices taken into consideration. On the cloud side, as the DL model size grows exponentially in the last two years (e.g., OpenAI GPT3, Google switch-transformer, etc.), how to efficiently support the training and inference of those immerse models is also an emerging research direction. Without lowering their hardware cost, however, incorporating them into the paradigm of machine learning as a service (MLaaS) will be infeasible. Moreover, the security and fault tolerant capability of DL also leads to a coherent of research, such as DL against error and non-ideal effects of the target hardware (e.g., bit-error of memory system). Therefore, the aforementioned concerns motivate the research of hardware-aware deep learning, for optimized energy, latency, and even security.

Dr. Deliang Fan
Dr. Zhezhi He
Dr. Alessandro Bruno
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

acceleration of deep learning
artificial intelligence of things (AIoT)
model compression
algorithm and hardware co-design for deep learning
neural architecture search
security issues associated with deep learning on hardware
near-sensor intelligence
hardware-aware compilation techniques of deep learning
federated learning and split learning

Published Papers (11 papers)

Download All Papers

Research

15 pages, 970 KiB

Open AccessArticle

Implementation of the SoftMax Activation for Reconfigurable Neural Network Hardware Accelerators

by Vladislav Shatravin, Dmitriy Shashev and Stanislav Shidlovskiy

Appl. Sci. 2023, 13(23), 12784; https://0-doi-org.brum.beds.ac.uk/10.3390/app132312784 - 28 Nov 2023

Viewed by 761

Abstract

In recent decades, machine-learning algorithms have been extensively utilized to tackle various complex tasks. To achieve the high performance and efficiency of these algorithms, various hardware accelerators are used. Typically, these devices are specialized for specific neural network architectures and activation functions. However, state-of-the-art complex autonomous and mobile systems may require different algorithms for different tasks. Reconfigurable accelerators can be used to resolve this problem. They possess the capability to support diverse neural network architectures and allow for significant alterations to the implemented model at runtime. Thus, a single device can be used to address entirely different tasks. Our research focuses on dynamically reconfigurable accelerators based on reconfigurable computing environments (RCE). To implement the required neural networks on such devices, their algorithms need to be adapted to the homogeneous structure of RCE. This article proposes the first implementation of the widely used SoftMax activation for hardware accelerators based on RCE. The implementation leverages spatial distribution and incorporates several optimizations to enhance its performance. The timing simulation of the proposed implementation on FPGA shows a high throughput of 1.12 Gbps at 23 MHz. The result is comparable to counterparts lacking reconfiguration capability. However, this flexibility comes at the expense of the increased consumption of logic elements. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

30 pages, 2413 KiB

Open AccessArticle

Application of Machine Learning Ensemble Methods to ASTRI Mini-Array Cherenkov Event Reconstruction

by Antonio Pagliaro, Giancarlo Cusumano, Antonino La Barbera, Valentina La Parola and Saverio Lombardi

Appl. Sci. 2023, 13(14), 8172; https://0-doi-org.brum.beds.ac.uk/10.3390/app13148172 - 13 Jul 2023

Cited by 2 | Viewed by 779

Abstract

The Imaging Atmospheric Cherenkov technique has opened up previously unexplored windows for the study of astrophysical radiation sources in the very high-energy (VHE) regime and is playing an important role in the discovery and characterization of VHE gamma-ray emitters. However, even for the most powerful sources, the data collected by Imaging Atmospheric Cherenkov Telescopes (IACTs) are heavily dominated by the overwhelming background due to cosmic-ray nuclei and cosmic-ray electrons. As a result, the analysis of IACT data necessitates the use of a highly efficient background rejection technique capable of distinguishing a gamma-ray induced signal through identification of shape features in its image. We present a detailed case study of gamma/hadron separation and energy reconstruction. Using a set of simulated data based on the ASTRI Mini-Array Cherenkov telescopes, we have assessed and compared a number of supervised Machine Learning methods, including the Random Forest method, Extra Trees method, and Extreme Gradient Boosting (XGB). To determine the optimal weighting for each method in the ensemble, we conducted extensive experiments involving multiple trials and cross-validation tests. As a result of this thorough investigation, we found that the most sensitive Machine Learning technique applied to our data sample for gamma/hadron segregation is a Stacking Ensemble Method composed of 42% Extra Trees, 28% Random Forest, and 30% XGB. In addition, the best-performing technique for energy estimation is a different Stacking Ensemble Method composed of 45% XGB, 27.5% Extra Trees, and 27.5% Random Forest. These optimal weightings were derived from extensive testing and fine-tuning, ensuring maximum performance for both gamma/hadron separation and energy estimation. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

18 pages, 2843 KiB

Open AccessArticle

The Effects of Daubechies Wavelet Basis Function (DWBF) and Decomposition Level on the Performance of Artificial Intelligence-Based Atrial Fibrillation (AF) Detection Based on Electrocardiogram (ECG) Signals

by Satria Mandala, Annisa Rizki Pratiwi Wibowo, Adiwijaya, Suyanto, Mohd Soperi Mohd Zahid and Ardian Rizal

Appl. Sci. 2023, 13(5), 3036; https://0-doi-org.brum.beds.ac.uk/10.3390/app13053036 - 27 Feb 2023

Cited by 7 | Viewed by 1983

Abstract

This research studies the effects of both Daubechies wavelet basis function (DWBF) and decomposition level (DL) on the performance of detecting atrial fibrillation (AF) based on electrocardiograms (ECGs). ECG signals (consisting of 23 AF data and 18 normal data from MIT-BIH) were decomposed at various levels using several types of DWBF to obtain four wavelet coefficient features (WCFs), namely, minimum (min), maximum (max), mean, and standard deviation (stdev). These features were then classified to detect the presence of AF using a support vector machine (SVM) classifier. Distribution of training and testing data for the SVM uses the 5-fold cross-validation (CV) principle to produce optimum detection performance. In this study, AF detection performance is measured and analyzed based on accuracy, sensitivity, and specificity metrics. The results of the analysis show that accuracy tends to decrease with increases in the decomposition level. In addition, it becomes stable in various types of DWBF. For both sensitivity and specificity, the results of the analysis show that increasing the decomposition level also causes a decrease in both sensitivity and specificity. However, unlike the accuracy, changing the DWBF type causes both two metrics to fluctuate over a wider range. The statistical results also indicate that the highest AF accuracy detection (i.e., 94.17%) is obtained at the Daubechies 2 (DB₂) function with a decomposition level of 4, whereas the highest sensitivity, 97.57%, occurs when the AF detection uses DB₆ with a decomposition level of 2. Finally, DB₂ with decomposition level 4 results in 96.750% for specificity. The finding of this study is that selecting the appropriate DL has a more significant effect than DWBF on AF detection using WCF. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

13 pages, 2313 KiB

Open AccessArticle

Hardware-Aware Mobile Building Block Evaluation for Computer Vision

by Maxim Bonnaerens, Matthias Freiberger, Marian Verhelst and Joni Dambre

Appl. Sci. 2022, 12(24), 12615; https://0-doi-org.brum.beds.ac.uk/10.3390/app122412615 - 09 Dec 2022

Viewed by 1628

Abstract

In this paper, we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach enables matching of information obtained by previous comparison paradigms, but provides more insights into the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of two on specific hardware ML accelerators. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

20 pages, 1307 KiB

Open AccessArticle

Learning Low-Precision Structured Subnetworks Using Joint Layerwise Channel Pruning and Uniform Quantization

by Xinyu Zhang, Ian Colbert and Srinjoy Das

Appl. Sci. 2022, 12(15), 7829; https://0-doi-org.brum.beds.ac.uk/10.3390/app12157829 - 04 Aug 2022

Cited by 2 | Viewed by 1463

Abstract

Pruning and quantization are core techniques used to reduce the inference costs of deep neural networks. Among the state-of-the-art pruning techniques, magnitude-based pruning algorithms have demonstrated consistent success in the reduction of both weight and feature map complexity. However, we find that existing measures of neuron (or channel) importance estimation used for such pruning procedures have at least one of two limitations: (1) failure to consider the interdependence between successive layers; and/or (2) performing the estimation in a parametric setting or by using distributional assumptions on the feature maps. In this work, we demonstrate that the importance rankings of the output neurons of a given layer strongly depend on the sparsity level of the preceding layer, and therefore, naïvely estimating neuron importance to drive magnitude-based pruning will lead to suboptimal performance. Informed by this observation, we propose a purely data-driven nonparametric, magnitude-based channel pruning strategy that works in a greedy manner based on the activations of the previous sparsified layer. We demonstrate that our proposed method works effectively in combination with statistics-based quantization techniques to generate low precision structured subnetworks that can be efficiently accelerated by hardware platforms such as GPUs and FPGAs. Using our proposed algorithms, we demonstrate increased performance per memory footprint over existing solutions across a range of discriminative and generative networks. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

18 pages, 10204 KiB

Open AccessArticle

Design and Acceleration of Field Programmable Gate Array-Based Deep Learning for Empty-Dish Recycling Robots

by Zhichen Wang, Hengyi Li, Xuebin Yue and Lin Meng

Appl. Sci. 2022, 12(14), 7337; https://0-doi-org.brum.beds.ac.uk/10.3390/app12147337 - 21 Jul 2022

Cited by 5 | Viewed by 1620

Abstract

As the proportion of the working population decreases worldwide, robots with artificial intelligence have been a good choice to help humans. At the same time, field programmable gate array (FPGA) is generally used on edge devices including robots, and it greatly accelerates the inference process of deep learning tasks, including object detection tasks. In this paper, we build a unique object detection dataset of 16 common kinds of dishes and use this dataset for training a YOLOv3 object detection model. Then, we propose a formalized process of deploying a YOLOv3 model on the FPGA platform, which consists of training and pruning the model on a software platform, and deploying the pruned model on a hardware platform (such as FPGA) through Vitis AI. According to the experimental results, we successfully realize acceleration of the dish detection using a YOLOv3 model based on FPGA. By applying different sparse training and pruning methods, we test the pruned model in 18 different situations on the ZCU102 evaluation board. In order to improve detection speed as much as possible while ensuring detection accuracy, for the pruned model with the highest comprehensive performance, compared to the original model, the comparison results are as follows: the model size is reduced from 62 MB to 12 MB, which is only 19% of the origin; the number of parameters is reduced from 61,657,117 to 9,900,539, which is only 16% of the origin; the running time is reduced from 14.411 s to 6.828 s, which is only less than half of the origin, while the detection accuracy is decreased from 97% to 94.1%, which is only less than 3%. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

16 pages, 1691 KiB

Open AccessArticle

Sigmoid Activation Implementation for Neural Networks Hardware Accelerators Based on Reconfigurable Computing Environments for Low-Power Intelligent Systems

by Vladislav Shatravin, Dmitriy Shashev and Stanislav Shidlovskiy

Appl. Sci. 2022, 12(10), 5216; https://0-doi-org.brum.beds.ac.uk/10.3390/app12105216 - 21 May 2022

Cited by 2 | Viewed by 2371

Abstract

The remarkable results of applying machine learning algorithms to complex tasks are well known. They open wide opportunities in natural language processing, image recognition, and predictive analysis. However, their use in low-power intelligent systems is restricted because of high computational complexity and memory requirements. This group includes a wide variety of devices, from smartphones and Internet of Things (IoT)smart sensors to unmanned aerial vehicles (UAVs), self-driving cars, and nodes of Edge Computing systems. All of these devices have severe limitations to their weight and power consumption. To apply neural networks in these systems efficiently, specialized hardware accelerators are used. However, hardware implementation of some neural network operations is a challenging task. Sigmoid activation is popular in the classification problem and is a notable example of such a complex operation because it uses division and exponentiation. The paper proposes efficient implementations of this activation for dynamically reconfigurable accelerators. Reconfigurable computing environments (RCE) allow achieving reconfigurability of accelerators. The paper shows the advantages of applying such accelerators in low-power systems, proposes the centralized and distributed hardware implementations of the sigmoid, presents comparisons with the results of other studies, and describes application of the proposed approaches to other activation functions. Timing simulations of the developed Verilog modules show low delay (14–18.5 ns) with acceptable accuracy (average absolute error is 4 × 10⁻³). Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

22 pages, 1001 KiB

Open AccessArticle

Hardware Platform-Aware Binarized Neural Network Model Optimization

by Quang Hieu Vo, Faaiz Asim, Batyrbek Alimkhanuly, Seunghyun Lee and Lokwon Kim

Appl. Sci. 2022, 12(3), 1296; https://0-doi-org.brum.beds.ac.uk/10.3390/app12031296 - 26 Jan 2022

Cited by 1 | Viewed by 1999

Abstract

Deep Neural Networks (DNNs) have shown superior accuracy at the expense of high memory and computation requirements. Optimizing DNN models regarding energy and hardware resource requirements is extremely important for applications with resource-constrained embedded environments. Although using binary neural networks (BNNs), one of the recent promising approaches, significantly reduces the design’s complexity, accuracy degradation is inevitable when reducing the precision of parameters and output activations. To balance between implementation cost and accuracy, in addition to proposing specialized hardware accelerators for corresponding specific network models, most recent software binary neural networks have been optimized based on generalized metrics, such as FLOPs or MAC operation requirements. However, with the wide range of hardware available today, independently evaluating software network structures is not good enough to determine the final network model for typical devices. In this paper, an architecture search algorithm based on estimating the hardware performance at the design time is proposed to achieve the best binary neural network models for hardware implementation on target platforms. With the XNOR-net used as a base architecture and target platforms, including Field Programmable Gate Array (FPGA), Graphic Processing Unit (GPU), and Resistive Random Access Memory (RRAM), the proposed algorithm shows its efficiency by giving more accurate estimation for the hardware performance at the design time than FLOPs or MAC operations. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

27 pages, 19012 KiB

Open AccessArticle

MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms

by Ruiqi Chen, Tianyu Wu, Yuchen Zheng and Ming Ling

Appl. Sci. 2022, 12(1), 89; https://0-doi-org.brum.beds.ac.uk/10.3390/app12010089 - 22 Dec 2021

Cited by 6 | Viewed by 6124

Abstract

In Internet of Things (IoT) scenarios, it is challenging to deploy Machine Learning (ML) algorithms on low-cost Field Programmable Gate Arrays (FPGAs) in a real-time, cost-efficient, and high-performance way. This paper introduces Machine Learning on FPGA (MLoF), a series of ML IP cores implemented on the low-cost FPGA platforms, aiming at helping more IoT developers to achieve comprehensive performance in various tasks. With Verilog, we deploy and accelerate Artificial Neural Networks (ANNs), Decision Trees (DTs), K-Nearest Neighbors (k-NNs), and Support Vector Machines (SVMs) on 10 different FPGA development boards from seven producers. Additionally, we analyze and evaluate our design with six datasets, and compare the best-performing FPGAs with traditional SoC-based systems including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle. The results show that Lattice’s ICE40UP5 achieves the best overall performance with low power consumption, on which MLoF averagely reduces power by 891% and increases performance by 9 times. Moreover, its cost, power, Latency Production (CPLP) outperforms SoC-based systems by 25 times, which demonstrates the significance of MLoF in endpoint deployment of ML algorithms. Furthermore, we make all of the code open-source in order to promote future research. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

17 pages, 4014 KiB

Open AccessArticle

AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching

by Luca Mocerino and Andrea Calimera

Appl. Sci. 2021, 11(23), 11164; https://0-doi-org.brum.beds.ac.uk/10.3390/app112311164 - 24 Nov 2021

Cited by 1 | Viewed by 1982

Abstract

The reduction in energy consumption is key for deep neural networks (DNNs) to ensure usability and reliability, whether they are deployed on low-power end-nodes with limited resources or high-performance platforms that serve large pools of users. Leveraging the over-parametrization shown by many DNN models, convolutional neural networks (ConvNets) in particular, energy efficiency can be improved substantially preserving the model accuracy. The solution proposed in this work exploits the intrinsic redundancy of ConvNets to maximize the reuse of partial arithmetic results during the inference stages. Specifically, the weight-set of a given ConvNet is discretized through a clustering procedure such that the largest possible number of inner multiplications fall into predefined bins; this allows an off-line computation of the most frequent results, which in turn can be stored locally and retrieved when needed during the forward pass. Such a reuse mechanism leads to remarkable energy savings with the aid of a custom processing element (PE) that integrates an associative memory with a standard floating-point unit (FPU). Moreover, the adoption of an approximate associative rule based on a partial bit-match increases the hit rate over the pre-computed results, maximizing the energy reduction even further. Results collected on a set of ConvNets trained for computer vision and speech processing tasks reveal that the proposed associative-based hw-sw co-design achieves up to 77% in energy savings with less than

1 %

in accuracy loss. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

17 pages, 905 KiB

Open AccessArticle

High-Performance English–Chinese Machine Translation Based on GPU-Enabled Deep Neural Networks with Domain Corpus

by Lanxin Zhao, Wanrong Gao and Jianbin Fang

Appl. Sci. 2021, 11(22), 10915; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210915 - 18 Nov 2021

Cited by 7 | Viewed by 2473

Abstract

The ability to automate machine translation has various applications in international commerce, medicine, travel, education, and text digitization. Due to the different grammar and lack of clear word boundaries in Chinese, it is challenging to conduct translation from word-based languages (e.g., English) to Chinese. This article has implemented a GPU-enabled deep learning machine translation system based on a domain-specific corpus. Our system takes English text as input and uses an encoder-decoder model with an attention mechanism based on Google’s Transformer to translate the text to Chinese output. The model was trained using a simple self-designed entropy loss function and an Adam optimizer on English–Chinese bilingual text sentences from the News area of the UM-Corpus. The parallel training process of our model can be performed on common laptops, desktops, and servers with one or more GPUs. At training time, we not only track loss over training epochs but also measure the quality of our model’s translations with the BLEU score. We also provide an easy-to-use web interface for users so as to manage corpus, training projects, and trained models. The experimental results show that we can achieve a maximum BLEU score of 29.2. We can further improve this score by tuning other hyperparameters. The GPU-enabled model training runs over 15x faster than on a multi-core CPU, which facilitates us having a shorter turn-around time. As a case study, we compare the performance of our model to that of Baidu’s, which shows that our model can compete with the industry-level translation system. We argue that our deep-learning-based translation system is particularly suitable for teaching purposes and small/medium-sized enterprises. Full article

(This article belongs to the Special Issue Hardware-Aware Deep Learning)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Hardware-Aware Deep Learning

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI