Hardware-Aware Deep Learning

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 May 2023) | Viewed by 25980

Special Issue Editors


E-Mail Website
Guest Editor
School of Electrical, Computer and Energy Engineering, Arizona State University (ASU), Tempe, AZ, USA
Interests: hardware-aware deep learning; in-memory computing; emerging post-CMOS non-volatile memory; trustworthy AI
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 00240, China
Interests: neuromorphic computing; secure and efficient deep learning; electronic design automation

E-Mail Website
Guest Editor
Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, 20072 Milan, Italy
Interests: computer vision; artificial intelligence; deep learning; image analysis and processing; visual saliency; biomedical engineering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

One of the main factors that contributes to the success of deep learning (DL) is the mighty computing power provided by modern hardware, spanning from high-performance server systems to resource-limited edge devices. The edge side (e.g., embedded systems, IoT) demands not only extreme energy-efficiency but also real-time inference capability, which requires cross-stack techniques, including model compression, compilation, architecture and circuit design of AI chips, emerging devices, etc. Beyond that, recent investigations, such federated learning, also bring model training to the edge side, with the data-security and computing limitations of mobile devices taken into consideration. On the cloud side, as the DL model size grows exponentially in the last two years (e.g., OpenAI GPT3, Google switch-transformer, etc.), how to efficiently support the training and inference of those immerse models is also an emerging research direction. Without lowering their hardware cost, however, incorporating them into the paradigm of machine learning as a service (MLaaS) will be infeasible. Moreover, the security and fault tolerant capability of DL also leads to a coherent of research, such as DL against error and non-ideal effects of the target hardware (e.g., bit-error of memory system). Therefore, the aforementioned concerns motivate the research of hardware-aware deep learning, for optimized energy, latency, and even security.

Dr. Deliang Fan
Dr. Zhezhi He
Dr. Alessandro Bruno
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • acceleration of deep learning
  • artificial intelligence of things (AIoT)
  • model compression
  • algorithm and hardware co-design for deep learning
  • neural architecture search
  • security issues associated with deep learning on hardware
  • near-sensor intelligence
  • hardware-aware compilation techniques of deep learning
  • federated learning and split learning

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

15 pages, 970 KiB  
Article
Implementation of the SoftMax Activation for Reconfigurable Neural Network Hardware Accelerators
by Vladislav Shatravin, Dmitriy Shashev and Stanislav Shidlovskiy
Appl. Sci. 2023, 13(23), 12784; https://0-doi-org.brum.beds.ac.uk/10.3390/app132312784 - 28 Nov 2023
Viewed by 761
Abstract
In recent decades, machine-learning algorithms have been extensively utilized to tackle various complex tasks. To achieve the high performance and efficiency of these algorithms, various hardware accelerators are used. Typically, these devices are specialized for specific neural network architectures and activation functions. However, [...] Read more.
In recent decades, machine-learning algorithms have been extensively utilized to tackle various complex tasks. To achieve the high performance and efficiency of these algorithms, various hardware accelerators are used. Typically, these devices are specialized for specific neural network architectures and activation functions. However, state-of-the-art complex autonomous and mobile systems may require different algorithms for different tasks. Reconfigurable accelerators can be used to resolve this problem. They possess the capability to support diverse neural network architectures and allow for significant alterations to the implemented model at runtime. Thus, a single device can be used to address entirely different tasks. Our research focuses on dynamically reconfigurable accelerators based on reconfigurable computing environments (RCE). To implement the required neural networks on such devices, their algorithms need to be adapted to the homogeneous structure of RCE. This article proposes the first implementation of the widely used SoftMax activation for hardware accelerators based on RCE. The implementation leverages spatial distribution and incorporates several optimizations to enhance its performance. The timing simulation of the proposed implementation on FPGA shows a high throughput of 1.12 Gbps at 23 MHz. The result is comparable to counterparts lacking reconfiguration capability. However, this flexibility comes at the expense of the increased consumption of logic elements. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

30 pages, 2413 KiB  
Article
Application of Machine Learning Ensemble Methods to ASTRI Mini-Array Cherenkov Event Reconstruction
by Antonio Pagliaro, Giancarlo Cusumano, Antonino La Barbera, Valentina La Parola and Saverio Lombardi
Appl. Sci. 2023, 13(14), 8172; https://0-doi-org.brum.beds.ac.uk/10.3390/app13148172 - 13 Jul 2023
Cited by 2 | Viewed by 779
Abstract
The Imaging Atmospheric Cherenkov technique has opened up previously unexplored windows for the study of astrophysical radiation sources in the very high-energy (VHE) regime and is playing an important role in the discovery and characterization of VHE gamma-ray emitters. However, even for the [...] Read more.
The Imaging Atmospheric Cherenkov technique has opened up previously unexplored windows for the study of astrophysical radiation sources in the very high-energy (VHE) regime and is playing an important role in the discovery and characterization of VHE gamma-ray emitters. However, even for the most powerful sources, the data collected by Imaging Atmospheric Cherenkov Telescopes (IACTs) are heavily dominated by the overwhelming background due to cosmic-ray nuclei and cosmic-ray electrons. As a result, the analysis of IACT data necessitates the use of a highly efficient background rejection technique capable of distinguishing a gamma-ray induced signal through identification of shape features in its image. We present a detailed case study of gamma/hadron separation and energy reconstruction. Using a set of simulated data based on the ASTRI Mini-Array Cherenkov telescopes, we have assessed and compared a number of supervised Machine Learning methods, including the Random Forest method, Extra Trees method, and Extreme Gradient Boosting (XGB). To determine the optimal weighting for each method in the ensemble, we conducted extensive experiments involving multiple trials and cross-validation tests. As a result of this thorough investigation, we found that the most sensitive Machine Learning technique applied to our data sample for gamma/hadron segregation is a Stacking Ensemble Method composed of 42% Extra Trees, 28% Random Forest, and 30% XGB. In addition, the best-performing technique for energy estimation is a different Stacking Ensemble Method composed of 45% XGB, 27.5% Extra Trees, and 27.5% Random Forest. These optimal weightings were derived from extensive testing and fine-tuning, ensuring maximum performance for both gamma/hadron separation and energy estimation. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

18 pages, 2843 KiB  
Article
The Effects of Daubechies Wavelet Basis Function (DWBF) and Decomposition Level on the Performance of Artificial Intelligence-Based Atrial Fibrillation (AF) Detection Based on Electrocardiogram (ECG) Signals
by Satria Mandala, Annisa Rizki Pratiwi Wibowo, Adiwijaya, Suyanto, Mohd Soperi Mohd Zahid and Ardian Rizal
Appl. Sci. 2023, 13(5), 3036; https://0-doi-org.brum.beds.ac.uk/10.3390/app13053036 - 27 Feb 2023
Cited by 7 | Viewed by 1983
Abstract
This research studies the effects of both Daubechies wavelet basis function (DWBF) and decomposition level (DL) on the performance of detecting atrial fibrillation (AF) based on electrocardiograms (ECGs). ECG signals (consisting of 23 AF data and 18 normal data from MIT-BIH) were decomposed [...] Read more.
This research studies the effects of both Daubechies wavelet basis function (DWBF) and decomposition level (DL) on the performance of detecting atrial fibrillation (AF) based on electrocardiograms (ECGs). ECG signals (consisting of 23 AF data and 18 normal data from MIT-BIH) were decomposed at various levels using several types of DWBF to obtain four wavelet coefficient features (WCFs), namely, minimum (min), maximum (max), mean, and standard deviation (stdev). These features were then classified to detect the presence of AF using a support vector machine (SVM) classifier. Distribution of training and testing data for the SVM uses the 5-fold cross-validation (CV) principle to produce optimum detection performance. In this study, AF detection performance is measured and analyzed based on accuracy, sensitivity, and specificity metrics. The results of the analysis show that accuracy tends to decrease with increases in the decomposition level. In addition, it becomes stable in various types of DWBF. For both sensitivity and specificity, the results of the analysis show that increasing the decomposition level also causes a decrease in both sensitivity and specificity. However, unlike the accuracy, changing the DWBF type causes both two metrics to fluctuate over a wider range. The statistical results also indicate that the highest AF accuracy detection (i.e., 94.17%) is obtained at the Daubechies 2 (DB2) function with a decomposition level of 4, whereas the highest sensitivity, 97.57%, occurs when the AF detection uses DB6 with a decomposition level of 2. Finally, DB2 with decomposition level 4 results in 96.750% for specificity. The finding of this study is that selecting the appropriate DL has a more significant effect than DWBF on AF detection using WCF. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

13 pages, 2313 KiB  
Article
Hardware-Aware Mobile Building Block Evaluation for Computer Vision
by Maxim Bonnaerens, Matthias Freiberger, Marian Verhelst and Joni Dambre
Appl. Sci. 2022, 12(24), 12615; https://0-doi-org.brum.beds.ac.uk/10.3390/app122412615 - 09 Dec 2022
Viewed by 1628
Abstract
In this paper, we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture [...] Read more.
In this paper, we propose a methodology to accurately evaluate and compare the performance of efficient neural network building blocks for computer vision in a hardware-aware manner. Our comparison uses pareto fronts based on randomly sampled networks from a design space to capture the underlying accuracy/complexity trade-offs. We show that our approach enables matching of information obtained by previous comparison paradigms, but provides more insights into the relationship between hardware cost and accuracy. We use our methodology to analyze different building blocks and evaluate their performance on a range of embedded hardware platforms. This highlights the importance of benchmarking building blocks as a preselection step in the design process of a neural network. We show that choosing the right building block can speed up inference by up to a factor of two on specific hardware ML accelerators. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

20 pages, 1307 KiB  
Article
Learning Low-Precision Structured Subnetworks Using Joint Layerwise Channel Pruning and Uniform Quantization
by Xinyu Zhang, Ian Colbert and Srinjoy Das
Appl. Sci. 2022, 12(15), 7829; https://0-doi-org.brum.beds.ac.uk/10.3390/app12157829 - 04 Aug 2022
Cited by 2 | Viewed by 1463
Abstract
Pruning and quantization are core techniques used to reduce the inference costs of deep neural networks. Among the state-of-the-art pruning techniques, magnitude-based pruning algorithms have demonstrated consistent success in the reduction of both weight and feature map complexity. However, we find that existing [...] Read more.
Pruning and quantization are core techniques used to reduce the inference costs of deep neural networks. Among the state-of-the-art pruning techniques, magnitude-based pruning algorithms have demonstrated consistent success in the reduction of both weight and feature map complexity. However, we find that existing measures of neuron (or channel) importance estimation used for such pruning procedures have at least one of two limitations: (1) failure to consider the interdependence between successive layers; and/or (2) performing the estimation in a parametric setting or by using distributional assumptions on the feature maps. In this work, we demonstrate that the importance rankings of the output neurons of a given layer strongly depend on the sparsity level of the preceding layer, and therefore, naïvely estimating neuron importance to drive magnitude-based pruning will lead to suboptimal performance. Informed by this observation, we propose a purely data-driven nonparametric, magnitude-based channel pruning strategy that works in a greedy manner based on the activations of the previous sparsified layer. We demonstrate that our proposed method works effectively in combination with statistics-based quantization techniques to generate low precision structured subnetworks that can be efficiently accelerated by hardware platforms such as GPUs and FPGAs. Using our proposed algorithms, we demonstrate increased performance per memory footprint over existing solutions across a range of discriminative and generative networks. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

18 pages, 10204 KiB  
Article
Design and Acceleration of Field Programmable Gate Array-Based Deep Learning for Empty-Dish Recycling Robots
by Zhichen Wang, Hengyi Li, Xuebin Yue and Lin Meng
Appl. Sci. 2022, 12(14), 7337; https://0-doi-org.brum.beds.ac.uk/10.3390/app12147337 - 21 Jul 2022
Cited by 5 | Viewed by 1620
Abstract
As the proportion of the working population decreases worldwide, robots with artificial intelligence have been a good choice to help humans. At the same time, field programmable gate array (FPGA) is generally used on edge devices including robots, and it greatly accelerates the [...] Read more.
As the proportion of the working population decreases worldwide, robots with artificial intelligence have been a good choice to help humans. At the same time, field programmable gate array (FPGA) is generally used on edge devices including robots, and it greatly accelerates the inference process of deep learning tasks, including object detection tasks. In this paper, we build a unique object detection dataset of 16 common kinds of dishes and use this dataset for training a YOLOv3 object detection model. Then, we propose a formalized process of deploying a YOLOv3 model on the FPGA platform, which consists of training and pruning the model on a software platform, and deploying the pruned model on a hardware platform (such as FPGA) through Vitis AI. According to the experimental results, we successfully realize acceleration of the dish detection using a YOLOv3 model based on FPGA. By applying different sparse training and pruning methods, we test the pruned model in 18 different situations on the ZCU102 evaluation board. In order to improve detection speed as much as possible while ensuring detection accuracy, for the pruned model with the highest comprehensive performance, compared to the original model, the comparison results are as follows: the model size is reduced from 62 MB to 12 MB, which is only 19% of the origin; the number of parameters is reduced from 61,657,117 to 9,900,539, which is only 16% of the origin; the running time is reduced from 14.411 s to 6.828 s, which is only less than half of the origin, while the detection accuracy is decreased from 97% to 94.1%, which is only less than 3%. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

16 pages, 1691 KiB  
Article
Sigmoid Activation Implementation for Neural Networks Hardware Accelerators Based on Reconfigurable Computing Environments for Low-Power Intelligent Systems
by Vladislav Shatravin, Dmitriy Shashev and Stanislav Shidlovskiy
Appl. Sci. 2022, 12(10), 5216; https://0-doi-org.brum.beds.ac.uk/10.3390/app12105216 - 21 May 2022
Cited by 2 | Viewed by 2371
Abstract
The remarkable results of applying machine learning algorithms to complex tasks are well known. They open wide opportunities in natural language processing, image recognition, and predictive analysis. However, their use in low-power intelligent systems is restricted because of high computational complexity and memory [...] Read more.
The remarkable results of applying machine learning algorithms to complex tasks are well known. They open wide opportunities in natural language processing, image recognition, and predictive analysis. However, their use in low-power intelligent systems is restricted because of high computational complexity and memory requirements. This group includes a wide variety of devices, from smartphones and Internet of Things (IoT)smart sensors to unmanned aerial vehicles (UAVs), self-driving cars, and nodes of Edge Computing systems. All of these devices have severe limitations to their weight and power consumption. To apply neural networks in these systems efficiently, specialized hardware accelerators are used. However, hardware implementation of some neural network operations is a challenging task. Sigmoid activation is popular in the classification problem and is a notable example of such a complex operation because it uses division and exponentiation. The paper proposes efficient implementations of this activation for dynamically reconfigurable accelerators. Reconfigurable computing environments (RCE) allow achieving reconfigurability of accelerators. The paper shows the advantages of applying such accelerators in low-power systems, proposes the centralized and distributed hardware implementations of the sigmoid, presents comparisons with the results of other studies, and describes application of the proposed approaches to other activation functions. Timing simulations of the developed Verilog modules show low delay (14–18.5 ns) with acceptable accuracy (average absolute error is 4 × 10−3). Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

22 pages, 1001 KiB  
Article
Hardware Platform-Aware Binarized Neural Network Model Optimization
by Quang Hieu Vo, Faaiz Asim, Batyrbek Alimkhanuly, Seunghyun Lee and Lokwon Kim
Appl. Sci. 2022, 12(3), 1296; https://0-doi-org.brum.beds.ac.uk/10.3390/app12031296 - 26 Jan 2022
Cited by 1 | Viewed by 1999
Abstract
Deep Neural Networks (DNNs) have shown superior accuracy at the expense of high memory and computation requirements. Optimizing DNN models regarding energy and hardware resource requirements is extremely important for applications with resource-constrained embedded environments. Although using binary neural networks (BNNs), one of [...] Read more.
Deep Neural Networks (DNNs) have shown superior accuracy at the expense of high memory and computation requirements. Optimizing DNN models regarding energy and hardware resource requirements is extremely important for applications with resource-constrained embedded environments. Although using binary neural networks (BNNs), one of the recent promising approaches, significantly reduces the design’s complexity, accuracy degradation is inevitable when reducing the precision of parameters and output activations. To balance between implementation cost and accuracy, in addition to proposing specialized hardware accelerators for corresponding specific network models, most recent software binary neural networks have been optimized based on generalized metrics, such as FLOPs or MAC operation requirements. However, with the wide range of hardware available today, independently evaluating software network structures is not good enough to determine the final network model for typical devices. In this paper, an architecture search algorithm based on estimating the hardware performance at the design time is proposed to achieve the best binary neural network models for hardware implementation on target platforms. With the XNOR-net used as a base architecture and target platforms, including Field Programmable Gate Array (FPGA), Graphic Processing Unit (GPU), and Resistive Random Access Memory (RRAM), the proposed algorithm shows its efficiency by giving more accurate estimation for the hardware performance at the design time than FLOPs or MAC operations. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

27 pages, 19012 KiB  
Article
MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms
by Ruiqi Chen, Tianyu Wu, Yuchen Zheng and Ming Ling
Appl. Sci. 2022, 12(1), 89; https://0-doi-org.brum.beds.ac.uk/10.3390/app12010089 - 22 Dec 2021
Cited by 6 | Viewed by 6124
Abstract
In Internet of Things (IoT) scenarios, it is challenging to deploy Machine Learning (ML) algorithms on low-cost Field Programmable Gate Arrays (FPGAs) in a real-time, cost-efficient, and high-performance way. This paper introduces Machine Learning on FPGA (MLoF), a series of ML IP cores [...] Read more.
In Internet of Things (IoT) scenarios, it is challenging to deploy Machine Learning (ML) algorithms on low-cost Field Programmable Gate Arrays (FPGAs) in a real-time, cost-efficient, and high-performance way. This paper introduces Machine Learning on FPGA (MLoF), a series of ML IP cores implemented on the low-cost FPGA platforms, aiming at helping more IoT developers to achieve comprehensive performance in various tasks. With Verilog, we deploy and accelerate Artificial Neural Networks (ANNs), Decision Trees (DTs), K-Nearest Neighbors (k-NNs), and Support Vector Machines (SVMs) on 10 different FPGA development boards from seven producers. Additionally, we analyze and evaluate our design with six datasets, and compare the best-performing FPGAs with traditional SoC-based systems including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle. The results show that Lattice’s ICE40UP5 achieves the best overall performance with low power consumption, on which MLoF averagely reduces power by 891% and increases performance by 9 times. Moreover, its cost, power, Latency Production (CPLP) outperforms SoC-based systems by 25 times, which demonstrates the significance of MLoF in endpoint deployment of ML algorithms. Furthermore, we make all of the code open-source in order to promote future research. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

17 pages, 4014 KiB  
Article
AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching
by Luca Mocerino and Andrea Calimera
Appl. Sci. 2021, 11(23), 11164; https://0-doi-org.brum.beds.ac.uk/10.3390/app112311164 - 24 Nov 2021
Cited by 1 | Viewed by 1982
Abstract
The reduction in energy consumption is key for deep neural networks (DNNs) to ensure usability and reliability, whether they are deployed on low-power end-nodes with limited resources or high-performance platforms that serve large pools of users. Leveraging the over-parametrization shown by many DNN [...] Read more.
The reduction in energy consumption is key for deep neural networks (DNNs) to ensure usability and reliability, whether they are deployed on low-power end-nodes with limited resources or high-performance platforms that serve large pools of users. Leveraging the over-parametrization shown by many DNN models, convolutional neural networks (ConvNets) in particular, energy efficiency can be improved substantially preserving the model accuracy. The solution proposed in this work exploits the intrinsic redundancy of ConvNets to maximize the reuse of partial arithmetic results during the inference stages. Specifically, the weight-set of a given ConvNet is discretized through a clustering procedure such that the largest possible number of inner multiplications fall into predefined bins; this allows an off-line computation of the most frequent results, which in turn can be stored locally and retrieved when needed during the forward pass. Such a reuse mechanism leads to remarkable energy savings with the aid of a custom processing element (PE) that integrates an associative memory with a standard floating-point unit (FPU). Moreover, the adoption of an approximate associative rule based on a partial bit-match increases the hit rate over the pre-computed results, maximizing the energy reduction even further. Results collected on a set of ConvNets trained for computer vision and speech processing tasks reveal that the proposed associative-based hw-sw co-design achieves up to 77% in energy savings with less than 1% in accuracy loss. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

17 pages, 905 KiB  
Article
High-Performance English–Chinese Machine Translation Based on GPU-Enabled Deep Neural Networks with Domain Corpus
by Lanxin Zhao, Wanrong Gao and Jianbin Fang
Appl. Sci. 2021, 11(22), 10915; https://0-doi-org.brum.beds.ac.uk/10.3390/app112210915 - 18 Nov 2021
Cited by 7 | Viewed by 2473
Abstract
The ability to automate machine translation has various applications in international commerce, medicine, travel, education, and text digitization. Due to the different grammar and lack of clear word boundaries in Chinese, it is challenging to conduct translation from word-based languages (e.g., English) to [...] Read more.
The ability to automate machine translation has various applications in international commerce, medicine, travel, education, and text digitization. Due to the different grammar and lack of clear word boundaries in Chinese, it is challenging to conduct translation from word-based languages (e.g., English) to Chinese. This article has implemented a GPU-enabled deep learning machine translation system based on a domain-specific corpus. Our system takes English text as input and uses an encoder-decoder model with an attention mechanism based on Google’s Transformer to translate the text to Chinese output. The model was trained using a simple self-designed entropy loss function and an Adam optimizer on English–Chinese bilingual text sentences from the News area of the UM-Corpus. The parallel training process of our model can be performed on common laptops, desktops, and servers with one or more GPUs. At training time, we not only track loss over training epochs but also measure the quality of our model’s translations with the BLEU score. We also provide an easy-to-use web interface for users so as to manage corpus, training projects, and trained models. The experimental results show that we can achieve a maximum BLEU score of 29.2. We can further improve this score by tuning other hyperparameters. The GPU-enabled model training runs over 15x faster than on a multi-core CPU, which facilitates us having a shorter turn-around time. As a case study, we compare the performance of our model to that of Baidu’s, which shows that our model can compete with the industry-level translation system. We argue that our deep-learning-based translation system is particularly suitable for teaching purposes and small/medium-sized enterprises. Full article
(This article belongs to the Special Issue Hardware-Aware Deep Learning)
Show Figures

Figure 1

Back to TopTop