Next Issue
Volume 11, March
Previous Issue
Volume 10, September

J. Low Power Electron. Appl., Volume 10, Issue 4 (December 2020) – 13 articles

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Readerexternal link to open them.
Order results
Result details
Select all
Export citation of selected articles as:
Article
Cross-Layer Reliability, Energy Efficiency, and Performance Optimization of Near-Threshold Data Paths
J. Low Power Electron. Appl. 2020, 10(4), 42; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040042 - 03 Dec 2020
Viewed by 853
Abstract
Modern electronic devices are an indispensable part of our everyday life. A major enabler for such integration is the exponential increase of the computation capabilities as well as the drastic improvement in the energy efficiency over the last 50 years, commonly known as [...] Read more.
Modern electronic devices are an indispensable part of our everyday life. A major enabler for such integration is the exponential increase of the computation capabilities as well as the drastic improvement in the energy efficiency over the last 50 years, commonly known as Moore’s law. In this regard, the demand for energy-efficient digital circuits, especially for application domains such as the Internet of Things (IoT), has faced an enormous growth. Since the power consumption of a circuit highly depends on the supply voltage, aggressive supply voltage scaling to the near-threshold voltage region, also known as Near-Threshold Computing (NTC), is an effective way of increasing the energy efficiency of a circuit by an order of magnitude. However, NTC comes with specific challenges with respect to performance and reliability, which mandates new sets of design techniques to fully harness its potential. While techniques merely focused at one abstraction level, in particular circuit-level design, can have limited benefits, cross-layer approaches result in far better optimizations. This paper presents instruction multi-cycling and functional unit partitioning methods to improve energy efficiency and resiliency of functional units. The proposed methods significantly improve the circuit timing, and at the same time considerably limit leakage energy, by employing a combination of cross-layer techniques based on circuit redesign and code replacement techniques. Simulation results show that the proposed methods improve performance and energy efficiency of an Arithmetic Logic Unit by 19% and 43%, respectively. Furthermore, the improved performance of the optimized circuits can be traded to improving the reliability. Full article
(This article belongs to the Special Issue Circuits and Systems Advances in Near Threshold Computing)
Show Figures

Figure 1

Article
A Nano-Power 0.5 V Event-Driven Digital-LDO with Fast Start-Up Burst Oscillator for SoC-IoT
J. Low Power Electron. Appl. 2020, 10(4), 41; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040041 - 01 Dec 2020
Cited by 1 | Viewed by 975
Abstract
Towards the integration of Digital-LDO regulators in the ultra-low-power System-On-Chip Internet-of-Things architecture, the D-LDO architecture should constitute the main regulator for powering digital and mixed-signal loads including the SoC system clock. Such an implementation requires an in-regulator clock generation unit that provides an [...] Read more.
Towards the integration of Digital-LDO regulators in the ultra-low-power System-On-Chip Internet-of-Things architecture, the D-LDO architecture should constitute the main regulator for powering digital and mixed-signal loads including the SoC system clock. Such an implementation requires an in-regulator clock generation unit that provides an autonomous D-LDO design. In contrast to contemporary D-LDO designs that employ ring-oscillator architecture which start-up time is dependent on the oscillating frequency, this work presents a design with nano-power consumption, fabricated with an active area of 0.035 mm2 at a 55-nm Global Foundries CMOS process that introduces a fast start-up burst oscillator based on a high-gain stage with wake-up time independent of D-LDO frequency. In combination with linear search coarse regulation and asynchronous fine regulation, it succeeds 558 nA minimum quiescent current with CL 75 pF, maximum current efficiency of 99.2% and 1.16x power efficiency improvement compared to analog counterpart oriented to SoC-IoT loads. Full article
Show Figures

Figure 1

Article
An Improved K-Spare Decomposing Algorithm for Mapping Neural Networks onto Crossbar-Based Neuromorphic Computing Systems
J. Low Power Electron. Appl. 2020, 10(4), 40; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040040 - 25 Nov 2020
Viewed by 939
Abstract
Mapping deep neural network (DNN) models onto crossbar-based neuromorphic computing system (NCS) has recently become more popular since it allows us to realize the advantages of DNNs on small computing systems. However, due to the physical limitations of NCS, such as limited programmability, [...] Read more.
Mapping deep neural network (DNN) models onto crossbar-based neuromorphic computing system (NCS) has recently become more popular since it allows us to realize the advantages of DNNs on small computing systems. However, due to the physical limitations of NCS, such as limited programmability, or a fixed and small number of neurons and synapses of memristor crossbars (the most important component of NCS), we have to quantize and decompose a DNN model into many partitions before the mapping. However, each weight parameter in the original network has its own scaling factor, while crossbar cell hardware has only one scaling factor. This will cause a significant error and will reduce the performance of the system. To mitigate this issue, the K-spare neuron approach has been proposed, which uses additional K spare neurons to capture more scaling factors. Unfortunately, this approach typically uses a large number of neurons overhead. To mitigate this issue, this paper proposes an improved version of the K-spare neuron method that uses a decomposition algorithm to minimize the neuron number overhead while maintaining the accuracy of the DNN model. We achieve this goal by using a mean squared quantization error (MSQE) to evaluate which crossbar units are more important and use more scaling factor than others, instead of using the same k-spare neurons for all crossbar cells as previous work does. Our experimental results are demonstrated on the ImageNet dataset (ILSVRC2012) and three typical and popular deep convolution neural networks: VGG16, Resnet152, and MobileNet v2. Our proposed method only uses 0.1%, 3.12%, and 2.4% neurons overhead for VGG16, Resnet152, and MobileNet v2 to keep their accuracy loss at 0.44%, 0.63%, and 1.24%, respectively, while other methods use about 10–20% of neurons overhead for the same accuracy loss. Full article
Show Figures

Figure 1

Article
Low Power Photo-Voltaic Harvesting Matrix Based Boost DC–DC Converter with Recycled and Synchro-Recycled Scheme
J. Low Power Electron. Appl. 2020, 10(4), 39; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040039 - 18 Nov 2020
Viewed by 924
Abstract
Photo-voltaic (PV) power harvest can have decent efficiency when dealing with high power. When operating with a DC–DC boost converter during the low-power harvest, its efficiency and output voltage are degraded due to excessive losses in the converter components. The objective of this [...] Read more.
Photo-voltaic (PV) power harvest can have decent efficiency when dealing with high power. When operating with a DC–DC boost converter during the low-power harvest, its efficiency and output voltage are degraded due to excessive losses in the converter components. The objective of this paper is to present a systematic approach to designing an efficient low-power photo-voltaic harvesting topology with an improved efficiency and output voltage. The proposed topology uses a boost converter with and extra inductor in recycled and synchro-recycled techniques in continuous current mode (CCM). By exploiting the non-linearity of the PV cell, it reduces the power loss and using the current stored in the second inductor, it enhances the output voltage and output power simultaneously. Further, by utilizing the Metal Oxide Silicon Field Effect Transistor’s (MOSFET) body diode as a switch, it maintains a minimum hardware, and introduces a negligible impact on the reliability. The test results of the proposed boost converters show that it achieves a decent power and output voltage. Theoretical and experimental results of the proposed topologies with a tested prototype are presented along with a strategy to maximize power and voltage conversion efficiencies and output voltage. Full article
(This article belongs to the Special Issue Novel Control Techniques for DC-DC Converters)
Show Figures

Figure 1

Article
Hybrid Application Mapping for Composable Many-Core Systems: Overview and Future Perspective
J. Low Power Electron. Appl. 2020, 10(4), 38; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040038 - 17 Nov 2020
Cited by 1 | Viewed by 1167
Abstract
Many-core platforms are rapidly expanding in various embedded areas as they provide the scalable computational power required to meet the ever-growing performance demands of embedded applications and systems. However, the huge design space of possible task mappings, the unpredictable workload dynamism, and the [...] Read more.
Many-core platforms are rapidly expanding in various embedded areas as they provide the scalable computational power required to meet the ever-growing performance demands of embedded applications and systems. However, the huge design space of possible task mappings, the unpredictable workload dynamism, and the numerous non-functional requirements of applications in terms of timing, reliability, safety, and so forth. impose significant challenges when designing many-core systems. Hybrid Application Mapping (HAM) is an emerging class of design methodologies for many-core systems which address these challenges via an incremental (per-application) mapping scheme: The mapping process is divided into (i) a design-time Design Space Exploration (DSE) step per application to obtain a set of high-quality mapping options and (ii) a run-time system management step in which applications are launched dynamically (on demand) using the precomputed mappings. This paper provides an overview of HAM and the design methodologies developed in line with it. We introduce the basics of HAM and elaborate on the way it addresses the major challenges of application mapping in many-core systems. We provide an overview of the main challenges encountered when employing HAM and survey a collection of state-of-the-art techniques and methodologies proposed to address these challenges. We finally present an overview of open topics and challenges in HAM, provide a summary of emerging trends for addressing them particularly using machine learning, and outline possible future directions. While there exists a large body of HAM methodologies, the techniques studied in this paper are developed, to a large extent, within the scope of invasive computing. Invasive computing introduces resource awareness into applications and employs explicit resource reservation to enable incremental application mapping and dynamic system management. Full article
Show Figures

Figure 1

Article
Framework for Design Exploration and Performance Analysis of RF-NoC Manycore Architecture
J. Low Power Electron. Appl. 2020, 10(4), 37; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040037 - 03 Nov 2020
Cited by 1 | Viewed by 1055
Abstract
The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive [...] Read more.
The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive solution offering high performance and multicast/broadcast capabilities. However, managing RF links is a critical aspect that relies on both application-dependent and architectural parameters. This paper proposes a design space exploration framework for OFDMA-based RF-NoC architecture, which takes advantage of both real application benchmarks simulated using Sniper and RF-NoC architecture modeled using Noxim. We adopted the proposed framework to finely configure a routing algorithm, working with real traffic, achieving up to 45% of delay reduction, compared to a wired NoC setup in similar conditions. Full article
Show Figures

Figure 1

Article
InSight: An FPGA-Based Neuromorphic Computing System for Deep Neural Networks
J. Low Power Electron. Appl. 2020, 10(4), 36; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040036 - 30 Oct 2020
Cited by 1 | Viewed by 1191
Abstract
Deep neural networks have demonstrated impressive results in various cognitive tasks such as object detection and image classification. This paper describes a neuromorphic computing system that is designed from the ground up for energy-efficient evaluation of deep neural networks. The computing system consists [...] Read more.
Deep neural networks have demonstrated impressive results in various cognitive tasks such as object detection and image classification. This paper describes a neuromorphic computing system that is designed from the ground up for energy-efficient evaluation of deep neural networks. The computing system consists of a non-conventional compiler, a neuromorphic hardware architecture, and a space-efficient microarchitecture that leverages existing integrated circuit design methodologies. The compiler takes a trained, feedforward network as input, compresses the weights linearly, and generates a time delay neural network reducing the number of connections significantly. The connections and units in the simplified network are mapped to silicon synapses and neurons. We demonstrate an implementation of the neuromorphic computing system based on a field-programmable gate array that performs image classification on the hand-wirtten 0 to 9 digits MNIST dataset with 99.37% accuracy consuming only 93uJ per image. For image classification on the colour images in 10 classes CIFAR-10 dataset, it achieves 83.43% accuracy at more than 11× higher energy-efficiency compared to a recent field-programmable gate array (FPGA)-based accelerator. Full article
Show Figures

Figure 1

Article
A 100 MHz 0.41 fJ/(Bit∙Search) 28 nm CMOS-Bulk Content Addressable Memory for HEP Experiments
J. Low Power Electron. Appl. 2020, 10(4), 35; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040035 - 28 Oct 2020
Viewed by 965
Abstract
This paper presents a transistor-level design with extensive experimental validation of a Content Addressable Memory (CAM), based on an eXclusive OR (XOR) single-bit cell. This design exploits a dedicated architecture and a fully custom approach (both in the schematic and the layout phase), [...] Read more.
This paper presents a transistor-level design with extensive experimental validation of a Content Addressable Memory (CAM), based on an eXclusive OR (XOR) single-bit cell. This design exploits a dedicated architecture and a fully custom approach (both in the schematic and the layout phase), in order to achieve very low-power and high-speed performances. The proposed architecture does not require an internal clock or pre-charge phase, which usually increase the power request and slow down data searches. On the other hand, the dedicated solutions are exploited in order to minimize parasitic layout-induced capacitances in the single-bit cell, further reducing the power consumption. The prototype device, named CAM-28CB, is integrated in the deeply downscaled 28 nm Complementary Metal-Oxide-Semiconductor (CMOS) Bulk (28CB) technology. In this way, the high transistor transition frequency and the intrinsic lower parasitic capacitances allow the system speed to be improved. Furthermore, the high radiation hardness of this technology node (up to 1Grad TID), together with the CAM-28CB high-speed and low-power performances, makes this device suitable for High-Energy Physics experiments, such as ATLAS (A Toroidal LHC ApparatuS) at Large Hadron Collider (LHC). The prototype operates at a frequency of up to 100 MHz and consumes 46.86 µW. The total area occupancy is 1702 µm2 for 1.152 kb memory bit cells. The device operates with a single supply voltage of 1 V and achieves 0.41 fJ/bit/search Figure-of-Merit. Full article
(This article belongs to the Special Issue Low-Power CMOS Analog and Digital Circuits and Filters)
Show Figures

Figure 1

Article
Simple Scheme for the Implementation of Low Voltage Fully Differential Amplifiers without Output Common-Mode Feedback Network
J. Low Power Electron. Appl. 2020, 10(4), 34; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040034 - 23 Oct 2020
Cited by 3 | Viewed by 1194
Abstract
A simple scheme to implement class AB low-voltage fully differential amplifiers that do not require an output common-mode feedback network (CMFN) is introduced. It has a rail to rail output signal swing and high rejection of common-mode input signals. It operates in strong [...] Read more.
A simple scheme to implement class AB low-voltage fully differential amplifiers that do not require an output common-mode feedback network (CMFN) is introduced. It has a rail to rail output signal swing and high rejection of common-mode input signals. It operates in strong inversion with ±300 mV supplies in a 180 nm CMOS process. It uses an auxiliary amplifier that minimizes supply requirements by setting the op-amp input terminals very close to one of the rails and also serves as a common-mode feedback network to generate complementary output signals. The scheme is verified with simulation results of an amplifier that consumes 25 µW, has a gain-bandwidth product (GBW) of 16.1 MHz, slew rate (SR) of 8.4 V/µs, the small signal figure of merit (FOMSS) of 6.49 MHz*pF/µW, the large signal figure of merit (FOMLS) of 3.39 V/µs*pF/µW, and current efficiency (CE) of 2.03 in strong inversion, with a 10 pF load capacitance. Full article
Show Figures

Figure 1

Article
Challenges and Opportunities in Near-Threshold DNN Accelerators around Timing Errors
J. Low Power Electron. Appl. 2020, 10(4), 33; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040033 - 16 Oct 2020
Viewed by 1164
Abstract
AI evolution is accelerating and Deep Neural Network (DNN) inference accelerators are at the forefront of ad hoc architectures that are evolving to support the immense throughput required for AI computation. However, much more energy efficient design paradigms are inevitable to realize the [...] Read more.
AI evolution is accelerating and Deep Neural Network (DNN) inference accelerators are at the forefront of ad hoc architectures that are evolving to support the immense throughput required for AI computation. However, much more energy efficient design paradigms are inevitable to realize the complete potential of AI evolution and curtail energy consumption. The Near-Threshold Computing (NTC) design paradigm can serve as the best candidate for providing the required energy efficiency. However, NTC operation is plagued with ample performance and reliability concerns arising from the timing errors. In this paper, we dive deep into DNN architecture to uncover some unique challenges and opportunities for operation in the NTC paradigm. By performing rigorous simulations in TPU systolic array, we reveal the severity of timing errors and its impact on inference accuracy at NTC. We analyze various attributes—such as data–delay relationship, delay disparity within arithmetic units, utilization pattern, hardware homogeneity, workload characteristics—and uncover unique localized and global techniques to deal with the timing errors in NTC. Full article
(This article belongs to the Special Issue Circuits and Systems Advances in Near Threshold Computing)
Show Figures

Figure 1

Article
Intra- and Inter-Server Smart Task Scheduling for Profit and Energy Optimization of HPC Data Centers
J. Low Power Electron. Appl. 2020, 10(4), 32; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040032 - 14 Oct 2020
Viewed by 951
Abstract
Servers in a data center are underutilized due to over-provisioning, which contributes heavily toward the high-power consumption of the data centers. Recent research in optimizing the energy consumption of High Performance Computing (HPC) data centers mostly focuses on consolidation of Virtual Machines (VMs) [...] Read more.
Servers in a data center are underutilized due to over-provisioning, which contributes heavily toward the high-power consumption of the data centers. Recent research in optimizing the energy consumption of High Performance Computing (HPC) data centers mostly focuses on consolidation of Virtual Machines (VMs) and using dynamic voltage and frequency scaling (DVFS). These approaches are inherently hardware-based, are frequently unique to individual systems, and often use simulation due to lack of access to HPC data centers. Other approaches require profiling information on the jobs in the HPC system to be available before run-time. In this paper, we propose a reinforcement learning based approach, which jointly optimizes profit and energy in the allocation of jobs to available resources, without the need for such prior information. The approach is implemented in a software scheduler used to allocate real applications from the Princeton Application Repository for Shared-Memory Computers (PARSEC) benchmark suite to a number of hardware nodes realized with Odroid-XU3 boards. Experiments show that the proposed approach increases the profit earned by 40% while simultaneously reducing energy consumption by 20% when compared to a heuristic-based approach. We also present a network-aware server consolidation algorithm called Bandwidth-Constrained Consolidation (BCC), for HPC data centers which can address the under-utilization problem of the servers. Our experiments show that the BCC consolidation technique can reduce the power consumption of a data center by up-to 37%. Full article
Show Figures

Figure 1

Article
PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications
J. Low Power Electron. Appl. 2020, 10(4), 31; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040031 - 30 Sep 2020
Viewed by 999
Abstract
Multiple multi-threaded tasks constitute a modern many-core application. An accompanying generic Directed Acyclic Graph (DAG) represents the execution precedence relationship between the tasks. The application comes with a hard deadline and high peak power consumption. Parallel execution of multiple tasks on multiple cores [...] Read more.
Multiple multi-threaded tasks constitute a modern many-core application. An accompanying generic Directed Acyclic Graph (DAG) represents the execution precedence relationship between the tasks. The application comes with a hard deadline and high peak power consumption. Parallel execution of multiple tasks on multiple cores results in a quicker execution, but higher peak power. Peak power single-handedly determines the involved cooling costs in many-cores, while its violations could induce performance-crippling execution uncertainties. Less task parallelization, on the other hand, results in lower peak power, but a more prolonged deadline violating execution. The problem of peak power minimization in many-cores is to determine task-to-core mapping configuration in the spatio-temporal domain that minimizes the peak power consumption of an application, but ensures application still meets the deadline. All previous works on peak power minimization for many-core applications (with or without DAG) assume only single-threaded tasks. We are the first to propose a framework, called PkMin, which minimizes the peak power of many-core applications with DAG that have multi-threaded tasks. PkMin leverages the inherent convexity in the execution characteristics of multi-threaded tasks to find a configuration that satisfies the deadline, as well as minimizes peak power. Evaluation on hundreds of applications shows PkMin on average results in 49.2% lower peak power than a similar state-of-the-art framework. Full article
Show Figures

Figure 1

Review
A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures
J. Low Power Electron. Appl. 2020, 10(4), 30; https://0-doi-org.brum.beds.ac.uk/10.3390/jlpea10040030 - 24 Sep 2020
Viewed by 1364
Abstract
Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types [...] Read more.
Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop