Energy-Efficient In-Memory Acceleration of Deep Neural Networks Through a Hardware-Software Co-Design Approach

171380-Thumbnail Image.png
Description
Deep neural networks (DNNs), as a main-stream algorithm for various AI tasks, achieve higher accuracy at the cost of increased computational complexity and model size, posing great challenges to hardware platforms. This dissertation first tackles the design challenges of resistive

Deep neural networks (DNNs), as a main-stream algorithm for various AI tasks, achieve higher accuracy at the cost of increased computational complexity and model size, posing great challenges to hardware platforms. This dissertation first tackles the design challenges of resistive random-access-memory (RRAM) based in-memory computing (IMC) architectures. A new metric, model stability from the loss landscape, is proposed to help shed light on accuracy under variations and model compression and guide a novel variation-aware training (VAT) solution. The proposed method effectively improves post-mapping accuracy of multiple datasets. Next, a hybrid RRAM/SRAM IMC DNN inference accelerator is developed, that integrates an RRAM-based IMC macro, a reconfigurable SRAM-based multiply-accumulate (MAC) macro, and a programmable shifter. The hybrid IMC accelerator fully recovers the inference accuracy post the mapping. Furthermore, this dissertation researches on architectural optimizations for high IMC utilization, low on-chip communication cost, and low energy-delay product (EDP), including on-chip interconnect design, PE array utilization, and tile-to-router mapping and scheduling. The optimal choice of on-chip interconnect results in up to 6x improvement in energy-delay-area product for RRAM IMC architectures. Furthermore, the PE and NoC optimizations show up to 62% improvement in PE utilization, 78% reduction in area, and 78% lower energy-area product for a wide range of modern DNNs. Finally, this dissertation proposes a novel chiplet-based IMC benchmarking simulator, SIAM, and a heterogeneous chiplet IMC architecture to address the limitations of a monolithic DNN accelerator. SIAM utilizes model-based and cycle-accurate simulation to provide a scalable and flexible architecture. SIAM is calibrated against a published silicon result, SIMBA, from Nvidia. The heterogeneous architecture utilizes a custom mapping with a bank of big and little chiplets, and a hybrid network-on-package (NoP) to optimize the utilization, interconnect bandwidth, and energy efficiency. The proposed big-little chiplet-based RRAM IMC architecture significantly improves energy efficiency at lower area, compared to conventional GPUs. In summary, this dissertation comprehensively investigates novel methods that encompass device, circuits, architecture, packaging, and algorithm to design scalable high-performance and energy-efficient IMC architectures.
Date Created
2022
Agent

Study of On-Chip Integrated Switched-Capacitor Voltage Regulator

158871-Thumbnail Image.png
Description
Power management circuits have been more and more widely used in various applications, while providing fully integrated voltage regulation remains a challenging topic. Switched-capacitor (SC) voltage converters have received attentions in integrated power conversion for fixed-ratio voltage conversions with good

Power management circuits have been more and more widely used in various applications, while providing fully integrated voltage regulation remains a challenging topic. Switched-capacitor (SC) voltage converters have received attentions in integrated power conversion for fixed-ratio voltage conversions with good efficiency and feasibility of integration. During my PhD study, an on-chip current sensing technique is proposed to dynamically modulate both switching frequency and switch widths of SC voltage converters, enhancing fast transient response and higher efficiency across a wide range of load currents. In conjunction with SC converters, a low-dropout regulator (LDO) is implemented which is driven by a push-pull operational transconductance amplifier (OTA), whose current is mirrored and sensed with minimal power and efficiency overhead. The sensed load current directly controls the frequency and width of SC converters through a voltage-controlled oscillator (VCO) and a time-to-digital converter, respectively.
Theoretical analysis and optimization for SC DC-DC converters have been presented in prior works, however optimization of different capacitors, namely flying and input/output decoupling capacitors, in SC voltage regulators (SCVRs) under an area constraint has not been addressed. A methodology to optimize flying and decoupling capacitance for area-constrained on-chip SCVRs to achieve the highest system-level power efficiency. Considering both conversion efficiency and droop voltage against fast load transients, the proposed model determines the optimal ratio between flying and decoupling.
Based on the previous design, a fully integrated switched-capacitor voltage regulator with voltage comparison and on-chip lossless current sensing control is proposed. Based on the voltage comparison result and sensed current as the load current changes, the frequency of the SC converters are modulated for optimal efficiency. The voltage regulator targets 2.1V input voltage and 0.9V output voltage, which offers higher-voltage power transfer across chip package. A 17-phase interleaved structure is used to reduce output voltage ripple.
In 65nm CMOS, the regulator is implemented with MIM-capacitor, targeting 2.1V input voltage and 0.9V output voltage. According to the measurement results, the proposed SC voltage regulator achieves 69.6% peak efficiency at 60mA load current, which corresponds to a 4.2mW/mm2 power-area density and 12.5mW
F power-capacitance density. The efficiency across 20mA to 92mA regulator load current range is above 62%. The steady-state output voltage ripple across 22x load current range of 3.5mA-76mA is between 50mV to 60mV.
Date Created
2020
Agent

Novel Learning-Based Task Schedulers for Domain-Specific SoCs

158693-Thumbnail Image.png
Description
This Master’s thesis includes the design, integration on-chip, and evaluation of a set of imitation learning (IL)-based scheduling policies: deep neural network (DNN)and decision tree (DT). We first developed IL-based scheduling policies for heterogeneous systems-on-chips (SoCs). Then, we tested these

This Master’s thesis includes the design, integration on-chip, and evaluation of a set of imitation learning (IL)-based scheduling policies: deep neural network (DNN)and decision tree (DT). We first developed IL-based scheduling policies for heterogeneous systems-on-chips (SoCs). Then, we tested these policies using a system-level domain-specific system-on-chip simulation framework [11]. Finally, we transformed them into efficient code using a cloud engine [1] and implemented on a user-space emulation framework [61] on a Unix-based SoC. IL is one area of machine learning (ML) and a useful method to train artificial intelligence (AI) models by imitating the decisions of an expert or Oracle that knows the optimal solution. This thesis's primary focus is to adapt an ML model to work on-chip and optimize the resource allocation for a set of domain-specific wireless and radar systems applications. Evaluation results with four streaming applications from wireless communications and radar domains show how the proposed IL-based scheduler approximates an offline Oracle expert with more than 97% accuracy and 1.20× faster execution time. The models have been implemented as an add-on, making it easy to port to other SoCs.
Date Created
2020
Agent

Experimental Evaluation of the Feasibility of Wearable Piezoelectric Energy Harvesting

158679-Thumbnail Image.png
Description
Technological advances in low power wearable electronics and energy optimization techniques

make motion energy harvesting a viable energy source. However, it has not been

widely adopted due to bulky energy harvester designs that are uncomfortable to wear. This

work addresses this problem by

Technological advances in low power wearable electronics and energy optimization techniques

make motion energy harvesting a viable energy source. However, it has not been

widely adopted due to bulky energy harvester designs that are uncomfortable to wear. This

work addresses this problem by analyzing the feasibility of powering low wearable power

devices using piezoelectric energy generated at the human knee. We start with a novel

mathematical model for estimating the power generated from human knee joint movements.

This thesis’s major contribution is to analyze the feasibility of human motion energy harvesting

and validating this analytical model using a commercially available piezoelectric

module. To this end, we implemented an experimental setup that replicates a human knee.

Then, we performed experiments at different excitation frequencies and amplitudes with

two commercially available Macro Fiber Composite (MFC) modules. These experimental

results are used to validate the analytical model and predict the energy harvested as a function

of the number of steps taken in a day. The model estimates that 13μWcan be generated

on an average while walking with a 4.8% modeling error. The obtained results show that

piezoelectricity is indeed a viable approach for powering low-power wearable devices.
Date Created
2020
Agent

Detecting Unauthorized Activity in Lightweight IoT Devices

158603-Thumbnail Image.png
Description
The manufacturing process for electronic systems involves many players, from chip/board design and fabrication to firmware design and installation.

In today's global supply chain, any of these steps are prone to interference from rogue players, creating a security risk.

Manufactured

The manufacturing process for electronic systems involves many players, from chip/board design and fabrication to firmware design and installation.

In today's global supply chain, any of these steps are prone to interference from rogue players, creating a security risk.

Manufactured devices need to be verified to perform only their intended operations since it is not economically feasible to control the supply chain and use only trusted facilities.

It is becoming increasingly necessary to trust but verify the received devices both at production and in the field.

Unauthorized hardware or firmware modifications, known as Trojans,

can steal information, drain the battery, or damage battery-driven embedded systems and lightweight Internet of Things (IoT) devices.

Since Trojans may be triggered in the field at an unknown instance,

it is essential to detect their presence at run-time.

However, it isn't easy to run sophisticated detection algorithms on these devices

due to limited computational power and energy, and in some cases, lack of accessibility.

Since finding a trusted sample is infeasible in general, the proposed technique is based on self-referencing to remove any effect of environmental or device-to-device variations in the frequency domain.

In particular, the self-referencing is achieved by exploiting the band-limited nature of Trojan activity using signal detection theory.

When the device enters the test mode, a predefined test application is run on the device

repetitively for a known period. The periodicity ensures that the spectral electromagnetic power of the test application concentrates at known frequencies, leaving the remaining frequencies within the operating bandwidth at the noise level. Any deviations from the noise level for these unoccupied frequency locations indicate the presence of unknown (unauthorized) activity. Hence, the malicious activity can differentiate without using a golden reference or any knowledge of the Trojan activity attributes.

The proposed technique's effectiveness is demonstrated through experiments with collecting and processing side-channel signals, such as involuntarily electromagnetic emissions and power consumption, of a wearable electronics prototype and commercial system-on-chip under a variety of practical scenarios.
Date Created
2020
Agent

Hardware Implementation and Analysis of Temporal Interference Mitigation : A High-Level Synthesis Based Approach

158584-Thumbnail Image.png
Description
The following document describes the hardware implementation and analysis of Temporal Interference Mitigation using High-Level Synthesis. As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio frequency (RF) based systems are posing as viable solution to this

The following document describes the hardware implementation and analysis of Temporal Interference Mitigation using High-Level Synthesis. As the problem of spectral congestion becomes more chronic and widespread, Electromagnetic radio frequency (RF) based systems are posing as viable solution to this problem. Among the existing RF methods Cooperation based systems have been a solution to a host of congestion problems. One of the most important elements of RF receiver is the spatially adaptive part of the receiver. Temporal Mitigation is vital technique employed at the receiver for signal recovery and future propagation along the radar chain.

The computationally intensive parts of temporal mitigation are identified and hardware accelerated. The hardware implementation is based on sequential approach with optimizations applied on the individual components for better performance.

An extensive analysis using a range of fixed point data types is performed to find the optimal data type necessary.

Finally a hybrid combination of data types for different components of temporal mitigation is proposed based on results from the above analysis.
Date Created
2020
Agent

Design, Optimization, and Applications of Wearable IoT Devices

Description
Movement disorders are becoming one of the leading causes of functional disability due to aging populations and extended life expectancy. Diagnosis, treatment, and rehabilitation currently depend on the behavior observed in a clinical environment. After the patient leaves the clinic,

Movement disorders are becoming one of the leading causes of functional disability due to aging populations and extended life expectancy. Diagnosis, treatment, and rehabilitation currently depend on the behavior observed in a clinical environment. After the patient leaves the clinic, there is no standard approach to continuously monitor the patient and report potential problems. Furthermore, self-recording is inconvenient and unreliable. To address these challenges, wearable health monitoring is emerging as an effective way to augment clinical care for movement disorders.

Wearable devices are being used in many health, fitness, and activity monitoring applications. However, their widespread adoption has been hindered by several adaptation and technical challenges. First, conventional rigid devices are uncomfortable to wear for long periods. Second, wearable devices must operate under very low-energy budgets due to their small battery capacities. Small batteries create a need for frequent recharging, which in turn leads users to stop using them. Third, the usefulness of wearable devices must be demonstrated through high impact applications such that users can get value out of them.

This dissertation presents solutions to solving the challenges faced by wearable devices. First, it presents an open-source hardware/software platform for wearable health monitoring. The proposed platform uses flexible hybrid electronics to enable devices that conform to the shape of the user’s body. Second, it proposes an algorithm to enable recharge-free operation of wearable devices that harvest energy from the environment. The proposed solution maximizes the performance of the wearable device under minimum energy constraints. The results of the proposed algorithm are, on average, within 3% of the optimal solution computed offline. Third, a comprehensive framework for human activity recognition (HAR), one of the first steps towards a solution for movement disorders is presented. It starts with an online learning framework for HAR. Experiments on a low power IoT device (TI-CC2650 MCU) with twenty-two users show 95% accuracy in identifying seven activities and their transitions with less than 12.5 mW power consumption. The online learning framework is accompanied by a transfer learning approach for HAR that determines the number of neural network layers to transfer among uses to enable efficient online learning. Next, a technique to co-optimize the accuracy and active time of wearable applications by utilizing multiple design points with different energy-accuracy trade-offs is presented. The proposed technique switches between the design points at runtime to maximize a generalized objective function under tight harvested energy budget constraints. Finally, we present the first ultra-low-energy hardware accelerator that makes it practical to perform HAR on energy harvested from wearable devices. The accelerator consumes 22.4 microjoules per operation using a commercial 65 nm technology. In summary, the solutions presented in this dissertation can enable the wider adoption of wearable devices.
Date Created
2020
Agent

The Architecture Design and Hardware Implementation of Communications and High-Precision Positioning System

158413-Thumbnail Image.png
Description
Within the near future, a vast demand for autonomous vehicular techniques can be forecast on both aviation and ground platforms, including autonomous driving, automatic landing, air traffic management. These techniques usually rely on the positioning system and the communication system

Within the near future, a vast demand for autonomous vehicular techniques can be forecast on both aviation and ground platforms, including autonomous driving, automatic landing, air traffic management. These techniques usually rely on the positioning system and the communication system independently, where it potentially causes spectrum congestion. Inspired by the spectrum sharing technique, Communications and High-Precision Positioning (CHP2) system is invented to provide a high precision position service (precision ~1cm) while performing the communication task simultaneously under the same spectrum. CHP2 system is implemented on the consumer-off-the-shelf (COTS) software-defined radio (SDR) platform with customized hardware. Taking the advantages of the SDR platform, the completed baseband processing chain, time-of-arrival estimation (ToA), time-of-flight estimation (ToF) are mathematically modeled and then implemented onto the system-on-chip (SoC) system. Due to the compact size and cost economy, the CHP2 system can be installed on different aerial or ground platforms enabling a high-mobile and reconfigurable network.

In this dissertation report, the implementation procedure of the CHP2 system is discussed in detail. It mainly focuses on the system construction on the Xilinx Ultrascale+ SoC platform. The CHP2 waveform design, ToA solution, and timing exchanging algorithms are also introduced. Finally, several in-lab tests and over-the-air demonstrations are conducted. The demonstration shows the best ranging performance achieves the ~1 cm standard deviation and 10Hz refreshing rate of estimation by using a 10MHz narrow-band signal over 915MHz (US ISM) or 783MHz (EU Licensed) carrier frequency.
Date Created
2020
Agent

Predictive Process Design Kits for the 7 nm and 5 nm Technology Nodes

157985-Thumbnail Image.png
Description
Recent years have seen fin field effect transistors (finFETs) dominate modern complementary metal oxide semiconductor (CMOS) processes, [1][2], e.g., at the sub 20 nm technology nodes, as they alleviate short channel effects, provide lower leakage, and enable some continued VDD

Recent years have seen fin field effect transistors (finFETs) dominate modern complementary metal oxide semiconductor (CMOS) processes, [1][2], e.g., at the sub 20 nm technology nodes, as they alleviate short channel effects, provide lower leakage, and enable some continued VDD scaling. However, a realistic finFET based predictive process design kit (PDK) that supports investigation into both circuit and physical design, encompassing all aspects of digital design, for academic use has been unavailable. While the finFET based FreePDK15 was supplemented with a standard cell library, it lacked full physical verification (LVS) and parasitic extraction at the time [3][4]. Consequently, the only available sub 45 nm educational PDKs are the planar CMOS based Synopsys 32/28 nm and FreePDK45 (45 nm PDK) [5][6]. The cell libraries available for those processes are not realistic since they use large cell heights, in contrast to recent industry trends. Additionally, the SRAM rules and cells provided by these PDKs are not realistic. Because finFETs have a 3D structure, which affects transistor density, using planar libraries scaled to sub 22 nm dimensions for research is likely to give poor accuracy.

Commercial libraries and PDKs, especially for advanced nodes, are often difficult to obtain for academic use, and access to the actual physical layouts is even more restricted. Furthermore, the necessary non disclosure agreements (NDAs) are un manageable for large university classes and the plethora of design rules can distract from the key points. NDAs also make it difficult for the publication of physical design as these may disclose proprietary design rules and structures.

This work focuses on the development of realistic PDKs for academic use that overcome these limitations. These PDKs, developed for the N7 and N5 nodes, even before 7 nm and 5 nm processes were available in industry, are thus predictive. The predictions have been based on publications of the continually improving lithography, as well as estimates of what would be available at N7 and N5. For the most part, these assumptions have been accurate with regards to N7, except for the expectation that extreme ultraviolet (EUV) lithography would be widely available, which has turned out to be optimistic.
Date Created
2019
Agent

Algorithm and Hardware Design for High Volume Rate 3-D Medical Ultrasound Imaging

157982-Thumbnail Image.png
Description
Ultrasound B-mode imaging is an increasingly significant medical imaging modality for clinical applications. Compared to other imaging modalities like computed tomography (CT) or magnetic resonance imaging (MRI), ultrasound imaging has the advantage of being safe, inexpensive, and portable. While two

Ultrasound B-mode imaging is an increasingly significant medical imaging modality for clinical applications. Compared to other imaging modalities like computed tomography (CT) or magnetic resonance imaging (MRI), ultrasound imaging has the advantage of being safe, inexpensive, and portable. While two dimensional (2-D) ultrasound imaging is very popular, three dimensional (3-D) ultrasound imaging provides distinct advantages over its 2-D counterpart by providing volumetric imaging, which leads to more accurate analysis of tumor and cysts. However, the amount of received data at the front-end of 3-D system is extremely large, making it impractical for power-constrained portable systems.



In this thesis, algorithm and hardware design techniques to support a hand-held 3-D ultrasound imaging system are proposed. Synthetic aperture sequential beamforming (SASB) is chosen since its computations can be split into two stages, where the output generated of Stage 1 is significantly smaller in size compared to the input. This characteristic enables Stage 1 to be done in the front end while Stage 2 can be sent out to be processed elsewhere.



The contributions of this thesis are as follows. First, 2-D SASB is extended to 3-D. Techniques to increase the volume rate of 3-D SASB through a new multi-line firing scheme and use of linear chirp as the excitation waveform, are presented. A new sparse array design that not only reduces the number of active transducers but also avoids the imaging degradation caused by grating lobes, is proposed. A combination of these techniques increases the volume rate of 3-D SASB by 4\texttimes{} without introducing extra computations at the front end.



Next, algorithmic techniques to further reduce the Stage 1 computations in the front end are presented. These include reducing the number of distinct apodization coefficients and operating with narrow-bit-width fixed-point data. A 3-D die stacked architecture is designed for the front end. This highly parallel architecture enables the signals received by 961 active transducers to be digitalized, routed by a network-on-chip, and processed in parallel. The processed data are accumulated through a bus-based structure. This architecture is synthesized using TSMC 28 nm technology node and the estimated power consumption of the front end is less than 2 W.



Finally, the Stage 2 computations are mapped onto a reconfigurable multi-core architecture, TRANSFORMER, which supports different types of on-chip memory banks and run-time reconfigurable connections between general processing elements and memory banks. The matched filtering step and the beamforming step in Stage 2 are mapped onto TRANSFORMER with different memory configurations. Gem5 simulations show that the private cache mode generates shorter execution time and higher computation efficiency compared to other cache modes. The overall execution time for Stage 2 is 14.73 ms. The average power consumption and the average Giga-operations-per-second/Watt in 14 nm technology node are 0.14 W and 103.84, respectively.
Date Created
2019
Agent