Quantization and Evaluation of AI Algorithms for Hardware Acceleration

161913-Thumbnail Image.png
Description
Artificial intelligence is one of the leading technologies that mimics the problem solving and decision making capabilities of the human brain. Machine learning algorithms, especially deep learning algorithms, are leading the way in terms of performance and robustness. They are

Artificial intelligence is one of the leading technologies that mimics the problem solving and decision making capabilities of the human brain. Machine learning algorithms, especially deep learning algorithms, are leading the way in terms of performance and robustness. They are used for various purposes, mainly for computer vision, speech recognition, and object detection. The algorithms are usually tested inaccuracy, and they utilize full floating-point precision (32 bits). The hardware would require a high amount of power and area to accommodate many parameters with full precision. In this exploratory work, the convolution autoencoder is quantized for the working of an event base camera. The model is designed so that the autoencoder can work on-chip, which would sufficiently decrease the latency in processing. Different quantization methods are used to quantize and binarize the weights and activations of this neural network model to be portable and power efficient. The sparsity term is added to make the model as robust and energy-efficient as possible. The network model was able to recoup the lost accuracy due to binarizing the weights and activation's to quantize the layers of the encoder selectively. This method of recouping the accuracy gives enough flexibility to introduce the network on the chip to get real-time processing from systems like event-based cameras. Lately, computer vision, especially object detection have made strides in their object detection accuracy. The algorithms can sufficiently detect and predict the objects in real-time. However, end-to-end detection of the algorithm is challenging due to the large parameter need and processing requirements. A change in the Non Maximum Suppression algorithm in SSD(Single Shot Detector)-Mobilenet-V1 resulted in less computational complexity without change in the quality of output metric. The Mean Average Precision(mAP) calculated suggests that this method can be implemented in the post-processing of other networks.
Date Created
2021
Agent

End-to-End Performance Benchmarking Tool for High-Speed Memory Access in Deep Learning

Description
Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing.

Due to high DRAM access latency and energy, several convolutional neural network(CNN) accelerators face performance and energy efficiency challenges, which are critical for embedded implementations. As these applications exploit larger datasets, memory accesses of these emerging applications are increasing. As a result, it is difficult to predict the combined dynamic random access memory (DRAM) workload behavior, which can sabotage memory optimizations in software. To understand the impact of external memory access on CNN accelerators which reduces the high DRAMaccess latency and energy, simulators such as RAMULATOR and VAMPIRE have been proposed in prior work. In this work, we utilize these simulators to benchmark external memory access in CNN accelerators. Experiments are performed generating trace files based on the number of parameters and data precision and also using trace file generated for CNN Accelerator Altera Arria 10 GX 1150 FPGA data to complete the end to end workflow using the mentioned simulators. Besides that, certain modifications were made in the default VAMPIRE code to implement certain functionalities such as PREA(Precharge All) and REF(Refresh). Then, precalculated energies were computed for DDR3, DDR4, and HBM based on the micron model to mention it in the dram specification file inputted to the VAMPIRE tool. An experimental study was performed and a comparison is made between DDR3, DDR4, and HBM, it was proved that DDR4 is nearly 31% more energy-efficient than DDR3 and HBMis 54% energy-efficient than DDR3. Performed modeling and experimental analysis on a large set of data and then split it into a set of data and compared the results of the small sets multiplied with the number of sets and the large data set and concluded that the results were nearly the same. Finally, a GUI is developed by wrapping both the simulators. GUI provides user-friendly access and can analyze the parameters without much prior knowledge and understanding of the working.
Date Created
2021
Agent

Fully-passive Wireless Acquisition of Biosignals

158814-Thumbnail Image.png
Description
The recording of biosignals enables physicians to correctly diagnose diseases and prescribe treatment. Existing wireless systems failed to effectively replace the conventional wired methods due to their large sizes, high power consumption, and the need to replace batteries. This thesis

The recording of biosignals enables physicians to correctly diagnose diseases and prescribe treatment. Existing wireless systems failed to effectively replace the conventional wired methods due to their large sizes, high power consumption, and the need to replace batteries. This thesis aims to alleviate these issues by presenting a series of wireless fully-passive sensors for the acquisition of biosignals: including neuropotential, biopotential, intracranial pressure (ICP), in addition to a stimulator for the pacing of engineered cardiac cells. In contrast to existing wireless biosignal recording systems, the proposed wireless sensors do not contain batteries or high-power electronics such as amplifiers or digital circuitries. Instead, the RFID tag-like sensors utilize a unique radiofrequency (RF) backscattering mechanism to enable wireless and battery-free telemetry of biosignals with extremely low power consumption. This characteristic minimizes the risk of heat-induced tissue damage and avoids the need to use any transcranial/transcutaneous wires, and thus significantly enhances long-term safety and reliability. For neuropotential recording, a small (9mm x 8mm), biocompatible, and flexible wireless recorder is developed and verified by in vivo acquisition of two types of neural signals, the somatosensory evoked potential (SSEP) and interictal epileptic discharges (IEDs). For wireless multichannel neural recording, a novel time-multiplexed multichannel recording method based on an inductor-capacitor delay circuit is presented and tested, realizing simultaneous wireless recording from 11 channels in a completely passive manner. For biopotential recording, a wearable and flexible wireless sensor is developed, achieving real-time wireless acquisition of ECG, EMG, and EOG signals. For ICP monitoring, a very small (5mm x 4mm) wireless ICP sensor is designed and verified both in vitro through a benchtop setup and in vivo through real-time ICP recording in rats. Finally, for cardiac cell stimulation, a flexible wireless passive stimulator, capable of delivering stimulation current as high as 60 mA, is developed, demonstrating successful control over the contraction of engineered cardiac cells. The studies conducted in this thesis provide information and guidance for future translation of wireless fully-passive telemetry methods into actual clinical application, especially in the field of implantable and wearable electronics.
Date Created
2020
Agent

Visual Perception, Prediction and Understanding with Relations

158799-Thumbnail Image.png
Description
Rapid development of computer vision applications such as image recognition and object detection has been enabled by the emerging deep learning technologies. To improve the accuracy further, deeper and wider neural networks with diverse architecture are proposed for better feature

Rapid development of computer vision applications such as image recognition and object detection has been enabled by the emerging deep learning technologies. To improve the accuracy further, deeper and wider neural networks with diverse architecture are proposed for better feature extraction. Though the performance boost is impressive, only marginal improvement can be achieved with significantly increased computational overhead. One solution is to compress the exploding-sized model by dropping less important weights or channels. This is an effective solution that has been well explored. However, by utilizing the rich relation information of the data, one can also improve the accuracy with reasonable overhead. This work makes progress toward efficient and accurate visual tasks including detection, prediction and understanding by using relations.
For object detection, a novel approach, Graph Assisted Reasoning (GAR), is proposed to utilize a heterogeneous graph to model object-object relations and object-scene relations. GAR fuses the features from neighboring object nodes as well as scene nodes. In this way, GAR produces better recognition than that produced from individual object nodes. Moreover, compared to previous approaches using Recurrent Neural Network (RNN), GAR's light-weight and low-coupling architecture further facilitate its integration into the object detection module.

For trajectories prediction, a novel approach, namely Diverse Attention RNN (DAT-RNN), is proposed to handle the diversity of trajectories and modeling of neighboring relations. DAT-RNN integrates both temporal and spatial relations to improve the prediction under various circumstances.

Last but not least, this work presents a novel relation implication-enhanced (RIE) approach that improves relation detection through relation direction and implication. With the relation implication, the SGG model is exposed to more ground truth information and thus mitigates the overfitting problem of the biased datasets. Moreover, the enhancement with relation implication is compatible with various context encoding schemes.

Comprehensive experiments on benchmarking datasets demonstrate the efficacy of the proposed approaches.
Date Created
2020
Agent

Efficient and Online Deep Learning through Model Plasticity and Stability

158769-Thumbnail Image.png
Description
The rapid advancement of Deep Neural Networks (DNNs), computing, and sensing technology has enabled many new applications, such as the self-driving vehicle, the surveillance drone, and the robotic system. Compared to conventional edge devices (e.g. cell phone or smart home

The rapid advancement of Deep Neural Networks (DNNs), computing, and sensing technology has enabled many new applications, such as the self-driving vehicle, the surveillance drone, and the robotic system. Compared to conventional edge devices (e.g. cell phone or smart home devices), these emerging devices are required to deal with much more complicated and dynamic situations in real-time with bounded computation resources. However, there are several challenges, including but not limited to efficiency, real-time adaptation, model stability, and automation of architecture design.

To tackle the challenges mentioned above, model plasticity and stability are leveraged to achieve efficient and online deep learning, especially in the scenario of learning streaming data at the edge:

First, a dynamic training scheme named Continuous Growth and Pruning (CGaP) is proposed to compress the DNNs through growing important parameters and pruning unimportant ones, achieving up to 98.1% reduction in the number of parameters.

Second, this dissertation presents Progressive Segmented Training (PST), which targets catastrophic forgetting problems in continual learning through importance sampling, model segmentation, and memory-assisted balancing. PST achieves state-of-the-art accuracy with 1.5X FLOPs reduction in the complete inference path.

Third, to facilitate online learning in real applications, acquisitive learning (AL) is further proposed to emphasize both knowledge inheritance and acquisition: the majority of the knowledge is first pre-trained in the inherited model and then adapted to acquire new knowledge. The inherited model's stability is monitored by noise injection and the landscape of the loss function, while the acquisition is realized by importance sampling and model segmentation. Compared to a conventional scheme, AL reduces accuracy drop by >10X on CIFAR-100 dataset, with 5X reduction in latency per training image and 150X reduction in training FLOPs.

Finally, this dissertation presents evolutionary neural architecture search in light of model stability (ENAS-S). ENAS-S uses a novel fitness score, which addresses not only the accuracy but also the model stability, to search for an optimal inherited model for the application of continual learning. ENAS-S outperforms hand-designed DNNs when learning from a data stream at the edge.

In summary, in this dissertation, several algorithms exploiting model plasticity and model stability are presented to improve the efficiency and accuracy of deep neural networks, especially for the scenario of continual learning.
Date Created
2020
Agent

Wireless Wearable Sensor to Characterize Respiratory Behaviors

158748-Thumbnail Image.png
Description
Respiratory behavior provides effective information to characterize lung functionality, including respiratory rate, respiratory profile, and respiratory volume. Current methods have limited capabilities of continuous characterization of respiratory behavior and are primarily targeting the measurement of respiratory rate, which has relatively

Respiratory behavior provides effective information to characterize lung functionality, including respiratory rate, respiratory profile, and respiratory volume. Current methods have limited capabilities of continuous characterization of respiratory behavior and are primarily targeting the measurement of respiratory rate, which has relatively less value in clinical application. In this dissertation, a wireless wearable sensor on a paper substrate is developed to continuously characterize respiratory behavior and deliver clinically relevant parameters, contributing to asthma control. Based on the anatomical analysis and experimental results, the optimum site for the wireless wearable sensor is on the midway of the xiphoid process and the costal margin, corresponding to the abdomen-apposed rib cage. At the wearing site, the linear strain change during respiration is measured and converted to lung volume by the wireless wearable sensor utilizing a distance-elapsed ultrasound. An on-board low-power Bluetooth module transmits the temporal lung volume change to a smartphone, where a custom-programmed app computes to show the clinically relevant parameters, such as forced vital capacity (FVC) and forced expiratory volume delivered in the first second (FEV1) and the FEV1/FVC ratio. Enhanced by a simple, yet effective machine-learning algorithm, a system consisting of two wireless wearable sensors accurately extracts respiratory features and classifies the respiratory behavior within four postures among different subjects, demonstrating that the respiratory behaviors are individual- and posture-dependent contributing to monitoring the posture-related respiratory diseases. The continuous and accurate monitoring of respiratory behaviors can track the respiratory disorders and diseases' progression for timely and objective approaches for control and management.
Date Created
2020
Agent

Efficient and Secure Deep Learning Inference System: A Software and Hardware Co-design Perspective

158684-Thumbnail Image.png
Description
The advances of Deep Learning (DL) achieved recently have successfully demonstrated its great potential of surpassing or close to human-level performance across multiple domains. Consequently, there exists a rising demand to deploy state-of-the-art DL algorithms, e.g., Deep Neural Networks (DNN),

The advances of Deep Learning (DL) achieved recently have successfully demonstrated its great potential of surpassing or close to human-level performance across multiple domains. Consequently, there exists a rising demand to deploy state-of-the-art DL algorithms, e.g., Deep Neural Networks (DNN), in real-world applications to release labors from repetitive work. On the one hand, the impressive performance achieved by the DNN normally accompanies with the drawbacks of intensive memory and power usage due to enormous model size and high computation workload, which significantly hampers their deployment on the resource-limited cyber-physical systems or edge devices. Thus, the urgent demand for enhancing the inference efficiency of DNN has also great research interests across various communities. On the other hand, scientists and engineers still have insufficient knowledge about the principles of DNN which makes it mostly be treated as a black-box. Under such circumstance, DNN is like "the sword of Damocles" where its security or fault-tolerance capability is an essential concern which cannot be circumvented.

Motivated by the aforementioned concerns, this dissertation comprehensively investigates the emerging efficiency and security issues of DNNs, from both software and hardware design perspectives. From the efficiency perspective, as the foundation technique for efficient inference of target DNN, the model compression via quantization is elaborated. In order to maximize the inference performance boost, the deployment of quantized DNN on the revolutionary Computing-in-Memory based neural accelerator is presented in a cross-layer (device/circuit/system) fashion. From the security perspective, the well known adversarial attack is investigated spanning from its original input attack form (aka. Adversarial example generation) to its parameter attack variant.
Date Created
2020
Agent

Robust Networks: Neural Networks Robust to Quantization Noise and Analog Computation Noise Based on Natural Gradient

157977-Thumbnail Image.png
Description
Deep neural networks (DNNs) have had tremendous success in a variety of

statistical learning applications due to their vast expressive power. Most

applications run DNNs on the cloud on parallelized architectures. There is a need

for for efficient DNN inference on edge with

Deep neural networks (DNNs) have had tremendous success in a variety of

statistical learning applications due to their vast expressive power. Most

applications run DNNs on the cloud on parallelized architectures. There is a need

for for efficient DNN inference on edge with low precision hardware and analog

accelerators. To make trained models more robust for this setting, quantization and

analog compute noise are modeled as weight space perturbations to DNNs and an

information theoretic regularization scheme is used to penalize the KL-divergence

between perturbed and unperturbed models. This regularizer has similarities to

both natural gradient descent and knowledge distillation, but has the advantage of

explicitly promoting the network to and a broader minimum that is robust to

weight space perturbations. In addition to the proposed regularization,

KL-divergence is directly minimized using knowledge distillation. Initial validation

on FashionMNIST and CIFAR10 shows that the information theoretic regularizer

and knowledge distillation outperform existing quantization schemes based on the

straight through estimator or L2 constrained quantization.
Date Created
2019
Agent

On-chip learning and inference acceleration of sparse representations

157619-Thumbnail Image.png
Description
The past decade has seen a tremendous surge in running machine learning (ML) functions on mobile devices, from mere novelty applications to now indispensable features for the next generation of devices.

While the mobile platform capabilities range widely, long battery life

The past decade has seen a tremendous surge in running machine learning (ML) functions on mobile devices, from mere novelty applications to now indispensable features for the next generation of devices.

While the mobile platform capabilities range widely, long battery life and reliability are common design concerns that are crucial to remain competitive.

Consequently, state-of-the-art mobile platforms have become highly heterogeneous by combining a powerful CPUs with GPUs to accelerate the computation of deep neural networks (DNNs), which are the most common structures to perform ML operations.

But traditional von Neumann architectures are not optimized for the high memory bandwidth and massively parallel computation demands required by DNNs.

Hence, propelling research into non-von Neumann architectures to support the demands of DNNs.

The re-imagining of computer architectures to perform efficient DNN computations requires focusing on the prohibitive demands presented by DNNs and alleviating them. The two central challenges for efficient computation are (1) large memory storage and movement due to weights of the DNN and (2) massively parallel multiplications to compute the DNN output.

Introducing sparsity into the DNNs, where certain percentage of either the weights or the outputs of the DNN are zero, greatly helps with both challenges. This along with algorithm-hardware co-design to compress the DNNs is demonstrated to provide efficient solutions to greatly reduce the power consumption of hardware that compute DNNs. Additionally, exploring emerging technologies such as non-volatile memories and 3-D stacking of silicon in conjunction with algorithm-hardware co-design architectures will pave the way for the next generation of mobile devices.

Towards the objectives stated above, our specific contributions include (a) an architecture based on resistive crosspoint array that can update all values stored and compute matrix vector multiplication in parallel within a single cycle, (b) a framework of training DNNs with a block-wise sparsity to drastically reduce memory storage and total number of computations required to compute the output of DNNs, (c) the exploration of hardware implementations of sparse DNNs and architectural guidelines to reduce power consumption for the implementations in monolithic 3D integrated circuits, and (d) a prototype chip in 65nm CMOS accelerator for long-short term memory networks trained with the proposed block-wise sparsity scheme.
Date Created
2019
Agent

Design and Optimization of Resistive RAM-based Storage and Computing Systems

157305-Thumbnail Image.png
Description
The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (u

The Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory

technology because of its attractive attributes, including excellent scalability (< 10 nm), low

programming voltage (< 3 V), fast switching speed (< 10 ns), high OFF/ON ratio (> 10),

good endurance (up to 1012 cycles) and great compatibility with silicon CMOS technology [1].

However, ReRAM suffers from larger write latency, energy and reliability issue compared to

Dynamic Random Access Memory (DRAM). To improve the energy-efficiency, latency efficiency and reliability of ReRAM storage systems, a low cost cross-layer approach that spans device, circuit, architecture and system levels is proposed.

For 1T1R 2D ReRAM system, the effect of both retention and endurance errors on

ReRAM reliability is considered. Proposed approach is to design circuit-level and architecture-level techniques to reduce raw Bit Error Rate significantly and then employ low cost Error Control Coding to achieve the desired lifetime.

For 1S1R 2D ReRAM system, a cross-point array with “multi-bit per access” per subarray

is designed for high energy-efficiency and good reliability. The errors due to cell-level as well

as array-level variations are analyzed and a low cost scheme to maintain reliability and latency

with low energy consumption is proposed.

For 1S1R 3D ReRAM system, access schemes which activate multiple subarrays with

multiple layers in a subarray are used to achieve high energy efficiency through activating fewer

subarray, and good reliability is achieved through innovative data organization.

Finally, a novel ReRAM-based accelerator design is proposed to support multiple

Convolutional Neural Networks (CNN) topologies including VGGNet, AlexNet and ResNet.

The multi-tiled architecture consists of 9 processing elements per tile, where each tile

implements the dot product operation using ReRAM as computation unit. The processing

elements operate in a systolic fashion, thereby maximizing input feature map reuse and

minimizing interconnection cost. The system-level evaluation on several network benchmarks

show that the proposed architecture can improve computation efficiency and energy efficiency

compared to a state-of-the-art ReRAM-based accelerator.
Date Created
2019
Agent