Database Storage Design for Model Serving Workloads

161833-Thumbnail Image.png
The meteoric rise of Deep Neural Networks (DNN) has led to the development of various Machine Learning (ML) frameworks (e.g., Tensorflow, PyTorch). Every ML framework has a different way of handling DNN models, data types, operations involved, and the internal

The meteoric rise of Deep Neural Networks (DNN) has led to the development of various Machine Learning (ML) frameworks (e.g., Tensorflow, PyTorch). Every ML framework has a different way of handling DNN models, data types, operations involved, and the internal representations stored on disk or memory. There have been initiatives such as the Open Neural Network Exchange (ONNX) for a more standardized approach to machine learning for better interoperability between the various popular ML frameworks. Model Serving Platforms (MSP) (e.g., Tensorflow Serving, Clipper) are used for serving DNN models to applications and edge devices. These platforms have gained widespread use for their flexibility in serving DNN models created by various ML frameworks. They also have additional capabilities such as caching, automatic ensembling, and scheduling. However, few of these frameworks focus on optimizing the storage of these DNN models, some of which may take up to ∼130GB storage space(“Turing-NLG: A 17-billion-parameter language model by Microsoft” 2020). These MSPs leave it to the ML frameworks for optimizing the DNN model with various model compression techniques, such as quantization and pruning. This thesis investigates the viability of automatic cross-model compression using traditional deduplication techniques and storage optimizations. Scenarios are identified where different DNN models have shareable model weight parameters. “Chunking” a model into smaller pieces is explored as an approach for deduplication. This thesis also proposes a design for storage in a Relational Database Management System (RDBMS) that allows for automatic cross-model deduplication.
Date Created

Augmenting Academic Research Search And Reading With Richer Context

161564-Thumbnail Image.png
The volume of scientific research is growing at an exponential rate over the past100 years. With the advent of the internet and ubiquitous access to the web, academic research search engines such as Google Scholar, Microsoft Academic, etc., have become

The volume of scientific research is growing at an exponential rate over the past100 years. With the advent of the internet and ubiquitous access to the web, academic research search engines such as Google Scholar, Microsoft Academic, etc., have become the go-to platforms for systemic reviews and search. Although many academic search engines host lots of content, they provide minimal context about where the search terms matched. Many of these search engines also fail to provide additional tools which can help enhance a researcher’s understanding of research content outside their respective websites. An example of such a tool can be a browser extension/plugin that surfaces context-relevant information about a research article when the user reads a research article. This dissertation discusses a solution developed to bring more intrinsic characteristics of research documents such as the structure of the research document, tables in the document, the keywords associated with the document to improve search capabilities and augment the information a researcher may read. The prototype solution named Sci-Genie( is a search engine over scientific articles from Computer Science ArXiv. Sci-Genie parses research papers and indexes research documents’ structure to provide context-relevant information about the matched search fragments. The same search engine also powers a browser extension to augment the information about a research article the user may be reading. The browser extension augments the user’s interface with information about tables from the cited papers, other papers by the same authors, and even the citations to and from the current article. The browser extension is further powered with access endpoints that leverage a machine learning model to filter tables comparing various entities. The dissertation further discusses these machine learning models and some baselines that help classify whether a table is comparing various entities or not. The dissertation finally concludes by discussing the current shortcomings of Sci-Genie and possible future research scope based on learnings after building Sci-Genie.
Date Created

Optimization of Block-based Tensor Decompositions through Sub-Tensor Impact Graphs and Applications to Dynamicity in Data and User Focus

161479-Thumbnail Image.png
Tensors are commonly used for representing multi-dimensional data, such as Web graphs, sensor streams, and social networks. As a consequence of the increase in the use of tensors, tensor decomposition operations began to form the basis for many data analysis

Tensors are commonly used for representing multi-dimensional data, such as Web graphs, sensor streams, and social networks. As a consequence of the increase in the use of tensors, tensor decomposition operations began to form the basis for many data analysis and knowledge discovery tasks, from clustering, trend detection, anomaly detection to correlationanalysis [31, 38]. It is well known that Singular Value matrix Decomposition (SVD) [9] is used to extract latent semantics for matrix data. When apply SVD to tensors, which have more than two modes, it is tensor decomposition. The two most popular tensor decomposition algorithms are the Tucker [54] and the CP [19] decompositions. Intuitively, they both generalize SVD to tensors. However, one key problem with tensor decomposition is its computational complexity which may cause system bottleneck. Therefore, two phase block-centric CP tensor decomposition (2PCP) was proposed to partition the tensor into small sub-tensors, execute sub-tensor decomposition in parallel and combine the factors from each sub-tensor into final decomposition factors through iterative rerefinement process. Consequently, I proposed Sub-tensor Impact Graph (SIG) to account for inaccuracy propagation among sub-tensors and measure the impact of decomposition of sub-tensors on the other's decomposition, Based on SIG, I proposed several optimization strategies to optimize 2PCP's phase-2 refinement process. Furthermore, I applied SIG and optimization strategies for data focus, data evolution, and focus shifting in tensor analysis. Personalized Tensor Decomposition (PTD) is proposed to account for the users focus given the observations that in many applications, the user may have a focus of interest i.e., part of the data for which the user needs high accuracy and beyond this area focus, accuracy may not be as critical. PTD takes as input one or more areas of focus and performs the decomposition in such a way that, when reconstructed, the accuracy of the tensor is boosted for these areas of focus. A related challenge of data evolution in tensor analytics is incremental tensor decomposition since re-computation of the whole tensor decomposition with each update will cause high computational costs and incur large memory overheads. Especially for applications where data evolves over time and the tensor-based analysis results need to be continuouslymaintained. To avoid re-decomposition, I propose a two-phase block-incremental CP-based tensor decomposition technique, BICP, that efficiently and effectively maintains tensor decomposition results in the presence of dynamically evolving tensor data. I further extend the research focus on user focus shift. User focus may change over time as data is evolving along the time. Although PTD is efficient, re-computation for each user preference update can be the bottleneck for the system. Therefore I propose dynamic evolving user focus tensor decomposition which can smartly reuse the existing decomposition result to improve the efficiency of evolving user focus block decomposition.
Date Created

Shuffle Overhead Analysis for the Layered Data Abstractions

161458-Thumbnail Image.png
Apache Spark is one of the most widely adopted open-source Big Data processing engines. High performance and ease of use for a wide class of users are some of the primary reasons for the wide adoption. Although data partitioning increases

Apache Spark is one of the most widely adopted open-source Big Data processing engines. High performance and ease of use for a wide class of users are some of the primary reasons for the wide adoption. Although data partitioning increases the performance of the analytics workload, its application to Apache Spark is very limited due to layered data abstractions. Once data is written to a stable storage system like Hadoop Distributed File System (HDFS), the data locality information is lost, and while reading the data back into Spark’s in-memory layer, the reading process is random which incurs shuffle overhead. This report investigates the use of metadata information that is stored along with the data itself for reducing shuffle overload in the join-based workloads. It explores the Hyperspace library to mitigate the shuffle overhead for Spark SQL applications. The article also introduces the Lachesis system to solve the shuffle overhead problem. The benchmark results show that the persistent partition and co-location techniques can be beneficial for matrix multiplication using SQL (Structured Query Language) operator along with the TPC-H analytical queries benchmark. The study concludes with a discussion about the trade-offs of using integrated stable storage to layered storage abstractions. It also discusses the feasibility of integration of the Machine Learning (ML) inference phase with the SQL operators along with cross-engine compatibility for employing data locality information.
Date Created

Generating Trusted Coordination of Collaborative Software Development Using Blockchain

158591-Thumbnail Image.png
The coordination of developing various complex and large-scale projects using computers has been well established and is the so-called computer-supported cooperative work (CSCW). Collaborative software development consists of a group of teams working together to achieve a common goal for

The coordination of developing various complex and large-scale projects using computers has been well established and is the so-called computer-supported cooperative work (CSCW). Collaborative software development consists of a group of teams working together to achieve a common goal for developing a high-quality, complex, and large-scale software system efficiently, and it requires common processes and communication channels among these teams. The common processes for coordination among software development teams can be handled by similar principles in CSCW. The development of complex and large-scale software becomes complicated due to the involvement of many software development teams. The development of such a software system can be largely improved by effective collaboration among the participating software development teams at both software components and system levels. The efficiency of developing software components depends on trusted coordination among the participating teams for sharing, processing, and managing information on various participating teams, which are often operating in a distributed environment. Participating teams may belong to the same organization or different organizations. Existing approaches to coordination in collaborative software development are based on using a centralized repository to store, process, and retrieve information on participating software development teams during the development. These approaches use a centralized authority, have a single point of failure, and restricted rights to own data and software. In this thesis, the generation of trusted coordination in collaborative software development using blockchain is studied, and an approach to achieving trusted cooperation for collaborative software development using blockchain is presented. The smart contracts are created in the blockchain to encode software specifications and acceptance criteria for the software results generated by participating teams. The blockchain used in the approach is a private blockchain because a private blockchain has the characteristics of providing non-repudiation, privacy, and integrity, which are required in trusted coordination of collaborative software development. This approach is implemented using Hyperledger, an open-source private blockchain. An example to illustrate the approach is also given.
Date Created

Identification of Compromised Nodes in Collaborative Intrusion Detection Systems for Large Scale Networks Due to Insider Attacks

158417-Thumbnail Image.png
Large organizations have multiple networks that are subject to attacks, which can be detected by continuous monitoring and analyzing the network traffic by Intrusion Detection Systems. Collaborative Intrusion Detection Systems (CIDS) are used for efficient detection of distributed attacks by

Large organizations have multiple networks that are subject to attacks, which can be detected by continuous monitoring and analyzing the network traffic by Intrusion Detection Systems. Collaborative Intrusion Detection Systems (CIDS) are used for efficient detection of distributed attacks by having a global view of the traffic events in large networks. However, CIDS are vulnerable to internal attacks, and these internal attacks decrease the mutual trust among the nodes in CIDS required for sharing of critical and sensitive alert data in CIDS. Without the data sharing, the nodes of CIDS cannot collaborate efficiently to form a comprehensive view of events in the networks monitored to detect distributed attacks. The compromised nodes will further decrease the accuracy of CIDS by generating false positives and false negatives of the traffic event classifications. In this thesis, an approach based on a trust score system is presented to detect and suspend the compromised nodes in CIDS to improve the trust among the nodes for efficient collaboration. This trust score-based approach is implemented as a consensus model on a private blockchain because private blockchain has the features to address the accountability, integrity and privacy requirements of CIDS. In this approach, the trust scores of malicious nodes are decreased with every reported false negative or false positive of the traffic event classifications. When the trust scores of any node falls below a threshold, the node is identified as compromised and suspended. The approach is evaluated for the accuracy of identifying malicious nodes in CIDS.
Date Created