Database Storage Design for Model Serving Workloads
Document
Description
The meteoric rise of Deep Neural Networks (DNN) has led to the development of various Machine Learning (ML) frameworks (e.g., Tensorflow, PyTorch). Every ML framework has a different way of handling DNN models, data types, operations involved, and the internal representations stored on disk or memory. There have been initiatives such as the Open Neural Network Exchange (ONNX) for a more standardized approach to machine learning for better interoperability between the various popular ML frameworks. Model Serving Platforms (MSP) (e.g., Tensorflow Serving, Clipper) are used for serving DNN models to applications and edge devices. These platforms have gained widespread use for their flexibility in serving DNN models created by various ML frameworks. They also have additional capabilities such as caching, automatic ensembling, and scheduling. However, few of these frameworks focus on optimizing the storage of these DNN models, some of which may take up to ∼130GB storage space(“Turing-NLG: A 17-billion-parameter language model by Microsoft” 2020). These MSPs leave it to the ML frameworks for optimizing the DNN model with various model compression techniques, such as quantization and pruning. This thesis investigates the viability of automatic cross-model compression using traditional deduplication techniques and storage optimizations. Scenarios are identified where different DNN models have shareable model weight parameters. “Chunking” a model into smaller pieces is explored as an approach for deduplication. This thesis also proposes a design for storage in a Relational Database Management System (RDBMS) that allows for automatic cross-model deduplication.