Building Vision and Language Models with Implicit Supervision and Increased Efficiency

171740-Thumbnail Image.png
Description
An important objective of AI is to understand real-world observations and build up interactive communication with people. The ability to interpret and react to the perception reveals the important necessity of developing such a system across both the modalities of

An important objective of AI is to understand real-world observations and build up interactive communication with people. The ability to interpret and react to the perception reveals the important necessity of developing such a system across both the modalities of Vision (V) and Language (L). Although there have been massive efforts on various VL tasks, e.g., Image/Video Captioning, Visual Question Answering, and Textual Grounding, very few of them focus on building the VL models with increased efficiency under real-world scenarios. The main focus of this dissertation is to comprehensively investigate the very uncharted efficient VL learning, aiming to build lightweight, data-efficient, and real-world applicable VL models. The proposed studies in this dissertation take three primary aspects into account when it comes to efficient VL, 1). Data Efficiency: collecting task-specific annotations is prohibitively expensive and so manual labor is not always attainable. Techniques are developed to assist the VL learning from implicit supervision, i.e., in a weakly- supervised fashion. 2). Continuing from that, efficient representation learning is further explored with increased scalability, leveraging a large image-text corpus without task-specific annotations. In particular, the knowledge distillation technique is studied for generic Representation Learning which proves to bring substantial performance gain to the regular representation learning schema. 3). Architectural Efficiency. Deploying the VL model on edge devices is notoriously challenging due to their cumbersome architectures. To further extend these advancements to the real world, a novel efficient VL architecture is designed to tackle the inference bottleneck and the inconvenient two-stage training. Extensive discussions have been conducted on several critical aspects that prominently influence the performances of compact VL models.
Date Created
2022
Agent

iLieDown - Improved Display Orientation For Handheld Devices Using Convolutional Neural Networks.pdf

132117-Thumbnail Image.png
Description
91% of smartphone and tablet users experience a problem with their device screen being oriented the wrong way during use [11]. In [11], the authors proposed iRotate, a previous solution which uses computer vision to solve the orientation problem. We

91% of smartphone and tablet users experience a problem with their device screen being oriented the wrong way during use [11]. In [11], the authors proposed iRotate, a previous solution which uses computer vision to solve the orientation problem. We propose iLieDown, an improved method of automatically rotating smartphones, tablets, and other device displays. This paper introduces a new algorithm to correctly orient the display relative to the user’s face using a convolutional neural network (CNN). The CNN model is trained to predict the rotation of faces in various environments through data augmentation, uses a confidence threshold, and analyzes multiple images to be accurate and robust. iLieDown is battery and CPU efficient, causes no noticeable lag to the user during use, and is 6x more accurate than iRotate.
Date Created
2019-12
Agent