Full metadata
Title
Building Vision and Language Models with Implicit Supervision and Increased Efficiency
Description
An important objective of AI is to understand real-world observations and build up interactive communication with people. The ability to interpret and react to the perception reveals the important necessity of developing such a system across both the modalities of Vision (V) and Language (L). Although there have been massive efforts on various VL tasks, e.g., Image/Video Captioning, Visual Question Answering, and Textual Grounding, very few of them focus on building the VL models with increased efficiency under real-world scenarios. The main focus of this dissertation is to comprehensively investigate the very uncharted efficient VL learning, aiming to build lightweight, data-efficient, and real-world applicable VL models. The proposed studies in this dissertation take three primary aspects into account when it comes to efficient VL, 1). Data Efficiency: collecting task-specific annotations is prohibitively expensive and so manual labor is not always attainable. Techniques are developed to assist the VL learning from implicit supervision, i.e., in a weakly- supervised fashion. 2). Continuing from that, efficient representation learning is further explored with increased scalability, leveraging a large image-text corpus without task-specific annotations. In particular, the knowledge distillation technique is studied for generic Representation Learning which proves to bring substantial performance gain to the regular representation learning schema. 3). Architectural Efficiency. Deploying the VL model on edge devices is notoriously challenging due to their cumbersome architectures. To further extend these advancements to the real world, a novel efficient VL architecture is designed to tackle the inference bottleneck and the inconvenient two-stage training. Extensive discussions have been conducted on several critical aspects that prominently influence the performances of compact VL models.
Date Created
2022
Contributors
- Fang, Zhiyuan (Author)
- Yang, Yezhou (Thesis advisor)
- Baral, Chitta (Committee member)
- Liu, Huan (Committee member)
- Liu, Zicheng (Committee member)
- Arizona State University (Publisher)
Topical Subject
Resource Type
Extent
197 pages
Language
eng
Copyright Statement
In Copyright
Primary Member of
Peer-reviewed
No
Open Access
No
Handle
https://hdl.handle.net/2286/R.2.N.171740
Level of coding
minimal
Cataloging Standards
Note
Partial requirement for: Ph.D., Arizona State University, 2022
Field of study: Computer Science
System Created
- 2022-12-20 06:19:18
System Modified
- 2022-12-20 06:19:18
- 1 year 10 months ago
Additional Formats