Implicit Hypothetical Reasoning about Intrinsic Physical Properties

171495-Thumbnail Image.png
Description
Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical

Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them, many of these properties (such as mass and coefficient of friction) are not captured directly by the imaging pipeline. Videos often capture objects, their motion, and the interactions between different objects. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. This thesis introduces a new video question-answering task for reasoning about the implicit physical properties of objects in a scene, from videos. For this task, I introduce a dataset -- CRIPP-VQA (Counterfactual Reasoning about Implicit Physical Properties - Video Question Answering), which contains videos of objects in motion, annotated with hypothetical/counterfactual questions about the effect of actions (such as removing, adding, or replacing objects), questions about planning (choosing actions to perform to reach a particular goal), as well as descriptive questions about the visible properties of objects. Further, I benchmark the performance of existing video question-answering models on two test settings of CRIPP-VQA: i.i.d. and an out-of-distribution setting which contains objects with values of mass, coefficient of friction, and initial velocities that are not seen in the training distribution. Experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this thesis) and explicit properties (the focus of prior work) of objects.
Date Created
2022
Agent

Operational Safety Assessment Methodology Framework: An Approach to Quantifying Automated Vehicle Safety

171430-Thumbnail Image.png
Description
To date, there is not a standardized method for consistently quantifying the performance of an automated driving system (ADS)-equipped vehicle (AV). The purpose of this dissertation is to contribute to a framework for such an approach referred to throughout as

To date, there is not a standardized method for consistently quantifying the performance of an automated driving system (ADS)-equipped vehicle (AV). The purpose of this dissertation is to contribute to a framework for such an approach referred to throughout as the operational safety assessment (OSA) methodology. Through this research, safety metrics are identified, researched, and analyzed to capture aspects of the operational safety of AVs, interacting with other salient objects. This dissertation outlines the approach for developing this methodology through a series of key steps including: (1) comprehensive literature review; (2) research and refinement of OSA metrics; (3) generation of MATLAB script for metric calculations; (4) generation of simulated events for analysis; (5) collection of real-world data for analysis; (6) review of OSA methodology results; and (7) discussion of future work to expand complexity, fidelity, and relevance aspects of the OSA methodology. The detailed literature review includes the identification of metrics historically used in both traditional and more recent evaluations of vehicle performance. Subsequently, the metric formulations are refined, and robust severity evaluations are proposed. A MATLAB script is then presented which was generated to calculate the metrics from any given source assuming proper formatting of the data. To further refine the formulations and the MATLAB script, a variety of simulated scenarios are discussed including car-following, intersection, and lane change situations. Additionally, a data collection activity is presented, leveraging the SMARTDRIVE testbed operated by Maricopa County Department of Transportation in Anthem, AZ to collect real-world data from an active intersection. Lastly, the efficacy of the OSA methodology with respect to the evaluation of vehicle performance for a set of scenarios is evaluated utilizing both simulated and real-world data. This assessment provides a demonstration of the ability and robustness of this methodology to evaluate vehicle performance for a given scenario. At the conclusion of this dissertation, additional factors including fidelity, complexity, and relevance are explored to contribute to a more comprehensive evaluation.
Date Created
2022
Agent

Vehicle Re-identification Using a Multi-View Vehicle Dataset

168842-Thumbnail Image.png
Description
There has been an explosion in the amount of data on the internet because of modern technology – especially image data – as a consequence of an exponential growth in the number of cameras existing in the world right now;

There has been an explosion in the amount of data on the internet because of modern technology – especially image data – as a consequence of an exponential growth in the number of cameras existing in the world right now; from more extensive surveillance camera systems to billions of people walking around with smartphones in their pockets that come with built-in cameras. With this sudden increase in the accessibility of cameras, most of the data that is getting captured through these devices is ending up on the internet. Researchers soon took leverage of this data by creating large-scale datasets. However, generating a dataset – let alone a large-scale one – requires a lot of man-hours. This work presents an algorithm that makes use of optical flow and feature matching, along with utilizing localization outputs from a Mask R-CNN, to generate large-scale vehicle datasets without much human supervision. Additionally, this work proposes a novel multi-view vehicle dataset (MVVdb) of 500 vehicles which is also generated using the aforementioned algorithm.There are various research problems in computer vision that can leverage a multi-view dataset, e.g., 3D pose estimation, and 3D object detection. On the other hand, a multi-view vehicle dataset can be used for a 2D image to 3D shape prediction, generation of 3D vehicle models, and even a more robust vehicle make and model recognition. In this work, a ResNet is trained on the multi-view vehicle dataset to perform vehicle re-identification, which is fundamentally similar to a vehicle make and recognition problem – also showcasing the usability of the MVVdb dataset.
Date Created
2022
Agent

Video Captioning with Commonsense Knowledge Anchors

168821-Thumbnail Image.png
Description
It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest

It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. This work presents a Commonsense knowledge Anchored Video cAptioNing (dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, commonsense knowledge is queried to complement per training caption by querying a generic knowledge atlas ATOMIC, and form the commonsense- caption entailment corpus. A BERT based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. With extensive empirical evaluations on MSR-VTT, V2C and VATEX datasets, CAVAN consistently improves the quality of generations and shows higher keyword hit rate. Experimental results with ablations validate the effectiveness of CAVAN and reveals that the use of commonsense knowledge contributes to the video caption generation.
Date Created
2022
Agent

Towards Energy­-efficient Visual Navigation: Sensor Quantization and Event­-based Vision Pipelines

168739-Thumbnail Image.png
Description
Visual navigation is a useful and important task for a variety of applications. As the preva­lence of robots increase, there is an increasing need for energy-­efficient navigation methods as well. Many aspects of efficient visual navigation algorithms have been implemented

Visual navigation is a useful and important task for a variety of applications. As the preva­lence of robots increase, there is an increasing need for energy-­efficient navigation methods as well. Many aspects of efficient visual navigation algorithms have been implemented in the lit­erature, but there is a lack of work on evaluation of the efficiency of the image sensors. In this thesis, two methods are evaluated: adaptive image sensor quantization for traditional camera pipelines as well as new event­-based sensors for low­-power computer vision.The first contribution in this thesis is an evaluation of performing varying levels of sen­sor linear and logarithmic quantization with the task of visual simultaneous localization and mapping (SLAM). This unconventional method can provide efficiency benefits with a trade­ off between accuracy of the task and energy-­efficiency. A new sensor quantization method, gradient­-based quantization, is introduced to improve the accuracy of the task. This method only lowers the bit level of parts of the image that are less likely to be important in the SLAM algorithm since lower bit levels signify better energy­-efficiency, but worse task accuracy. The third contribution is an evaluation of the efficiency and accuracy of event­-based camera inten­sity representations for the task of optical flow. The results of performing a learning based optical flow are provided for each of five different reconstruction methods along with ablation studies. Lastly, the challenges of an event feature­-based SLAM system are presented with re­sults demonstrating the necessity for high quality and high­ resolution event data. The work in this thesis provides studies useful for examining trade­offs for an efficient visual navigation system with traditional and event vision sensors. The results of this thesis also provide multiple directions for future work.
Date Created
2022
Agent

ARGOS: Adaptive Recognition and Gated Operation System for Real-time Vision Applications

168714-Thumbnail Image.png
Description
Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices. For instance, the intersection management of Connected Autonomous Vehicles (CAVs) requires running computationally intensive object recognition algorithms on low-power traffic cameras. This dissertation aims to study the effect of a dynamic hardware and software approach to address this issue. Characteristics of real-world applications can facilitate this dynamic adjustment and reduce the computation. Specifically, this dissertation starts with a dynamic hardware approach that adjusts itself based on the toughness of input and extracts deeper features if needed. Next, an adaptive learning mechanism has been studied that use extracted feature from previous inputs to improve system performance. Finally, a system (ARGOS) was proposed and evaluated that can be run on embedded systems while maintaining the desired accuracy. This system adopts shallow features at inference time, but it can switch to deep features if the system desires a higher accuracy. To improve the performance, ARGOS distills the temporal knowledge from deep features to the shallow system. Moreover, ARGOS reduces the computation furthermore by focusing on regions of interest. The response time and mean average precision are adopted for the performance evaluation to evaluate the proposed ARGOS system.
Date Created
2022
Agent

Topology Processing of Retinotopic Maps

168694-Thumbnail Image.png
Description
Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging

Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging (fMRI) signals of cortical responses to slowly moving visual stimuli on the retina. Biological evidences show the retinotopic mapping is topology-preserving/topological (i.e. keep the neighboring relationship after human brain process) within each visual region. Unfortunately, due to limited spatial resolution and the signal-noise ratio of fMRI, state of art retinotopic map is not topological. The topic was to model the topology-preserving condition mathematically, fix non-topological retinotopic map with numerical methods, and improve the quality of retinotopic maps. The impose of topological condition, benefits several applications. With the topological retinotopic maps, one may have a better insight on human retinotopic maps, including better cortical magnification factor quantification, more precise description of retinotopic maps, and potentially better exam ways of in Ophthalmology clinic.
Date Created
2022
Agent

Implicitly Supervised Neural Question Answering

168624-Thumbnail Image.png
Description
How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability

How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability to understand natural language, where it reads a passage of text or an image and answers comprehension questions. In recent years, the development of transformer-based language models and large-scale human-annotated datasets has led to remarkable progress in the field of question answering. However, several disadvantages of fully supervised question answering systems have been observed. Such as generalizing to unseen out-of-distribution domains, linguistic style differences in questions, and adversarial samples. This thesis proposes implicitly supervised question answering systems trained using knowledge acquisition from external knowledge sources and new learning methods that provide inductive biases to learn question answering. In particular, the following research projects are discussed: (1) Knowledge Acquisition methods: these include semantic and abductive information retrieval for seeking missing knowledge, a method to represent unstructured text corpora as a knowledge graph, and constructing a knowledge base for implicit commonsense reasoning. (2) Learning methods: these include Knowledge Triplet Learning, a method over knowledge graphs; Test-Time Learning, a method to generalize to an unseen out-of-distribution context; WeaQA, a method to learn visual question answering using image captions without strong supervision; WeaSel, weakly supervised method for relative spatial reasoning; and a new paradigm for unsupervised natural language inference. These methods potentially provide a new research direction to overcome the pitfalls of direct supervision.
Date Created
2022
Agent

Attributable Watermarking of Speech Generative Models

168441-Thumbnail Image.png
Description
Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns

Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation, Intellectual property theft(IP theft) and copyright infringement. One method to solve these threats is to embedded attributable watermarking in synthesized contents so that user can identify the user-end models where the contents are generated from. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Existing studies showed the feasibility of model attribution in the image domain and tradeoff between attribution accuracy and generation quality under the various adversarial attacks but not in speech domain. This work discuss the feasibility of model attribution in different domain and algorithmic improvements for generating user-end speech models that empirically achieve high accuracy of attribution while maintaining high generation quality. Lastly, several experiments are conducted show the tradeoff between attributability and generation quality under a variety of attacks on generated speech signals attempting to remove the watermarks.
Date Created
2021
Agent

Multimodal Robot Learning for Grasping and Manipulation

168406-Thumbnail Image.png
Description
Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or handle variability is an expensive, complex, and time-consuming process. However, with the advent of more complex sensors and algorithms, overcoming these limitations becomes within reach. This work proposes innovations in artificial intelligence, language understanding, and multimodal integration to enable next-generation grasping and manipulation capabilities in autonomous robots. The underlying thesis is that multimodal observations and instructions can drastically expand the responsiveness and dexterity of robot manipulators. Natural language, in particular, can be used to enable intuitive, bidirectional communication between a human user and the machine. To this end, this work presents a system that learns context-aware robot control policies from multimodal human demonstrations. Among the main contributions presented are techniques for (a) collecting demonstrations in an efficient and intuitive fashion, (b) methods for leveraging physical contact with the environment and objects, (c) the incorporation of natural language to understand context, and (d) the generation of robust robot control policies. The presented approach and systems are evaluated in multiple grasping and manipulation settings ranging from dexterous manipulation to pick-and-place, as well as contact-rich bimanual insertion tasks. Moreover, the usability of these innovations, especially when utilizing human task demonstrations and communication interfaces, is evaluated in several human-subject studies.
Date Created
2021
Agent