Towards Reliable Semantic Vision

187328-Thumbnail Image.png
Description
Models that learn from data are widely and rapidly being deployed today for real-world use, and have become an integral and embedded part of human lives. While these technological advances are exciting and impactful, such data-driven computer vision systems often

Models that learn from data are widely and rapidly being deployed today for real-world use, and have become an integral and embedded part of human lives. While these technological advances are exciting and impactful, such data-driven computer vision systems often fail in inscrutable ways. This dissertation seeks to study and improve the reliability of machine learning models from several perspectives including the development of robust training algorithms to mitigate the risks of such failures, construction of new datasets that provide a new perspective on capabilities of vision models, and the design of evaluation metrics for re-calibrating the perception of performance improvements. I will first address distribution shift in image classification with the following contributions: (1) two methods for improving the robustness of image classifiers to distribution shift by leveraging the classifier's failures into an adversarial data transformation pipeline guided by domain knowledge, (2) an interpolation-based technique for flagging out-of-distribution samples, and (3) an intriguing trade-off between distributional and adversarial robustness resulting from data modification strategies. I will then explore reliability considerations for \textit{semantic vision} models that learn from both visual and natural language data; I will discuss how logical and semantic sentence transformations affect the performance of vision--language models and my contributions towards developing knowledge-guided learning algorithms to mitigate these failures. Finally, I will describe the effort towards building and evaluating complex reasoning capabilities of vision--language models towards the long-term goal of robust and reliable computer vision models that can communicate, collaborate, and reason with humans.
Date Created
2023
Agent

Building Vision and Language Models with Implicit Supervision and Increased Efficiency

171740-Thumbnail Image.png
Description
An important objective of AI is to understand real-world observations and build up interactive communication with people. The ability to interpret and react to the perception reveals the important necessity of developing such a system across both the modalities of

An important objective of AI is to understand real-world observations and build up interactive communication with people. The ability to interpret and react to the perception reveals the important necessity of developing such a system across both the modalities of Vision (V) and Language (L). Although there have been massive efforts on various VL tasks, e.g., Image/Video Captioning, Visual Question Answering, and Textual Grounding, very few of them focus on building the VL models with increased efficiency under real-world scenarios. The main focus of this dissertation is to comprehensively investigate the very uncharted efficient VL learning, aiming to build lightweight, data-efficient, and real-world applicable VL models. The proposed studies in this dissertation take three primary aspects into account when it comes to efficient VL, 1). Data Efficiency: collecting task-specific annotations is prohibitively expensive and so manual labor is not always attainable. Techniques are developed to assist the VL learning from implicit supervision, i.e., in a weakly- supervised fashion. 2). Continuing from that, efficient representation learning is further explored with increased scalability, leveraging a large image-text corpus without task-specific annotations. In particular, the knowledge distillation technique is studied for generic Representation Learning which proves to bring substantial performance gain to the regular representation learning schema. 3). Architectural Efficiency. Deploying the VL model on edge devices is notoriously challenging due to their cumbersome architectures. To further extend these advancements to the real world, a novel efficient VL architecture is designed to tackle the inference bottleneck and the inconvenient two-stage training. Extensive discussions have been conducted on several critical aspects that prominently influence the performances of compact VL models.
Date Created
2022
Agent

Event Detection as Multi-Task Text Generation

171580-Thumbnail Image.png
Description
Event detection refers to the task of identifying event occurrences in a given natural language text. Event detection comprises two subtasks; recognizing event mention (event identification) and the type of event (event classification). Breaking from the sequence labeling and word

Event detection refers to the task of identifying event occurrences in a given natural language text. Event detection comprises two subtasks; recognizing event mention (event identification) and the type of event (event classification). Breaking from the sequence labeling and word classification approaches, this work models event detection, and its constituent subtasks of trigger identification and trigger classification, as independent sequence generation tasks. This work proposes a prompted multi-task generative model trained on event identification, classification, and combined event detection. The model is evaluated on on general-domain and biomedical-domain event detection datasets, achieving state-of-the-art results on the general-domain Roles Across Multiple Sentences (RAMS) dataset, establishing event detection benchmark performance on WikiEvents, and achieving competitive performance on the general-domain Massive Event Detection (MAVEN) dataset and the biomedical-domain Multi-Level Event Extraction (MLEE) dataset.
Date Created
2022
Agent

Implicit Hypothetical Reasoning about Intrinsic Physical Properties

171495-Thumbnail Image.png
Description
Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical

Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them, many of these properties (such as mass and coefficient of friction) are not captured directly by the imaging pipeline. Videos often capture objects, their motion, and the interactions between different objects. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. This thesis introduces a new video question-answering task for reasoning about the implicit physical properties of objects in a scene, from videos. For this task, I introduce a dataset -- CRIPP-VQA (Counterfactual Reasoning about Implicit Physical Properties - Video Question Answering), which contains videos of objects in motion, annotated with hypothetical/counterfactual questions about the effect of actions (such as removing, adding, or replacing objects), questions about planning (choosing actions to perform to reach a particular goal), as well as descriptive questions about the visible properties of objects. Further, I benchmark the performance of existing video question-answering models on two test settings of CRIPP-VQA: i.i.d. and an out-of-distribution setting which contains objects with values of mass, coefficient of friction, and initial velocities that are not seen in the training distribution. Experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this thesis) and explicit properties (the focus of prior work) of objects.
Date Created
2022
Agent

Extracting Semantic Information from Online Conversations to Enhance Cyber Defense

171434-Thumbnail Image.png
Description
Recent advances in techniques allow the extraction of Cyber Threat Information (CTI) from online content, such as social media, blog articles, and posts in discussion forums. Most research work focuses on social media and blog posts since their content is

Recent advances in techniques allow the extraction of Cyber Threat Information (CTI) from online content, such as social media, blog articles, and posts in discussion forums. Most research work focuses on social media and blog posts since their content is often contributed by cybersecurity experts and is usually of cleaner formats. While posts in online forums are noisier and less structured, online forums attract more users than other sources and contain much valuable information that may help predict cyber threats. Therefore, effectively extracting CTI from online forum posts is an important task in today's data-driven cybersecurity defenses. Many Natural Language Processing (NLP) techniques are applied to the cybersecurity domains to extract the useful information, however, there is still space to improve. In this dissertation, a new Named Entity Recognition framework for cybersecurity domains and thread structure construction methods for unstructured forums are proposed to support the extraction of CTI. Then, extend them to filter the posts in the forums to eliminate non cybersecurity related topics with Cyber Attack Relevance Scale (CARS), extract the cybersecurity knowledgeable users to enhance more information for enhancing cybersecurity, and extract trending topic phrases related to cyber attacks in the hackers forums to find the clues for potential future attacks to predict them.
Date Created
2022
Agent

Defeating Attackers by Bridging the Gaps Between Security and Intelligence

168710-Thumbnail Image.png
Description
The omnipresent data, growing number of network devices, and evolving attack techniques have been challenging organizations’ security defenses over the past decade. With humongous volumes of logs generated by those network devices, looking for patterns of malicious activities and identifying

The omnipresent data, growing number of network devices, and evolving attack techniques have been challenging organizations’ security defenses over the past decade. With humongous volumes of logs generated by those network devices, looking for patterns of malicious activities and identifying them in time is growing beyond the capabilities of their defense systems. Deep Learning, a subset of Machine Learning (ML) and Artificial Intelligence (AI), fills in this gapwith its ability to learn from huge amounts of data, and improve its performance as the data it learns from increases. In this dissertation, I bring forward security issues pertaining to two top threats that most organizations fear, Advanced Persistent Threat (APT), and Distributed Denial of Service (DDoS), along with deep learning models built towards addressing those security issues. First, I present a deep learning model, APT Detection, capable of detecting anomalous activities in a system. Evaluation of this model demonstrates how it can contribute to early detection of an APT attack with an Area Under the Curve (AUC) of up to 91% on a Receiver Operating Characteristic (ROC) curve. Second, I present DAPT2020, a first of its kind dataset capturing an APT attack exploiting web and system vulnerabilities in an emulated organization’s production network. Evaluation of the dataset using well known machine learning models demonstrates the need for better deep learning models to detect APT attacks. I then present DAPT2021, a semi-synthetic dataset capturing an APT attackexploiting human vulnerabilities, alongside 2 less skilled attacks. By emulating the normal behavior of the employees in a set target organization, DAPT2021 has been created to enable researchers study the causations and correlations among the captured data, a much-needed information to detect an underlying threat early. Finally, I present a distributed defense framework, SmartDefense, that can detect and mitigate over 90% of DDoS traffic at the source and over 97.5% of the remaining DDoS traffic at the Internet Service Provider’s (ISP’s) edge network. Evaluation of this work shows how by using attributes sent by customer edge network, SmartDefense can further help ISPs prevent up to 51.95% of the DDoS traffic from going to the destination.
Date Created
2022
Agent

Implicitly Supervised Neural Question Answering

168624-Thumbnail Image.png
Description
How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability

How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability to understand natural language, where it reads a passage of text or an image and answers comprehension questions. In recent years, the development of transformer-based language models and large-scale human-annotated datasets has led to remarkable progress in the field of question answering. However, several disadvantages of fully supervised question answering systems have been observed. Such as generalizing to unseen out-of-distribution domains, linguistic style differences in questions, and adversarial samples. This thesis proposes implicitly supervised question answering systems trained using knowledge acquisition from external knowledge sources and new learning methods that provide inductive biases to learn question answering. In particular, the following research projects are discussed: (1) Knowledge Acquisition methods: these include semantic and abductive information retrieval for seeking missing knowledge, a method to represent unstructured text corpora as a knowledge graph, and constructing a knowledge base for implicit commonsense reasoning. (2) Learning methods: these include Knowledge Triplet Learning, a method over knowledge graphs; Test-Time Learning, a method to generalize to an unseen out-of-distribution context; WeaQA, a method to learn visual question answering using image captions without strong supervision; WeaSel, weakly supervised method for relative spatial reasoning; and a new paradigm for unsupervised natural language inference. These methods potentially provide a new research direction to overcome the pitfalls of direct supervision.
Date Created
2022
Agent

Examining Data Integration with Schema Changes Based on Cell-level Mapping Using Deep Learning Models

168435-Thumbnail Image.png
Description
Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is always required. To be even worse, the data preprocessing tasks are often dull and heavy, which require huge human labors to deal with. Statistics show 70% - 80% of the data scientists' time is spent on data integration process. Among various reasons, schema changes that commonly exist in the data warehouse are one significant obstacle that impedes the automation of the end-to-end data integration process. Traditional data integration applications rely on data processing operators such as join, union, aggregation and so on. Those operations are fragile and can be easily interrupted by schema changes. Whenever schema changes happen, the data integration applications will require human labors to solve the interruptions and downtime. The industries as well as the data scientists need a new mechanism to handle the schema changes in data integration tasks. This work proposes a new direction of data integration applications based on deep learning models. The data integration problem is defined in the scenario of integrating tabular-format data with natural schema changes, using the cell-based data abstraction. In addition, data augmentation and adversarial learning are investigated to boost the model robustness to schema changes. The experiments are tested on two real-world data integration scenarios, and the results demonstrate the effectiveness of the proposed approach.
Date Created
2021
Agent

An Application of Attention for the Prediction of TCR-Epitope Binding Affinity

168430-Thumbnail Image.png
Description
T-cells are an integral component of the immune system, enabling the body to distinguish between pathogens and the self. The primary mechanism which enables this is their T-cell receptors (TCR) which bind to antigen epitopes foreign to the body. This

T-cells are an integral component of the immune system, enabling the body to distinguish between pathogens and the self. The primary mechanism which enables this is their T-cell receptors (TCR) which bind to antigen epitopes foreign to the body. This detection mechanism allows the T-cell to determine when an immune response is necessary. The computational prediction of TCR-epitope binding is important to researchers for both medical applications and for furthering their understanding of the biological mechanisms that impact immunity. Models which have been developed for this purpose fail to account for the interrelationships between amino acids and demonstrate poor out-of-sample performance. Small changes to the amino acids in these protein sequences can drastically change their structure and function. In recent years, attention-based deep learning models have shown success in their ability to learn rich contextual representations of data. To capture the contextual biological relationships between the amino acids, a multi-head self-attention model was created to predict the binding affinity between given TCR and epitope sequences. By learning the structural nuances of the sequences, this model is able to improve upon existing model performance and grant insights into the underlying mechanisms which impact binding.
Date Created
2021
Agent

Multimodal Robot Learning for Grasping and Manipulation

168406-Thumbnail Image.png
Description
Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or handle variability is an expensive, complex, and time-consuming process. However, with the advent of more complex sensors and algorithms, overcoming these limitations becomes within reach. This work proposes innovations in artificial intelligence, language understanding, and multimodal integration to enable next-generation grasping and manipulation capabilities in autonomous robots. The underlying thesis is that multimodal observations and instructions can drastically expand the responsiveness and dexterity of robot manipulators. Natural language, in particular, can be used to enable intuitive, bidirectional communication between a human user and the machine. To this end, this work presents a system that learns context-aware robot control policies from multimodal human demonstrations. Among the main contributions presented are techniques for (a) collecting demonstrations in an efficient and intuitive fashion, (b) methods for leveraging physical contact with the environment and objects, (c) the incorporation of natural language to understand context, and (d) the generation of robust robot control policies. The presented approach and systems are evaluated in multiple grasping and manipulation settings ranging from dexterous manipulation to pick-and-place, as well as contact-rich bimanual insertion tasks. Moreover, the usability of these innovations, especially when utilizing human task demonstrations and communication interfaces, is evaluated in several human-subject studies.
Date Created
2021
Agent