We Need to Talk About Robustness to Adversarial Attacks While Removing Spurious Dataset Biases

161967-Thumbnail Image.png
Description
Machine learning models can pick up biases and spurious correlations from training data and projects and amplify these biases during inference, thus posing significant challenges in real-world settings. One approach to mitigating this is a class of methods that can

Machine learning models can pick up biases and spurious correlations from training data and projects and amplify these biases during inference, thus posing significant challenges in real-world settings. One approach to mitigating this is a class of methods that can identify filter out bias-inducing samples from the training datasets to force models to avoid being exposed to biases. However, the filtering leads to a considerable wastage of resources as most of the dataset created is discarded as biased. This work deals with avoiding the wastage of resources by identifying and quantifying the biases. I further elaborate on the implications of dataset filtering on robustness (to adversarial attacks) and generalization (to out-of-distribution samples). The findings suggest that while dataset filtering does help to improve OOD(Out-Of-Distribution) generalization, it has a significant negative impact on robustness to adversarial attacks. It also shows that transforming bias-inducing samples into adversarial samples (instead of eliminating them from the dataset) can significantly boost robustness without sacrificing generalization.
Date Created
2021
Agent

Learning Causality with Networked Observational Data

161577-Thumbnail Image.png
Description
This dissertation considers the question of how convenient access to copious networked observational data impacts our ability to learn causal knowledge. It investigates in what ways learning causality from such data is different from -- or the same as --

This dissertation considers the question of how convenient access to copious networked observational data impacts our ability to learn causal knowledge. It investigates in what ways learning causality from such data is different from -- or the same as -- the traditional causal inference which often deals with small scale i.i.d. data collected from randomized controlled trials? For example, how can we exploit network information for a series of tasks in the area of learning causality? To answer this question, the dissertation is written toward developing a suite of novel causal learning algorithms that offer actionable insights for a series of causal inference tasks with networked observational data. The work aims to benefit real-world decision-making across a variety of highly influential applications. In the first part of this dissertation, it investigates the task of inferring individual-level causal effects from networked observational data. First, it presents a representation balancing-based framework for handling the influence of hidden confounders to achieve accurate estimates of causal effects. Second, it extends the framework with an adversarial learning approach to properly combine two types of existing heuristics: representation balancing and treatment prediction. The second part of the dissertation describes a framework for counterfactual evaluation of treatment assignment policies with networked observational data. A novel framework that captures patterns of hidden confounders is developed to provide more informative input for downstream counterfactual evaluation methods. The third part presents a framework for debiasing two-dimensional grid-based e-commerce search with observational search log data where there is an implicit network connecting neighboring products in a search result page. A novel inverse propensity scoring framework that models user behavior patterns for two-dimensional display in e-commerce websites is developed, which aims to optimize online performance of ranking algorithms with offline log data.
Date Created
2021
Agent

Using Machine Learning Models to Detect Fake News, Bots, and Rumors on Social Media

147700-Thumbnail Image.png
Description

In this paper, I introduce the fake news problem and detail how it has been exacerbated<br/>through social media. I explore current practices for fake news detection using natural language<br/>processing and current benchmarks in ranking the efficacy of various language models.

In this paper, I introduce the fake news problem and detail how it has been exacerbated<br/>through social media. I explore current practices for fake news detection using natural language<br/>processing and current benchmarks in ranking the efficacy of various language models. Using a<br/>Twitter-specific benchmark, I attempt to reproduce the scores of six language models<br/>demonstrating their effectiveness in seven tweet classification tasks. I explain the successes and<br/>challenges in reproducing these results and provide analysis for the future implications of fake<br/>news research.

Date Created
2021-05
Agent

The Fusion of Multimodal Brain Imaging Data from Geometry Perspectives

158676-Thumbnail Image.png
Description
The rapid development in acquiring multimodal neuroimaging data provides opportunities to systematically characterize human brain structures and functions. For example, in the brain magnetic resonance imaging (MRI), a typical non-invasive imaging technique, different acquisition sequences (modalities) lead to the different

The rapid development in acquiring multimodal neuroimaging data provides opportunities to systematically characterize human brain structures and functions. For example, in the brain magnetic resonance imaging (MRI), a typical non-invasive imaging technique, different acquisition sequences (modalities) lead to the different descriptions of brain functional activities, or anatomical biomarkers. Nowadays, in addition to the traditional voxel-level analysis of images, there is a trend to process and investigate the cross-modality relationship in a high dimensional level of images, e.g. surfaces and networks.

In this study, I aim to achieve multimodal brain image fusion by referring to some intrinsic properties of data, e.g. geometry of embedding structures where the commonly used image features reside. Since the image features investigated in this study share an identical embedding space, i.e. either defined on a brain surface or brain atlas, where a graph structure is easy to define, it is straightforward to consider the mathematically meaningful properties of the shared structures from the geometry perspective.

I first introduce the background of multimodal fusion of brain image data and insights of geometric properties playing a potential role to link different modalities. Then, several proposed computational frameworks either using the solid and efficient geometric algorithms or current geometric deep learning models are be fully discussed. I show how these designed frameworks deal with distinct geometric properties respectively, and their applications in the real healthcare scenarios, e.g. to enhanced detections of fetal brain diseases or abnormal brain development.
Date Created
2020
Agent

Understanding Disinformation: Learning with Weak Social Supervision

158566-Thumbnail Image.png
Description
Social media has become an important means of user-centered information sharing and communications in a gamut of domains, including news consumption, entertainment, marketing, public relations, and many more. The low cost, easy access, and rapid dissemination of information on social

Social media has become an important means of user-centered information sharing and communications in a gamut of domains, including news consumption, entertainment, marketing, public relations, and many more. The low cost, easy access, and rapid dissemination of information on social media draws a large audience but also exacerbate the wide propagation of disinformation including fake news, i.e., news with intentionally false information. Disinformation on social media is growing fast in volume and can have detrimental societal effects. Despite the importance of this problem, our understanding of disinformation in social media is still limited. Recent advancements of computational approaches on detecting disinformation and fake news have shown some early promising results. Novel challenges are still abundant due to its complexity, diversity, dynamics, multi-modality, and costs of fact-checking or annotation.

Social media data opens the door to interdisciplinary research and allows one to collectively study large-scale human behaviors otherwise impossible. For example, user engagements over information such as news articles, including posting about, commenting on, or recommending the news on social media, contain abundant rich information. Since social media data is big, incomplete, noisy, unstructured, with abundant social relations, solely relying on user engagements can be sensitive to noisy user feedback. To alleviate the problem of limited labeled data, it is important to combine contents and this new (but weak) type of information as supervision signals, i.e., weak social supervision, to advance fake news detection.

The goal of this dissertation is to understand disinformation by proposing and exploiting weak social supervision for learning with little labeled data and effectively detect disinformation via innovative research and novel computational methods. In particular, I investigate learning with weak social supervision for understanding disinformation with the following computational tasks: bringing the heterogeneous social context as auxiliary information for effective fake news detection; discovering explanations of fake news from social media for explainable fake news detection; modeling multi-source of weak social supervision for early fake news detection; and transferring knowledge across domains with adversarial machine learning for cross-domain fake news detection. The findings of the dissertation significantly expand the boundaries of disinformation research and establish a novel paradigm of learning with weak social supervision that has important implications in broad applications in social media.
Date Created
2020
Agent

A Study on Generative Adversarial Networks Exacerbating Social Data Bias

158485-Thumbnail Image.png
Description
Generative Adversarial Networks are designed, in theory, to replicate the distribution of the data they are trained on. With real-world limitations, such as finite network capacity and training set size, they inevitably suffer a yet unavoidable technical failure: mode collapse.

Generative Adversarial Networks are designed, in theory, to replicate the distribution of the data they are trained on. With real-world limitations, such as finite network capacity and training set size, they inevitably suffer a yet unavoidable technical failure: mode collapse. GAN-generated data is not nearly as diverse as the real-world data the network is trained on; this work shows that this effect is especially drastic when the training data is highly non-uniform. Specifically, GANs learn to exacerbate the social biases which exist in the training set along sensitive axes such as gender and race. In an age where many datasets are curated from web and social media data (which are almost never balanced), this has dangerous implications for downstream tasks using GAN-generated synthetic data, such as data augmentation for classification. This thesis presents an empirical demonstration of this phenomenon and illustrates its real-world ramifications. It starts by showing that when asked to sample images from an illustrative dataset of engineering faculty headshots from 47 U.S. universities, unfortunately skewed toward white males, a DCGAN’s generator “imagines” faces with light skin colors and masculine features. In addition, this work verifies that the generated distribution diverges more from the real-world distribution when the training data is non-uniform than when it is uniform. This work also shows that a conditional variant of GAN is not immune to exacerbating sensitive social biases. Finally, this work contributes a preliminary case study on Snapchat’s explosively popular GAN-enabled “My Twin” selfie lens, which consistently lightens the skin tone for women of color in an attempt to make faces more feminine. The results and discussion of the study are meant to caution machine learning practitioners who may unsuspectingly increase the biases in their applications.
Date Created
2020
Agent

A Hacker-Centric Perspective to Empower Cyber Defense

158434-Thumbnail Image.png
Description
Malicious hackers utilize the World Wide Web to share knowledge. Previous work has demonstrated that information mined from online hacking communities can be used as precursors to cyber-attacks. In a threatening scenario, where security alert systems are facing high false

Malicious hackers utilize the World Wide Web to share knowledge. Previous work has demonstrated that information mined from online hacking communities can be used as precursors to cyber-attacks. In a threatening scenario, where security alert systems are facing high false positive rates, understanding the people behind cyber incidents can help reduce the risk of attacks. However, the rapidly evolving nature of those communities leads to limitations still largely unexplored, such as: who are the skilled and influential individuals forming those groups, how they self-organize along the lines of technical expertise, how ideas propagate within them, and which internal patterns can signal imminent cyber offensives? In this dissertation, I have studied four key parts of this complex problem set. Initially, I leverage content, social network, and seniority analysis to mine key-hackers on darkweb forums, identifying skilled and influential individuals who are likely to succeed in their cybercriminal goals. Next, as hackers often use Web platforms to advertise and recruit collaborators, I analyze how social influence contributes to user engagement online. On social media, two time constraints are proposed to extend standard influence measures, which increases their correlation with adoption probability and consequently improves hashtag adoption prediction. On darkweb forums, the prediction of where and when hackers will post a message in the near future is accomplished by analyzing their recurrent interactions with other hackers. After that, I demonstrate how vendors of malware and malicious exploits organically form hidden organizations on darkweb marketplaces, obtaining significant consistency across the vendors’ communities extracted using the similarity of their products in different networks. Finally, I predict imminent cyber-attacks correlating malicious hacking activity on darkweb forums with real-world cyber incidents, evidencing how social indicators are crucial for the performance of the proposed model. This research is a hybrid of social network analysis (SNA), machine learning (ML), evolutionary computation (EC), and temporal logic (TL), presenting expressive contributions to empower cyber defense.
Date Created
2020
Agent

Using Event logs and Rapid Ethnographic Data to Mine Clinical Pathways

158252-Thumbnail Image.png
Description
Background: Process mining (PM) using event log files is gaining popularity in healthcare to investigate clinical pathways. But it has many unique challenges. Clinical Pathways (CPs) are often complex and unstructured which results in spaghetti-like models. Moreover, the log files

Background: Process mining (PM) using event log files is gaining popularity in healthcare to investigate clinical pathways. But it has many unique challenges. Clinical Pathways (CPs) are often complex and unstructured which results in spaghetti-like models. Moreover, the log files collected from the electronic health record (EHR) often contain noisy and incomplete data. Objective: Based on the traditional process mining technique of using event logs generated by an EHR, observational video data from rapid ethnography (RE) were combined to model, interpret, simplify and validate the perioperative (PeriOp) CPs. Method: The data collection and analysis pipeline consisted of the following steps: (1) Obtain RE data, (2) Obtain EHR event logs, (3) Generate CP from RE data, (4) Identify EHR interfaces and functionalities, (5) Analyze EHR functionalities to identify missing events, (6) Clean and preprocess event logs to remove noise, (7) Use PM to compute CP time metrics, (8) Further remove noise by removing outliers, (9) Mine CP from event logs and (10) Compare CPs resulting from RE and PM. Results: Four provider interviews and 1,917,059 event logs and 877 minutes of video ethnography recording EHRs interaction were collected. When mapping event logs to EHR functionalities, the intraoperative (IntraOp) event logs were more complete (45%) when compared with preoperative (35%) and postoperative (21.5%) event logs. After removing the noise (496 outliers) and calculating the duration of the PeriOp CP, the median was 189 minutes and the standard deviation was 291 minutes. Finally, RE data were analyzed to help identify most clinically relevant event logs and simplify spaghetti-like CPs resulting from PM. Conclusion: The study demonstrated the use of RE to help overcome challenges of automatic discovery of CPs. It also demonstrated that RE data could be used to identify relevant clinical tasks and incomplete data, remove noise (outliers), simplify CPs and validate mined CPs.
Date Created
2020
Agent

Measuring the Impact of Social Network Interactions

158099-Thumbnail Image.png
Description
Social links form the backbone of human interactions, both in an offline and online world. Such interactions harbor network diffusion or in simpler words, information spreading in a population of connected individuals. With recent increase in user engagement in social

Social links form the backbone of human interactions, both in an offline and online world. Such interactions harbor network diffusion or in simpler words, information spreading in a population of connected individuals. With recent increase in user engagement in social media platforms thus giving rise to networks of large scale, it has become imperative to understand the diffusion mechanisms by considering evolving instances of these network structures. Additionally, I claim that human connections fluctuate over time and attempt to study empirically grounded models of diffusion that embody these variations through evolving network structures. Patterns of interactions that are now stimulated by these fluctuating connections can be harnessed

towards predicting real world events. This dissertation attempts at analyzing

and then modeling such patterns of social network interactions. I propose how such

models could be used in advantage over traditional models of diffusion in various

predictions and simulations of real world events.

The specific three questions rooted in understanding social network interactions that have been addressed in this dissertation are: (1) can interactions captured through evolving diffusion networks indicate and predict the phase changes in a diffusion process? (2) can patterns and models of interactions in hacker forums be used in cyber-attack predictions in the real world? and (3) do varying patterns of social influence impact behavior adoption with different success ratios and could they be used to simulate rumor diffusion?

For the first question, I empirically analyze information cascades of Twitter and Flixster data and conclude that in evolving network structures characterizing diffusion, local network neighborhood surrounding a user is particularly a better indicator of the approaching phases. For the second question, I attempt to build an integrated approach utilizing unconventional signals from the "darkweb" forum discussions for predicting attacks on a target organization. The study finds that filtering out credible users and measuring network features surrounding them can be good indicators of an impending attack. For the third question, I develop an experimental framework in a controlled environment to understand how individuals respond to peer behavior in situations of sequential decision making and develop data-driven agent based models towards simulating rumor diffusion.
Date Created
2020
Agent

Towards Robust Machine Learning Models for Data Scarcity

158066-Thumbnail Image.png
Description
Recently, a well-designed and well-trained neural network can yield state-of-the-art results across many domains, including data mining, computer vision, and medical image analysis. But progress has been limited for tasks where labels are difficult or impossible to obtain. This reliance

Recently, a well-designed and well-trained neural network can yield state-of-the-art results across many domains, including data mining, computer vision, and medical image analysis. But progress has been limited for tasks where labels are difficult or impossible to obtain. This reliance on exhaustive labeling is a critical limitation in the rapid deployment of neural networks. Besides, the current research scales poorly to a large number of unseen concepts and is passively spoon-fed with data and supervision.

To overcome the above data scarcity and generalization issues, in my dissertation, I first propose two unsupervised conventional machine learning algorithms, hyperbolic stochastic coding, and multi-resemble multi-target low-rank coding, to solve the incomplete data and missing label problem. I further introduce a deep multi-domain adaptation network to leverage the power of deep learning by transferring the rich knowledge from a large-amount labeled source dataset. I also invent a novel time-sequence dynamically hierarchical network that adaptively simplifies the network to cope with the scarce data.

To learn a large number of unseen concepts, lifelong machine learning enjoys many advantages, including abstracting knowledge from prior learning and using the experience to help future learning, regardless of how much data is currently available. Incorporating this capability and making it versatile, I propose deep multi-task weight consolidation to accumulate knowledge continuously and significantly reduce data requirements in a variety of domains. Inspired by the recent breakthroughs in automatically learning suitable neural network architectures (AutoML), I develop a nonexpansive AutoML framework to train an online model without the abundance of labeled data. This work automatically expands the network to increase model capability when necessary, then compresses the model to maintain the model efficiency.

In my current ongoing work, I propose an alternative method of supervised learning that does not require direct labels. This could utilize various supervision from an image/object as a target value for supervising the target tasks without labels, and it turns out to be surprisingly effective. The proposed method only requires few-shot labeled data to train, and can self-supervised learn the information it needs and generalize to datasets not seen during training.
Date Created
2020
Agent