Protecting User Privacy with Social Media Data and Mining

158023-Thumbnail Image.png
Description
The pervasive use of the Web has connected billions of people all around the globe and enabled them to obtain information at their fingertips. This results in tremendous amounts of user-generated data which makes users traceable and vulnerable to privacy

The pervasive use of the Web has connected billions of people all around the globe and enabled them to obtain information at their fingertips. This results in tremendous amounts of user-generated data which makes users traceable and vulnerable to privacy leakage attacks. In general, there are two types of privacy leakage attacks for user-generated data, i.e., identity disclosure and private-attribute disclosure attacks. These attacks put users at potential risks ranging from persecution by governments to targeted frauds. Therefore, it is necessary for users to be able to safeguard their privacy without leaving their unnecessary traces of online activities. However, privacy protection comes at the cost of utility loss defined as the loss in quality of personalized services users receive. The reason is that this information of traces is crucial for online vendors to provide personalized services and a lack of it would result in deteriorating utility. This leads to a dilemma of privacy and utility.

Protecting users' privacy while preserving utility for user-generated data is a challenging task. The reason is that users generate different types of data such as Web browsing histories, user-item interactions, and textual information. This data is heterogeneous, unstructured, noisy, and inherently different from relational and tabular data and thus requires quantifying users' privacy and utility in each context separately. In this dissertation, I investigate four aspects of protecting user privacy for user-generated data. First, a novel adversarial technique is introduced to assay privacy risks in heterogeneous user-generated data. Second, a novel framework is proposed to boost users' privacy while retaining high utility for Web browsing histories. Third, a privacy-aware recommendation system is developed to protect privacy w.r.t. the rich user-item interaction data by recommending relevant and privacy-preserving items. Fourth, a privacy-preserving framework for text representation learning is presented to safeguard user-generated textual data as it can reveal private information.
Date Created
2020
Agent

Explainable AI in Workflow Development and Verification Using Pi-Calculus

157864-Thumbnail Image.png
Description
Computer science education is an increasingly vital area of study with various challenges that increase the difficulty level for new students resulting in higher attrition rates. As part of an effort to resolve this issue, a new visual programming language

Computer science education is an increasingly vital area of study with various challenges that increase the difficulty level for new students resulting in higher attrition rates. As part of an effort to resolve this issue, a new visual programming language environment was developed for this research, the Visual IoT and Robotics Programming Language Environment (VIPLE). VIPLE is based on computational thinking and flowchart, which reduces the needs of memorization of detailed syntax in text-based programming languages. VIPLE has been used at Arizona State University (ASU) in multiple years and sections of FSE100 as well as in universities worldwide. Another major issue with teaching large programming classes is the potential lack of qualified teaching assistants to grade and offer insight to a student’s programs at a level beyond output analysis.

In this dissertation, I propose a novel framework for performing semantic autograding, which analyzes student programs at a semantic level to help students learn with additional and systematic help. A general autograder is not practical for general programming languages, due to the flexibility of semantics. A practical autograder is possible in VIPLE, because of its simplified syntax and restricted options of semantics. The design of this autograder is based on the concept of theorem provers. To achieve this goal, I employ a modified version of Pi-Calculus to represent VIPLE programs and Hoare Logic to formalize program requirements. By building on the inference rules of Pi-Calculus and Hoare Logic, I am able to construct a theorem prover that can perform automated semantic analysis. Furthermore, building on this theorem prover enables me to develop a self-learning algorithm that can learn the conditions for a program’s correctness according to a given solution program.
Date Created
2020
Agent

A Study of User Behaviors and Activities on Online Mental Health Communities

157843-Thumbnail Image.png
Description
Social media is a medium that contains rich information which has been shared by many users every second every day. This information can be utilized for various outcomes such as understanding user behaviors, learning the effect of social media on

Social media is a medium that contains rich information which has been shared by many users every second every day. This information can be utilized for various outcomes such as understanding user behaviors, learning the effect of social media on a community, and developing a decision-making system based on the information available. With the growing popularity of social networking sites, people can freely express their opinions and feelings which results in a tremendous amount of user-generated data. The rich amount of social media data has opened the path for researchers to study and understand the users’ behaviors and mental health conditions. Several studies have shown that social media provides a means to capture an individual state of mind. Given the social media data and related work in this field, this work studies the scope of users’ discussion among online mental health communities. In the first part of this dissertation, this work focuses on the role of social media on mental health among sexual abuse community. It employs natural language processing techniques to extract topics of responses, examine how diverse these topics are to answer research questions such as whether responses are limited to emotional support; if not, what other topics are; what the diversity of topics manifests; how online response differs from traditional response found in a physical world. To answer these questions, this work extracts Reddit posts on rape to understand the nature of user responses for this stigmatized topic. In the second part of this dissertation, this work expands to a broader range of online communities. In particular, it investigates the potential roles of social media on mental health among five major communities, i.e., trauma and abuse community, psychosis and anxiety community, compulsive disorders community, coping and therapy community, and mood disorders community. This work studies how people interact with each other in each of these communities and what these online forums provide a resource to users who seek help. To understand users’ behaviors, this work extracts Reddit posts on 52 related subcommunities and analyzes the linguistic behavior of each community. Experiments in this dissertation show that Reddit is a good medium for users with mental health issues to find related helpful resources. Another interesting observation is an interesting topic cluster from users’ posts which shows that discussion and communication among users help individuals to find proper resources for their problem. Moreover, results show that the anonymity of users in Reddit allows them to have discussions about different topics beyond social support such as financial and religious support.
Date Created
2019
Agent

Assessing Influential Users in Live Streaming Social Networks

157833-Thumbnail Image.png
Description
Live streaming has risen to significant popularity in the recent past and largely this live streaming is a feature of existing social networks like Facebook, Instagram, and Snapchat. However, there does exist at least one social network entirely devoted to

Live streaming has risen to significant popularity in the recent past and largely this live streaming is a feature of existing social networks like Facebook, Instagram, and Snapchat. However, there does exist at least one social network entirely devoted to live streaming, and specifically the live streaming of video games, Twitch. This social network is unique for a number of reasons, not least because of its hyper-focus on live content and this uniqueness has challenges for social media researchers.

Despite this uniqueness, almost no scientific work has been performed on this public social network. Thus, it is unclear what user interaction features present on other social networks exist on Twitch. Investigating the interactions between users and identifying which, if any, of the common user behaviors on social network exist on Twitch is an important step in understanding how Twitch fits in to the social media ecosystem. For example, there are users that have large followings on Twitch and amass a large number of viewers, but do those users exert influence over the behavior of other user the way that popular users on Twitter do?

This task, however, will not be trivial. The same hyper-focus on live content that makes Twitch unique in the social network space invalidates many of the traditional approaches to social network analysis. Thus, new algorithms and techniques must be developed in order to tap this data source. In this thesis, a novel algorithm for finding games whose releases have made a significant impact on the network is described as well as a novel algorithm for detecting and identifying influential players of games. In addition, the Twitch network is described in detail along with the data that was collected in order to power the two previously described algorithms.
Date Created
2019
Agent

Understanding Bots on Social Media - An Application in Disaster Response

157831-Thumbnail Image.png
Description
Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters.

Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are affected by disasters, volunteers who offer contributions, and first responders. On the other hand, social media is a fertile ground for malicious users who purposefully disturb the relief processes facilitated on social media. These malicious users take advantage of social bots to overrun social media posts with fake images, rumors, and false information. This process causes distress and prevents actionable information from reaching the affected people. Social bots are automated accounts that are controlled by a malicious user and these bots have become prevalent on social media in recent years.

In spite of existing efforts towards understanding and removing bots on social media, there are at least two drawbacks associated with the current bot detection algorithms: general-purpose bot detection methods are designed to be conservative and not label a user as a bot unless the algorithm is highly confident and they overlook the effect of users who are manipulated by bots and (unintentionally) spread their content. This study is trifold. First, I design a Machine Learning model that uses content and context of social media posts to detect actionable ones among them; it specifically focuses on tweets in which people ask for help after major disasters. Second, I focus on bots who can be a facilitator of malicious content spreading during disasters. I propose two methods for detecting bots on social media with a focus on the recall of the detection. Third, I study the characteristics of users who spread the content of malicious actors. These features have the potential to improve methods that detect malicious content such as fake news.
Date Created
2019
Agent

Types of Bots: Categorization of Accounts Using Unsupervised Machine Learning

Description
Social media bot detection has been a signature challenge in recent years in online social networks. Many scholars agree that the bot detection problem has become an "arms race" between malicious actors, who seek to create bots to influence opinion

Social media bot detection has been a signature challenge in recent years in online social networks. Many scholars agree that the bot detection problem has become an "arms race" between malicious actors, who seek to create bots to influence opinion on these networks, and the social media platforms to remove these accounts. Despite this acknowledged issue, bot presence continues to remain on social media networks. So, it has now become necessary to monitor different bots over time to identify changes in their activities or domain. Since monitoring individual accounts is not feasible, because the bots may get suspended or deleted, bots should be observed in smaller groups, based on their characteristics, as types. Yet, most of the existing research on social media bot detection is focused on labeling bot accounts by only distinguishing them from human accounts and may ignore differences between individual bot accounts. The consideration of these bots' types may be the best solution for researchers and social media companies alike as it is in both of their best interests to study these types separately. However, up until this point, bot categorization has only been theorized or done manually. Thus, the goal of this research is to automate this process of grouping bots by their respective types. To accomplish this goal, the author experimentally demonstrates that it is possible to use unsupervised machine learning to categorize bots into types based on the proposed typology by creating an aggregated dataset, subsequent to determining that the accounts within are bots, and utilizing an existing typology for bots. Having the ability to differentiate between types of bots automatically will allow social media experts to analyze bot activity, from a new perspective, on a more granular level. This way, researchers can identify patterns related to a given bot type's behaviors over time and determine if certain detection methods are more viable for that type.
Date Created
2019
Agent

Unsupervised Attributed Graph Learning: Models and Applications

157818-Thumbnail Image.png
Description
Graph is a ubiquitous data structure, which appears in a broad range of real-world scenarios. Accordingly, there has been a surge of research to represent and learn from graphs in order to accomplish various machine learning and graph analysis tasks.

Graph is a ubiquitous data structure, which appears in a broad range of real-world scenarios. Accordingly, there has been a surge of research to represent and learn from graphs in order to accomplish various machine learning and graph analysis tasks. However, most of these efforts only utilize the graph structure while nodes in real-world graphs usually come with a rich set of attributes. Typical examples of such nodes and their attributes are users and their profiles in social networks, scientific articles and their content in citation networks, protein molecules and their gene sets in biological networks as well as web pages and their content on the Web. Utilizing node features in such graphs---attributed graphs---can alleviate the graph sparsity problem and help explain various phenomena (e.g., the motives behind the formation of communities in social networks). Therefore, further study of attributed graphs is required to take full advantage of node attributes.

In the wild, attributed graphs are usually unlabeled. Moreover, annotating data is an expensive and time-consuming process, which suffers from many limitations such as annotators’ subjectivity, reproducibility, and consistency. The challenges of data annotation and the growing increase of unlabeled attributed graphs in various real-world applications significantly demand unsupervised learning for attributed graphs.

In this dissertation, I propose a set of novel models to learn from attributed graphs in an unsupervised manner. To better understand and represent nodes and communities in attributed graphs, I present different models in node and community levels. In node level, I utilize node features as well as the graph structure in attributed graphs to learn distributed representations of nodes, which can be useful in a variety of downstream machine learning applications. In community level, with a focus on social media, I take advantage of both node attributes and the graph structure to discover not only communities but also their sentiment-driven profiles and inter-community relations (i.e., alliance, antagonism, or no relation). The discovered community profiles and relations help to better understand the structure and dynamics of social media.
Date Created
2019
Agent

Three Facets of Online Political Networks: Communities, Antagonisms, and Polarization

157810-Thumbnail Image.png
Description
Millions of users leave digital traces of their political engagements on social media platforms every day. Users form networks of interactions, produce textual content, like and share each others' content. This creates an invaluable opportunity to better understand the political

Millions of users leave digital traces of their political engagements on social media platforms every day. Users form networks of interactions, produce textual content, like and share each others' content. This creates an invaluable opportunity to better understand the political engagements of internet users. In this proposal, I present three algorithmic solutions to three facets of online political networks; namely, detection of communities, antagonisms and the impact of certain types of accounts on political polarization. First, I develop a multi-view community detection algorithm to find politically pure communities. I find that word usage among other content types (i.e. hashtags, URLs) complement user interactions the best in accurately detecting communities.

Second, I focus on detecting negative linkages between politically motivated social media users. Major social media platforms do not facilitate their users with built-in negative interaction options. However, many political network analysis tasks rely on not only positive but also negative linkages. Here, I present the SocLSFact framework to detect negative linkages among social media users. It utilizes three pieces of information; sentiment cues of textual interactions, positive interactions, and socially balanced triads. I evaluate the contribution of each three aspects in negative link detection performance on multiple tasks.

Third, I propose an experimental setup that quantifies the polarization impact of automated accounts on Twitter retweet networks. I focus on a dataset of tragic Parkland shooting event and its aftermath. I show that when automated accounts are removed from the retweet network the network polarization decrease significantly, while a same number of accounts to the automated accounts are removed randomly the difference is not significant. I also find that prominent predictors of engagement of automatically generated content is not very different than what previous studies point out in general engaging content on social media. Last but not least, I identify accounts which self-disclose their automated nature in their profile by using expressions such as bot, chat-bot, or robot. I find that human engagement to self-disclosing accounts compared to non-disclosing automated accounts is much smaller. This observational finding can motivate further efforts into automated account detection research to prevent their unintended impact.
Date Created
2019
Agent

Learning with attributed networks: algorithms and applications

157589-Thumbnail Image.png
Description
Attributes - that delineating the properties of data, and connections - that describing the dependencies of data, are two essential components to characterize most real-world phenomena. The synergy between these two principal elements renders a unique data representation - the

Attributes - that delineating the properties of data, and connections - that describing the dependencies of data, are two essential components to characterize most real-world phenomena. The synergy between these two principal elements renders a unique data representation - the attributed networks. In many cases, people are inundated with vast amounts of data that can be structured into attributed networks, and their use has been attractive to researchers and practitioners in different disciplines. For example, in social media, users interact with each other and also post personalized content; in scientific collaboration, researchers cooperate and are distinct from peers by their unique research interests; in complex diseases studies, rich gene expression complements to the gene-regulatory networks. Clearly, attributed networks are ubiquitous and form a critical component of modern information infrastructure. To gain deep insights from such networks, it requires a fundamental understanding of their unique characteristics and be aware of the related computational challenges.

My dissertation research aims to develop a suite of novel learning algorithms to understand, characterize, and gain actionable insights from attributed networks, to benefit high-impact real-world applications. In the first part of this dissertation, I mainly focus on developing learning algorithms for attributed networks in a static environment at two different levels: (i) attribute level - by designing feature selection algorithms to find high-quality features that are tightly correlated with the network topology; and (ii) node level - by presenting network embedding algorithms to learn discriminative node embeddings by preserving node proximity w.r.t. network topology structure and node attribute similarity. As changes are essential components of attributed networks and the results of learning algorithms will become stale over time, in the second part of this dissertation, I propose a family of online algorithms for attributed networks in a dynamic environment to continuously update the learning results on the fly. In fact, developing application-aware learning algorithms is more desired with a clear understanding of the application domains and their unique intents. As such, in the third part of this dissertation, I am also committed to advancing real-world applications on attributed networks by incorporating the objectives of external tasks into the learning process.
Date Created
2019
Agent

Analysis and decision-making with social media

157582-Thumbnail Image.png
Description
The rapid advancements of technology have greatly extended the ubiquitous nature of smartphones acting as a gateway to numerous social media applications. This brings an immense convenience to the users of these applications wishing to stay connected to other individuals

The rapid advancements of technology have greatly extended the ubiquitous nature of smartphones acting as a gateway to numerous social media applications. This brings an immense convenience to the users of these applications wishing to stay connected to other individuals through sharing their statuses, posting their opinions, experiences, suggestions, etc on online social networks (OSNs). Exploring and analyzing this data has a great potential to enable deep and fine-grained insights into the behavior, emotions, and language of individuals in a society. This proposed dissertation focuses on utilizing these online social footprints to research two main threads – 1) Analysis: to study the behavior of individuals online (content analysis) and 2) Synthesis: to build models that influence the behavior of individuals offline (incomplete action models for decision-making).

A large percentage of posts shared online are in an unrestricted natural language format that is meant for human consumption. One of the demanding problems in this context is to leverage and develop approaches to automatically extract important insights from this incessant massive data pool. Efforts in this direction emphasize mining or extracting the wealth of latent information in the data from multiple OSNs independently. The first thread of this dissertation focuses on analytics to investigate the differentiated content-sharing behavior of individuals. The second thread of this dissertation attempts to build decision-making systems using social media data.

The results of the proposed dissertation emphasize the importance of considering multiple data types while interpreting the content shared on OSNs. They highlight the unique ways in which the data and the extracted patterns from text-based platforms or visual-based platforms complement and contrast in terms of their content. The proposed research demonstrated that, in many ways, the results obtained by focusing on either only text or only visual elements of content shared online could lead to biased insights. On the other hand, it also shows the power of a sequential set of patterns that have some sort of precedence relationships and collaboration between humans and automated planners.
Date Created
2019
Agent