Discovering and Mitigating Social Data Bias

155717-Thumbnail Image.png
Description
Exabytes of data are created online every day. This deluge of data is no more apparent than it is on social media. Naturally, finding ways to leverage this unprecedented source of human information is an active area of research. Social

Exabytes of data are created online every day. This deluge of data is no more apparent than it is on social media. Naturally, finding ways to leverage this unprecedented source of human information is an active area of research. Social media platforms have become laboratories for conducting experiments about people at scales thought unimaginable only a few years ago.

Researchers and practitioners use social media to extract actionable patterns such as where aid should be distributed in a crisis. However, the validity of these patterns relies on having a representative dataset. As this dissertation shows, the data collected from social media is seldom representative of the activity of the site itself, and less so of human activity. This means that the results of many studies are limited by the quality of data they collect.

The finding that social media data is biased inspires the main challenge addressed by this thesis. I introduce three sets of methodologies to correct for bias. First, I design methods to deal with data collection bias. I offer a methodology which can find bias within a social media dataset. This methodology works by comparing the collected data with other sources to find bias in a stream. The dissertation also outlines a data collection strategy which minimizes the amount of bias that will appear in a given dataset. It introduces a crawling strategy which mitigates the amount of bias in the resulting dataset. Second, I introduce a methodology to identify bots and shills within a social media dataset. This directly addresses the concern that the users of a social media site are not representative. Applying these methodologies allows the population under study on a social media site to better match that of the real world. Finally, the dissertation discusses perceptual biases, explains how they affect analysis, and introduces computational approaches to mitigate them.

The results of the dissertation allow for the discovery and removal of different levels of bias within a social media dataset. This has important implications for social media mining, namely that the behavioral patterns and insights extracted from social media will be more representative of the populations under study.
Date Created
2017
Agent

Sky View Factors From Synthetic Fisheye Photos for Thermal Comfort Routing - A Case Study in Phoenix, Arizona

128185-Thumbnail Image.png
Description

The Sky View Factor (SVF) is a dimension-reduced representation of urban form and one of the major variables in radiation models that estimate outdoor thermal comfort. Common ways of retrieving SVFs in urban environments include capturing fisheye photographs or creating

The Sky View Factor (SVF) is a dimension-reduced representation of urban form and one of the major variables in radiation models that estimate outdoor thermal comfort. Common ways of retrieving SVFs in urban environments include capturing fisheye photographs or creating a digital 3D city or elevation model of the environment. Such techniques have previously been limited due to a lack of imagery or lack of full scale detailed models of urban areas. We developed a web based tool that automatically generates synthetic hemispherical fisheye views from Google Earth at arbitrary spatial resolution and calculates the corresponding SVFs through equiangular projection. SVF results were validated using Google Maps Street View and compared to results from other SVF calculation tools. We generated 5-meter resolution SVF maps for two neighborhoods in Phoenix, Arizona to illustrate fine-scale variations of intra-urban horizon limitations due to urban form and vegetation. To demonstrate the utility of our synthetic fisheye approach for heat stress applications, we automated a radiation model to generate outdoor thermal comfort maps for Arizona State University’s Tempe campus for a hot summer day using synthetic fisheye photos and on-site meteorological data. Model output was tested against mobile transect measurements of the six-directional radiant flux density. Based on the thermal comfort maps, we implemented a pedestrian routing algorithm that is optimized for distance and thermal comfort preferences. Our synthetic fisheye approach can help planners assess urban design and tree planting strategies to maximize thermal comfort outcomes and can support heat hazard mitigation in urban areas.

Date Created
2017-03-27
Agent

Methodologies in Predictive Visual Analytics

155343-Thumbnail Image.png
Description
Predictive analytics embraces an extensive area of techniques from statistical modeling to machine learning to data mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To date, visualization has been broadly used

Predictive analytics embraces an extensive area of techniques from statistical modeling to machine learning to data mining and is applied in business intelligence, public health, disaster management and response, and many other fields. To date, visualization has been broadly used to support tasks in the predictive analytics pipeline under the underlying assumption that a human-in-the-loop can aid the analysis by integrating domain knowledge that might not be broadly captured by the system. Primary uses of visualization in the predictive analytics pipeline have focused on data cleaning, exploratory analysis, and diagnostics. More recently, numerous visual analytics systems for feature selection, incremental learning, and various prediction tasks have been proposed to support the growing use of complex models, agent-specific optimization, and comprehensive model comparison and result exploration. Such work is being driven by advances in interactive machine learning and the desire of end-users to understand and engage with the modeling process. However, despite the numerous and promising applications of visual analytics to predictive analytics tasks, work to assess the effectiveness of predictive visual analytics is lacking.

This thesis studies the current methodologies in predictive visual analytics. It first defines the scope of predictive analytics and presents a predictive visual analytics (PVA) pipeline. Following the proposed pipeline, a predictive visual analytics framework is developed to be used to explore under what circumstances a human-in-the-loop prediction process is most effective. This framework combines sentiment analysis, feature selection mechanisms, similarity comparisons and model cross-validation through a variety of interactive visualizations to support analysts in model building and prediction. To test the proposed framework, an instantiation for movie box-office prediction is developed and evaluated. Results from small-scale user studies are presented and discussed, and a generalized user study is carried out to assess the role of predictive visual analytics under a movie box-office prediction scenario.
Date Created
2017
Agent

Visual Analytics Methods for Exploring Geographically Networked Phenomena

155291-Thumbnail Image.png
Description
The connections between different entities define different kinds of networks, and many such networked phenomena are influenced by their underlying geographical relationships. By integrating network and geospatial analysis, the goal is to extract information about interaction topologies and the relationships

The connections between different entities define different kinds of networks, and many such networked phenomena are influenced by their underlying geographical relationships. By integrating network and geospatial analysis, the goal is to extract information about interaction topologies and the relationships to related geographical constructs. In the recent decades, much work has been done analyzing the dynamics of spatial networks; however, many challenges still remain in this field. First, the development of social media and transportation technologies has greatly reshaped the typologies of communications between different geographical regions. Second, the distance metrics used in spatial analysis should also be enriched with the underlying network information to develop accurate models.

Visual analytics provides methods for data exploration, pattern recognition, and knowledge discovery. However, despite the long history of geovisualizations and network visual analytics, little work has been done to develop visual analytics tools that focus specifically on geographically networked phenomena. This thesis develops a variety of visualization methods to present data values and geospatial network relationships, which enables users to interactively explore the data. Users can investigate the connections in both virtual networks and geospatial networks and the underlying geographical context can be used to improve knowledge discovery. The focus of this thesis is on social media analysis and geographical hotspots optimization. A framework is proposed for social network analysis to unveil the links between social media interactions and their underlying networked geospatial phenomena. This will be combined with a novel hotspot approach to improve hotspot identification and boundary detection with the networks extracted from urban infrastructure. Several real world problems have been analyzed using the proposed visual analytics frameworks. The primary studies and experiments show that visual analytics methods can help analysts explore such data from multiple perspectives and help the knowledge discovery process.
Date Created
2017
Agent

Visualizing numerical uncertainty in climate ensembles

155108-Thumbnail Image.png
Description
The proper quantification and visualization of uncertainty requires a high level of domain knowledge. Despite this, few studies have collected and compared the roles, experiences and opinions of scientists in different types of uncertainty analysis. I address this gap by

The proper quantification and visualization of uncertainty requires a high level of domain knowledge. Despite this, few studies have collected and compared the roles, experiences and opinions of scientists in different types of uncertainty analysis. I address this gap by conducting two types of studies: 1) a domain characterization study with general questions for experts from various fields based on a recent literature review in ensemble analysis and visualization, and; 2) a long-term interview with domain experts focusing on specific problems and challenges in uncertainty analysis. From the domain characterization, I identified the most common metrics applied for uncertainty quantification and discussed the current visualization applications of these methods. Based on the interviews with domain experts, I characterized the background and intents of the experts when performing uncertainty analysis. This enables me to characterize domain needs that are currently underrepresented or unsupported in the literature. Finally, I developed a new framework for visualizing uncertainty in climate ensembles.
Date Created
2016
Agent

The role of teamwork in predicting movie earnings

154998-Thumbnail Image.png
Description
Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst task was abstracted and replicated in a laboratory (controlled environment).

Intelligence analysts’ work has become progressively complex due to increasing security threats and data availability. In order to study “big” data exploration within the intelligence domain the intelligence analyst task was abstracted and replicated in a laboratory (controlled environment). Participants used a computer interface and movie database to determine the opening weekend gross movie earnings of three pre-selected movies. Data consisted of Twitter tweets and predictive models. These data were displayed in various formats such as graphs, charts, and text. Participants used these data to make their predictions. It was expected that teams (a team is a group with members who have different specialties and who work interdependently) would outperform individuals and groups. That is, teams would be significantly better at predicting “Opening Weekend Gross” than individuals or groups. Results indicated that teams outperformed individuals and groups in the first prediction, under performed in the second prediction, and performed better than individuals in the third prediction (but not better than groups). Insights and future directions are discussed.
Date Created
2016
Agent

A unified framework based on convolutional neural networks for interpreting carotid intima-media thickness videos

154703-Thumbnail Image.png
Description
Cardiovascular disease (CVD) is the leading cause of mortality yet largely preventable, but the key to prevention is to identify at-risk individuals before adverse events. For predicting individual CVD risk, carotid intima-media thickness (CIMT), a noninvasive ultrasound method, has proven

Cardiovascular disease (CVD) is the leading cause of mortality yet largely preventable, but the key to prevention is to identify at-risk individuals before adverse events. For predicting individual CVD risk, carotid intima-media thickness (CIMT), a noninvasive ultrasound method, has proven to be valuable, offering several advantages over CT coronary artery calcium score. However, each CIMT examination includes several ultrasound videos, and interpreting each of these CIMT videos involves three operations: (1) select three enddiastolic ultrasound frames (EUF) in the video, (2) localize a region of interest (ROI) in each selected frame, and (3) trace the lumen-intima interface and the media-adventitia interface in each ROI to measure CIMT. These operations are tedious, laborious, and time consuming, a serious limitation that hinders the widespread utilization of CIMT in clinical practice. To overcome this limitation, this paper presents a new system to automate CIMT video interpretation. Our extensive experiments demonstrate that the suggested system significantly outperforms the state-of-the-art methods. The superior performance is attributable to our unified framework based on convolutional neural networks (CNNs) coupled with our informative image representation and effective post-processing of the CNN outputs, which are uniquely designed for each of the above three operations.
Date Created
2016
Agent

Visual analytics for spatiotemporal cluster analysis

154403-Thumbnail Image.png
Description
Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.
Date Created
2016
Agent

Vectorization in analyzing 2D/3D data

154357-Thumbnail Image.png
Description
Vectorization is an important process in the fields of graphics and image processing. In computer-aided design (CAD), drawings are scanned, vectorized and written as CAD files in a process called paper-to-CAD conversion or drawing conversion. In geographic information systems (GIS),

Vectorization is an important process in the fields of graphics and image processing. In computer-aided design (CAD), drawings are scanned, vectorized and written as CAD files in a process called paper-to-CAD conversion or drawing conversion. In geographic information systems (GIS), satellite or aerial images are vectorized to create maps. In graphic design and photography, raster graphics can be vectorized for easier usage and resizing. Vector arts are popular as online contents. Vectorization takes raster images, point clouds, or a series of scattered data samples in space, outputs graphic elements of various types including points, lines, curves, polygons, parametric curves and surface patches. The vectorized representations consist of a different set of components and elements from that of the inputs. The change of representation is the key difference between vectorization and practices such as smoothing and filtering. Compared to the inputs, the vector outputs provide higher order of control and attributes such as smoothness. Their curvatures or gradients at the points are scale invariant and they are more robust data sources for downstream applications and analysis. This dissertation explores and broadens the scope of vectorization in various contexts. I propose a novel vectorization algorithm on raster images along with several new applications for vectorization mechanism in processing and analysing both 2D and 3D data sets. The main components of the research are: using vectorization in generating 3D models from 2D floor plans; a novel raster image vectorization methods and its applications in computer vision, image processing, and animation; and vectorization in visualizing and information extraction in 3D laser scan data. I also apply vectorization analysis towards human body scans and rock surface scans to show insights otherwise difficult to obtain.
Date Created
2016
Agent

Visual analytics tool for the Global Change Assessment Model

154057-Thumbnail Image.png
Description
The Global Change Assessment Model (GCAM) is an integrated assessment tool for exploring consequences and responses to global change. However, the current iteration of GCAM relies on NetCDF file outputs which need to be exported for visualization and analysis purposes.

The Global Change Assessment Model (GCAM) is an integrated assessment tool for exploring consequences and responses to global change. However, the current iteration of GCAM relies on NetCDF file outputs which need to be exported for visualization and analysis purposes. Such a requirement limits the uptake of this modeling platform for analysts that may wish to explore future scenarios. This work has focused on a web-based geovisual analytics interface for GCAM. Challenges of this work include enabling both domain expert and model experts to be able to functionally explore the model. Furthermore, scenario analysis has been widely applied in climate science to understand the impact of climate change on the future human environment. The inter-comparison of scenario analysis remains a big challenge in both the climate science and visualization communities. In a close collaboration with the Global Change Assessment Model team, I developed the first visual analytics interface for GCAM with a series of interactive functions to help users understand the simulated impact of climate change on sectors of the global economy, and at the same time allow them to explore inter comparison of scenario analysis with GCAM models. This tool implements a hierarchical clustering approach to allow inter-comparison and similarity analysis among multiple scenarios over space, time, and multiple attributes through a set of coordinated multiple views. After working with this tool, the scientists from the GCAM team agree that the geovisual analytics tool can facilitate scenario exploration and enable scientific insight gaining process into scenario comparison. To demonstrate my work, I present two case studies, one of them explores the potential impact that the China south-north water transportation project in the Yangtze River basin will have on projected water demands. The other case study using GCAM models demonstrates how the impact of spatial variations and scales on similarity analysis of climate scenarios varies at world, continental, and country scales.
Date Created
2015
Agent