Positive Unlabeled Learning - Optimization and Evaluation

161906-Thumbnail Image.png
Description
In many real-world machine learning classification applications, well labeled training data can be difficult, expensive, or even impossible to obtain. In such situations, it is sometimes possible to label a small subset of data as belonging to the class of

In many real-world machine learning classification applications, well labeled training data can be difficult, expensive, or even impossible to obtain. In such situations, it is sometimes possible to label a small subset of data as belonging to the class of interest though it is impractical to manually label all data not of interest. The result is a small set of positive labeled data and a large set of unknown and unlabeled data. This is known as the Positive and Unlabeled learning (PU learning) problem, a type of semi-supervised learning. In this dissertation, the PU learning problem is rigorously defined, several common assumptions described, and a literature review of the field provided. A new family of effective PU learning algorithms, the MLR (Modified Logistic Regression) family of algorithms, is described. Theoretical and experimental justification for these algorithms is provided demonstrating their success and flexibility. Extensive experimentation and empirical evidence are provided comparing several new and existing PU learning evaluation estimation metrics in a wide variety of scenarios. The surprisingly clear advantage of a simple recall estimate as the best estimate for overall PU classifier performance is described. Finally, an application of PU learning to the field of solar fault detection, an area not previously explored in the field, demonstrates the advantage and potential of PU learning in new application domains.
Date Created
2021
Agent

Data-Driven Representation Learning in Multimodal Feature Fusion

156587-Thumbnail Image.png
Description
Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at

Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction.

We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems.

In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.
Date Created
2018
Agent