Analysis Methods for No-Confounding Screening Designs

158883-Thumbnail Image.png
Description
Nonregular designs are a preferable alternative to regular resolution four designs because they avoid confounding two-factor interactions. As a result nonregular designs can estimate and identify a few active two-factor interactions. However, due to the sometimes complex alias structure of

Nonregular designs are a preferable alternative to regular resolution four designs because they avoid confounding two-factor interactions. As a result nonregular designs can estimate and identify a few active two-factor interactions. However, due to the sometimes complex alias structure of nonregular designs, standard screening strategies can fail to identify all active effects. In this research, two-level nonregular screening designs with orthogonal main effects will be discussed. By utilizing knowledge of the alias structure, a design based model selection process for analyzing nonregular designs is proposed.

The Aliased Informed Model Selection (AIMS) strategy is a design specific approach that is compared to three generic model selection methods; stepwise regression, least absolute shrinkage and selection operator (LASSO), and the Dantzig selector. The AIMS approach substantially increases the power to detect active main effects and two-factor interactions versus the aforementioned generic methodologies. This research identifies design specific model spaces; sets of models with strong heredity, all estimable, and exhibit no model confounding. These spaces are then used in the AIMS method along with design specific aliasing rules for model selection decisions. Model spaces and alias rules are identified for three designs; 16-run no-confounding 6, 7, and 8-factor designs. The designs are demonstrated with several examples as well as simulations to show the AIMS superiority in model selection.

A final piece of the research provides a method for augmenting no-confounding designs based on a model spaces and maximum average D-efficiency. Several augmented designs are provided for different situations. A final simulation with the augmented designs shows strong results for augmenting four additional runs if time and resources permit.
Date Created
2020
Agent

Active Learning with Explore and Exploit Equilibriums

158694-Thumbnail Image.png
Description
In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images

In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images are abundant while labeling requires a considerable amount of knowledge from experienced physicians. Active learning addresses this challenge with an iterative process to select instances from the unlabeled data to annotate and improve the supervised learner. At each step, the query of examples to be labeled can be considered as a dilemma between exploitation of the supervised learner's current knowledge and exploration of the unlabeled input features.

Motivated by the need for efficient active learning strategies, this dissertation proposes new algorithms for batch-mode, pool-based active learning. The research considers the following questions: how can unsupervised knowledge of the input features (exploration) improve learning when incorporated with supervised learning (exploitation)? How to characterize exploration in active learning when data is high-dimensional? Finally, how to adaptively make a balance between exploration and exploitation?

The first contribution proposes a new active learning algorithm, Cluster-based Stochastic Query-by-Forest (CSQBF), which provides a batch-mode strategy that accelerates learning with added value from exploration and improved exploitation scores. CSQBF balances exploration and exploitation using a probabilistic scoring criterion based on classification probabilities from a tree-based ensemble model within each data cluster.

The second contribution introduces two more query strategies, Double Margin Active Learning (DMAL) and Cluster Agnostic Active Learning (CAAL), that combine consistent exploration and exploitation modules into a coherent and unified measure for label query. Instead of assuming a fixed clustering structure, CAAL and DMAL adopt a soft-clustering strategy which provides a new approach to formalize exploration in active learning.

The third contribution addresses the challenge of dynamically making a balance between exploration and exploitation criteria throughout the active learning process. Two adaptive algorithms are proposed based on feedback-driven bandit optimization frameworks that elegantly handle this issue by learning the relationship between exploration-exploitation trade-off and an active learner's performance.
Date Created
2020
Agent

Real-time Analysis and Control for Smart Manufacturing Systems

158682-Thumbnail Image.png
Description
Recent advances in manufacturing system, such as advanced embedded sensing, big data analytics and IoT and robotics, are promising a paradigm shift in the manufacturing industry towards smart manufacturing systems. Typically, real-time data is available in many industries, such as

Recent advances in manufacturing system, such as advanced embedded sensing, big data analytics and IoT and robotics, are promising a paradigm shift in the manufacturing industry towards smart manufacturing systems. Typically, real-time data is available in many industries, such as automotive, semiconductor, and food production, which can reflect the machine conditions and production system’s operation performance. However, a major research gap still exists in terms of how to utilize these real-time data information to evaluate and predict production system performance and to further facilitate timely decision making and production control on the factory floor. To tackle these challenges, this dissertation takes on an integrated analytical approach by hybridizing data analytics, stochastic modeling and decision making under uncertainty methodology to solve practical manufacturing problems.

Specifically, in this research, the machine degradation process is considered. It has been shown that machines working at different operating states may break down in different probabilistic manners. In addition, machines working in worse operating stage are more likely to fail, thus causing more frequent down period and reducing the system throughput. However, there is still a lack of analytical methods to quantify the potential impact of machine condition degradation on the overall system performance to facilitate operation decision making on the factory floor. To address these issues, this dissertation considers a serial production line with finite buffers and multiple machines following Markovian degradation process. An integrated model based on the aggregation method is built to quantify the overall system performance and its interactions with machine condition process. Moreover, system properties are investigated to analyze the influence of system parameters on system performance. In addition, three types of bottlenecks are defined and their corresponding indicators are derived to provide guidelines on improving system performance. These methods provide quantitative tools for modeling, analyzing, and improving manufacturing systems with the coupling between machine condition degradation and productivity given the real-time signals.
Date Created
2020
Agent

Bioman: Discrete-event Simulator to Analyze Operations for Car-T Cell Therapy Manufacturing

158530-Thumbnail Image.png
Description
The success of genetically-modified T-cells in treating hematological malignancies has accelerated the research timeline for Chimeric Antigen Receptor-T (CAR-T) cell therapy. Since there are only two approved products (Kymriah and Yescarta), the process knowledge is limited. This leads to a

The success of genetically-modified T-cells in treating hematological malignancies has accelerated the research timeline for Chimeric Antigen Receptor-T (CAR-T) cell therapy. Since there are only two approved products (Kymriah and Yescarta), the process knowledge is limited. This leads to a low efficiency at manufacturing stage with serious challenges corresponding to high cost and scalability. In addition, the individualized nature of the therapy limits inventory and creates a high risk of product loss due to supply chain failure. The sector needs a new manufacturing paradigm capable of quickly responding to individualized demands while considering complex system dynamics.

The research formulates the problem of Chimeric Antigen Receptor-T (CAR-T) manufacturing design, understanding the performance for large scale production of personalized therapies. The solution looks to develop a simulation environment for bio-manufacturing systems with single-use equipment. The result is BioMan: a discrete-event simulation model that considers the role of therapy's individualized nature, type of processing and quality-management policies on process yield and time, while dealing with the available resource constraints simultaneously. The tool will be useful to understand the impact of varying factor inputs on Chimeric Antigen Receptor-T (CAR-T) cell manufacturing and will eventually facilitate the decision-maker to finalize the right strategies achieving better processing, high resource utilization, and less failure rates.
Date Created
2020
Agent

A Study on Optimization Measurement Policies for Quality Control Improvements in Gene Therapy Manufacturing

158527-Thumbnail Image.png
Description
With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene

With the increased demand for genetically modified T-cells in treating hematological malignancies, the need for an optimized measurement policy within the current good manufacturing practices for better quality control has grown greatly. There are several steps involved in manufacturing gene therapy. These steps are for the autologous-type gene therapy, in chronological order, are harvesting T-cells from the patient, activation of the cells (thawing the cryogenically frozen cells after transport to manufacturing center), viral vector transduction, Chimeric Antigen Receptor (CAR) attachment during T-cell expansion, then infusion into patient. The need for improved measurement heuristics within the transduction and expansion portions of the manufacturing process has reached an all-time high because of the costly nature of manufacturing the product, the high cycle time (approximately 14-28 days from activation to infusion), and the risk for external contamination during manufacturing that negatively impacts patients post infusion (such as illness and death).

The main objective of this work is to investigate and improve measurement policies on the basis of quality control in the transduction/expansion bio-manufacturing processes. More specifically, this study addresses the issue of measuring yield within the transduction/expansion phases of gene therapy. To do so, it was decided to model the process as a Markov Decision Process where the decisions being made are optimally chosen to create an overall optimal measurement policy; for a set of predefined parameters.
Date Created
2020
Agent

Embedded Feature Selection for Model-based Clustering

158093-Thumbnail Image.png
Description
Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.

The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.

The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.

The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison.
Date Created
2020
Agent

Efficient Incremental Model Learning on Data Streams

157543-Thumbnail Image.png
Description
With the development of modern technological infrastructures, such as social networks or the Internet of Things (IoT), data is being generated at a speed that is never before seen. Analyzing the content of this data helps us further understand underlying

With the development of modern technological infrastructures, such as social networks or the Internet of Things (IoT), data is being generated at a speed that is never before seen. Analyzing the content of this data helps us further understand underlying patterns and discover relationships among different subsets of data, enabling intelligent decision making. In this thesis, I first introduce the Low-rank, Win-dowed, Incremental Singular Value Decomposition (SVD) framework to inclemently maintain SVD factors over streaming data. Then, I present the Group Incremental Non-Negative Matrix Factorization framework to leverage redundancies in the data to speed up incremental processing. They primarily tackle the challenges of using factorization models in the scenarios with streaming textual data. In order to tackle the challenges in improving the effectiveness and efficiency of generative models in this streaming environment, I introduce the Incremental Dynamic Multiscale Topic Model framework, which identifies multi-scale patterns and their evolutions within streaming datasets. While the latent factor models assume the linear independence in the latent factors, the generative models assume the observation is generated from a set of latent variables with various distributions. Furthermore, some models may not be accessible or their underlying structures are too complex to understand, such as simulation ensembles, where there may be thousands of parameters with a huge parameter space, the only way to learn information from it is to execute real simulations. When performing knowledge discovery and decision making through data- and model-driven simulation ensembles, it is expensive to operate these ensembles continuously at large scale, due to the high computational. Consequently, given a relatively small simulation budget, it is desirable to identify a sparse ensemble that includes the most informative simulations, while still permitting effective exploration of the input parameter space. Therefore, I present Complexity-Guided Parameter Space Sampling framework, which is an intelligent, top-down sampling scheme to select the most salient simulation parameters to execute, given a limited computational budget. Moreover, I also present a Pivot-Guided Parameter Space Sampling framework, which incrementally maintains a diverse ensemble of models of the simulation ensemble space and uses a pivot guided mechanism for future sample selection.
Date Created
2019
Agent

GeoSparkSim: A Scalable Microscopic Road Network Traffic Simulator Based on Apache Spark

157491-Thumbnail Image.png
Description
Researchers and practitioners have widely studied road network traffic data in different areas such as urban planning, traffic prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality

Researchers and practitioners have widely studied road network traffic data in different areas such as urban planning, traffic prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality urban traffic data requires tremendous efforts because participating vehicles must install Global Positioning System(GPS) receivers and administrators must continuously monitor these devices. There have been some urban traffic simulators trying to generate such data with different features. However, they suffer from two critical issues (1) Scalability: most of them only offer single-machine solution which is not adequate to produce large-scale data. Some simulators can generate traffic in parallel but do not well balance the load among machines in a cluster. (2) Granularity: many simulators do not consider microscopic traffic situations including traffic lights, lane changing, car following. This paper proposed GeoSparkSim, a scalable traffic simulator which extends Apache Spark to generate large-scale road network traffic datasets with microscopic traffic simulation. The proposed system seamlessly integrates with a Spark-based spatial data management system, GeoSpark, to deliver a holistic approach that allows data scientists to simulate, analyze and visualize large-scale urban traffic data. To implement microscopic traffic models, GeoSparkSim employs a simulation-aware vehicle partitioning method to partition vehicles among different machines such that each machine has a balanced workload. The experimental analysis shows that GeoSparkSim can simulate the movements of 200 thousand cars over an extensive road network (250 thousand road junctions and 300 thousand road segments).
Date Created
2019
Agent

Performance Analysis of a Double Crane with Finite Interoperational Buffer Capacity with Multiple Fidelity Simulations

156625-Thumbnail Image.png
Description
With trends of globalization on rise, predominant of the trades happen by sea, and experts have predicted an increase in trade volumes over the next few years. With increasing trade volumes, container ships’ upsizing is being carried out to meet

With trends of globalization on rise, predominant of the trades happen by sea, and experts have predicted an increase in trade volumes over the next few years. With increasing trade volumes, container ships’ upsizing is being carried out to meet the demand. But the problem with container ships’ upsizing is that the sea port terminals must be equipped adequately to improve the turnaround time otherwise the container ships’ upsizing would not yield the anticipated benefits. This thesis focus on a special type of a double automated crane set-up, with a finite interoperational buffer capacity. The buffer is placed in between the cranes, and the idea behind this research is to analyze the performance of the crane operations when this technology is adopted. This thesis proposes the approximation of this complex system, thereby addressing the computational time issue and allowing to efficiently analyze the performance of the system. The approach to model this system has been carried out in two phases. The first phase consists of the development of discrete event simulation model to make the system evolve over time. The challenges of this model are its high processing time which consists of performing large number of experimental runs, thus laying the foundation for the development of the analytical model of the system, and with respect to analytical modeling, a continuous time markov process approach has been adopted. Further, to improve the efficiency of the analytical model, a state aggregation approach is proposed. Thus, this thesis would give an insight on the outcomes of the two approaches and the behavior of the error space, and the performance of the models for the varying buffer capacities would reflect the scope of improvement in these kinds of operational set up.
Date Created
2018
Agent

Stochastic Modeling and Optimization to Improve Identification and Treatment of Alzheimer’s Disease

156575-Thumbnail Image.png
Description
Mathematical modeling and decision-making within the healthcare industry have given means to quantitatively evaluate the impact of decisions into diagnosis, screening, and treatment of diseases. In this work, we look into a specific, yet very important disease, the Alzheimer. In

Mathematical modeling and decision-making within the healthcare industry have given means to quantitatively evaluate the impact of decisions into diagnosis, screening, and treatment of diseases. In this work, we look into a specific, yet very important disease, the Alzheimer. In the United States, Alzheimer’s Disease (AD) is the 6th leading cause of death. Diagnosis of AD cannot be confidently confirmed until after death. This has prompted the importance of early diagnosis of AD, based upon symptoms of cognitive decline. A symptom of early cognitive decline and indicator of AD is Mild Cognitive Impairment (MCI). In addition to this qualitative test, Biomarker tests have been proposed in the medical field including p-Tau, FDG-PET, and hippocampal. These tests can be administered to patients as early detectors of AD thus improving patients’ life quality and potentially reducing the costs of the health structure. Preliminary work has been conducted in the development of a Sequential Tree Based Classifier (STC), which helps medical providers predict if a patient will contract AD or not, by sequentially testing these biomarker tests. The STC model, however, has its limitations and the need for a more complex, robust model is needed. In fact, STC assumes a general linear model as the status of the patient based upon the tests results. We take a simulation perspective and try to define a more complex model that represents the patient evolution in time.

Specifically, this thesis focuses on the formulation of a Markov Chain model that is complex and robust. This Markov Chain model emulates the evolution of MCI patients based upon doctor visits and the sequential administration of biomarker tests. Data provided to create this Markov Chain model were collected by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The data lacked detailed information of the sequential administration of the biomarker tests and therefore, different analytical approaches were tried and conducted in order to calibrate the model. The resulting Markov Chain model provided the capability to conduct experiments regarding different parameters of the Markov Chain and yielded different results of patients that contracted AD and those that did not, leading to important insights into effect of thresholds and sequence on patient prediction capability as well as health costs reduction.



The data in this thesis was provided from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). ADNI investigators did not contribute to any analysis or writing of this thesis. A list of the ADNI investigators can be found at: http://adni.loni.usc.edu/about/governance/principal-investigators/ .
Date Created
2018
Agent