Machine Learning Models for High-Dimensional Matched Data

161983-Thumbnail Image.png
Description
Matching or stratification is commonly used in observational studies to remove bias due to confounding variables. Analyzing matched data sets requires specific methods which handle dependency among observations within a stratum. Also, modern studies often include hundreds or thousands of

Matching or stratification is commonly used in observational studies to remove bias due to confounding variables. Analyzing matched data sets requires specific methods which handle dependency among observations within a stratum. Also, modern studies often include hundreds or thousands of variables. Traditional methods for matched data sets are challenged in high-dimensional settings, mixed type variables (numerical and categorical), nonlinear andinteraction effects. Furthermore, machine learning research for such structured data is quite limited. This dissertation addresses this important gap and proposes machine learning models for identifying informative variables from high-dimensional matched data sets. The first part of this dissertation proposes a machine learning model to identify informative variables from high-dimensional matched case-control data sets. The outcome of interest in this study design is binary (case or control), and each stratum is assumed to have one unit from each outcome level. The proposed method which is referred to as Matched Forest (MF) is effective for large number of variables and identifying interaction effects. The second part of this dissertation proposes three enhancements of MF algorithm. First, a regularization framework is proposed to improve variable selection performance in excessively high-dimensional settings. Second, a classification method is proposed to classify unlabeled pairs of data. Third, two metrics are proposed to estimate the effects of important variables identified by MF. The third part proposes a machine learning model based on Neural Networks to identify important variables from a more generalized matched case-control data set where each stratum has one unit from case outcome level and more than one unit from control outcome level. This method which is referred to as Matched Neural Network (MNN) performs better than current algorithms to identify variables with interaction effects. Lastly, a generalized machine learning model is proposed to identify informative variables from high-dimensional matched data sets where the outcome has more than two levels. This method outperforms existing algorithms in the literature in identifying variables with complex nonlinear and interaction effects.
Date Created
2021
Agent

Projection properties and analysis methods for six to fourteen factor no confounding designs in 16 runs

151329-Thumbnail Image.png
Description
During the initial stages of experimentation, there are usually a large number of factors to be investigated. Fractional factorial (2^(k-p)) designs are particularly useful during this initial phase of experimental work. These experiments often referred to as screening experiments hel

During the initial stages of experimentation, there are usually a large number of factors to be investigated. Fractional factorial (2^(k-p)) designs are particularly useful during this initial phase of experimental work. These experiments often referred to as screening experiments help reduce the large number of factors to a smaller set. The 16 run regular fractional factorial designs for six, seven and eight factors are in common usage. These designs allow clear estimation of all main effects when the three-factor and higher order interactions are negligible, but all two-factor interactions are aliased with each other making estimation of these effects problematic without additional runs. Alternatively, certain nonregular designs called no-confounding (NC) designs by Jones and Montgomery (Jones & Montgomery, Alternatives to resolution IV screening designs in 16 runs, 2010) partially confound the main effects with the two-factor interactions but do not completely confound any two-factor interactions with each other. The NC designs are useful for independently estimating main effects and two-factor interactions without additional runs. While several methods have been suggested for the analysis of data from nonregular designs, stepwise regression is familiar to practitioners, available in commercial software, and is widely used in practice. Given that an NC design has been run, the performance of stepwise regression for model selection is unknown. In this dissertation I present a comprehensive simulation study evaluating stepwise regression for analyzing both regular fractional factorial and NC designs. Next, the projection properties of the six, seven and eight factor NC designs are studied. Studying the projection properties of these designs allows the development of analysis methods to analyze these designs. Lastly the designs and projection properties of 9 to 14 factor NC designs onto three and four factors are presented. Certain recommendations are made on analysis methods for these designs as well.
Date Created
2012
Agent