Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data

Document
Description
In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed.

Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion.

Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified.