Missing Data in Conditional Inference Trees
Document
Description
Decision trees is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. Conditional Inference Trees (CTREEs) is a non-parametric class of decision trees that uses statistical theory in order to select variables for splitting. Missing data can be problematic in decision trees because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., deletion, majority rule, and surrogate split) have been implemented in decision tree algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. In addition to these approaches, this dissertation proposed a modified multiple imputation approach to handling missing data in CTREEs. A simulation was conducted to compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging. Results revealed that simple approaches (i.e., majority rule, treat missing as its own category, and listwise deletion) were effective in handling missing data in CTREEs. The modified multiple imputation approach did not perform very well against simple approaches in most conditions, but this approach did seem best suited for small sample sizes and extreme missingness situations.