Computational Challenges in BART Modeling: Extrapolation, Classification, and Causal Inference

191496-Thumbnail Image.png
Description
This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART trees. This allows for extrapolation based on the most relevant data points and covariate variables determined by the trees' structure. The local GP technique is extended to the Bayesian causal forest (BCF) models to address the positivity violation issue in causal inference. Additionally, I introduce the LongBet model to estimate time-varying, heterogeneous treatment effects in panel data. Furthermore, I present a Poisson-based model, with a modified likelihood for XBART for the multi-class classification problem.
Date Created
2024
Agent

Cancer Imaging Method Analysis for Glioblastomas

Description
Glioblastoma is one of the leading types of brain cancer leading to patient death. To combat this type of cancer, many different types of imaging are used to analyze and treat glioblastomas. Still, magnetic resonance imaging, computed tomography, and positron

Glioblastoma is one of the leading types of brain cancer leading to patient death. To combat this type of cancer, many different types of imaging are used to analyze and treat glioblastomas. Still, magnetic resonance imaging, computed tomography, and positron emission tomography are the most commonly used imaging methods. In this literature review, the three different types of imaging are analyzed based on the preparation before imaging by the patient, the methods by which the images are created, the risks involved, and the technological advances in each category. The technological advances also included tools that combined two types of cancer imaging into one. The attributes of each imaging type are then analyzed to see which imaging methods are most effective and how they can be used to create better patient outcomes. Through this review, it was seen that all three methods of imaging were effective in their own ways, but the decision for which tool was based on what stage the cancer was in.
Date Created
2024-05
Agent

Spatio-Temporal Methods for Analysis of Implications of Natural Hazard Risk

190981-Thumbnail Image.png
Description
As the impacts of climate change worsen in the coming decades, natural hazards are expected to increase in frequency and intensity, leading to increased loss and risk to human livelihood. The spatio-temporal statistical approaches developed and applied in this dissertation

As the impacts of climate change worsen in the coming decades, natural hazards are expected to increase in frequency and intensity, leading to increased loss and risk to human livelihood. The spatio-temporal statistical approaches developed and applied in this dissertation highlight the ways in which hazard data can be leveraged to understand loss trends, build forecasts, and study societal impacts of losses. Specifically, this work makes use of the Spatial Hazard Events and Losses Database which is an unparalleled source of loss data for the United States. The first portion of this dissertation develops accurate loss baselines that are crucial for mitigation planning, infrastructure investment, and risk communication. This is accomplished thorough a stationarity analysis of county level losses following a normalization procedure. A wide variety of studies employ loss data without addressing stationarity assumptions or the possibility for spurious regression. This work enables the statistically rigorous application of such loss time series to modeling applications. The second portion of this work develops a novel matrix variate dynamic factor model for spatio-temporal loss data stratified across multiple correlated hazards or perils. The developed model is employed to analyze and forecast losses from convective storms, which constitute some of the highest losses covered by insurers. Adopting factor-based approach, forecasts are achieved despite the complex and often unobserved underlying drivers of these losses. The developed methodology extends the literature on dynamic factor models to matrix variate time series. Specifically, a covariance structure is imposed that is well suited to spatio-temporal problems while significantly reducing model complexity. The model is fit via the EM algorithm and Kalman filter. The third and final part of this dissertation investigates the impact of compounding hazard events on state and regional migration in the United States. Any attempt to capture trends in climate related migration must account for the inherent uncertainties surrounding climate change, natural hazard occurrences, and socioeconomic factors. For this reason, I adopt a Bayesian modeling approach that enables the explicit estimation of the inherent uncertainty. This work can provide decision-makers with greater clarity regarding the extent of knowledge on climate trends.
Date Created
2023
Agent

Tree Ensemble Algorithms for Causal Machine Learning

190865-Thumbnail Image.png
Description
This dissertation centers on treatment effect estimation in the field of causal inference, and aims to expand the toolkit for effect estimation when the treatment variable is binary. Two new stochastic tree-ensemble methods for treatment effect estimation in the continuous

This dissertation centers on treatment effect estimation in the field of causal inference, and aims to expand the toolkit for effect estimation when the treatment variable is binary. Two new stochastic tree-ensemble methods for treatment effect estimation in the continuous outcome setting are presented. The Accelerated Bayesian Causal Forrest (XBCF) model handles variance via a group-specific parameter, and the Heteroskedastic version of XBCF (H-XBCF) uses a separate tree ensemble to learn covariate-dependent variance. This work also contributes to the field of survival analysis by proposing a new framework for estimating survival probabilities via density regression. Within this framework, the Heteroskedastic Accelerated Bayesian Additive Regression Trees (H-XBART) model, which is also developed as part of this work, is utilized in treatment effect estimation for right-censored survival outcomes. All models have been implemented as part of the XBART R package, and their performance is evaluated via extensive simulation studies with appropriate sets of comparators. The contributed methods achieve similar levels of performance, while being orders of magnitude (sometimes as much as 100x) faster than comparator state-of-the-art methods, thus offering an exciting opportunity for treatment effect estimation in the large data setting.
Date Created
2023
Agent

Bayesian Inference for Markov Kernels Valued in Wasserstein Spaces

190789-Thumbnail Image.png
Description
In this work, the author analyzes quantitative and structural aspects of Bayesian inference using Markov kernels, Wasserstein metrics, and Kantorovich monads. In particular, the author shows the following main results: first, that Markov kernels can be viewed as Borel measurable

In this work, the author analyzes quantitative and structural aspects of Bayesian inference using Markov kernels, Wasserstein metrics, and Kantorovich monads. In particular, the author shows the following main results: first, that Markov kernels can be viewed as Borel measurable maps with values in a Wasserstein space; second, that the Disintegration Theorem can be interpreted as a literal equality of integrals using an original theory of integration for Markov kernels; third, that the Kantorovich monad can be defined for Wasserstein metrics of any order; and finally, that, under certain assumptions, a generalized Bayes’s Law for Markov kernels provably leads to convergence of the expected posterior distribution in the Wasserstein metric. These contributions provide a basis for studying further convergence, approximation, and stability properties of Bayesian inverse maps and inference processes using a unified theoretical framework that bridges between statistical inference, machine learning, and probabilistic programming semantics.
Date Created
2023
Agent

Bayesian Spatiotemporal Modeling and Uncertainty Quantification

190731-Thumbnail Image.png
Description
Uncertainty Quantification (UQ) is crucial in assessing the reliability of predictivemodels that make decisions for human experts in a data-rich world. The Bayesian approach to UQ for inverse problems has gained popularity. However, addressing UQ in high-dimensional inverse problems is

Uncertainty Quantification (UQ) is crucial in assessing the reliability of predictivemodels that make decisions for human experts in a data-rich world. The Bayesian approach to UQ for inverse problems has gained popularity. However, addressing UQ in high-dimensional inverse problems is challenging due to the intensity and inefficiency of Markov Chain Monte Carlo (MCMC) based Bayesian inference methods. Consequently, the first primary focus of this thesis is enhancing efficiency and scalability for UQ in inverse problems. On the other hand, the omnipresence of spatiotemporal data, particularly in areas like traffic analysis, underscores the need for effectively addressing inverse problems with spatiotemporal observations. Conventional solutions often overlook spatial or temporal correlations, resulting in underutilization of spatiotemporal interactions for parameter learning. Appropriately modeling spatiotemporal observations in inverse problems thus forms another pivotal research avenue. In terms of UQ methodologies, the calibration-emulation-sampling (CES) scheme has emerged as effective for large-dimensional problems. I introduce a novel CES approach by employing deep neural network (DNN) models during the emulation and sampling phase. This approach not only enhances computational efficiency but also diminishes sensitivity to training set variations. The newly devised “Dimension- Reduced Emulative Autoencoder Monte Carlo (DREAM)” algorithm scales Bayesian UQ up to thousands of dimensions in physics-constrained inverse problems. The algorithm’s effectiveness is exemplified through elliptic and advection-diffusion inverse problems. In the realm of spatiotemporal modeling, I propose to use Spatiotemporal Gaussian processes (STGP) in likelihood modeling and Spatiotemporal Besov processes (STBP) in prior modeling separately. These approaches highlight the efficacy of incorporat- ing spatial and temporal information for enhanced parameter estimation and UQ. Additionally, the superiority of STGP is demonstrated compared to static and time- averaged methods in time-dependent advection-diffusion partial differential equation (PDE) and three chaotic ordinary differential equations (ODE). Expanding upon Besov Process (BP), a method known for sparsity-promotion and edge-preservation, STBP is introduced to capture spatial data features and model temporal correlations by replacing the random coefficients in the series expansion with stochastic time functions following Q-exponential process(Q-EP). This advantage is showcased in dynamic computerized tomography (CT) reconstructions through comparison with classic STGP and a time-uncorrelated approach.
Date Created
2023
Agent

Multiple Testing of Local Maxima for Detection of Peaks and Change Points with Non-stationary Noise

189356-Thumbnail Image.png
Description
This dissertation comprises two projects: (i) Multiple testing of local maxima for detection of peaks and change points with non-stationary noise, and (ii) Height distributions of critical points of smooth isotropic Gaussian fields: computations, simulations and asymptotics. The first project

This dissertation comprises two projects: (i) Multiple testing of local maxima for detection of peaks and change points with non-stationary noise, and (ii) Height distributions of critical points of smooth isotropic Gaussian fields: computations, simulations and asymptotics. The first project introduces a topological multiple testing method for one-dimensional domains to detect signals in the presence of non-stationary Gaussian noise. The approach involves conducting tests at local maxima based on two observation conditions: (i) the noise is smooth with unit variance and (ii) the noise is not smooth where kernel smoothing is applied to increase the signal-to-noise ratio (SNR). The smoothed signals are then standardized, which ensures that the variance of the new sequence's noise becomes one, making it possible to calculate $p$-values for all local maxima using random field theory. Assuming unimodal true signals with finite support and non-stationary Gaussian noise that can be repeatedly observed. The algorithm introduced in this work, demonstrates asymptotic strong control of the False Discovery Rate (FDR) and power consistency as the number of sequence repetitions and signal strength increase. Simulations indicate that FDR levels can also be controlled under non-asymptotic conditions with finite repetitions. The application of this algorithm to change point detection also guarantees FDR control and power consistency. The second project focuses on investigating the explicit and asymptotic height densities of critical points of smooth isotropic Gaussian random fields on both Euclidean space and spheres.The formulae are based on characterizing the distribution of the Hessian of the Gaussian field using the Gaussian orthogonally invariant (GOI) matrices and the Gaussian orthogonal ensemble (GOE) matrices, which are special cases of GOI matrices. However, as the dimension increases, calculating explicit formulae becomes computationally challenging. The project includes two simulation methods for these distributions. Additionally, asymptotic distributions are obtained by utilizing the asymptotic distribution of the eigenvalues (excluding the maximum eigenvalues) of the GOE matrix for large dimensions. However, when it comes to the maximum eigenvalue, the Tracy-Widom distribution is utilized. Simulation results demonstrate the close approximation between the asymptotic distribution and the real distribution when $N$ is sufficiently large.
Date Created
2023
Agent

Case Studies in Machine Learning of Reduced Form Models for Causal Inference

187395-Thumbnail Image.png
Description
This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters.

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.
Date Created
2023
Agent

Estimation for Disease Models Across Scales

171927-Thumbnail Image.png
Description
Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems. Eilertson et al. (2019) propose using a state-space model combined with maximum likelihood methods for estimating measles transmission. A Bayesian approach that uses particle Markov Chain Monte Carlo (pMCMC) is proposed to estimate the parameters of the non-linear state-space model developed in Eilertson et al. (2019) and similar previous studies. This dissertation illustrates the performance of this approach by calculating posterior estimates of the model parameters and predictions of the unobserved states in simulations and case studies. Also, Iteration Filtering (IF2) is used as a support method to verify the Bayesian estimation and to inform the selection of prior distributions. In the second half of the thesis, a birth-death process is proposed to model the unobserved population size of a disease vector. This model studies the effect of a disease vector population size on a second affected population. The second population follows a non-homogenous Poisson process when conditioned on the vector process with a transition rate given by a scaled version of the vector population. The observation model also measures a potential threshold event when the host species population size surpasses a certain level yielding a higher transmission rate. A maximum likelihood procedure is developed for this model, which combines particle filtering with the Minorize-Maximization (MM) algorithm and extends the work of Crawford et al. (2014).
Date Created
2022
Agent

Optimal Designs under Logistic Mixed Models

171508-Thumbnail Image.png
Description
Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated parameters, which leads to local optimality. Moreover, the information matrix under a GLMM doesn't have a closed-form expression. My dissertation includes three topics related to this design problem. The first part is searching for locally optimal designs under GLMMs with longitudinal data. I apply penalized quasi-likelihood (PQL) method to approximate the information matrix and compare several approximations to show the superiority of PQL over other approximations. Under different local parameters and design restrictions, locally D- and A- optimal designs are constructed based on the approximation. An interesting finding is that locally optimal designs sometimes apply different designs to different subjects. Finally, the robustness of these locally optimal designs is discussed. In the second part, an unknown observational covariate is added to the previous model. With an unknown observational variable in the experiment, expected optimality criteria are considered. Under different assumptions of the unknown variable and parameter settings, locally optimal designs are constructed and discussed. In the last part, Bayesian optimal designs are considered under logistic mixed models. Considering different priors of the local parameters, Bayesian optimal designs are generated. Bayesian design under such a model is usually expensive in time. The running time in this dissertation is optimized to an acceptable amount with accurate results. I also discuss the robustness of these Bayesian optimal designs, which is the motivation of applying such an approach.
Date Created
2022
Agent