Unveiling Cellular Heterogeneity, Genetic Regulation, and Protein Trafficking Dynamics Via Novel Integrative Multi-Omic Approaches

190974-Thumbnail Image.png
Description
Advancements in high-throughput biotechnologies have generated large-scale multi-omics datasets encompassing diverse dimensions such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, metagenomics, and phenomics. Traditionally, statistical and machine learning-based approaches utilize single-omics data sources to uncover molecular signatures, dissect complicated cellular mechanisms,

Advancements in high-throughput biotechnologies have generated large-scale multi-omics datasets encompassing diverse dimensions such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, metagenomics, and phenomics. Traditionally, statistical and machine learning-based approaches utilize single-omics data sources to uncover molecular signatures, dissect complicated cellular mechanisms, and predict clinical results. However, to capture the multifaceted pathological mechanisms, integrative multi-omics analysis is needed that can provide a comprehensive picture of the disease. Here, I present three novel approaches to multi-omics integrative analysis. I introduce a single-cell integrative clustering method, which leverages multi-omics to enhance the resolution of cell subpopulations. Applied to a Cellular Indexing of Transcriptomes and Epitopes (CITE-Seq) dataset from human Acute Myeloid Lymphoma (AML) and control samples, this approach unveiled nuanced cell populations that otherwise remain elusive. I then shift the focus to a computational framework to discover transcriptional regulatory trios in which a transcription factor binds to a regulatory element harboring a genetic variant and subsequently differentially regulates the transcription level of a target gene. Applied to whole-exome, whole-genome, and transcriptome data of multiple myeloma samples, this approach discovered synergetic cis-acting and trans-acting regulatory elements associated with tumorigenesis. The next part of this work introduces a novel methodology that leverages the transcriptome and surface protein data at the single-cell level produced by CITE-Seq to model the intracellular protein trafficking process. Applied to COVID-19 samples, this approach revealed dysregulated protein trafficking associated with the severity of the infection.
Date Created
2023
Agent

中国双重上市公司A、B股价格差异研究

187480-Thumbnail Image.png
Description
中国大陆证券市场上的A、B股市场,是世界独特的分割市场,其中,双重上市公司A、B股(以下简称AB股),同股同权,但B股相对A股价格长期折价,被称为“B股难题”(B Share Puzzle), 这是国际资本市场上的一个热点问题,此相关问题研究也一直延续。本文尝试研究中国政府出台的对股市长期发展进行调节的政策与B股折价之间的关系,通过对AB股发展历史的回顾,梳理出二个对AB股长期发展干预和调节的政策,即2001年2月中国政府允许中国大陆居民投资B股(简称政策一)和2005年4月29日开始的中国证券市场股权分置改革(简称政策二),并在此基础上,运用计量统计方法实证分析,研究发现中国政府出台的对股市长期发展进行调节的政策一、政策二与B股折价率有显著相关性,同时政策的干预和调节是分别有针对性进行的,使得B股折价率变化在政策影响下,通过A股价格或者B股价格的显著变化而实现。另外发现,B股平均折价率具有波动聚集特性,有小幅波动和均值回归特点,具有可预测性。
Date Created
2023
Agent

Developing a Stochastic Modeling App for Biophysics Education

186805-Thumbnail Image.png
Description

Computational and systems biology are rapidly growing fields of academic study, but unfamiliar researchers are impeded by a lack of accessible, programming-optional, modelling tools. To address this gap, I developed BioSSA, a web framework built on JavaScript and D3.js which

Computational and systems biology are rapidly growing fields of academic study, but unfamiliar researchers are impeded by a lack of accessible, programming-optional, modelling tools. To address this gap, I developed BioSSA, a web framework built on JavaScript and D3.js which allows users to explore a small library of curated biophysical models as well as create and simulate their own reaction network. The mathematical foundation of BioSSA is the Stochastic Gillespie Algorithm, which is widely used in mathematical modeling and biology to represent chemical reaction systems. SGA is particularly well-suited as an introductory modelling tool because of its flexibility, broad applicability, and its ability to numerically approximate systems when analytical solutions are not available. BioSSA is freely available to the community and further improvements are planned.

Date Created
2023-05
Agent

Examining the Significance of Economic Connectedness as an Indicator of Disparities in COVID-19 Infection Risk in Arizona ZCTAs

185080-Thumbnail Image.png
Description

Bridging social capital describes the diffusion of information across networks built between individuals of different social identities. This project aims to understand if the bridging ties of economic connectedness (EC), measured by data from Facebook friends and calculated as the

Bridging social capital describes the diffusion of information across networks built between individuals of different social identities. This project aims to understand if the bridging ties of economic connectedness (EC), measured by data from Facebook friends and calculated as the average share of high socioeconomic status friends that an individual from a low socioeconomic status has, can be a predictor of variations in COVID-19 infection risk across Arizona ZIP code tabulation areas (ZCTAs). Economic connectedness values across Arizona ZCTAs was examined in addition to the correlation of EC to various social and demographic factors such as age, sex, race and ethnicity, educational background, income, and health insurance coverage. A multiple linear regression model was conducted to examine the association of EC to biweekly COVID-19 growth rate from October 2020 to November 2021, and to examine the longitudinal trends in the association between these two factors. The study found that the bridging ties of economic connectedness has a significant effect size comparable to that of other demographic features, and has implications in being used to identify vulnerabilities and health disparities in communities during the pandemic.

Date Created
2023-05
Agent

Novel Bioinformatics Methods for Co-expression Analysis of Single Cell RNA Sequencing and Circular RNA Sequencing Time Series Data

171582-Thumbnail Image.png
Description
High throughput transcriptome data analysis like Single-cell Ribonucleic Acid sequencing (scRNA-seq) and Circular Ribonucleic Acid (circRNA) data have made significant breakthroughs, especially in cancer genomics. Analysis of transcriptome time series data is core in identifying time point(s) where drastic changes

High throughput transcriptome data analysis like Single-cell Ribonucleic Acid sequencing (scRNA-seq) and Circular Ribonucleic Acid (circRNA) data have made significant breakthroughs, especially in cancer genomics. Analysis of transcriptome time series data is core in identifying time point(s) where drastic changes in gene transcription are associated with homeostatic to non-homeostatic cellular transition (tipping points). In Chapter 2 of this dissertation, I present a novel cell-type specific and co-expression-based tipping point detection method to identify target gene (TG) versus transcription factor (TF) pairs whose differential co-expression across time points drive biological changes in different cell types and the time point when these changes are observed. This method was applied to scRNA-seq data sets from a SARS-CoV-2 study (18 time points), a human cerebellum development study (9 time points), and a lung injury study (18 time points). Similarly, leveraging transcriptome data across treatment time points, I developed methodologies to identify treatment-induced and cell-type specific differentially co-expressed pairs (DCEPs). In part one of Chapter 3, I presented a pipeline that used a series of statistical tests to detect DCEPs. This method was applied to scRNA-seq data of patients with non-small cell lung cancer (NSCLC) sequenced across cancer treatment times. However, this pipeline does not account for correlations among multiple single cells from the same sample and correlations among multiple samples from the same patient. In Part 2 of Chapter 3, I presented a solution to this problem using a mixed-effect model. In Chapter 4, I present a summary of my work that focused on the cross-species analysis of circRNA transcriptome time series data. I compared circRNA profiles in neonatal pig and mouse hearts, identified orthologous circRNAs, and discussed regulation mechanisms of cardiomyocyte proliferation and myocardial regeneration conserved between mouse and pig at different time points.
Date Created
2022
Agent

Statistical Methods for Analysis of Genomic Data with Applications in Oncology

161916-Thumbnail Image.png
Description
This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used

This dissertation presents three novel algorithms with real-world applications to genomic oncology. While the methodologies presented here were all developed to overcome various challenges associated with the adoption of high throughput genomic data in clinical oncology, they can be used in other domains as well. First, a network informed feature ranking algorithm is presented, which shows a significant increase in ability to select true predictive features from simulated data sets when compared to other state of the art graphical feature ranking methods. The methodology also shows an increased ability to predict pathological complete response to preoperative chemotherapy from genomic sequencing data of breast cancer patients utilizing domain knowledge from protein-protein interaction networks. Second, an algorithm that overcomes population biases inherent in the use of a human reference genome developed primarily from European populations is presented to classify microsatellite instability (MSI) status from next-generation-sequencing (NGS) data. The methodology significantly increases the accuracy of MSI status prediction in African and African American ancestries. Finally, a single variable model is presented to capture the bimodality inherent in genomic data stemming from heterogeneous diseases. This model shows improvements over other parametric models in the measurements of receiver-operator characteristic (ROC) curves for bimodal data. The model is used to estimate ROC curves for heterogeneous biomarkers in a dataset containing breast cancer and cancer-free specimen.
Date Created
2021
Agent

Exercise, Genistein, and Their Combined Effect on Gut Microbiota and Mitochondrial Oxidative Capacity After 12-Week of a Western Diet on C57BL/6 Adult Mice

161651-Thumbnail Image.png
Description
Obesity is one of the most challenging health conditions of our time, characterized by complex interactions between behavioral, environmental, and genetic factors. These interactions lead to a distinctive obese phenotype. Twenty years ago, the gut microbiota (GM) was postulated as

Obesity is one of the most challenging health conditions of our time, characterized by complex interactions between behavioral, environmental, and genetic factors. These interactions lead to a distinctive obese phenotype. Twenty years ago, the gut microbiota (GM) was postulated as a significant factor contributing to the obese phenotype and associated metabolic disturbances. Exercise had shown to improve and revert the metabolic abnormalities in obese individuals. Also, genistein has a suggested potential anti-obesogenic effect. Studying the dynamic interaction of the GM with relevant organs in metabolic homeostasis is crucial for the design of new long-term therapies to treat obesity. The purpose of this experimental study is to examine exercise (Exe), genistein (Gen), and their combined intervention (Exe + Gen) effects on GM composition and musculoskeletal mitochondrial oxidative function in diet-induced obese mice. Also, this study aims to explore the association between gut microbial diversity and mitochondrial oxidative capacity. 132 adult male (n=63) and female (n= 69) C57BL/6 mice were randomized to one of five interventions for twelve weeks: control (n= 27), high fat diet (HFD; n=26), HFD + Exe (n=28), HFD + Gen (n=27), or HFD + Exe + Gen (n=24). All HFD drinking water was supplemented with 42g sugar/L. Fecal pellets were collected, DNA extracted, and measured the microbial composition by sequencing the V4 of the 16S rRNA gene with Illumina. The mitochondrial oxidative capacity was assessed by measuring the enzymatic kinetic activity of the citrate synthase (CS) of forty-nine mice. This study found that Exe groups had a significantly higher bacterial richness compared to HFD + Gen or HFD group. Exe + Gen showed the synergistic effect to drive the GM towards the control group´s GM composition as we found Ruminococcus significantly more abundant in the HFD + Exe + Gen than the rest of the HFD groups. The study did not find preventive capacity in either of the interventions on the CS activity. Therefore, further research is needed to confirm the synergistic effect of Exe, Exe, and Gen on the gut bacterial richness and the capacity to prevent HFD-induced deleterious effect on GM and mitochondrial oxidative capacity.
Date Created
2021
Agent

Learning RNA Viral Disease Dynamics from Molecular Sequence Data

158895-Thumbnail Image.png
Description
The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious

The severity of the health and economic devastation resulting from outbreaks of viruses such as Zika, Ebola, SARS-CoV-1 and, most recently, SARS-CoV-2 underscores the need for tools which aim to delineate critical disease dynamical features underlying observed patterns of infectious disease spread. The growing emphasis placed on genome sequencing to support pathogen outbreak response highlights the need to adapt traditional epidemiological metrics to leverage this increasingly rich data stream. Further, the rapidity with which pathogen molecular sequence data is now generated, coupled with advent of sophisticated, Bayesian statistical techniques for pathogen molecular sequence analysis, creates an unprecedented opportunity to disrupt and innovate public health surveillance using 21st century tools. Bayesian phylogeography is a modeling framework which assumes discrete traits -- such as age, location of sampling, or species -- evolve according to a continuous-time Markov chain process along a phylogenetic tree topology which is inferred from molecular sequence data.

While myriad studies exist which reconstruct patterns of discrete trait evolution along an inferred phylogeny, attempts to translate the results of phyloegographic analyses into actionable metrics that can be used by public health agencies to direct the development of interventions aimed at reducing pathogen spread are conspicuously absent from the literature. In this dissertation, I focus on developing an intuitive metric, the phylogenetic risk ratio (PRR), which I use to translate the results of Bayesian phylogeographic modeling studies into a form actionable by public health agencies. I apply the PRR to two case studies: i) age-associated diffusion of influenza A/H3N2 during the 2016-17 US epidemic and ii) host associated diffusion of West Nile virus in the US. I discuss the limitations of this (and Bayesian phylogeographic) approaches when studying non-geographic traits for which limited metadata is available in public molecular sequence databases and statistically principled solutions to the missing metadata problem in the phylogenetic context. Then, I perform a simulation study to evaluate the statistical performance of the missing metadata solution. Finally, I provide a solution for researchers whom are interested in using the PRR and phylogenetic UTMs in their own genomic epidemiological studies yet are deterred by the idiosyncratic, error-prone processes required to implement these methods using popular Bayesian phylogenetic inference software packages. My solution, Build-A-BEAST, is a publicly available, object-oriented system written in python which aims to reduce the complexity and idiosyncrasy of creating XML files necessary to perform the aforementioned analyses. This dissertation extends the conceptual framework of Bayesian phylogeographic methods, develops a method to translates the output of phylogenetic models into an actionable form, evaluates the use of priors for missing metadata, and, finally, provides a solution which eases the implementation of these methods. In doing so, I lay the foundation for future work in disseminating and implementing Bayesian phylogeographic methods for routine public health surveillance.
Date Created
2020
Agent

Biomarkers of Familial Speech Sound Disorders: Genes, Perception, and Motor Control

158859-Thumbnail Image.png
Description
Speech sound disorders (SSDs) are the most prevalent type of communication disorder in children. Clinically, speech-language pathologists (SLPs) rely on behavioral methods for assessing and treating SSDs. Though clients typically experience improved speech outcomes as a result of

Speech sound disorders (SSDs) are the most prevalent type of communication disorder in children. Clinically, speech-language pathologists (SLPs) rely on behavioral methods for assessing and treating SSDs. Though clients typically experience improved speech outcomes as a result of therapy, there is evidence that underlying deficits may persist even in individuals who have completed treatment for surface-level speech behaviors. Advances in the field of genetics have created the opportunity to investigate the contribution of genes to human communication. Due to the heterogeneity of many communication disorders, the manner in which specific genetic changes influence neural mechanisms, and thereby behavioral phenotypes, remains largely unknown. The purpose of this study was to identify genotype-phenotype associations, along with perceptual, and motor-related biomarkers within families displaying SSDs. Five parent-child trios participated in genetic testing, and five families participated in a combination of genetic and behavioral testing to help elucidate biomarkers related to SSDs. All of the affected individuals had a history of childhood apraxia of speech (CAS) except for one family that displayed a phonological disorder. Genetic investigation yielded several genes of interest relevant for an SSD phenotype: CNTNAP2, CYFIP1, GPR56, HERC1, KIAA0556, LAMA5, LAMB1, MDGA2, MECP2, NBEA, SHANK3, TENM3, and ZNF142. All of these genes showed at least some expression in the developing brain. Gene ontology analysis yielded terms supporting a genetic influence on central nervous system development. Behavioral testing revealed evidence of a sequential processing biomarker for all individuals with CAS, with many showing deficits in sequential motor skills in addition to speech deficits. In some families, participants also showed evidence of a co-occurring perceptual processing biomarker. The family displaying a phonological phenotype showed milder sequential processing deficits compared to CAS families. Overall, this study supports the presence of a sequential processing biomarker for CAS and shows that relevant genes of interest may be influencing a CAS phenotype via sequential processing. Knowledge of these biomarkers can help strengthen precision of clinical assessment and motivate development of novel interventions for individuals with SSDs.
Date Created
2020
Agent

Fine Mapping Functional Noncoding Genetic Elements Via Machine Learning

158771-Thumbnail Image.png
Description
All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.
Date Created
2020
Agent