Discovering subclones and their driver genes in tumors sequenced at standard depths

157966-Thumbnail Image.png
Description
Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x

Understanding intratumor heterogeneity and their driver genes is critical to

designing personalized treatments and improving clinical outcomes of cancers. Such

investigations require accurate delineation of the subclonal composition of a tumor, which

to date can only be reliably inferred from deep-sequencing data (>300x depth). The

resulting algorithm from the work presented here, incorporates an adaptive error model

into statistical decomposition of mixed populations, which corrects the mean-variance

dependency of sequencing data at the subclonal level and enables accurate subclonal

discovery in tumors sequenced at standard depths (30-50x). Tested on extensive computer

simulations and real-world data, this new method, named model-based adaptive grouping

of subclones (MAGOS), consistently outperforms existing methods on minimum

sequencing depth, decomposition accuracy and computation efficiency. MAGOS supports

subclone analysis using single nucleotide variants and copy number variants from one or

more samples of an individual tumor. GUST algorithm, on the other hand is a novel method

in detecting the cancer type specific driver genes. Combination of MAGOS and GUST

results can provide insights into cancer progression. Applications of MAGOS and GUST

to whole-exome sequencing data of 33 different cancer types’ samples discovered a

significant association between subclonal diversity and their drivers and patient overall

survival.
Date Created
2019
Agent

A Retrospective Investigation to Assess the Potential Application of Predictive Machine Learning Algorithms in Oncology Clinical Trials

132249-Thumbnail Image.png
Description
The purpose of this investigation is to apply a machine learning algorithm with de-identified, historic oncology clinical trial data to assess the theoretical understanding of predictive modeling to derive potential clinical practice recommendations. Within this study, electronic medical records from

The purpose of this investigation is to apply a machine learning algorithm with de-identified, historic oncology clinical trial data to assess the theoretical understanding of predictive modeling to derive potential clinical practice recommendations. Within this study, electronic medical records from the HonorHealth Virginia G. Piper Institute will undergo data visualization to identify potential correlations and trends critical for model creation as well as further identify potential expansions or limitations of scope regarding model purpose. Hypothesis pursued post data visualization was the development of a predictive model for 6-month survival. Current standard is estimated physician accuracy at 56.5% accuracy at 6 months out. This study created supervised learning models using decision trees, KNN, SVM and Ensemble methods using combinations of LASSO Logistic Regression and Know-GRFF Random Forest for feature selection. SVM trained on a combined set of LASSO and Know-GRRF featured produced the highest performing model at 75.5% with an AUC of 0.82. This study demonstrates the potential for applying predictive modeling on readily available EMR records to drive clinical practice recommendations. The models developed could potentially, with further development, be used as an ancillary tool for jumpstarting patient-physician conversations on survival and life expectancy.
Date Created
2019-05
Agent

Genetic variations and associated electrophysiological and behavioral traits in children with childhood apraxia of speech

156527-Thumbnail Image.png
Description
Childhood Apraxia of Speech (CAS) is a severe motor speech disorder that is difficult to diagnose as there is currently no gold-standard measurement to differentiate between CAS and other speech disorders. In the present study, we investigate underlying biomarkers associated

Childhood Apraxia of Speech (CAS) is a severe motor speech disorder that is difficult to diagnose as there is currently no gold-standard measurement to differentiate between CAS and other speech disorders. In the present study, we investigate underlying biomarkers associated with CAS in addition to enhanced phenotyping through behavioral testing. Cortical electrophysiological measures were utilized to investigate differences in neural activation in response to native and non-native vowel contrasts between children with CAS and typically developing peers. Genetic analysis included full exome sequencing of a child with CAS and his unaffected parents in order to uncover underlying genetic variation that may be causal to the child’s severely impaired speech and language. Enhanced phenotyping was completed through extensive behavioral testing, including speech, language, reading, spelling, phonological awareness, gross/fine motor, and oral and hand motor tasks. Results from cortical electrophysiological measures are consistent with previous evidence of a heightened neural response to non-native sounds in CAS, potentially indicating over specified phonological representations in this population. Results of exome sequencing suggest multiple genetic variations contributing to the severely affected phenotype in the child and provide further evidence of heterogeneous genomic pathways associated with CAS. Finally, results of behavioral testing demonstrate significant impairments evident across tasks in CAS, suggesting underlying sequential processing deficits in multiple domains. Overall, these results have the potential to delineate functional pathways from genetic variations to the brain to observable behavioral phenotypes and motivate the development of preventative and targeted treatment approaches.
Date Created
2018
Agent

Circular RNA characterization and regulatory network prediction in human tissue

Description
Circular RNAs (circRNAs) are a class of endogenous, non-coding RNAs that are formed when exons back-splice to each other and represent a new area of transcriptomics research. Numerous RNA sequencing (RNAseq) studies since 2012 have revealed that circRNAs are pervasively

Circular RNAs (circRNAs) are a class of endogenous, non-coding RNAs that are formed when exons back-splice to each other and represent a new area of transcriptomics research. Numerous RNA sequencing (RNAseq) studies since 2012 have revealed that circRNAs are pervasively expressed in eukaryotes, especially in the mammalian brain. While their functional role and impact remains to be clarified, circRNAs have been found to regulate micro-RNAs (miRNAs) as well as parental gene transcription and may thus have key roles in transcriptional regulation. Although circRNAs have continued to gain attention, our understanding of their expression in a cell-, tissue- , and brain region-specific context remains limited. Further, computational algorithms produce varied results in terms of what circRNAs are detected. This thesis aims to advance current knowledge of circRNA expression in a region specific context focusing on the human brain, as well as address computational challenges.

The overarching goal of my research unfolds over three aims: (i) evaluating circRNAs and their predicted impact on transcriptional regulatory networks in cell-specific RNAseq data; (ii) developing a novel solution for de novo detection of full length circRNAs as well as in silico validation of selected circRNA junctions using assembly; and (iii) application of these assembly based detection and validation workflows, and integrating existing tools, to systematically identify and characterize circRNAs in functionally distinct human brain regions. To this end, I have developed novel bioinformatics workflows that are applicable to non-polyA selected RNAseq datasets and can be used to characterize circRNA expression across various sample types and diseases. Further, I establish a reference dataset of circRNA expression profiles and regulatory networks in a brain region-specific manner. This resource along with existing databases such as circBase will be invaluable in advancing circRNA research as well as improving our understanding of their role in transcriptional regulation and various neurological conditions.
Date Created
2018
Agent

Expansion and Application of Pathways of Topological Rank Analysis (PoTRA) to Various Cancers

133258-Thumbnail Image.png
Description
Cancer is the second leading cause of death in the United States. Cancer is a serious, complex disease which causes cells to grow uncontrollably, causing millions of deaths per year [1]. Cancer is usually caused by a combination of environmental

Cancer is the second leading cause of death in the United States. Cancer is a serious, complex disease which causes cells to grow uncontrollably, causing millions of deaths per year [1]. Cancer is usually caused by a combination of environmental variables and biological pathways. The pathways have a very robust structure normally, but are altered because of cancer, resulting in a loss of connectivity between pathways. In order detect these pathways, a PageRank-based method called Pathways of Topological Rank Analysis (PoTRA) was created, which measures the relative rankings of the genes in each pathway. Applying this algorithm will allow us to figure out what pathways differed significantly in areas with cancer and areas without cancer. This would allow scientists to focus on specific pathways in order to learn more about the cancer and find more effective ways to treat it. So far, analysis using PoTRA has been successfully conducted on hepatocellular carcinoma (HCC) and its subtypes, resulting in all significant pathways found being cancer-associated. Now, using the TCGA data stored in Google Cloud's BigQuery, we created a pipeline to apply PoTRA to other cancer data sets and see how well it cross-applies to other cancers. The results show that even though some modification may need to be made to adapt to other datasets, many significant pathways were found for both HCC and breast cancer.
Date Created
2018-05
Agent

Analysis of HIV Risk Groups Using Bayesian Analysis

133301-Thumbnail Image.png
Description
Phylogenetic analyses that were conducted in the past didn't have the ability or functionality to inform and implement useful public health decisions while using clustering. Models can be constructed to conduct any further analyses for the result of meaningful data

Phylogenetic analyses that were conducted in the past didn't have the ability or functionality to inform and implement useful public health decisions while using clustering. Models can be constructed to conduct any further analyses for the result of meaningful data to be used in the future of public health informatics. A phylogenetic tree is considered one of the best ways for researchers to visualize and analyze the evolutionary history of a certain virus. The focus of this study was to research HIV phylodynamic and phylogenetic methods. This involved identifying the fast growing HIV transmission clusters and rates for certain risk groups in the US. In order to achieve these results an HIV database was required to retrieve real-time data for implementation, alignment software for multiple sequence alignment, Bayesian analysis software for the development and manipulation of models, and graphical tools for visualizing the output from the models created. This study began by conducting a literature review on HIV phylogeographies and phylodynamics. Sequence data was then obtained from a sequence database to be run in a multiple alignment software. The sequence that was obtained was unaligned which is why the alignment was required. Once the alignment was performed, the same file was loaded into a Bayesian analysis software for model creation of a phylogenetic tree. When the model was created, the tree was edited in a tree visualization software for the user to easily interpret. From this study the output of the tree resulted the way it did, due to a distant homology or the mixing of certain parameters. For a further continuation of this study, it would be interesting to use the same aligned sequence and use different model parameter selections for the initial creation of the model to see how the output changes. This is because one small change for the model parameter could greatly affect the output of the phylogenetic tree.
Date Created
2018-05
Agent

Topological analysis of biological pathways : genes, microRNAs and pathways involved in hepatocellular carcinoma

155994-Thumbnail Image.png
Description
Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired

Rewired biological pathways and/or rewired microRNA (miRNA)-mRNA interactions might also influence the activity of biological pathways. Here, rewired biological pathways is defined as differential (rewiring) effect of genes on the topology of biological pathways between controls and cases. Similarly, rewired miRNA-mRNA interactions are defined as the differential (rewiring) effects of miRNAs on the topology of biological pathways between controls and cases. In the dissertation, it is discussed that how rewired biological pathways (Chapter 1) and/or rewired miRNA-mRNA interactions (Chapter 2) aberrantly influence the activity of biological pathways and their association with disease.

This dissertation proposes two PageRank-based analytical methods, Pathways of Topological Rank Analysis (PoTRA) and miR2Pathway, discussed in Chapter 1 and Chapter 2, respectively. PoTRA focuses on detecting pathways with an altered number of hub genes in corresponding pathways between two phenotypes. The basis for PoTRA is that the loss of connectivity is a common topological trait of cancer networks, as well as the prior knowledge that a normal biological network is a scale-free network whose degree distribution follows a power law where a small number of nodes are hubs and a large number of nodes are non-hubs. However, from normal to cancer, the process of the network losing connectivity might be the process of disrupting the scale-free structure of the network, namely, the number of hub genes might be altered in cancer compared to that in normal samples. Hence, it is hypothesized that if the number of hub genes is different in a pathway between normal and cancer, this pathway might be involved in cancer. MiR2Pathway focuses on quantifying the differential effects of miRNAs on the activity of a biological pathway when miRNA-mRNA connections are altered from normal to disease and rank disease risk of rewired miRNA-mediated biological pathways. This dissertation explores how rewired gene-gene interactions and rewired miRNA-mRNA interactions lead to aberrant activity of biological pathways, and rank pathways for their disease risk. The two methods proposed here can be used to complement existing genomics analysis methods to facilitate the study of biological mechanisms behind disease at the systems-level.
Date Created
2017
Agent

FlyExpress 7: An Integrated Discovery Platform To Study Coexpressed Genes Using in Situ Hybridization Images in Drosophila

130273-Thumbnail Image.png
Description
Gene expression patterns assayed across development can offer key clues about a gene’s function and regulatory role. Drosophila melanogaster is ideal for such investigations as multiple individual and high-throughput efforts have captured the spatiotemporal patterns of thousands of embryonic expressed

Gene expression patterns assayed across development can offer key clues about a gene’s function and regulatory role. Drosophila melanogaster is ideal for such investigations as multiple individual and high-throughput efforts have captured the spatiotemporal patterns of thousands of embryonic expressed genes in the form of in situ images. FlyExpress (www.flyexpress.net), a knowledgebase based on a massive and unique digital library of standardized images and a simple search engine to find coexpressed genes, was created to facilitate the analytical and visual mining of these patterns. Here, we introduce the next generation of FlyExpress resources to facilitate the integrative analysis of sequence data and spatiotemporal patterns of expression from images. FlyExpress 7 now includes over 100,000 standardized in situ images and implements a more efficient, user-defined search algorithm to identify coexpressed genes via Genomewide Expression Maps (GEMs). Shared motifs found in the upstream 5′ regions of any pair of coexpressed genes can be visualized in an interactive dotplot. Additional webtools and link-outs to assist in the downstream validation of candidate motifs are also provided. Together, FlyExpress 7 represents our largest effort yet to accelerate discovery via the development and dispersal of new webtools that allow researchers to perform data-driven analyses of coexpression (image) and genomic (sequence) data.
Date Created
2017-06-30
Agent

Novel methods of biomarker discovery and predictive modeling using Random Forest

155725-Thumbnail Image.png
Description
Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.

Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.

Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.

Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.
Date Created
2017
Agent

Evolution-Informed Modeling Improves Outcome Prediction for Cancers

128373-Thumbnail Image.png
Description

Despite wide applications of high-throughput biotechnologies in cancer research, many biomarkers discovered by exploring large-scale omics data do not provide satisfactory performance when used to predict cancer treatment outcomes. This problem is partly due to the overlooking of functional implications

Despite wide applications of high-throughput biotechnologies in cancer research, many biomarkers discovered by exploring large-scale omics data do not provide satisfactory performance when used to predict cancer treatment outcomes. This problem is partly due to the overlooking of functional implications of molecular markers. Here, we present a novel computational method that uses evolutionary conservation as prior knowledge to discover bona fide biomarkers. Evolutionary selection at the molecular level is nature's test on functional consequences of genetic elements. By prioritizing genes that show significant statistical association and high functional impact, our new method reduces the chances of including spurious markers in the predictive model. When applied to predicting therapeutic responses for patients with acute myeloid leukemia and to predicting metastasis for patients with prostate cancers, the new method gave rise to evolution-informed models that enjoyed low complexity and high accuracy. The identified genetic markers also have significant implications in tumor progression and embrace potential drug targets. Because evolutionary conservation can be estimated as a gene-specific, position-specific, or allele-specific parameter on the nucleotide level and on the protein level, this new method can be extended to apply to miscellaneous “omics” data to accelerate biomarker discoveries.

Date Created
2016-10-21
Agent