Methods for Detecting Mutations in Non-model Organisms

158849-Thumbnail Image.png
Description
Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and

Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates.
This can make mutation detection difficult; and while increasing sequencing depth
can often help, sequence-specific errors and other non-random biases cannot be de-
tected by increased depth. The problem of accurate genotyping is exacerbated when
there is not a reference genome or other auxiliary information available.
I explore several methods for sensitively detecting mutations in non-model or-
ganisms using an example Eucalyptus melliodora individual. I use the structure of
the tree to find bounds on its somatic mutation rate and evaluate several algorithms
for variant calling. I find that conventional methods are suitable if the genome of a
close relative can be adapted to the study organism. However, with structured data,
a likelihood framework that is aware of this structure is more accurate. I use the
techniques developed here to evaluate a reference-free variant calling algorithm.
I also use this data to evaluate a k-mer based base quality score recalibrator
(KBBQ), a tool I developed to recalibrate base quality scores attached to sequencing
data. Base quality scores can help detect errors in sequencing reads, but are often
inaccurate. The most popular method for correcting this issue requires a known
set of variant sites, which is unavailable in most cases. I simulate data and show
that errors in this set of variant sites can cause calibration errors. I then show that
KBBQ accurately recalibrates base quality scores while requiring no reference or other
information and performs as well as other methods.
Finally, I use the Eucalyptus data to investigate the impact of quality score calibra-
tion on the quality of output variant calls and show that improved base quality score
calibration increases the sensitivity and reduces the false positive rate of a variant
calling algorithm.
Date Created
2020
Agent

Gene Families in Cancer: Using phylogenetic data to examine an atavistic model of cancer

136199-Thumbnail Image.png
Description
Despite the 40-year war on cancer, very limited progress has been made in developing a cure for the disease. This failure has prompted the reevaluation of the causes and development of cancer. One resulting model, coined the atavistic model of

Despite the 40-year war on cancer, very limited progress has been made in developing a cure for the disease. This failure has prompted the reevaluation of the causes and development of cancer. One resulting model, coined the atavistic model of cancer, posits that cancer is a default phenotype of the cells of multicellular organisms which arises when the cell is subjected to an unusual amount of stress. Since this default phenotype is similar across cell types and even organisms, it seems it must be an evolutionarily ancestral phenotype. We take a phylostratigraphical approach, but systematically add species divergence time data to estimate gene ages numerically and use these ages to investigate the ages of genes involved in cancer. We find that ancient disease-recessive cancer genes are significantly enriched for DNA repair and SOS activity, which seems to imply that a core component of cancer development is not the regulation of growth, but the regulation of mutation. Verification of this finding could drastically improve cancer treatment and prevention.
Date Created
2015-05
Agent