Gene Network Inference via Sequence Alignment and Rectification
Document
Description
While techniques for reading DNA in some capacity has been possible for decades,
the ability to accurately edit genomes at scale has remained elusive. Novel techniques
have been introduced recently to aid in the writing of DNA sequences. While writing
DNA is more accessible, it still remains expensive, justifying the increased interest in
in silico predictions of cell behavior. In order to accurately predict the behavior of
cells it is necessary to extensively model the cell environment, including gene-to-gene
interactions as completely as possible.
Significant algorithmic advances have been made for identifying these interactions,
but despite these improvements current techniques fail to infer some edges, and
fail to capture some complexities in the network. Much of this limitation is due to
heavily underdetermined problems, whereby tens of thousands of variables are to be
inferred using datasets with the power to resolve only a small fraction of the variables.
Additionally, failure to correctly resolve gene isoforms using short reads contributes
significantly to noise in gene quantification measures.
This dissertation introduces novel mathematical models, machine learning techniques,
and biological techniques to solve the problems described above. Mathematical
models are proposed for simulation of gene network motifs, and raw read simulation.
Machine learning techniques are shown for DNA sequence matching, and DNA
sequence correction.
Results provide novel insights into the low level functionality of gene networks. Also
shown is the ability to use normalization techniques to aggregate data for gene network
inference leading to larger data sets while minimizing increases in inter-experimental
noise. Results also demonstrate that high error rates experienced by third generation
sequencing are significantly different than previous error profiles, and that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.
the ability to accurately edit genomes at scale has remained elusive. Novel techniques
have been introduced recently to aid in the writing of DNA sequences. While writing
DNA is more accessible, it still remains expensive, justifying the increased interest in
in silico predictions of cell behavior. In order to accurately predict the behavior of
cells it is necessary to extensively model the cell environment, including gene-to-gene
interactions as completely as possible.
Significant algorithmic advances have been made for identifying these interactions,
but despite these improvements current techniques fail to infer some edges, and
fail to capture some complexities in the network. Much of this limitation is due to
heavily underdetermined problems, whereby tens of thousands of variables are to be
inferred using datasets with the power to resolve only a small fraction of the variables.
Additionally, failure to correctly resolve gene isoforms using short reads contributes
significantly to noise in gene quantification measures.
This dissertation introduces novel mathematical models, machine learning techniques,
and biological techniques to solve the problems described above. Mathematical
models are proposed for simulation of gene network motifs, and raw read simulation.
Machine learning techniques are shown for DNA sequence matching, and DNA
sequence correction.
Results provide novel insights into the low level functionality of gene networks. Also
shown is the ability to use normalization techniques to aggregate data for gene network
inference leading to larger data sets while minimizing increases in inter-experimental
noise. Results also demonstrate that high error rates experienced by third generation
sequencing are significantly different than previous error profiles, and that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.