Statistical Sequence Alignment of Protein Coding Regions

190921-Thumbnail Image.png
Description
Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequence artifacts and errors made in alignment reconstruction can impact downstream analyses, leading to erroneous conclusions in comparative and functional genomic studies. While such errors are eventually fixed in the reference genomes of model organisms, many genomes used by researchers contain these artifacts, often forcing researchers to discard large amounts of data to prevent artifacts from impacting results. I developed COATi, a statistical, codon-aware pairwise aligner designed to align protein-coding sequences in the presence of artifacts commonly introduced by sequencing or annotation errors, such as early stop codons and abiological frameshifts. Unlike common sequence aligners, which rely on amino acid translations, only model insertion and deletions between codons, or lack a statistical model, COATi combines a codon substitution model specifically designed for protein-coding regions, a complex insertion-deletion model, and a sequencing base calling error step. The alignment algorithm is based on finite state transducers (FSTs), computational machines well-suited for modeling sequence evolution. I show that COATi outperforms available methods using a simulated empirical pairwise alignment dataset as a benchmark. The FST-based model and alignment algorithm in COATi is resource-intense for sequences longer than a few kilobases. To address this constraint, I developed an approximate model compatible with traditional dynamic programming alignment algorithms. I describe how the original codon substitution model is transformed to build an approximate model and how the alignment algorithm is implemented by modifying the popular Gotoh algorithm. I simulated a benchmark of alignments and measured how well the marginal models approximate the original method. Finally, I present a novel tool for analyzing sequence alignments. Available metrics can measure the similarity between two alignments or the column uncertainty within an alignment but cannot produce a site-specific comparison of two or more alignments. AlnDotPlot is an R software package inspired by traditional dot plots that can provide valuable insights when comparing pairwise alignments. I describe AlnDotPlot and showcase its utility in displaying a single alignment, comparing different pairwise alignments, and summarizing alignment space.
Date Created
2023
Agent

Deciphering Sequence to Function through Protein Dynamics

190714-Thumbnail Image.png
Description
This thesis explores a diverse array of topics related to the role of dynamic allostery in regulating protein functions. Allostery is the phenomenon where a catalytic pocket responds to perturbations caused by binding at another distant site. This response often

This thesis explores a diverse array of topics related to the role of dynamic allostery in regulating protein functions. Allostery is the phenomenon where a catalytic pocket responds to perturbations caused by binding at another distant site. This response often involves a conformational change resulting in a protein function alteration. However, it is essential to note the existence of dynamic allostery mechanisms that regulate protein function without relying on conformational changes but on dynamic motions. Within this thesis, position-specific equilibrium dynamics-based metrics like Dynamic Flexibility Index and Dynamic Coupling Index are employed to quantify the contributions of specific residues to protein dynamics. I investigated the role of dynamics in protein binding of the WW domain. In particular, I focused on how the mutations of distal positions modulate the binding site dynamics. By employing Dynamic Flexibility Index, I discovered that a residue, 10T, located distally from the binding pocket, plays a significant role in the observed dynamics difference between two variants: N21 (a native folded WW domain not binding Group I peptide) and CC16_N21 (an artificial WW domain binding Group I peptide). The T10H variant, created by exchanging the position 10 residue, enhances flexibility at positions 10 and 16. Consequently, this modification has led to an enhancement in the binding function of N21, enabling it to bind to Group I peptide effectively. Moreover, I investigated the influence of dynamic allostery on protein binding specificity, specifically in the PDZ domain PSD95. To gain insights into the binding process and accurately measure binding affinity, I employed two parallel computational approaches: Adaptive BP-docking and Steered Molecular Dynamics. These methods allowed me to model the binding interactions and quantify the binding strength robustly and comprehensively. The significance of allostery can serve as foundational knowledge in Deep Learning models, enabling the efficient mapping of protein sequences to their corresponding functionalities. One particular metric, Dynamic Coupling Index asymmetry, offers valuable insights into how the three-dimensional network of interactions facilitates communication within a protein structure. Leveraging these interactions, I developed a deep neural network architecture demonstrating enhanced capability in capturing epistatic interactions within Beta-lactamase and protein G function.
Date Created
2023
Agent

Mechanism of Ph-dependent Zn2+ Binding in the Zinc Transporter Protein YiiP

187586-Thumbnail Image.png
Description
Transition metal ions such as Zn2+, Mn2+, Co2+, and Fe2+ play crucial roles in organisms from all kingdoms of life. The homeostasis of these ions is mainly regulated by a group of secondary transporters from the cation diffusion facilitator (CDF)

Transition metal ions such as Zn2+, Mn2+, Co2+, and Fe2+ play crucial roles in organisms from all kingdoms of life. The homeostasis of these ions is mainly regulated by a group of secondary transporters from the cation diffusion facilitator (CDF) family. The mammalian zinc transporters (ZnTs), a subfamily of CDF, have been an important target for study as they are associated with several diseases, such as diabetes, delayed growth and osteopenia, Alzheimer’s disease, and Parkinsonism. The bacterial homolog of ZnTs, YiiP, is the first CDF transporter with a determined structure and is used as a model for studying the structural and mechanistic properties of CDF transporters. On the other hand, Molecular dynamics simulation has emerged as a valuable computational tool for exploring the physical basis of biological macromolecules' structure and function with atomic precision at femtosecond resolution. This work aims to elucidate the roles of the three Zn$2+ binding sites found on each YiiP protomer and the role of protons in the transport process of CDFs, which remain under debate despite previous thermodynamic and structural studies on YiiP. Cryo-EM, microscale thermophoresis (MST) and molecular dynamics (MD) simulations were used to address these questions. With a Zn2+ model that accurately reproduces experimental structures of the binding clusters, the dynamical influence of zinc binding on the transporter was accessed through MD simulations, which was consistent with the new cryo-EM structures. Zinc binding affinities obtained through MST were used to infer the stoichiometry of Zn2+/H+ antiport in combination with a microscopic thermodynamic model and constant pH simulations. The most likely microstates of H$^+$ and Zn2+ binding indicated a transport stoichiometry of 1 Zn2+ to 2-3 H+ depending on the external pH. A model describing the entire transport cycle of YiiP was finally built on these findings, providing insight into the structural and mechanistic properties of CDF transporters.
Date Created
2023
Agent

Exploring the Structures and Binding Sites of Electroneutral Cation/Proton Antiporter Proteins with Computational Methods

187516-Thumbnail Image.png
Description
Secondary active transporters play significant roles in maintaining living cells' homeostasis by utilizing the electrochemical gradient in driving ions or protons as the source of free energy to transport substrate through biological membranes.A broadly recognized molecular framework, the alternating access

Secondary active transporters play significant roles in maintaining living cells' homeostasis by utilizing the electrochemical gradient in driving ions or protons as the source of free energy to transport substrate through biological membranes.A broadly recognized molecular framework, the alternating access model, describes the transport mechanism as the transporter undergoes conformational changes between different conformations and alternatingly exposes its binding site to intracellular and extracellular sides and, thus, exchanges ion and substrate in a cyclical manner. Recent progress in structural biology brought the first-ever structural insights into the mammalian Cation-Proton Antiporters (CPA) family of proteins. However, the dynamic atomic-level information about the interactions between the newly discovered structures and the bound ion or the corresponding substrate remains unknown. With Molecular Dynamics (MD), multiple spontaneous ion binding events were observed in the equilibrium simulations, revealing the binding site topology of Horse Sodium-Proton Exchanger 9 (NHE9) and Bison Sodium-Proton Antiporter 2 (NHA2) in their preferred protonation state. Further investigation into more CPA homologs compared various aspects, including sequence identity, binding site topology, and energetic properties, and obtained general insights into the similarities shared by the binding process of CPA members. The putative binding site and other conserved residues in their actively ion-bound poses were identified for each model, and their similarities were compared. The energetic properties accessed by the three-dimensional free energy profile, initially found to be binding unfavorable for the experimental structures, were recalculated based on the simulation data. The updated results show consistency with the correct binding affinity as indicated by the experimental methods. This work provided a general picture of the structures and the ion-protein interaction of CPA proteins and serves as comprehensive guidance for any related future structural and computational work.
Date Created
2023
Agent

Advancing Biophysics Research with Bayesian Methods: Novel Applications and Insights into Biological Systems' Behavior

187432-Thumbnail Image.png
Description
The Bayesian paradigm provides a flexible and versatile framework for modeling complex biological systems without assuming a fixed functional form or other constraints on the underlying data. This dissertation explores the use of Bayesian nonparametric methods for analyzing fluorescence microscopy

The Bayesian paradigm provides a flexible and versatile framework for modeling complex biological systems without assuming a fixed functional form or other constraints on the underlying data. This dissertation explores the use of Bayesian nonparametric methods for analyzing fluorescence microscopy data in biophysics, with a focus on enumerating diffraction-limited particles, reconstructing potentials from trajectories corrupted by measurement noise, and inferring potential energy landscapes from fluorescence intensity experiments. This research demonstrates the power and potential of Bayesian methods for solving a variety of problems in fluorescence microscopy and biophysics more broadly.
Date Created
2023
Agent

Mapping the Sequence-Structure-Function Paradigm by Intrinsic Properties of Anisotropic Networks

161669-Thumbnail Image.png
Description
Proteins are the machines of living systems that carry out a diverse set of essential biochemical functions. Furthermore, the diversity of their functions has grown overtime via molecular evolution. This thesis aims to explore fundamental questions in protein science regarding

Proteins are the machines of living systems that carry out a diverse set of essential biochemical functions. Furthermore, the diversity of their functions has grown overtime via molecular evolution. This thesis aims to explore fundamental questions in protein science regarding the mechanisms of protein evolution particularly addressing how substitutions in sequence modulate function through structure and structural dynamics. In the work presented here, the first goal is to develop a set of tools which connect the sequence-structure relationship which are implemented in two major projects of protein structural refinement and protein structural design. Both of these two works highlight the importance of capturing important pairwise interactions within a given protein system.The second major goal of this work is to understand how sequence and structural dynamics give rise to protein function, and, importantly, how Nature can utilize allostery to evolve towards a new function. Here I employ several in-house and novel computational tools to shed light onto the mechanisms of allostery, and, particularly dynamic allostery in the absence of structural rearrangements. This analysis is applied to several different protein systems including Pin1, LacI, CoV-1 and CoV-2 and TEM-1. I show that the dynamics of protein systems may be altered fundamentally by distal perturbations such as ligand binding or point mutations. These peturbations lead to change in local interactions which cascade within the 3-D network of interaction of a protein and give rise to flexibility changes of distal sites, particularly those of functional/active residues positions thereby altering the protein function. This networking picture of the protein is further explored through asymmetric dynamic coupling which shows to be a marker of allosteric interactions between distal residue pairs. Within the networking picture, the concept of sequence context dependence upon mutation becomes critical in understanding the functional outcome of these mutations. Here I design a computational tool, EpiScore, which is able to capture these effects and correlate them to measured experimental epistasis in two protein systems, dihydrofolate reductase (DHFR) and TEM-1. Ultimately, the work provided in this thesis shows that both allostery and epistasis may be considered, and accurately modeled, as intrinsic properties of anisotropic networks.
Date Created
2021
Agent

Methods and instrumentation of sample injection for XFEL experiments

153166-Thumbnail Image.png
Description
ABSTRACT

X-Ray crystallography and NMR are two major ways of achieving atomic

resolution of structure determination for macro biomolecules such as proteins. Recently, new developments of hard X-ray pulsed free electron laser XFEL opened up new possibilities to break the dilemma of

ABSTRACT

X-Ray crystallography and NMR are two major ways of achieving atomic

resolution of structure determination for macro biomolecules such as proteins. Recently, new developments of hard X-ray pulsed free electron laser XFEL opened up new possibilities to break the dilemma of radiation dose and spatial resolution in diffraction imaging by outrunning radiation damage with ultra high brightness femtosecond X-ray pulses, which is so short in time that the pulse terminates before atomic motion starts. A variety of experimental techniques for structure determination of macro biomolecules is now available including imaging of protein nanocrystals, single particles such as viruses, pump-probe experiments for time-resolved nanocrystallography, and snapshot wide- angle x-ray scattering (WAXS) from molecules in solution. However, due to the nature of the "diffract-then-destroy" process, each protein crystal would be destroyed once

probed. Hence a new sample delivery system is required to replenish the target crystal at a high rate. In this dissertation, the sample delivery systems for the application of XFELs to biomolecular imaging will be discussed and the severe challenges related to the delivering of macroscopic protein crystal in a stable controllable way with minimum waste of sample and maximum hit rate will be tackled with several different development of injector designs and approaches. New developments of the sample delivery system such as liquid mixing jet also opens up new experimental methods which gives opportunities to study of the chemical dynamics in biomolecules in a molecular structural level. The design and characterization of the system will be discussed along with future possible developments and applications. Finally, LCP injector will be discussed which is critical for the success in various applications.
Date Created
2014
Agent

Calculating infrared spectra of proteins and other organic molecules based on normal modes

151169-Thumbnail Image.png
Description
The goal of this theoretical study of infrared spectra was to ascertain to what degree molecules may be identified from their IR spectra and which spectral regions are best suited for this purpose. The frequencies considered range from the lowest

The goal of this theoretical study of infrared spectra was to ascertain to what degree molecules may be identified from their IR spectra and which spectral regions are best suited for this purpose. The frequencies considered range from the lowest frequency molecular vibrations in the far-IR, terahertz region (below ~3 THz or 100 cm-1) up to the highest frequency vibrations (~120 THz or 4000 cm-1). An emphasis was placed on the IR spectra of chemical and biological threat molecules in the interest of detection and prevention. To calculate IR spectra, the technique of normal mode analysis was applied to organic molecules ranging in size from 8 to 11,352 atoms. The IR intensities of the vibrational modes were calculated in terms of the derivative of the molecular dipole moment with respect to each normal coordinate. Three sets of molecules were studied: the organophosphorus G- and V-type nerve agents and chemically related simulants (15 molecules ranging in size from 11 to 40 atoms); 21 other small molecules ranging in size from 8 to 24 atoms; and 13 proteins ranging in size from 304 to 11,352 atoms. Spectra for the first two sets of molecules were calculated using quantum chemistry software, the last two sets using force fields. The "middle" set used both methods, allowing for comparison between them and with experimental spectra from the NIST/EPA Gas-Phase Infrared Library. The calculated spectra of proteins, for which only force field calculations are practical, reproduced the experimentally observed amide I and II bands, but they were shifted by approximately +40 cm-1 relative to experiment. Considering the entire spectrum of protein vibrations, the most promising frequency range for differentiating between proteins was approximately 600-1300 cm-1 where water has low absorption and the proteins show some differences.
Date Created
2012
Agent