Description

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three

Whole genome sequencing (WGS) is a promising strategy to unravel variants or genes responsible for human diseases and traits. However, there is a lack of robust platforms for a comprehensive downstream analysis. In the present study, we first proposed three novel algorithms, sequence gap-filled gene feature annotation, bit-block encoded genotypes and sectional fast access to text lines to address three fundamental problems. The three algorithms then formed the infrastructure of a robust parallel computing framework, KGGSeq, for integrating downstream analysis functions for whole genome sequencing data. KGGSeq has been equipped with a comprehensive set of analysis functions for quality control, filtration, annotation, pathogenic prediction and statistical tests. In the tests with whole genome sequencing data from 1000 Genomes Project, KGGSeq annotated several thousand more reliable non-synonymous variants than other widely used tools (e.g. ANNOVAR and SNPEff). It took only around half an hour on a small server with 10 CPUs to access genotypes of ∼60 million variants of 2504 subjects, while a popular alternative tool required around one day. KGGSeq's bit-block genotype format used 1.5% or less space to flexibly represent phased or unphased genotypes with multiple alleles and achieved a speed of over 1000 times faster to calculate genotypic correlation.

Reuse Permissions
  • Downloads
    PDF (2.1 MB)

    Details

    Title
    • Robust and Rapid Algorithms Facilitate Large-Scale Whole Genome Sequencing Downstream Analysis in an Integrative Framework
    Contributors
    Date Created
    2017-01-23
    Resource Type
  • Text
  • Collections this item is in
    Identifier
    • Digital object identifier: 10.1093/nar/gkx019
    • Identifier Type
      International standard serial number
      Identifier Value
      1362-4962
    • Identifier Type
      International standard serial number
      Identifier Value
      0305-1048
    Note
    • The final version of this article, as published in Nucleic Acids Research, can be viewed online at: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkx019

    Citation and reuse

    Cite this item

    This is a suggested citation. Consult the appropriate style guide for specific citation guidelines.

    Li, M., Li, J., Li, M. J., Pan, Z., Hsu, J. S., Liu, D. J., . . . Sham, P. C. (2017). Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework. Nucleic Acids Research. doi:10.1093/nar/gkx019

    Machine-readable links