Full metadata
Title
An Analysis of the Benchmark Test lzbench for Open-Source Compressors
Description
With the rising data output and falling costs of Next Generation Sequencing technologies, research into data compression is crucial to maintaining storage efficiency and costs. High throughput sequencers such as the HiSeqX Ten can produce up to 1.8 terabases of data per run, and such large storage demands are even more important to consider for institutions that rely on their own servers rather than large data centers (cloud storage)1. Compression algorithms aim to reduce the amount of space taken up by large genomic datasets by encoding the most frequently occurring symbols with the shortest bit codewords and by changing the order of the data to make it easier to encode. Depending on the probability distribution of the symbols in the dataset or the structure of the data, choosing the wrong algorithm could result in a compressed file larger than the original or a poorly compressed file that results in a waste of time and space2. To test efficiency among compression algorithms for each file type, 37 open-source compression algorithms were used to compress six types of genomic datasets (FASTA, VCF, BCF, GFF, GTF, and SAM) and evaluated on compression speed, decompression speed, compression ratio, and file size using the benchmark test lzbench. Compressors that outpreformed the popular bioinformatics compressor Gzip (zlib -6) were evaluated against one another by ratio and speed for each file type and across the geometric means of all file types. Compressors that exhibited fast compression and decompression speeds were also evaluated by transmission time through variable speed internet pipes in scenarios where the file was compressed only once or compressed multiple times.
Date Created
2017-05
Contributors
- Howell, Abigail (Author)
- Cartwright, Reed (Thesis director)
- Wilson Sayres, Melissa (Committee member)
- Taylor, Jay (Committee member)
- Barrett, The Honors College (Contributor)
Topical Subject
Resource Type
Extent
23 pages
Language
eng
Copyright Statement
In Copyright
Primary Member of
Series
Academic Year 2016-2017
Handle
https://hdl.handle.net/2286/R.I.43060
Level of coding
minimal
Cataloging Standards
System Created
- 2017-10-30 02:50:58
System Modified
- 2021-07-15 10:18:27
- 3 years 4 months ago
Additional Formats