The Bioinformatics Core has written and contributed to open-source bioinformatics software for the entire community to use and is freely available. Below are some of the more broadly useful software that we have contributed to or developed.
HTStream is a fast quality control pipeline for High Throughput Sequencing data. The difference between HTStream and other pipelines is that HTStream uses a tab delimited fastq format which allows for streaming from application to application. This streaming creates some awesome efficiencies when processing HTS data.
- No intermediate files (reduces storage footprint)
- Reduce I/O (files are only read in and written out once)
- Handles both single end and paired end reads at the same time
- Processes can work at the same time allowing for process parallelization
- Built on top of mature C++ Boost libraries to reduce bugs and memory leaks
- Designed following the philosophy of Program Design in the UNIX Environment
- Works with native Unix/Linux applications such as grep/sed/awk etc.
Consolidates HTStream JSON ouptut into visually useful reports. Also takes numbers from mapping results to create graphs to help QC the data.
Analysis of Double Barcoded Illumina Amplicon Data, used for Microbial Community Analysis.
Most modern sequencing technologies produce reads that have deteriorating quality towards the 3'-end and some towards the 5'-end as well. Incorrectly called bases in both regions negatively impact assembles, mapping, and downstream bioinformatics analyses. Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads.
Scythe is a contaminant/adapter removal tool. It uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases.
High-throughput sequencing can currently produce hundreds of millions of reads per lane of sample and that number increases at a dizzying rate. Barcoding individual sequences for multiple lines or multiple species is a cost-efficient method to sequence and analyze a broad range of data. Sabre is a tool that will demultiplex barcoded reads into separate files. It will work on both single-end and paired-end data in fastq format. It simply compares the provided barcodes with each read and separates the read into its appropriate barcode file, after stripping the barcode from the read (and also stripping the quality values of the barcode bases). If a read does not have a recognized barcode, then it is put into an "unknown" file.