Crate fasten_sort

source ·
Expand description

Sort a fastq file. If the reads are paired end, then the sorted field concatenates R1 and R2 before comparisons in the sort. R1 and R2 reads will stay together if paired end.

Sorting by GC content will give better compression by magic of gzip and other algorithms.

Sorting can also aid in stable hashsums.

§Examples

§stable hashsum

cat file.fastq | fasten_sort | md5sum > file.fastq.md5

§better compression by sorting by GC content

zcat file.fastq.gz | fasten_sort --sort-by GC | gzip -c > smaller.fastq.gz
 
## get good compression from paired end reads
```bash
zcat R1.fastq.gz R2.fastq.gz | fasten_shuffle | \
  fasten_sort --paired-end --sort-by GC | \
  fasten_shuffle -d -1 sorted_1.fastq -2 sorted_2.fastq && \
  gzip -v sorted_1.fastq sorted_2.fastq

Compare compression between unsorted and sorted from the previous example

ls -lh sorted_1.fastq.gz sorted_2.fastq.gz

§Fast sorting of large files

If you want reads sorted but do not care if everything is sorted, you can sort in chunks. This is useful for streaming large files.

zcat large.fastq.gz | fasten_sort --paired-end --chunk-size 1000 | gzip -c > sorted.fastq.gz

§Usage

Usage: fasten_sort [-h] [-n INT] [-p] [-v] [-s STRING] [-r]

Options:
    -h, --help          Print this help menu.
    -n, --numcpus INT   Number of CPUs (default: 1)
    -p, --paired-end    The input reads are interleaved paired-end
    -v, --verbose       Print more status messages
    -s, --sort-by STRING
                        Sort by either SEQ, GC, or ID. If GC, then the entries
                        are sorted by GC percentage. SEQ and ID are
                        alphabetically sorted.
    -r, --reverse       Reverse sort
    -c, --chunk-size INT
                        If > 0, then chunks of reads or pairs will be sorted
                        instead of the whole set. This is useful for streaming
                        large files. Default: 0

Structs§

  • Seq 🔒
    A sequence struct that is paired-end aware

Functions§

  • main 🔒
  • minimizer 🔒
    Find the lexicographically smallest kmer in a sequence.
  • Sort fastq entries in a vector