Crate fasten_sort
source ·Expand description
Sort a fastq file. If the reads are paired end, then the sorted field concatenates R1 and R2 before comparisons in the sort. R1 and R2 reads will stay together if paired end.
Sorting by GC content will give better compression by magic of gzip and other algorithms.
Sorting can also aid in stable hashsums.
§Examples
§stable hashsum
cat file.fastq | fasten_sort | md5sum > file.fastq.md5
§better compression by sorting by GC content
zcat file.fastq.gz | fasten_sort --sort-by GC | gzip -c > smaller.fastq.gz
## get good compression from paired end reads
```bash
zcat R1.fastq.gz R2.fastq.gz | fasten_shuffle | \
fasten_sort --paired-end --sort-by GC | \
fasten_shuffle -d -1 sorted_1.fastq -2 sorted_2.fastq && \
gzip -v sorted_1.fastq sorted_2.fastq
Compare compression between unsorted and sorted from the previous example
ls -lh sorted_1.fastq.gz sorted_2.fastq.gz
§Fast sorting of large files
If you want reads sorted but do not care if everything is sorted, you can sort in chunks. This is useful for streaming large files.
zcat large.fastq.gz | fasten_sort --paired-end --chunk-size 1000 | gzip -c > sorted.fastq.gz
§Usage
Usage: fasten_sort [-h] [-n INT] [-p] [-v] [-s STRING] [-r]
Options:
-h, --help Print this help menu.
-n, --numcpus INT Number of CPUs (default: 1)
-p, --paired-end The input reads are interleaved paired-end
-v, --verbose Print more status messages
-s, --sort-by STRING
Sort by either SEQ, GC, or ID. If GC, then the entries
are sorted by GC percentage. SEQ and ID are
alphabetically sorted.
-r, --reverse Reverse sort
-c, --chunk-size INT
If > 0, then chunks of reads or pairs will be sorted
instead of the whole set. This is useful for streaming
large files. Default: 0
Structs§
- Seq 🔒A sequence struct that is paired-end aware
Functions§
- main 🔒
- Find the lexicographically smallest kmer in a sequence.
- Sort fastq entries in a vector