Crate fasten_normalize
source ·Expand description
Normalizes kmer depth by removing some reads from high kmer depths
The input has to be from fasten_kmer --remember-reads
where there are at least three columns:
kmer, count, read1, [read2,…]
This was inspired by BBNorm and is probably not the exact same algorithm. https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/
Examples
cat testdata/four_reads.fastq | \
fasten_kmer -k 5 --remember-reads | \
fasten_normalize | \
gzip -c > four_reads.normalized.fastq.gz
Paired end reads
cat testdata/R[12].fastq | \
fasten_shuffle | \
fasten_kmer -k 3 -m --paired-end | \
fasten_normalize --target-depth 10 --paired-end | \
gzip -c > normalized.fastq.gz
Usage
Usage: fasten_normalize [-h] [-n INT] [-p] [--verbose] [--version] [-t INT]
Options:
-h, --help Print this help menu.
-n, --numcpus INT Number of CPUs (default: 1)
-p, --paired-end The input reads are interleaved paired-end
--verbose Print more status messages
--version Print the version of Fasten and exit
-t, --target-depth INT
The target depth of kmer.
Algorithm
fasten_normalize
will downsample reads pertaining to each kmer.
For example, if AAAA
is found in the fasten_kmer
output 100
times, but you request 10x coverage, it will remove 90% of the
reads pertaining to AAAA
.
Specifically:
fasten_kmer
shows reads that begin with that kmerfasten_kmer
shows extra columns with R1/R2 if R1 begins with that kmer. If more than one read or read pair begins with that kmer, it will be displayed in subsequent columns.fasten_normalize
randomly selects reads that begin with that kmer and brings the number of reads down to that target coverage.
Choosing the correct k
Choose a kmer length that is unique enough in the genome but that will not be long enough to run into read-level errors. In the examples above, k=3 is likely very short. Starting with something like k=31 is probably a good start.
Constants
- Glues together paired end reads internally and is a character not expected in any read
Functions
- main 🔒
- Normalize the coverage to a certain target and print as a fastq
- Print the reads in fastq format when given in a single line with
~