Crate fasten_normalize

source ·
Expand description

Normalizes kmer depth by removing some reads from high kmer depths The input has to be from fasten_kmer --remember-reads where there are at least three columns: kmer, count, read1, [read2,…]

This was inspired by BBNorm and is probably not the exact same algorithm. https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

Examples

cat testdata/four_reads.fastq | \
  fasten_kmer -k 5 --remember-reads | \
  fasten_normalize | \
  gzip -c > four_reads.normalized.fastq.gz

Paired end reads

cat testdata/R[12].fastq | \
  fasten_shuffle | \
  fasten_kmer -k 3 -m --paired-end | \
  fasten_normalize --target-depth 10 --paired-end | \
  gzip -c > normalized.fastq.gz

Usage

Usage: fasten_normalize [-h] [-n INT] [-p] [--verbose] [--version] [-t INT]

Options:
    -h, --help          Print this help menu.
    -n, --numcpus INT   Number of CPUs (default: 1)
    -p, --paired-end    The input reads are interleaved paired-end
        --verbose       Print more status messages
        --version       Print the version of Fasten and exit
    -t, --target-depth INT
                        The target depth of kmer.

Algorithm

fasten_normalize will downsample reads pertaining to each kmer. For example, if AAAA is found in the fasten_kmer output 100 times, but you request 10x coverage, it will remove 90% of the reads pertaining to AAAA.

Specifically:

  1. fasten_kmer shows reads that begin with that kmer
  2. fasten_kmer shows extra columns with R1/R2 if R1 begins with that kmer. If more than one read or read pair begins with that kmer, it will be displayed in subsequent columns.
  3. fasten_normalize randomly selects reads that begin with that kmer and brings the number of reads down to that target coverage.

Choosing the correct k

Choose a kmer length that is unique enough in the genome but that will not be long enough to run into read-level errors. In the examples above, k=3 is likely very short. Starting with something like k=31 is probably a good start.

Constants

  • Glues together paired end reads internally and is a character not expected in any read

Functions

  • main 🔒
  • Normalize the coverage to a certain target and print as a fastq
  • Print the reads in fastq format when given in a single line with ~