MLST with colorid
I am pleasantly surprised by how well ColorID works with MLST. For some background, please see my previous post on cgMLST.
Basically, you can query your MLST database against your reads or assembly. First, your format your bigsi database which is your reads or assembly; then, query the MLST database against the bigsi. The major downside is that you cannot discover new alleles. Either there is an allele match or there isn’t. The upside is that it is super fast. Even faster in batch.
Example data
Benchmark dataset: Campylobacter raw milk outbreak
To get a few samples typed, you can download some benchmark datasets from our datasets repo.
cd ~/src
git clone https://github.com/ncezid-biome/datasets datasets-biome
cd ~/src/datasets-biome
# Set up a benchmark folder
mkdir Campy
cd Campy
cp ../scripts/Makefile.template ./Makefile
cp ../datasets/Campylobacter_jejuni_0810PADBR-1.tsv ./in.tsv
# Download and verify all the data
make all
# Compress the fastq files like a good citizen
# Gzipping will take a while.
gzip -v *.fastq
# Checksums no longer needed
rm *.sha256
And then a quick check in with what is here.
gzu2@L355901:~/src/datasets-biome/Campy$ ls
2014D-0067_1.fastq.gz D7316_1.fastq.gz D7323_1.fastq.gz D7330_1.fastq.gz PNUSA000194_1.fastq.gz
2014D-0067_2.fastq.gz D7316_2.fastq.gz D7323_2.fastq.gz D7330_2.fastq.gz PNUSA000194_2.fastq.gz
2014D-0068_1.fastq.gz D7319_1.fastq.gz D7324_1.fastq.gz D7331_1.fastq.gz PNUSA000195_1.fastq.gz
2014D-0068_2.fastq.gz D7319_2.fastq.gz D7324_2.fastq.gz D7331_2.fastq.gz PNUSA000195_2.fastq.gz
2014D-0070_1.fastq.gz D7320_1.fastq.gz D7327_1.fastq.gz D7333_1.fastq.gz PNUSA000196_1.fastq.gz
2014D-0070_2.fastq.gz D7320_2.fastq.gz D7327_2.fastq.gz D7333_2.fastq.gz PNUSA000196_2.fastq.gz
2014D-0189_1.fastq.gz D7321_1.fastq.gz D7328_1.fastq.gz D7334_1.fastq.gz in.tsv
2014D-0189_2.fastq.gz D7321_2.fastq.gz D7328_2.fastq.gz D7334_2.fastq.gz prefetch.done
D5663_1.fastq.gz D7322_1.fastq.gz D7329_1.fastq.gz MANIFEST sha256sum.log
D5663_2.fastq.gz D7322_2.fastq.gz D7329_2.fastq.gz Makefile
gzu2@L355901:~/src/datasets-biome/Campy$ du -sh
9.7G .
Campylobacter MLST database
Save the MLST database into a specific folder as shown here. It is a folder of fasta files.
mkdir -pv ~/db/wgMLST
cd ~/db/wgMLST
mkdir Campy.chewbbaca
cd Campy.chewbbaca
wget -O Campylobacter_jejuni_INNUENDO_wgMLST_2021-05-30T22_06_50.902917.zip "https://chewbbaca.online/api/NS/api/download/compressed_schemas/4/1/2021-05-30T22 :06:50.902917"
unzip Campylobacter_jejuni_INNUENDO_wgMLST_2021-05-30T22_06_50.902917.zip
cd ..
Run ColorID
Installation
This is a copy from the cgMLST post
pushd $HOME/bin
git clone https://github.com/hcdenbakker/colorid.git
cd colorid
cargo build --release
export PATH=$PATH:~/bin/colorid/target/release:~/bin/colorid/scripts
popd
Additionally, you’ll have to get my custom script to run the whole pipeline of building the database and querying it.
pushd ~/src
git clone https://github.com/lskatz/lskScripts
export PATH=$PATH:~/src/lskScripts/scripts
popd
colorid.mlst.pl: runs colorid/mlst on samples and produces a profiles.tsv in stdout
Usage: colorid.mlst.pl [options] MLST.db/ *.fasta *.fastq.gz > profiles.tsv
MLST.db The MLST database of fasta files, one fasta per locus
*.fasta *.fastq.gz can be any number of sequence files
-k kmer length [default: 39]
--tempdir Alternative location for temporary directory
--numcpus Default: 1
--paired Assume paired end reads for each sample.
This internally pairs R1 and R2 for each sample
for colorid's samples.tsv.
--help This useful help menu
Running ColorID
I basically made a perl script that runs the ColorID MLST workflow with as much multithreading as I could do.
time colorid.mlst.pl ~/db/wgMLST/Campy.chewbbaca *.fastq.gz --numcpus 4 --paired > profiles.tsv
- Campy.chewbbaca - the database
- *.fastq.gz - all R1/R2 fastq files
- –numcpus 4 - four threads
- –paired - to indicate that two reads at a time belong to a sample
- > profiles.tsv - the output is in profiles.tsv
This script:
- Builds the bigsi database from
*.fastq.gz
- Queries the MLST database against the bigsi database
- Parses the exact hits to create a profiles file. This file is a tsv which has first column as the sample name and subsequent columns as allele calls.
In my case with 4 threads and 22 samples, it took all of 38 minutes. Assembly-free.