13 minute read

I have been asked in many different contexts how to call alleles on the command line. In August 2022, I gave an internal workshop on how to call alleles on the command line and I thought I would translate that into a blog post.

If you’re here to look at classic MLST callers, this isn’t that blog post. If you want to know more about classic MLST callers, please see our paper:
Page et al 2017, “Comparison of classical multi-locus sequence typing software for next-generation sequencing data.” Microbial Genomics

NOTE: I will give the caveat that ChewBBACA has gone from version 2 to version 3 and so it is possible that things have changed.

Summary

The three callers that seem to be great on the command line are

Each caller has its positives and negatives and so I cannot say that one or the other is clearly the best one over all others.

A lot of my motivation is to find a CLI supplement for BioNumerics’s wgMLST caller, and so I compared results against BioNumerics. I had received four benchmark datasets from PulseNet and made a scatterplot of MLST distances between any two genomes in BioNumerics and the other caller. The distances used in the scatterplot are number of identical alleles divided by the number of total loci compared. Missing loci were excluded from the denominator.

When looking at the trendline between callers and BioNumerics, the linear relationship is outstanding.

Dataset Caller Trendline R2
Salmonella enterica ChewBBACA Y = 1.01x - 0.01 1
Salmonella enterica EToKi Y = 1x + 0 1
Salmonella enterica ColorID Y = 1.01x - 0.01 1
STEC ChewBBACA Y = 1x - 0.01 1
STEC EToKi Y = 0.99x + 0 1
STEC ColorID Y = 0.99x + 0 1
Campylobacter jejuni ChewBBACA Y = 1x + 0 1
Campylobacter jejuni EToKi Y = 1x + 0 1
Campylobacter jejuni ColorID Y = 1.01x + 0 1
Listeria monocytogenes ChewBBACA Y = 1x + 0 1
Listeria monocytogenes EToKi Y = 1x + 0 1
Listeria monocytogenes ColorID Y = 1x + 0 1

I also took a random Campylobacter jejuni genome and ran the same genome assembly through each of the three callers 50 times each. Even if the caller had the option for multithreading, I kept it at 1 thread each.

Benchmark times of three callers

This table is my semi-subjective overall review of the callers.

Measure EToKi ChewBBACA ColorID
Algorithm Gene prediction, “uberblast” to ref alleles to get locus, then hashsum comparison against complete database for allele call. Gene prediction, BLASTp to ref alleles to get locus, then BLASTp to complete database for allele call. Colored Debruijn graphs
Developers Single developer; Slightly responsive PhD students; responsive Single developer; Very responsive
Documentation Minimal Extensive wiki Medium
Installation Source; (Conda); Docker Source; Conda; Docker Cargo (easy)
Adding new alleles Not yet; should be easy to add Automatically done but also alters the database without approval Cannot be done but could be a new feature

Let’s call alleles

We’ll turn this overview into a workshop starting here.

Get your data

  1. Make a folder to stay organized. The code for this list is below this list.
  2. Get a database from https://chewbbaca.online. For this workshop, I am going to use https://chewbbaca.online/api/NS/api/download/compressed_schemas/4/1/2021-05-30T22:06:50.902917 which is a download link all the way to the right side on the Campylobacter page.
  3. Get an assembly you want to call alleles with. I chose 15AR0919 at random for this blog post since it is a complete genome.
mkdir -pv ~/workshops/wgMLST
cd ~/workshops/wgMLST
mkdir Campy.chewbbaca
cd Campy.chewbbaca
wget -O Campylobacter_jejuni_INNUENDO_wgMLST_2021-05-30T22_06_50.902917.zip "https://chewbbaca.online/api/NS/api/download/compressed_schemas/4/1/2021-05-30T22 :06:50.902917"
unzip Campylobacter_jejuni_INNUENDO_wgMLST_2021-05-30T22_06_50.902917.zip
cd ..
mkdir asm
# I took this download link from interacting on the NCBI website at https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP035894.1
wget -O asm/15AR0919.fasta "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&id=1804576668&&ncbi_phid=null"

What to expect when calling alleles

For each caller, here is what you should expect:

  • Will take a few minutes on a bacterial genome
  • Output formats
    • tsv profiles file
      • Assembly name, usually in the first column
      • Allele integers, usually in subsequent columns
    • fasta of allele matches (not always)

ChewBBACA

https://github.com/B-UMMI/chewBBACA

ChewBBACA Installation

ChewBBACA can be installed via conda

conda create -c bioconda -c conda-forge -n chewie "chewbbaca=3.1.2"
conda activate chewie

ChewBBACA Calling alleles

Calling alleles is pretty easy and takes about 2 minutes with 8 cpus. It is efficient in batching and so your times might speed up per genome if you are able to query multiples at once.

chewie AlleleCall -i asm --schema-directory Campy.chewbbaca -o chewie.out --cpu 8

For chewie version 3, you can add --no-inferred to avoid adding new alleles to the database as you are querying. For chewie version 2, you can add --fr to ignore any temporary files in the database and start fresh.

ChewBBACA Results

$ ls chewie.out/
cds_coordinates.tsv  loci_summary_stats.tsv  paralogous_counts.tsv  results_alleles.tsv      results_statistics.tsv
invalid_cds.txt      logging_info.txt        paralogous_loci.tsv    results_contigsInfo.tsv

Summary results are valuable and so let’s look at those first. The really great thing is that each of these columns are documented very well and are well-thought out.

$ column -t results_statistics.tsv
FILE      EXC   INF  PLOT3  PLOT5  LOTSC  NIPH  NIPHEM  ALM  ASM  PAMA  LNF
15AR0919  1078  68   0      0      0      2     4       0    0    0     1642

Here, it looks like we have 1078 alleles that match the Campylobacter wgMLST scheme exactly and then 68 inferred (new) alleles. 1642 loci were not found but that is not a big deal since we always expect many accessory loci not to be found. Actually, it would be awful if we found all loci in a single genome because it would show that something really weird was happening.

The other really important file is results_alleles.tsv which shows all allele integers per locus.

$ cut -f 1-5 results_alleles.tsv | column -t
FILE      INNUENDO_wgMLST-00013230  INNUENDO_wgMLST-00013231  INNUENDO_wgMLST-00013232  INNUENDO_wgMLST-00013233
15AR0919  LNF                       27                        1                         14

Here, I am showing the first five columns where the first column is the assembly name, and then the next columns are loci and their alleles. The first locus was not found as indicated by LNF. The next loci were found to be alleles 27, 1, and 14.

EToKi

EToKi is the MLST caller behind EnteroBase and is at https://github.com/zheminzhou/EToKi. Unfortunately, the developer has not responded to any github issues or pull requests recently. However, I found some issues: not all processes had 1 CPU by default. Not all processes had the same default CPU. Additionally, it depended on a license-restricted software package USEARCH. Therefore, I fixed it into my fork at https://github.com/lskatz/EToKi. Then, @rpetit3 helped us get it into Conda based on my fork.

EToKi installation

conda install -n etoki etoki

EToKi format database

As opposed to ChewBBACA, you cannot download the EToKi-formatted database directly You need to reformat the ChewBBACA database. I used seqtk and perl one-liners to do so.

This section makes a full concatenation on all fasta files from the Chewie database while making a two-line-per-entry fasta file (id, sequence).

mkdir -pv ~/workshops/wgMLST/Campy.etoki
cd ~/workshops/wgMLST
cat Campy.chewbbaca/*.fasta | \
  seqtk seq -l 0 - > Campy.etoki/source.fasta

This section removes new alleles indicated by * and then removes problematic _ characters. EToKi uses _ as a reserved character in separating locus and allele in a defline.

cd Campy.etoki
cat source.fasta | \
  perl -lane '
    $id=$_; $seq=<>; chomp($id,$seq); $id=~s/INNUENDO_|Pasteur_//; next if($id=~/\*/); $underscores = () = $id=~/_/g; die "ERROR: $id" if($underscores > 1); print "$id\n$seq";
  ' > etoki.fasta

Next for EToKi, we need to make a references file: a file of one allele per locus. Usually, the first allele in an MLST fasta file is the representative allele and therefore the one we are taking.

cat etoki.fasta | \
  perl -lane '
    $id=$_; $seq=<>; chomp($id,$seq); ($locus,$allele) = split(/_/, $id); next if($seen{$locus}++); print "$id\n$seq";
  ' > etoki.refs.fasta
cd ..

Next, use EToKi to create an md5sums database file for the rest of the alleles. This file will be etoki.csv.

EToKi.py MLSTdb -i Campy.etoki/etoki.fasta -r Campy.etoki/etoki.refs.fasta -d Campy.etoki/etoki.csv

EToKi Calling Alleles

Run EToKi’s caller like so

EToKi.py MLSType -i asm/15AR0919.fasta -r Campy.etoki/etoki.refs.fasta -k 15AR0919 -d Campy.etoki/etoki.csv -o 15AR0919.etoki.fasta

EToKi uses the references fasta file (-r) to find the correct locus and then the csv database (-d) to see if it is an exact match to any existing allele. If it is close enough to the reference but not an exact match to any known allele, it will indicate it in the resulting fasta file. -k will simply apply a sample name to the results. The results file is a fasta file with a detailed defline for each entry and the allele sequence.

EToKi Results

There is a resulting fasta file 15AR0919.fasta that contains each allele in a fasta entry. The defline is very detailed and so here is one example.

grep -m 1 ">"  15AR0919.fasta
>wgMLST-00013231 value_md5=13777d1c-21fb-e85c-1678-e85bf2a9a586 id=27 CIGAR=wgMLST-00013231_1:648M accepted=2 reference=MLSType:15AR0919 identity=0.9906999999999999 coordinates=NZ_CP035894.1:1124613..1125260:+

The fields can be briefly explained like

  • >wgMLST-00013231 locus name
  • value_md5= md5sum of sequence
  • id=1 allele 1 of this locus
  • CIGAR= A cigar string describing the match
  • accepted=2 a bitwise flag describing the match. 1 or 2 indicate good matches
    • 1 a good allele
    • 2 an allele that has no major problem in its sequences, but may be low quality (still acceptable). 2 is used by default when there is no quality scores such as a fasta file input.
    • 8 the allele is fine, but we are not the central database, so will not be able to assign a formal ID
    • 32 the gene is duplicated (two or more hits in the genome)
    • 64 the gene is fragmented
    • 256 the identity to the reference is too low
  • reference= The etoki module MLSType and the sample name
  • identity=0.99... 99% match
  • coordinates= query assembly coordinates

EToKi: Extra step to make the profiles spreadsheet

Next, we need to turn this output fasta file into a profiles file, which is a spreadsheet of alleles. To do that, you can use my custom perl script.

pushd $HOME/bin
git clone git@github.com:lskatz/lskScripts.git
export PATH=$PATH:$HOME/bin/lskScripts/scripts

The usage for this script can be pulled up with matrix.etoki.pl --help

matrix.etoki.pl: Generate a matrix of alleles from a set of etoki fasta results
  Usage: matrix.etoki.pl [options] *.etoki.fasta > matrix.tsv
    where etoki.fasta files are EToKi results with MLST deflines
  OPTIONS
  --database  The path to the etoki database whose format is CSV with
              three columns: md5sum, locus, allele [required]
  --help      This useful help menu

And then from there, it’s just one command to make the EToKi spreadsheet

matrix.etoki.pl --database Campylobacter_jejuni.etoki/etoki.csv *.etoki.fasta > etoki.tsv

ColorID

ColorID is based on a newer algorithm called colored de bruijn graphs. The database creation is intense but querying on that database is incredibly fast. The major downside of ColorID vs other algorithms is that it cannot report new alleles.

ColorID Installation

ColorID is programmed in Rust and therefore you can install it with cargo.

pushd $HOME/bin
git clone https://github.com/hcdenbakker/colorid.git
cd colorid
cargo build --release
export PATH=$PATH:~/bin/colorid/target/release:~/bin/colorid/scripts
popd

Add the export PATH command into .$HOME/.bashrc to keep the new path permanent.

ColorID Format Database

For the version of MLST with Color ID, we actually compare the MLST database against the query. Therefore, the assembly or the raw reads get formatted.

ColorID algorithm

mkdir colorid.out
bxi="colorid.out/assembly.bxi" 
fofn="chewie.out/asm.fofn
# We currently only have one query fasta file in this example
# but it is worth showing how it would look like with any number of fasta assembly files.
\ls asm/*.fasta | xargs -P 1 -n 1 bash -c 'echo -e "$(basename $0 .fasta)\t$0"' > $fofn
colorid build -b $bxi -s 30000000 -n 2 -k 39 -t 4 -r $fofn 

The ColorId build command:

  • -b $bxi is where the index is going to be saved
  • -s 30000000 is the size of the bloom filter. You want it to be about 10x the size of the genome or greater. Campy is about 1.6M and so 30M is fine.
  • -n 2 is the number of hashes to use for the the bloom filter
  • -k 39 is the kmer size. 39 seems to work well with bacteria.
  • -t 4 is 4 threads
  • -r $fofn is the file of filenames of assemblies to put into the database.

It should look something like this after the colorid build command.

Ref_file : colorid.out/asm.fofn
 Bigsi file : colorid.out/assembly.bxi
K-mer size: 39
Bloom filter parameters: num hashes 2, filter size 30000000
Inference of Bloom filters in parallel using 4 threads.
Creation of index, this may take a while!
Saving BIGSI to file.

And therefore your database is the query fasta file. In the next step, your actual query will be the MLST database against your assemblies.

ColorID Calling Alleles

In the query step, we match the MLST database against the fasta assembly file(s). We can simply use the ChewBBACA database as-is since it is a folder of fasta files. We use -s to indicate that we only want perfect matches. We use -m to indicate that each accession in the MLST database will be treated as a separate query.

colorid search -ms -b $bxi.bxi -q Campylobacter_jejuni.chewbbaca/*.fasta > colorid.out/alleles.colorid.txt 2> colorid.out/alleles.colorid.log
# Delete any line with custom alleles as indicated with a * in the identifier
$ sed -i.bak '/\*/d' colorid.out/alleles.colorid.txt

Next, we want to transform these raw results into a profile spreadsheet of alleles. ColorID has a python script process_MLST.py to do that.

# Transform to spreadsheet
process_MLST.py colorid.out/alleles.colorid.txt colorid.out/mlst
cut -f 1-4 colorid.out/mlst.detailed.tsv | column -t
# Gives a profiles spreadsheet of alleles.

Turning alleles into distances

After each of the MLST calling steps, you should have a spreadsheet of alleles. Typically, this is tab delimited with the first column as the identifier of the sample. The subsequent columns are labeled with each locus and the values are the alleles. In most cases, these allele values are simple integers; in other cases, they could be hashsums of allele sequences.

Great, now what? 🤷

Now we need to turn these alleles into distances. These distances can either be in the form of a distance matrix or in the form of a tree.

Here are the facts going into it.

  • We have a tsv for any caller in the same format.
    • One row is a header
    • Other rows are the sample name, followed by integers.
  • There are some differences between callers in their profiles spreadsheet.
    • Missing alleles are indicated differently.
      • BioNumerics: ?
      • EToKi: -1 (this nuance is due to my custom perl script and not the original author)
      • ChewBBACA: LNF
      • ColorID: MULTI or NOT_CALLED
    • The sample name might be named differently. Maybe the full fasta filen ame or maybe the basename of the filename.

I have created perl scripts for each caller to help identify distances. Additionally, we have found that GrapeTree is an excellent method already in place to generate trees from an allele profile spreadsheet.

ChewBBACA distances

distance matrix

distance.chewbbaca.pl chewie.out/results_alleles.tsv > chewie.distances.tsv

tree

grapetree --profile chewie.distances.tsv

EToKi distances

distance matrix

distance.etoki.pl *.etoki.fasta > etoki.distances.tsv

tree

grapetree --profile etoki.distances.tsv

ColorID distances

distance matrix

distanec.coloridmlst.pl colorid.out/alleles.colorid.txt > colorid.distances.tsv

tree

grapetree --profile colorid.distances.tsv

Done!

I went through a lot of background and hands on instruction and so by this point you should have a basic knowledge of

  • cgMLST methods
  • How to format cgMLST DBs
  • How to call alleles
  • Finding distances with alleles
  • Creating trees with grapetree

Appendix

Callers excluded from the workshop/study

Caller Intended use Reason for exclusion
mlst 7-gene MLST Did not call enough loci
ARIBA Antimicrobial resistance identification By Assembly (and often, 7-gene MLST) Too slow for a full genome
SRST2 7-gene MLST via short reads Too slow for a full genome
Mentalist wgMLST Dependencies are broken and the developer is unresponsive. I could not install it.
BigsDB wgMLST and 7-gene MLST Could not separate the server from the allele caller
StringMLST 7-gene MLST Did not call enough loci
BioNumerics wgMLST Sunsetting
SeqSphere+ wgMLST License restricted
PyMLST wgMLST Batch queries not possible

Callers not examined

Caller Intended use Why not examined
KMA-cgMLST cgMLST I was not aware of it at the time of this workshop/study
STing 7-gene MLST Likely the same issue as StringMLST since it is a similar algorithm