Hash MLST analysis
With EToKi and perhaps future MLST callers, we are possibly looking at a new class of MLST databases 1.
The way we think of MLST databases right now, there is a database of loci,
and each locus has a set of alleles.
The ChewBBACA format is a good example of a classic MLST database structure.
The ChewBBACA format is essentially a folder of fasta files, one per locus.
Each allele has an integer. If the local database finds a new allele, it assigns it a temporary allele number until it can synchronize with other databases.
Therefore, someone can communicate “I have allele number 2 for locus X” and someone will understand exactly what that means. Allele 2 encodes a precise sequence as defined by this synchronized database.
The advantage of this format is that an exact match against any allele gives excellent support for what your query sequence is and that you instantly know the allele integer identifer.
Either this identifier was in the database before you queried, or you had to create a new integer to represent the new sequence.
The disadvantage is that there is necessarily a synchronization step before you can compare your samples vs someone else’s.
Additionally if someone is doing their job right, there is a curation step too!
You wouldn’t want to synchronize poor quality alleles such as a 10 nucleotide allele, or an allele consisting of all A
s.
With EToKi, the database is a different paradigm. Each locus is represented by a single allele and we assume that all other alleles are within X percentage of the representative. Therefore, a query against the EToKi database for one locus is against a single allele. The identifier isn’t a simple integer anymore – it’s an md5sum of the sequence of the query.
sequenceA --> |hashsum| hashsumA
sequenceB --> |hashsum| hashsumB
>locus1_allele1
sequenceA
>locus2_allele5
sequenceB
|
V
>hashsumA
sequenceA
>hashsumB
sequenceB
I love this new paradigm because there is no real need to synchronize alleles anymore. If you find a new allele, there is no need to send it to me specifically to synchronize with my database. This is because your new allele has a specific hashsum. If I ever find this allele myself, it will have the same hashsum. In fact, if I have a whole genome of allele calls, I can compare it to one of your genomes and know exactly which alleles are the same and which are different. I can know this without a comprehensive database that we synchronize. All I need to know is which hashsum algorithm you used. Many people use md5sum, for example. However, there are many other hashsum algorithms out there.
Another huge advantage of this small database is that queries take a very short time. Classically, a caller would compare a query vs all or many alleles in the database. Now with EToKi, you only compare against one.
One possible disadvantage of this method is that you might have a locus with a large diversity. For example, what if your threshold is 80% identity, but you have a locus that has diversity up to 70% different than a reference sequence? In this case, it is up to the curator of the database to include more representative alleles per locus. As a result for some loci, more than one allele might be present in the database. Arguably, this is part of the database curation that already has to be done for an MLST database.
Another disadvantage is that you lose the nucleotide-level diversity in your results. This can be overcome by making a map of hash->sequence elsewhere in your database, but it would not be part of the database that you routinely query against. It can also be overcome by brute force, e.g., hash2seq. Arguably this is a minor disadvantage because in MLST you should only be concerned with same/different, which is what hashes already provide. One last disadvantage that I have to mention is that there is a possibility of hash collisions. That is to say, two different allele sequences could equate to the same hashsum. However, this is only a formal disadvantage and I have not seen this happen with any kind of strong algorithm such as md5sum or sha256. I have seen allele collisions with CRC32.
All in all I am very impressed with this new way of MLST calling and hope we can all get on board :)
Acknowledgements
Thank you Joao Carrico, Subin Park, and Andrew Huang for reviewing this post.
Footnotes
-
I’m not sure which caller was actually first to use hashes. Some of the early hash-based MLST callers include ChewieSnake and hash-cgMLST. However, these databases do not explicitly encourage a single allele per locus in the database like EToKi does. ↩