Original methods
Original methods
This is how I created the Nmen database from original files and according to the hash database specification.
I need these files
- refs.fasta
- profiles.tsv
- clusters.tsv
- alleles.tsv
- isolates.tsv (not part of the specification)
isolates.tsv
I went to the pubmlst site and downloaded isolates including LINCodes and cgMLST into a spreadsheet.
Although you cannot download everything at once, I downloaded the first ten thousand into isolates.tsv.tmp.
Then, I filtered for those with cgMLST.
mkdir Nmen.dbh
cat isolates.tsv.tmp | perl -F'\t' -lane 'next if(!$F[1335]); print;' > Nmen.dbh/isolates.tsv
refs.fasta and alleles.tsv
Need to download all loci to get refs.fasta and alleles.tsv.
# Get list of loci for this scheme Nmen cgMLST v3, ID 88
wget "https://rest.pubmlst.org/db/pubmlst_neisseria_seqdef/schemes/88/loci" -O loci.json
# Download all loci
mkdir locus.xargs
cat loci.json | \
perl -lane 'while(/(https.+?loci\/(\w+))/g){$URL="$1/alleles_fasta"; print $URL;}' | \
xargs -P 4 -n 1 bash -c '
dir=$(dirname $0);
locus=$(basename $dir);
target="locus.xargs/$locus.fasta";
if [ -e $target ]; then
exit;
fi;
echo $locus >&2;
wget -O $target.tmp $0 2>/dev/null && mv $target.tmp $target;
'
# format
cd .. # in the db directory
perl ../scripts/digestFasta.pl -o Nmen.dbh Nmen/loci/*.fasta --hash md5 --force
Now we have a folder db/Nmen.dbh that contains refs.fasta and alleles.tsv.
profiles.tsv
The alleles.tsv file looks sort of like this now
## hash-alleles-format v0.4
# locus allele hash-type
NEIS0001 UOM/xntDYHMRhA47qiIueQ md5 was=1;start-sequence=ATG;stop-sequence=TAA;length=924
NEIS0001 N612rWygdQFHPvrIpjrVfg md5 was=2;start-sequence=ATG;stop-sequence=TAA;length=924
perl ../scripts/intProfilesToHashes.pl Nmen.dbh
This creates profiles.tsv in the Nmen.dbh directory.
I wanted to update a few fields in profiles.tsv and so I ran this one liner:
cp data/Nmen.dbh/profiles.tsv data/Nmen.dbh/profiles.tsv.bak
cat data/Nmen.dbh/profiles.tsv.bak | \
perl -F'\t' -lane '
$id = $F[3];
if(!$id){
if($F[1]){
$id="PM-$F[1]";
}
}
if(!$id){
$id=$F[0];
}
if($.==1){
$id="ID";
}
print join("\t",$id,@F[5..scalar(@F)-1]);
' > data/Nmen.dbh/profiles.tsv
clusters.tsv
There are LINCodes in this database and so it is appropriate to make clusters.tsv.
cat Nmen.dbh/isolates.tsv | perl -F'\t' -lane '
# determine strain name as either biosample, isolate, or pubmlst ID.
$strain="$F[3]"||"ISOLATE-$F[1]"||"PUBMLST-$F[0]"||"UNKNOWN";
$strain=~s/\s+//g;# remove icky whitespace
$strain=~s/;.*//; # remove semicolon delimited alt names
$lin=$F[1335];
print join("\t", $strain, "LINCode", $lin);
' > Nmen.dbh/clusters.tsv
Cleanup
Here are some suggested ways to archive unnecessary files.
cd Nmen.dbh
mkdir archive
# mv isolates.tsv into archive
mv -nv isolates.tsv archive/
# cp alleles.tsv with attributes field into archive
cp -nv alleles.tsv archive/
# only copy back the first three fields out of archive
cut -f 1-3 archive/alleles.tsv > alleles.tsv
Now to compress the archive
cd archive
gzip -9v isolates.tsv alleles.tsv