Downloading the breadth of SARS-CoV-2
I was trying to figure out how to download the breadth of all of SARS-CoV-2 genomes and so I started out with the two major repositories: NCBI and GISAID.
NCBI
I think that for this task, all that is generally needed is the edirect package and the SRA Toolkit. The first tool will give us a spreadsheet of metadata. The second tool will give us the genomic data.
Edirect
The basic strategy here is to query the taxonomy ID of SARS-CoV-2 only, link it to SRA, then get summary documents and parse them.
esearch -db taxonomy -query "$taxid[uid]" | \
elink -target sra | \
esummary | \
xtract -pattern DocumentSummary -group Runs -element Run@acc -element Run@total_bases -group ExpXml -element Biosample -element Platform -element Statistics@total_bases -block Library_descriptor -element LIBRARY_NAME -element LIBRARY_STRATEGY -element LIBRARY_SOURCE -element LIBRARY_SELECTION -element LIBRARY_LAYOUT \
> ncbi.tsv 2> ncbi.log &
SRA Toolkit
Keeping with the chain of data, I used the resulting spreadsheet from Edirect to get a listing of all SRA Run IDs and download them. I used this script, given to me by Taylor Griswold which I modified a little.
https://github.com/lskatz/lskScripts/blob/master/qsub/array/launch_fastq-dump_split.sh
bash ~/src/qsub/array/launch_fastq-dump_split.sh out <(cut -f1 ncbi.tsv)
GISAID
These genomes were downloaded as fasta files: click on EpiCov, then download, then nextfasta.