1 minute read

Over the last several years, we have been generating benchmark datasets.

  • Have you ever made a workflow that you need to test on real data?
  • Do you have a new person on your team and you need to give them data so that they can learn your workflows?
  • Are there proficiency tests that you need to give to your teammates every so often?
  • Are you learning a new kind of analysis?

If you said yes to any of these then you might be in the market for benchmark datasets. Each of these datasets represents some genomic data and the expected result. For example, if you have a new clustering pipeline for outbreaks, you can view some of the datasets to download those data. Next, you can make sure your pipeline clusters the same way the dataset suggests.

I have been working with many collaborators through the last several years to create these datasets including two previous publications: Timme et al 2017 and Xiaoli et al 2022.

Datasets

We have already made datasets for the following:

  • Foodborne outbreaks - test outbreak clustering
  • SARS-CoV-2 - test clustering, assembly, lineage identification
  • Legionella outbreak - test outbreak clustering
  • In vitro evolution - test phylogeny
  • AMR/plasmid assemblies - test AMR genotyping
  • Oxford nanopore - toy dataset to run an assembler very fast

Each dataset comes with hashsums and metadata, ensuring that every user downloads the exact same files.

Future

Future datasets

I would love to have other datasets including

  • More species or Salmonella serotypes for AMR genotyping
  • cgMLST datasets with allele calls
  • Other species
    • Candida
    • Metagenomics from stool or sewage

Future scripting

Right now I use perl to create a Makefile which then runs the data download.

I am changing this now to a Makefile which seems to be doing really great.

Categories:

Updated: