Benchmark datasets

1 minute read

Over the last several years, we have been generating benchmark datasets.

Have you ever made a workflow that you need to test on real data?
Do you have a new person on your team and you need to give them data so that they can learn your workflows?
Are there proficiency tests that you need to give to your teammates every so often?
Are you learning a new kind of analysis?

If you said yes to any of these then you might be in the market for benchmark datasets. Each of these datasets represents some genomic data and the expected result. For example, if you have a new clustering pipeline for outbreaks, you can view some of the datasets to download those data. Next, you can make sure your pipeline clusters the same way the dataset suggests.

I have been working with many collaborators through the last several years to create these datasets including two previous publications: Timme et al 2017 and Xiaoli et al 2022.

Datasets

We have already made datasets for the following:

Foodborne outbreaks - test outbreak clustering
SARS-CoV-2 - test clustering, assembly, lineage identification
Legionella outbreak - test outbreak clustering
In vitro evolution - test phylogeny
AMR/plasmid assemblies - test AMR genotyping
Oxford nanopore - toy dataset to run an assembler very fast

Each dataset comes with hashsums and metadata, ensuring that every user downloads the exact same files.

Future

Future datasets

I would love to have other datasets including

More species or Salmonella serotypes for AMR genotyping
cgMLST datasets with allele calls
Other species
- Candida
- Metagenomics from stool or sewage

Future scripting

Right now I use perl to create a Makefile which then runs the data download.

I am changing this now to a Makefile which seems to be doing really great.

Share on

Twitter Facebook LinkedIn

Lee Katz

Benchmark datasets

Datasets

Future

Future datasets

Future scripting

Share on

You may also enjoy

Erdos Bacon number

Adapter trimming

Pixi basics Permalink

Call for action: bioinformatics conferences