Skip to content

How to create efficient DivBase queries

TODO add more tips here

General advice

Split large VCF files into smaller files. For instance by chromosome. If the assembly does not have chromosome level contiguity, we suggest to store a range of scaffolds per file.

If you want to check-out all samples across all files, it will be more efficient to download the files than to run a query

bcftools command pipe order

For example, it is typically faster to first subset on genomic range (=drop rows in the VCF that do not fullfil the range filter) and then subset on samples than to do the reverse. This is because sample subset will require writes to each row in the VCF; by first reducing the number of rows, there will be fewer write operations and thus a faster operation.