VCF Files¶
TODO
- what format? VCF.GZ
-
can it be compressed with gzip, or do we recommend bcftools
-
does it need to have an index file
-
no, divbase creates these during the queries. DivBase uses CSI indexed instead of TBI indexes since the former can accomate larger genome assemblies (TODO: find the reference for the memory size the two indexes can handle)
-
file size limitation?
- each DivBsae project has file quota. ask the staff if you have questions about this
- comment that splitting large VCFs by chromosome will likely be more efficient for DivBase.
- we can provide suggestions on how to split
- example script from mouse VCF benchmarking to split a larger VCF into 20 smaller by chromosome: https://github.com/ScilifelabDataCentre/divbase/blob/per-task-cpu-ram-metrics/scripts/benchmarking/split_mouse_vcf_per_scaffold.sh
Managing VCF files in DivBase¶
DivBase is built to help users manage their VCF data. For this to work well, users need to consider the following when uploading data
DivBase is built on with good data management priciples in mind (TODO: add an external ref for this? RDMkit, Turingway Etc.)
-
Data should not be duplicated in a DivBase project.
-
The same sample or the same variant can be present in multiple files, but the same sample and variant cannot!
Toy examples. For the sake of demonstration, let's assume that we have two VCF files that only contain four samples and one variant.
The following example is compatible with DivBase: the same samples are present in both files but the files describe different variants. Furhtermore, DivBase is built to handle that big VCF files are be split into smaller files by chromosome or by scafold/contig; in fact, it is often more performant to store the data in that way.
# >zcat file_1.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE001 SAMPLE002 SAMPLE003 SAMPLE004 1 12345 1_12345 T C . . PR;AN=10;AC=0 GT 0/0 1/0 0/0 0/ # >zcat file_2.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE001 SAMPLE002 SAMPLE003 SAMPLE004 1 56789 1_56789 T C . . PR;AN=10;AC=0 GT 0/0 0/0 0/0 0/0The following is also compatible with DivBase: the same variant is present in both files but the samples it describes are different between the two files.
# >zcat file_1.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE001 SAMPLE002 SAMPLE003 SAMPLE004 1 12345 1_12345 T C . . PR;AN=10;AC=0 GT 0/0 1/0 0/0 0/ # >zcat file_2.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE005 SAMPLE006 SAMPLE007 SAMPLE008 1 12345 1_12345 T C . . PR;AN=10;AC=0 GT 0/0 1/0 0/0 0/THE FOLLOWING IS NOT COMPATIBLE with DivBase: here one sample (SAMPLE001) occurs for the exact same variant in both files. DivBase will not be able to process this.
# >zcat file_1.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE001 SAMPLE002 SAMPLE003 SAMPLE004 1 12345 1_12345 T C . . PR;AN=10;AC=0 GT 0/0 1/0 0/0 0/ # >zcat file_2.vcf.gz | grep -v "^#" | head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE001 SAMPLE005 SAMPLE006 SAMPLE007 1 12345 1_12345 T C . . PR;AN=10;AC=0 GT 0/0 1/0 0/0 0/ -
VCF and sidecar sample metadata files are versioned in the DivBase project
-
The system will use the latest version for all queries.
- TODO can file version be set with a flag?
-