Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

2016-09-28 Thread Martin Morgan
On 09/27/2016 06:00 PM, Dario Strbenac wrote: Good day, file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation") aSet <- readVcf(file, "hg19") system.time(commonMutations <- re

Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

2016-09-28 Thread Vincent Carey
Dario's computer is faster than mine > system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet))) user system elapsed 426.271 57.296 483.766 The disk infrastructure is a determinant of throughput. Most VCF queries are decomposable and can be parallelized. After chunki

Re: [Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

2016-09-27 Thread Michael Lawrence
I think the basic problem is that each range requires a separate query through tabix. BAM and tabix are designed to be fast for single queries, like what a genome browser might generate, but not for querying thousands of regions at once. At least that's the way it seems to me. The index is only at

[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

2016-09-27 Thread Dario Strbenac
Good day, file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation") anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation") aSet <- readVcf(file, "hg19") system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))