Val, Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples).
I was able to see a ~25% speedup with 4 cores and an "optimal" speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could be improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <voben...@fhcrc.org> wrote: > Hi Michael, > > I'm interested in working on this. I'll discuss with Martin next week when > we're both back in the office. > > Val > > > > > > On 08/05/14 07:46, Michael Lawrence wrote: > >> Hi guys (Val, Martin, Herve): >> >> Anyone have an itch for optimization? The writeVcf function is currently a >> bottleneck in our WGS genotyping pipeline. For a typical 50 million row >> gVCF, it was taking 2.25 hours prior to yesterday's improvements >> (pasteCollapseRows) that brought it down to about 1 hour, which is still >> too long by my standards (> 0). Only takes 3 minutes to call the genotypes >> (and associated likelihoods etc) from the variant calls (using 80 cores >> and >> 450 GB RAM on one node), so the output is an issue. Profiling suggests >> that >> the running time scales non-linearly in the number of rows. >> >> Digging a little deeper, it seems to be something with R's string/memory >> allocation. Below, pasting 1 million strings takes 6 seconds, but 10 >> million strings takes over 2 minutes. It gets way worse with 50 million. I >> suspect it has something to do with R's string hash table. >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 6.396 0.028 6.420 >> >> end <- sample(1e8, 1e7) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 134.714 0.352 134.978 >> >> Indeed, even this takes a long time (in a fresh session): >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> end <- sample(1e8, 1e7) >> system.time(as.character(end)) >> user system elapsed >> 57.224 0.156 57.366 >> >> But running it a second time is faster (about what one would expect?): >> >> system.time(levels <- as.character(end)) >> user system elapsed >> 23.582 0.021 23.589 >> >> I did some simple profiling of R to find that the resizing of the string >> hash table is not a significant component of the time. So maybe something >> to do with the R heap/gc? No time right now to go deeper. But I know >> Martin >> likes this sort of thing ;) >> >> Michael >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > -- Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel