I thought it might come down to the heap initialization. We'll work with that.
On Wed, Aug 13, 2014 at 4:42 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 08/05/2014 07:46 AM, Michael Lawrence wrote: > >> Hi guys (Val, Martin, Herve): >> >> Anyone have an itch for optimization? The writeVcf function is currently a >> bottleneck in our WGS genotyping pipeline. For a typical 50 million row >> gVCF, it was taking 2.25 hours prior to yesterday's improvements >> (pasteCollapseRows) that brought it down to about 1 hour, which is still >> too long by my standards (> 0). Only takes 3 minutes to call the genotypes >> (and associated likelihoods etc) from the variant calls (using 80 cores >> and >> 450 GB RAM on one node), so the output is an issue. Profiling suggests >> that >> the running time scales non-linearly in the number of rows. >> >> Digging a little deeper, it seems to be something with R's string/memory >> allocation. Below, pasting 1 million strings takes 6 seconds, but 10 >> million strings takes over 2 minutes. It gets way worse with 50 million. I >> suspect it has something to do with R's string hash table. >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 6.396 0.028 6.420 >> >> end <- sample(1e8, 1e7) >> system.time(paste0("END", "=", end)) >> user system elapsed >> 134.714 0.352 134.978 >> >> Indeed, even this takes a long time (in a fresh session): >> >> set.seed(1000) >> end <- sample(1e8, 1e6) >> end <- sample(1e8, 1e7) >> system.time(as.character(end)) >> user system elapsed >> 57.224 0.156 57.366 >> > > my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, > which changes the example above from > > > system.time(as.character(end)) > user system elapsed > 82.835 0.343 83.195 > > to > > > system.time(as.character(end)) > user system elapsed > 9.245 0.169 9.424 > > but I think it's a one-time gain; I wonder what the writeVcf command is > that you're running? > > Martin > > >> But running it a second time is faster (about what one would expect?): >> >> system.time(levels <- as.character(end)) >> user system elapsed >> 23.582 0.021 23.589 >> >> I did some simple profiling of R to find that the resizing of the string >> hash table is not a significant component of the time. So maybe something >> to do with the R heap/gc? No time right now to go deeper. But I know >> Martin >> likes this sort of thing ;) >> >> Michael >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel