Gabe is still testing/profiling, but we'll send something randomized along eventually.
On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > I didn't see in the original thread a reproducible (simulated, I guess) > example, to be explicit about what the problem is?? > > Martin > > > On 08/26/2014 10:47 AM, Michael Lawrence wrote: > >> My understanding is that the heap optimization provided marginal gains, >> and >> that we need to think harder about how to optimize the all of the string >> manipulation in writeVcf. We either need to reduce it or reduce its >> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. >> >> >> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <voben...@fhcrc.org> >> wrote: >> >> Hi Gabe, >>> >>> Martin responded, and so did Michael, >>> >>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html >>> >>> It sounded like Michael was ok with working with/around heap >>> initialization. >>> >>> Michael, is that right or should we still consider this on the table? >>> >>> >>> Val >>> >>> >>> On 08/26/2014 09:34 AM, Gabe Becker wrote: >>> >>> Val, >>>> >>>> Has there been any movement on this? This remains a substantial >>>> bottleneck for us when writing very large VCF files (e.g. >>>> variants+genotypes for whole genome NGS samples). >>>> >>>> I was able to see a ~25% speedup with 4 cores and an "optimal" speedup >>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive >>>> parallelization strategy and no other changes. I suspect this could be >>>> improved on quite a bit, or possibly made irrelevant with judicious use >>>> of serial C code. >>>> >>>> Did you and Martin make any plans regarding optimizing writeVcf? >>>> >>>> Best >>>> ~G >>>> >>>> >>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <voben...@fhcrc.org >>>> <mailto:voben...@fhcrc.org>> wrote: >>>> >>>> Hi Michael, >>>> >>>> I'm interested in working on this. I'll discuss with Martin next >>>> week when we're both back in the office. >>>> >>>> Val >>>> >>>> >>>> >>>> >>>> >>>> On 08/05/14 07:46, Michael Lawrence wrote: >>>> >>>> Hi guys (Val, Martin, Herve): >>>> >>>> Anyone have an itch for optimization? The writeVcf function is >>>> currently a >>>> bottleneck in our WGS genotyping pipeline. For a typical 50 >>>> million row >>>> gVCF, it was taking 2.25 hours prior to yesterday's >>>> improvements >>>> (pasteCollapseRows) that brought it down to about 1 hour, which >>>> is still >>>> too long by my standards (> 0). Only takes 3 minutes to call >>>> the >>>> genotypes >>>> (and associated likelihoods etc) from the variant calls (using >>>> 80 cores and >>>> 450 GB RAM on one node), so the output is an issue. Profiling >>>> suggests that >>>> the running time scales non-linearly in the number of rows. >>>> >>>> Digging a little deeper, it seems to be something with R's >>>> string/memory >>>> allocation. Below, pasting 1 million strings takes 6 seconds, >>>> but >>>> 10 >>>> million strings takes over 2 minutes. It gets way worse with 50 >>>> million. I >>>> suspect it has something to do with R's string hash table. >>>> >>>> set.seed(1000) >>>> end <- sample(1e8, 1e6) >>>> system.time(paste0("END", "=", end)) >>>> user system elapsed >>>> 6.396 0.028 6.420 >>>> >>>> end <- sample(1e8, 1e7) >>>> system.time(paste0("END", "=", end)) >>>> user system elapsed >>>> 134.714 0.352 134.978 >>>> >>>> Indeed, even this takes a long time (in a fresh session): >>>> >>>> set.seed(1000) >>>> end <- sample(1e8, 1e6) >>>> end <- sample(1e8, 1e7) >>>> system.time(as.character(end)) >>>> user system elapsed >>>> 57.224 0.156 57.366 >>>> >>>> But running it a second time is faster (about what one would >>>> expect?): >>>> >>>> system.time(levels <- as.character(end)) >>>> user system elapsed >>>> 23.582 0.021 23.589 >>>> >>>> I did some simple profiling of R to find that the resizing of >>>> the string >>>> hash table is not a significant component of the time. So maybe >>>> something >>>> to do with the R heap/gc? No time right now to go deeper. But I >>>> know Martin >>>> likes this sort of thing ;) >>>> >>>> Michael >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _________________________________________________ >>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>>> mailing list >>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>> >>>> >>>> _________________________________________________ >>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing >>>> list >>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>>> >>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>> >>>> >>>> >>>> >>>> -- >>>> Computational Biologist >>>> Genentech Research >>>> >>>> >>> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel