The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes.
Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <becke...@gene.com> wrote: > Martin and Val. > > I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) > with profiling enabled. The results of summaryRprof for that run are > attached, though for a variety of reasons they are pretty misleading. > > It took over an hour to write (3700+seconds), so it's definitely a > bottleneck when the data get very large, even if it isn't for smaller data. > > Michael and I both think the culprit is all the pasting and cbinding that > is going on, and more to the point, that memory for an internal > representation to be written out is allocated at all. Streaming across the > object, looping by rows and writing directly to file (e.g. from C) should > be blisteringly fast in comparison. > > ~G > > > On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <micha...@gene.com> > wrote: > >> Gabe is still testing/profiling, but we'll send something randomized >> along eventually. >> >> >> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmor...@fhcrc.org> >> wrote: >> >>> I didn't see in the original thread a reproducible (simulated, I guess) >>> example, to be explicit about what the problem is?? >>> >>> Martin >>> >>> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote: >>> >>>> My understanding is that the heap optimization provided marginal gains, >>>> and >>>> that we need to think harder about how to optimize the all of the string >>>> manipulation in writeVcf. We either need to reduce it or reduce its >>>> overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. >>>> >>>> >>>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <voben...@fhcrc.org> >>>> wrote: >>>> >>>> Hi Gabe, >>>>> >>>>> Martin responded, and so did Michael, >>>>> >>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html >>>>> >>>>> It sounded like Michael was ok with working with/around heap >>>>> initialization. >>>>> >>>>> Michael, is that right or should we still consider this on the table? >>>>> >>>>> >>>>> Val >>>>> >>>>> >>>>> On 08/26/2014 09:34 AM, Gabe Becker wrote: >>>>> >>>>> Val, >>>>>> >>>>>> Has there been any movement on this? This remains a substantial >>>>>> bottleneck for us when writing very large VCF files (e.g. >>>>>> variants+genotypes for whole genome NGS samples). >>>>>> >>>>>> I was able to see a ~25% speedup with 4 cores and an "optimal" >>>>>> speedup >>>>>> of ~2x with 10-12 cores for a VCF with 500k rows using a very naive >>>>>> parallelization strategy and no other changes. I suspect this could be >>>>>> improved on quite a bit, or possibly made irrelevant with judicious >>>>>> use >>>>>> of serial C code. >>>>>> >>>>>> Did you and Martin make any plans regarding optimizing writeVcf? >>>>>> >>>>>> Best >>>>>> ~G >>>>>> >>>>>> >>>>>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <voben...@fhcrc.org >>>>>> <mailto:voben...@fhcrc.org>> wrote: >>>>>> >>>>>> Hi Michael, >>>>>> >>>>>> I'm interested in working on this. I'll discuss with Martin next >>>>>> week when we're both back in the office. >>>>>> >>>>>> Val >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 08/05/14 07:46, Michael Lawrence wrote: >>>>>> >>>>>> Hi guys (Val, Martin, Herve): >>>>>> >>>>>> Anyone have an itch for optimization? The writeVcf function >>>>>> is >>>>>> currently a >>>>>> bottleneck in our WGS genotyping pipeline. For a typical 50 >>>>>> million row >>>>>> gVCF, it was taking 2.25 hours prior to yesterday's >>>>>> improvements >>>>>> (pasteCollapseRows) that brought it down to about 1 hour, >>>>>> which >>>>>> is still >>>>>> too long by my standards (> 0). Only takes 3 minutes to call >>>>>> the >>>>>> genotypes >>>>>> (and associated likelihoods etc) from the variant calls >>>>>> (using >>>>>> 80 cores and >>>>>> 450 GB RAM on one node), so the output is an issue. Profiling >>>>>> suggests that >>>>>> the running time scales non-linearly in the number of rows. >>>>>> >>>>>> Digging a little deeper, it seems to be something with R's >>>>>> string/memory >>>>>> allocation. Below, pasting 1 million strings takes 6 >>>>>> seconds, but >>>>>> 10 >>>>>> million strings takes over 2 minutes. It gets way worse with >>>>>> 50 >>>>>> million. I >>>>>> suspect it has something to do with R's string hash table. >>>>>> >>>>>> set.seed(1000) >>>>>> end <- sample(1e8, 1e6) >>>>>> system.time(paste0("END", "=", end)) >>>>>> user system elapsed >>>>>> 6.396 0.028 6.420 >>>>>> >>>>>> end <- sample(1e8, 1e7) >>>>>> system.time(paste0("END", "=", end)) >>>>>> user system elapsed >>>>>> 134.714 0.352 134.978 >>>>>> >>>>>> Indeed, even this takes a long time (in a fresh session): >>>>>> >>>>>> set.seed(1000) >>>>>> end <- sample(1e8, 1e6) >>>>>> end <- sample(1e8, 1e7) >>>>>> system.time(as.character(end)) >>>>>> user system elapsed >>>>>> 57.224 0.156 57.366 >>>>>> >>>>>> But running it a second time is faster (about what one would >>>>>> expect?): >>>>>> >>>>>> system.time(levels <- as.character(end)) >>>>>> user system elapsed >>>>>> 23.582 0.021 23.589 >>>>>> >>>>>> I did some simple profiling of R to find that the resizing of >>>>>> the string >>>>>> hash table is not a significant component of the time. So >>>>>> maybe >>>>>> something >>>>>> to do with the R heap/gc? No time right now to go deeper. >>>>>> But I >>>>>> know Martin >>>>>> likes this sort of thing ;) >>>>>> >>>>>> Michael >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _________________________________________________ >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>>>>> mailing list >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>>> >>>>>> >>>>>> _________________________________________________ >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >>>>>> mailing >>>>>> list >>>>>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>>>>> >>>>>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Computational Biologist >>>>>> Genentech Research >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>> >>>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >> >> > > > -- > Computational Biologist > Genentech Research > -- Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel