Val and Martin, Apologies for the delay.
We realized that the Illumina platinum genome vcf files make a good test case, assuming you strip out all the info (info=NA when reading it into R) stuff. ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took about ~4.2 hrs to write out, and is about 1.5x the size of the files we are actually dealing with (~50M ranges vs our ~30M). Looking forward a new vastly improved writeVcf :). ~G On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence <lawrence.mich...@gene.com> wrote: > Yes, it's very clear that the scaling is non-linear, and Gabe has been > experimenting with a chunk-wise + parallel algorithm. Unfortunately there > is some frustrating overhead with the parallelism. But I'm glad Val is > arriving at something quicker. > > Michael > > > On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > >> On 08/27/2014 11:56 AM, Gabe Becker wrote: >> >>> The profiling I attached in my previous email is for 24 geno fields, as >>> I said, >>> but our typical usecase involves only ~4-6 fields, and is faster but >>> still on >>> the order of dozens of minutes. >>> >> >> I think Val is arriving at a (much) more efficient implementation, but... >> >> I wanted to share my guess that the poor _scaling_ is because the garbage >> collector runs multiple times as the different strings are pasted together, >> and has to traverse, in linear time, increasing numbers of allocated SEXPs. >> So times scale approximately quadratically with the number of rows in the >> VCF >> >> An efficiency is to reduce the number of SEXPs in play by writing out in >> chunks -- as each chunk is written, the SEXPs become available for >> collection and are re-used. Here's my toy example >> >> time.R >> ====== >> splitIndices <- function (nx, ncl) >> { >> i <- seq_len(nx) >> if (ncl == 0L) >> list() >> else if (ncl == 1L || nx == 1L) >> list(i) >> else { >> fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl) >> breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L) >> structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL) >> } >> } >> >> x = as.character(seq_len(1e7)); y = sample(x) >> if (!is.na(Sys.getenv("SPLIT", NA))) { >> idx <- splitIndices(length(x), 20) >> system.time(for (i in idx) paste(x[i], y[i], sep=":")) >> } else { >> system.time(paste(x, y, sep=":")) >> } >> >> >> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the >> relevant time is >> >> user system elapsed >> 15.320 0.064 15.381 >> >> versus with $ R --no-save --quiet -f time.R it is >> >> user system elapsed >> 95.360 0.164 95.511 >> >> I think this is likely an overall strategy when dealing with character >> data -- processing in independent chunks of moderate (1M?) size (enabling >> as a consequence parallel evaluation in modest memory) that are sufficient >> to benefit from vectorization, but that do not entail allocation of large >> numbers of in-use SEXPs. >> >> Martin >> >> >>> Sorry for the confusion. >>> ~G >>> >>> >>> On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <becke...@gene.com >>> <mailto:becke...@gene.com>> wrote: >>> >>> Martin and Val. >>> >>> I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno >>> fields) with >>> profiling enabled. The results of summaryRprof for that run are >>> attached, >>> though for a variety of reasons they are pretty misleading. >>> >>> It took over an hour to write (3700+seconds), so it's definitely a >>> bottleneck when the data get very large, even if it isn't for >>> smaller data. >>> >>> Michael and I both think the culprit is all the pasting and cbinding >>> that is >>> going on, and more to the point, that memory for an internal >>> representation >>> to be written out is allocated at all. Streaming across the object, >>> looping >>> by rows and writing directly to file (e.g. from C) should be >>> blisteringly >>> fast in comparison. >>> >>> ~G >>> >>> >>> On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence < >>> micha...@gene.com >>> <mailto:micha...@gene.com>> wrote: >>> >>> Gabe is still testing/profiling, but we'll send something >>> randomized >>> along eventually. >>> >>> >>> On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan < >>> mtmor...@fhcrc.org >>> <mailto:mtmor...@fhcrc.org>> wrote: >>> >>> I didn't see in the original thread a reproducible >>> (simulated, I >>> guess) example, to be explicit about what the problem is?? >>> >>> Martin >>> >>> >>> On 08/26/2014 10:47 AM, Michael Lawrence wrote: >>> >>> My understanding is that the heap optimization provided >>> marginal >>> gains, and >>> that we need to think harder about how to optimize the >>> all of >>> the string >>> manipulation in writeVcf. We either need to reduce it or >>> reduce its >>> overhead (i.e., the CHARSXP allocation). Gabe is doing >>> more tests. >>> >>> >>> On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain >>> <voben...@fhcrc.org <mailto:voben...@fhcrc.org>> >>> >>> wrote: >>> >>> Hi Gabe, >>> >>> Martin responded, and so did Michael, >>> >>> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__ >>> August/006082.html >>> >>> <https://stat.ethz.ch/pipermail/bioc-devel/2014- >>> August/006082.html> >>> >>> It sounded like Michael was ok with working >>> with/around heap >>> initialization. >>> >>> Michael, is that right or should we still consider >>> this on >>> the table? >>> >>> >>> Val >>> >>> >>> On 08/26/2014 09:34 AM, Gabe Becker wrote: >>> >>> Val, >>> >>> Has there been any movement on this? This >>> remains a >>> substantial >>> bottleneck for us when writing very large VCF >>> files (e.g. >>> variants+genotypes for whole genome NGS samples). >>> >>> I was able to see a ~25% speedup with 4 cores >>> and an >>> "optimal" speedup >>> of ~2x with 10-12 cores for a VCF with 500k >>> rows using >>> a very naive >>> parallelization strategy and no other changes. I >>> suspect >>> this could be >>> improved on quite a bit, or possibly made >>> irrelevant >>> with judicious use >>> of serial C code. >>> >>> Did you and Martin make any plans regarding >>> optimizing >>> writeVcf? >>> >>> Best >>> ~G >>> >>> >>> On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain >>> <voben...@fhcrc.org <mailto:voben...@fhcrc.org> >>> <mailto:voben...@fhcrc.org <mailto: >>> voben...@fhcrc.org>>> >>> >>> wrote: >>> >>> Hi Michael, >>> >>> I'm interested in working on this. I'll >>> discuss >>> with Martin next >>> week when we're both back in the office. >>> >>> Val >>> >>> >>> >>> >>> >>> On 08/05/14 07:46, Michael Lawrence wrote: >>> >>> Hi guys (Val, Martin, Herve): >>> >>> Anyone have an itch for optimization? >>> The >>> writeVcf function is >>> currently a >>> bottleneck in our WGS genotyping >>> pipeline. For >>> a typical 50 >>> million row >>> gVCF, it was taking 2.25 hours prior to >>> yesterday's improvements >>> (pasteCollapseRows) that brought it >>> down to >>> about 1 hour, which >>> is still >>> too long by my standards (> 0). Only >>> takes 3 >>> minutes to call the >>> genotypes >>> (and associated likelihoods etc) from >>> the >>> variant calls (using >>> 80 cores and >>> 450 GB RAM on one node), so the output >>> is an >>> issue. Profiling >>> suggests that >>> the running time scales non-linearly >>> in the >>> number of rows. >>> >>> Digging a little deeper, it seems to be >>> something with R's >>> string/memory >>> allocation. Below, pasting 1 million >>> strings >>> takes 6 seconds, but >>> 10 >>> million strings takes over 2 minutes. >>> It gets >>> way worse with 50 >>> million. I >>> suspect it has something to do with >>> R's string >>> hash table. >>> >>> set.seed(1000) >>> end <- sample(1e8, 1e6) >>> system.time(paste0("END", "=", end)) >>> user system elapsed >>> 6.396 0.028 6.420 >>> >>> end <- sample(1e8, 1e7) >>> system.time(paste0("END", "=", end)) >>> user system elapsed >>> 134.714 0.352 134.978 >>> >>> Indeed, even this takes a long time >>> (in a >>> fresh session): >>> >>> set.seed(1000) >>> end <- sample(1e8, 1e6) >>> end <- sample(1e8, 1e7) >>> system.time(as.character(end)) >>> user system elapsed >>> 57.224 0.156 57.366 >>> >>> But running it a second time is faster >>> (about >>> what one would >>> expect?): >>> >>> system.time(levels <- >>> as.character(end)) >>> user system elapsed >>> 23.582 0.021 23.589 >>> >>> I did some simple profiling of R to >>> find that >>> the resizing of >>> the string >>> hash table is not a significant >>> component of >>> the time. So maybe >>> something >>> to do with the R heap/gc? No time >>> right now to >>> go deeper. But I >>> know Martin >>> likes this sort of thing ;) >>> >>> Michael >>> >>> [[alternative HTML version >>> deleted]] >>> >>> >>> ______________________________ >>> _____________________ >>> Bioc-devel@r-project.org >>> <mailto:Bioc-devel@r-project.org> >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>> >>> mailing list >>> >>> https://stat.ethz.ch/mailman/_ >>> ___listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/ >>> __listinfo/bioc-devel> >>> >>> <https://stat.ethz.ch/mailman/ >>> __listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/ >>> listinfo/bioc-devel>> >>> >>> >>> ______________________________ >>> _____________________ >>> Bioc-devel@r-project.org >>> <mailto:Bioc-devel@r-project.org> >>> <mailto:Bioc-devel@r-project.__org >>> <mailto:Bioc-devel@r-project.org>> mailing >>> list >>> https://stat.ethz.ch/mailman/_ >>> ___listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/ >>> __listinfo/bioc-devel> >>> >>> >>> >>> <https://stat.ethz.ch/mailman/ >>> __listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/ >>> listinfo/bioc-devel>> >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _________________________________________________ >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project. >>> org> >>> mailing list >>> https://stat.ethz.ch/mailman/__listinfo/bioc-devel >>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research >>> Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 <tel:%28206%29%20667-2793> >>> >>> >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >>> >>> >>> >>> -- >>> Computational Biologist >>> Genentech Research >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > -- Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel