Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap initialization.

Michael, is that right or should we still consider this on the table?


Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:
Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <voben...@fhcrc.org
<mailto:voben...@fhcrc.org>> wrote:

    Hi Michael,

    I'm interested in working on this. I'll discuss with Martin next
    week when we're both back in the office.

    Val





    On 08/05/14 07:46, Michael Lawrence wrote:

        Hi guys (Val, Martin, Herve):

        Anyone have an itch for optimization? The writeVcf function is
        currently a
        bottleneck in our WGS genotyping pipeline. For a typical 50
        million row
        gVCF, it was taking 2.25 hours prior to yesterday's improvements
        (pasteCollapseRows) that brought it down to about 1 hour, which
        is still
        too long by my standards (> 0). Only takes 3 minutes to call the
        genotypes
        (and associated likelihoods etc) from the variant calls (using
        80 cores and
        450 GB RAM on one node), so the output is an issue. Profiling
        suggests that
        the running time scales non-linearly in the number of rows.

        Digging a little deeper, it seems to be something with R's
        string/memory
        allocation. Below, pasting 1 million strings takes 6 seconds, but 10
        million strings takes over 2 minutes. It gets way worse with 50
        million. I
        suspect it has something to do with R's string hash table.

        set.seed(1000)
        end <- sample(1e8, 1e6)
        system.time(paste0("END", "=", end))
             user  system elapsed
            6.396   0.028   6.420

        end <- sample(1e8, 1e7)
        system.time(paste0("END", "=", end))
             user  system elapsed
        134.714   0.352 134.978

        Indeed, even this takes a long time (in a fresh session):

        set.seed(1000)
        end <- sample(1e8, 1e6)
        end <- sample(1e8, 1e7)
        system.time(as.character(end))
             user  system elapsed
           57.224   0.156  57.366

        But running it a second time is faster (about what one would
        expect?):

        system.time(levels <- as.character(end))
             user  system elapsed
           23.582   0.021  23.589

        I did some simple profiling of R to find that the resizing of
        the string
        hash table is not a significant component of the time. So maybe
        something
        to do with the R heap/gc? No time right now to go deeper. But I
        know Martin
        likes this sort of thing ;)

        Michael

                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


    _________________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>




--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to