Re: [Bioc-devel] writeVcf performance

Martin Morgan Tue, 02 Sep 2014 13:36:16 -0700

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno fields, as I said,
but our typical usecase involves only ~4-6 fields, and is faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbagecollector runs multiple times as the different strings are pasted together, andhas to traverse, in linear time, increasing numbers of allocated SEXPs. So timesscale approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by writing out in chunks-- as each chunk is written, the SEXPs become available for collection and arere-used. Here's my toy example


time.R
======
splitIndices <- function (nx, ncl)
{
    i <- seq_len(nx)
    if (ncl == 0L)
        list()
    else if (ncl == 1L || nx == 1L)
        list(i)
    else {
        fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
        breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
        structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
    }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
    idx <- splitIndices(length(x), 20)
    system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
    system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R therelevant time is


   user  system elapsed
 15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

   user  system elapsed
 95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character data --processing in independent chunks of moderate (1M?) size (enabling as aconsequence parallel evaluation in modest memory) that are sufficient to benefitfrom vectorization, but that do not entail allocation of large numbers of in-useSEXPs.


Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <becke...@gene.com
<mailto:becke...@gene.com>> wrote:

    Martin and Val.

    I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
    profiling enabled. The results of summaryRprof for that run are attached,
    though for a variety of reasons they are pretty misleading.

    It took over an hour to write (3700+seconds), so it's definitely a
    bottleneck when the data get very large, even if it isn't for smaller data.

    Michael and I both think the culprit is all the pasting and cbinding that is
    going on, and more to the point, that memory for an internal representation
    to be written out is allocated at all.  Streaming across the object, looping
    by rows and writing directly to file (e.g. from C) should be blisteringly
    fast in comparison.

    ~G


    On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <micha...@gene.com
    <mailto:micha...@gene.com>> wrote:

        Gabe is still testing/profiling, but we'll send something randomized
        along eventually.


        On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>> wrote:

            I didn't see in the original thread a reproducible (simulated, I
            guess) example, to be explicit about what the problem is??

            Martin


            On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                My understanding is that the heap optimization provided marginal
                gains, and
                that we need to think harder about how to optimize the all of
                the string
                manipulation in writeVcf. We either need to reduce it or reduce 
its
                overhead (i.e., the CHARSXP allocation). Gabe is doing more 
tests.


                On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
                <voben...@fhcrc.org <mailto:voben...@fhcrc.org>>
                wrote:

                    Hi Gabe,

                    Martin responded, and so did Michael,

                    
https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
                    
<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>

                    It sounded like Michael was ok with working with/around heap
                    initialization.

                    Michael, is that right or should we still consider this on
                    the table?


                    Val


                    On 08/26/2014 09:34 AM, Gabe Becker wrote:

                        Val,

                        Has there been any movement on this? This remains a
                        substantial
                        bottleneck for us when writing very large VCF files 
(e.g.
                        variants+genotypes for whole genome NGS samples).

                        I was able to see a ~25% speedup with 4 cores and  an
                        "optimal" speedup
                        of ~2x with 10-12 cores for a VCF with 500k rows  using
                        a very naive
                        parallelization strategy and no other changes. I suspect
                        this could be
                        improved on quite a bit, or possibly made irrelevant
                        with judicious use
                        of serial C code.

                        Did you and Martin make any plans regarding optimizing
                        writeVcf?

                        Best
                        ~G


                        On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
                        <voben...@fhcrc.org <mailto:voben...@fhcrc.org>
                        <mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>>
                        wrote:

                              Hi Michael,

                              I'm interested in working on this. I'll discuss
                        with Martin next
                              week when we're both back in the office.

                              Val





                              On 08/05/14 07:46, Michael Lawrence wrote:

                                  Hi guys (Val, Martin, Herve):

                                  Anyone have an itch for optimization? The
                        writeVcf function is
                                  currently a
                                  bottleneck in our WGS genotyping pipeline. For
                        a typical 50
                                  million row
                                  gVCF, it was taking 2.25 hours prior to
                        yesterday's improvements
                                  (pasteCollapseRows) that brought it down to
                        about 1 hour, which
                                  is still
                                  too long by my standards (> 0). Only takes 3
                        minutes to call the
                                  genotypes
                                  (and associated likelihoods etc) from the
                        variant calls (using
                                  80 cores and
                                  450 GB RAM on one node), so the output is an
                        issue. Profiling
                                  suggests that
                                  the running time scales non-linearly in the
                        number of rows.

                                  Digging a little deeper, it seems to be
                        something with R's
                                  string/memory
                                  allocation. Below, pasting 1 million strings
                        takes 6 seconds, but
                        10
                                  million strings takes over 2 minutes. It gets
                        way worse with 50
                                  million. I
                                  suspect it has something to do with R's string
                        hash table.

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                      6.396   0.028   6.420

                                  end <- sample(1e8, 1e7)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                  134.714   0.352 134.978

                                  Indeed, even this takes a long time (in a
                        fresh session):

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  end <- sample(1e8, 1e7)
                                  system.time(as.character(end))
                                       user  system elapsed
                                     57.224   0.156  57.366

                                  But running it a second time is faster (about
                        what one would
                                  expect?):

                                  system.time(levels <- as.character(end))
                                       user  system elapsed
                                     23.582   0.021  23.589

                                  I did some simple profiling of R to find that
                        the resizing of
                                  the string
                                  hash table is not a significant component of
                        the time. So maybe
                                  something
                                  to do with the R heap/gc? No time right now to
                        go deeper. But I
                                  know Martin
                                  likes this sort of thing ;)

                                  Michael

                                           [[alternative HTML version deleted]]


                          ___________________________________________________
                        Bioc-devel@r-project.org
                        <mailto:Bioc-devel@r-project.org>
                        <mailto:Bioc-devel@r-project.__org
                        <mailto:Bioc-devel@r-project.org>>
                                  mailing list
                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>


                              
___________________________________________________
                        Bioc-devel@r-project.org
                        <mailto:Bioc-devel@r-project.org>
                        <mailto:Bioc-devel@r-project.__org
                        <mailto:Bioc-devel@r-project.org>> mailing
                        list
                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>


                          <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>




                        --
                        Computational Biologist
                        Genentech Research





                         [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



            --
            Computational Biology / Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

            Location: Arnold Building M1 B861
            Phone: (206) 667-2793 <tel:%28206%29%20667-2793>





    --
    Computational Biologist
    Genentech Research




--
Computational Biologist
Genentech Research



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] writeVcf performance

Reply via email to