Re: [Bioc-devel] Combining Ordinary List of GRanges Optimisation

Hervé Pagès Tue, 08 Jan 2013 15:39:43 -0800

Michael,

According to my timings, here is where c() spends its time on GRanges
objects with 1 meta col:

  - merging the seqinfo: 1.5%
  - combining the seqnames: 5.9%
  - combining the ranges: 12.4%
  - combining the strand: 2%
  - rbinding the mcols: 78.2%

It seems that any additional meta col would have an impact that is only
half the impact of the first meta col.

When doing

  df <- DataFrame(conservation=1:1000000*1e-6)
  rbind(df, df)

most of the time (> 90%) is spent on the following line (line 78,
in IRanges/R/DataFrame-utils.R)

  ans <- do.call(DataFrame, cols)

where 'cols' is a (named) list of length 1 containing a numeric vector
of length 2 millions.

Finally this:

  > system.time(ans <- do.call(DataFrame, cols))
     user  system elapsed
    3.684   0.012   3.701
  > system.time(ans2 <- DataFrame(cols))
     user  system elapsed
    3.828   0.008   3.842
  > system.time(ans3 <- DataFrame(conservation=cols[[1]]))
     user  system elapsed
    0.024   0.032   0.057

  > identical(ans, ans2)
  [1] TRUE
  > identical(ans, ans3)
  [1] TRUE

is intriguing. From a naive perspective, it doesn't sound that

  DataFrame(cols)

should need to do a lot more work than

 DataFrame(conservation=cols[[1]])

but apparently it does. May be there is room for some speed improvements
here...

H.

On 01/08/2013 02:05 PM, Michael Lawrence wrote:

That GRanges only had one column, so I'm hoping that's not a lot of
overhead. The merging of the thousands of Seqinfo objects is probably
the issue. Any way to make that n-ary instead of a Reduce() over a
binary merge?

Michael


On Tue, Jan 8, 2013 at 10:44 AM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    Hi Dario,


    On 01/06/2013 07:00 PM, Dario Strbenac wrote:

            Are you asking if you can rewrite your code to work faster,
            or are you asking if the BioC devs need to improve the code
            to be faster?


        I was suggesting that maybe the c function for GRanges could be
        optimised.

            Another would be manually splitting each GRanges objects
            into its components: seqnames, IRanges, strand, and
            metadata. Then concatenate these components and build one
            big GRanges object.


        This approach gives:

             user  system elapsed
           63.488  11.092  74.786


    I think this is more or less what 'do.call(c, blockRanges)' would give
    you if all your GRanges objects were naked i.e. if they had no meta
    columns.



        which by using c was previously:

             user  system elapsed
        935.770  23.657 961.952


    By default c() will also combine the meta columns which can be
    expensive if you have a lot of them and/or if some of them are
    complicated objects. You can call c() with 'ignore.mcols=TRUE'
    if you don't need to propagate the meta columns. Which, in the
    context of do.call(), translates to something like:

       allRanges <- do.call(c, c(blockRanges, list(ignore.mcols=TRUE)))

    IMPORTANT NOTE, related to this thread on the Bioconductor list:

    https://stat.ethz.ch/__pipermail/bioconductor/2012-__November/049567.html
    <https://stat.ethz.ch/pipermail/bioconductor/2012-November/049567.html>

    In short: if we ask the R core guys to change the implicit c() generic,
    my understanding is that it won't be possible to support additional
    args in "c" methods anymore, like the 'ignore.mcols' arg of the method
    for GenomicRanges objects. Should take the time to discuss this before
    I proceed?

    Thanks,
    H.



        Thanks for the tip. I now remember using this approach at some
        time in the past.

        _________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>


    _________________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Combining Ordinary List of GRanges Optimisation

Reply via email to