Michael, According to my timings, here is where c() spends its time on GRanges objects with 1 meta col:
- merging the seqinfo: 1.5% - combining the seqnames: 5.9% - combining the ranges: 12.4% - combining the strand: 2% - rbinding the mcols: 78.2% It seems that any additional meta col would have an impact that is only half the impact of the first meta col. When doing df <- DataFrame(conservation=1:1000000*1e-6) rbind(df, df) most of the time (> 90%) is spent on the following line (line 78, in IRanges/R/DataFrame-utils.R) ans <- do.call(DataFrame, cols) where 'cols' is a (named) list of length 1 containing a numeric vector of length 2 millions. Finally this: > system.time(ans <- do.call(DataFrame, cols)) user system elapsed 3.684 0.012 3.701 > system.time(ans2 <- DataFrame(cols)) user system elapsed 3.828 0.008 3.842 > system.time(ans3 <- DataFrame(conservation=cols[[1]])) user system elapsed 0.024 0.032 0.057 > identical(ans, ans2) [1] TRUE > identical(ans, ans3) [1] TRUE is intriguing. From a naive perspective, it doesn't sound that DataFrame(cols) should need to do a lot more work than DataFrame(conservation=cols[[1]]) but apparently it does. May be there is room for some speed improvements here... H. On 01/08/2013 02:05 PM, Michael Lawrence wrote:
That GRanges only had one column, so I'm hoping that's not a lot of overhead. The merging of the thousands of Seqinfo objects is probably the issue. Any way to make that n-ary instead of a Reduce() over a binary merge? Michael On Tue, Jan 8, 2013 at 10:44 AM, Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> wrote: Hi Dario, On 01/06/2013 07:00 PM, Dario Strbenac wrote: Are you asking if you can rewrite your code to work faster, or are you asking if the BioC devs need to improve the code to be faster? I was suggesting that maybe the c function for GRanges could be optimised. Another would be manually splitting each GRanges objects into its components: seqnames, IRanges, strand, and metadata. Then concatenate these components and build one big GRanges object. This approach gives: user system elapsed 63.488 11.092 74.786 I think this is more or less what 'do.call(c, blockRanges)' would give you if all your GRanges objects were naked i.e. if they had no meta columns. which by using c was previously: user system elapsed 935.770 23.657 961.952 By default c() will also combine the meta columns which can be expensive if you have a lot of them and/or if some of them are complicated objects. You can call c() with 'ignore.mcols=TRUE' if you don't need to propagate the meta columns. Which, in the context of do.call(), translates to something like: allRanges <- do.call(c, c(blockRanges, list(ignore.mcols=TRUE))) IMPORTANT NOTE, related to this thread on the Bioconductor list: https://stat.ethz.ch/__pipermail/bioconductor/2012-__November/049567.html <https://stat.ethz.ch/pipermail/bioconductor/2012-November/049567.html> In short: if we ask the R core guys to change the implicit c() generic, my understanding is that it won't be possible to support additional args in "c" methods anymore, like the 'ignore.mcols' arg of the method for GenomicRanges objects. Should take the time to discuss this before I proceed? Thanks, H. Thanks for the tip. I now remember using this approach at some time in the past. _________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> _________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel