Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Martin Morgan Wed, 01 Apr 2015 13:49:51 -0700

On 03/31/2015 12:40 PM, Michael Love wrote:

With GenomicRanges 1.19.48, I'm still having issues with re-naming the
first assay and duplication of memory from my March 9 email. I tried
assayNames<- as well. My use case is if I am given a
SummarizedExperiment where the first element is not named "counts"
(albeit the SE is most likely coming from summarizeOverlaps() and
already named "counts"...).

Thanks for the prompt Mike and sorry for the slow response. gc() is not the mosteffective tool to track memory use; I compiled my R with--enable-memory-profiling, and then used tracemem


  m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
  tracemem(m)
  se <- SummarizedExperiment(m)

The original behavior was

> names(assays(se)) <- "foo"

tracemem[0x7f49853a1010 -> 0x7f4981734010]: lapply lapply lapply lapplyendoapply endoapply assays assaystracemem[0x7f4981734010 -> 0x7f497f10e010]: lapply lapply lapply lapplyendoapply endoapply assays<- assays<-

which shows a memory copy on the way out (the call stack ending with the assaysaccess S4 generic then method) and on the way in, the assays<- setter genericand method). withDimnames=FALSE gave me


> names(assays(se, withDimnames=FALSE)) <- "foo"

tracemem[0x7f4981734010 -> 0x7f497f10e010]: lapply lapply lapply lapplyendoapply endoapply assays<- assays<-

>

with the duplication on the way in. GenomicRanges 1.19.50 gives, on a fresh 'se'

> names(assays(se, withDimnames=FALSE)) <- "foo"
>

with no duplication. assayNames<- (which I guess is the 'preferred' setter)behaves this way too.


Thanks for your report and patience.

Martin

sessionInfo()

R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils
    methods   base

other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10  testthat_0.9.1        devtools_1.7.0        knitr_1.9
[9] BiocInstaller_1.17.6

loaded via a namespace (and not attached):
[1] formatR_1.1    XVector_0.7.4  tools_3.3.0    stringr_0.6.2  evaluate_0.5.5

On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
<michaelisaiahl...@gmail.com> wrote:



On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmor...@fredhutch.org> wrote:


On 03/09/2015 08:07 AM, Michael Love wrote:


Some guidance on how to avoid duplication of the matrix for developers
would be greatly appreciated.



It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
extraction of assays (but obviously you don't have dimnames on the matrix). Row 
or column subsetting necessarily causes the subsetted assay data to be 
duplicated. There should not be any duplication when rowRanges() or colData() 
are changed without changing their dimension / ordering.


Thanks Martin for checking into the regression.

Sorry, I should have been more specific earlier, I meant more 
guidance/documentation in the man page for SE. I scanned the 'Extension' 
section but didn't find a note on withDimnames for extracting the matrix or 
this example of renaming the assays (it seems like this could easily be 
relevant for other package authors).

A prominent note there might help devs write more memory efficient packages.

The argument section mentions speed but I'd explicitly mention memory given 
that we're often storing big matrices:

"Setting withDimnames=FALSE  increases the speed with which assays are 
extracted."

(its entirely possible the info is there but i missed it)

Best,

Mike

Another example of a trouble point, is that if I am given an SE with
an unnamed assay and I need to give the assay a name, this also can
expand the memory used. I had found a solution (which works with
GenomicRanges 1.18 / current release) with:

names(assays(se, withDimnames=FALSE))[1] <- "foo"

But now I'm looking in devel and this appears to no longer work. The
memory used expands, equivalent to:

names(assays(se))[1] <- "foo"

Here's some code to try this:

m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
se <- SummarizedExperiment(m)
names(assays(se, withDimnames=FALSE))[1] <- "foo"
names(assays(se))[1] <- "foo"

while running gc() in between steps.



I think this is a regression of some sort, and I'll look into it. Thanks for 
the heads-up.

Martin



On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
<kasperdanielhan...@gmail.com> wrote:


On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <st...@channing.harvard.edu>
wrote:

I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old
version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?

I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
    1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.


after example(SummarizedExperiment)

colnames(assays(se1)[[1]])


[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently

rownames(assays(se1)[[1]]) = as.character(1:200)

rownames(assays(se1)[[1]])



NULL
seems we could issue a warning there



Vince, you need to be careful here.

The assays are stored without colnames (unless something has recently
changed).  The default is to - upon extraction - set the colnames of the
matrix.  This however requires a copy of the entire matrix.  So
essentially, upon extraction, each assay is needlessly duplicated to add
the colnames.  This is what I mean by inefficient. I would prefer to store
the assays with colnames.  This means that changing sampleNames of the
object will be inefficient (as it is for eSets) since it would require a
complete copy of everything.  But I would rather - much rather - copy when
setting sampleNames than copy when extracting an assay.

Best,
Kasper

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Reply via email to