Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Michael Love Tue, 31 Mar 2015 12:50:51 -0700

I forgot to ask my other question. I had gone in early March and fixed
my code to eliminate rowData<-, but the argument to SummarizedExperiment
was still called rowData, and a DataFrame could be provided. Then I
didn't check for a few weeks, but the argument for the rowData slot is
now called rowRanges. What's the trick to putting a DataFrame on an
empty GRanges, so I can get the old behavior but now using the rowRanges
argument?


On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
<michaelisaiahl...@gmail.com> wrote:
> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
> first assay and duplication of memory from my March 9 email. I tried
> assayNames<- as well. My use case is if I am given a
> SummarizedExperiment where the first element is not named "counts"
> (albeit the SE is most likely coming from summarizeOverlaps() and
> already named "counts"...).
>
>> sessionInfo()
> R Under development (unstable) (2015-03-31 r68129)
> Platform: x86_64-apple-darwin12.5.0 (64-bit)
> Running under: OS X 10.8.5 (Mountain Lion)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>    methods   base
>
> other attached packages:
> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
> S4Vectors_0.5.22
> [5] BiocGenerics_0.13.10  testthat_0.9.1        devtools_1.7.0        
> knitr_1.9
> [9] BiocInstaller_1.17.6
>
> loaded via a namespace (and not attached):
> [1] formatR_1.1    XVector_0.7.4  tools_3.3.0    stringr_0.6.2  evaluate_0.5.5
>
> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
> <michaelisaiahl...@gmail.com> wrote:
>>
>>
>> On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmor...@fredhutch.org> wrote:
>> >
>> > On 03/09/2015 08:07 AM, Michael Love wrote:
>> >>
>> >> Some guidance on how to avoid duplication of the matrix for developers
>> >> would be greatly appreciated.
>> >
>> >
>> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
>> > extraction of assays (but obviously you don't have dimnames on the 
>> > matrix). Row or column subsetting necessarily causes the subsetted assay 
>> > data to be duplicated. There should not be any duplication when 
>> > rowRanges() or colData() are changed without changing their dimension / 
>> > ordering.
>> >
>>
>> Thanks Martin for checking into the regression.
>>
>> Sorry, I should have been more specific earlier, I meant more 
>> guidance/documentation in the man page for SE. I scanned the 'Extension' 
>> section but didn't find a note on withDimnames for extracting the matrix or 
>> this example of renaming the assays (it seems like this could easily be 
>> relevant for other package authors).
>>
>> A prominent note there might help devs write more memory efficient packages.
>>
>> The argument section mentions speed but I'd explicitly mention memory given 
>> that we're often storing big matrices:
>>
>> "Setting withDimnames=FALSE  increases the speed with which assays are 
>> extracted."
>>
>> (its entirely possible the info is there but i missed it)
>>
>> Best,
>>
>> Mike
>>
>> >
>> >> Another example of a trouble point, is that if I am given an SE with
>> >> an unnamed assay and I need to give the assay a name, this also can
>> >> expand the memory used. I had found a solution (which works with
>> >> GenomicRanges 1.18 / current release) with:
>> >>
>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>> >>
>> >> But now I'm looking in devel and this appears to no longer work. The
>> >> memory used expands, equivalent to:
>> >>
>> >> names(assays(se))[1] <- "foo"
>> >>
>> >> Here's some code to try this:
>> >>
>> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>> >> se <- SummarizedExperiment(m)
>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>> >> names(assays(se))[1] <- "foo"
>> >>
>> >> while running gc() in between steps.
>> >
>> >
>> > I think this is a regression of some sort, and I'll look into it. Thanks 
>> > for the heads-up.
>> >
>> > Martin
>> >
>> >
>> >>
>> >>
>> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>> >> <kasperdanielhan...@gmail.com> wrote:
>> >>>
>> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
>> >>> <st...@channing.harvard.edu>
>> >>> wrote:
>> >>>
>> >>>> I am glad you are keeping this discussion alive Kasper.
>> >>>>
>> >>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>> >>>> kasperdanielhan...@gmail.com> wrote:
>> >>>>
>> >>>>> It sounds like the proposed changes are already made.  However (like
>> >>>>> others) I am still a bit mystified why this was necessary.  The old
>> >>>>> version
>> >>>>> did allow for a GRanges inside the DataFrame of the rowData, as far as 
>> >>>>> I
>> >>>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>> >>>>> data/use cases is this for?
>> >>>>>
>> >>>>> I am happy to hear that SummarizedExperiment is going to be spun out 
>> >>>>> into
>> >>>>> its own package.  When that happens, I have some comments, which I'll
>> >>>>> include here in anticipation
>> >>>>>    1) I now very strongly believe it was a design mistake to not have
>> >>>>> colnames on the assays.  The advantage of this choice is that 
>> >>>>> sampleNames
>> >>>>> are only stored one place.  The extreme disadvantage is the high
>> >>>>> ineffeciency when you want colnames on an extracted assay.
>> >>>>>
>> >>>>
>> >>>> after example(SummarizedExperiment)
>> >>>>
>> >>>>> colnames(assays(se1)[[1]])
>> >>>>
>> >>>> [1] "A" "B" "C" "D" "E" "F"
>> >>>>
>> >>>> so this seems to be optional.  But attempts to set rownames will fail
>> >>>> silently
>> >>>>
>> >>>>> rownames(assays(se1)[[1]]) = as.character(1:200)
>> >>>>
>> >>>>
>> >>>>> rownames(assays(se1)[[1]])
>> >>>>
>> >>>>
>> >>>> NULL
>> >>>> seems we could issue a warning there
>> >>>>
>> >>>
>> >>>
>> >>> Vince, you need to be careful here.
>> >>>
>> >>> The assays are stored without colnames (unless something has recently
>> >>> changed).  The default is to - upon extraction - set the colnames of the
>> >>> matrix.  This however requires a copy of the entire matrix.  So
>> >>> essentially, upon extraction, each assay is needlessly duplicated to add
>> >>> the colnames.  This is what I mean by inefficient. I would prefer to 
>> >>> store
>> >>> the assays with colnames.  This means that changing sampleNames of the
>> >>> object will be inefficient (as it is for eSets) since it would require a
>> >>> complete copy of everything.  But I would rather - much rather - copy 
>> >>> when
>> >>> setting sampleNames than copy when extracting an assay.
>> >>>
>> >>> Best,
>> >>> Kasper
>> >>>
>> >>>          [[alternative HTML version deleted]]
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel@r-project.org mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel@r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >
>> >
>> > --
>> > Computational Biology / Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N.
>> > PO Box 19024 Seattle, WA 98109
>> >
>> > Location: Arnold Building M1 B861
>> > Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Reply via email to