Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Michael Love Wed, 01 Apr 2015 07:18:59 -0700

I'll retract those last two emails about empty GRanges. That's simply:

se <- SummarizedExperiment(assays, colData=colData)
mcols(se) <- myDataFrame


On Tue, Mar 31, 2015 at 4:40 PM, Michael Love
<michaelisaiahl...@gmail.com> wrote:
> Would this code inspired by the release version of GenomicRanges work?
> e.g. if I want to add a DataFrame with 10 rows:
>
> names <- letters[1:10]
> x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names))
> mcols(x) <- DataFrame(foo=1:10)
>
> Then give x to the rowRanges argument of SummarizedExperiment?
>
> On Tue, Mar 31, 2015 at 3:49 PM, Michael Love
> <michaelisaiahl...@gmail.com> wrote:
>> I forgot to ask my other question. I had gone in early March and fixed
>> my code to eliminate rowData<-, but the argument to SummarizedExperiment
>> was still called rowData, and a DataFrame could be provided. Then I
>> didn't check for a few weeks, but the argument for the rowData slot is
>> now called rowRanges. What's the trick to putting a DataFrame on an
>> empty GRanges, so I can get the old behavior but now using the rowRanges
>> argument?
>>
>> On Tue, Mar 31, 2015 at 3:40 PM, Michael Love
>> <michaelisaiahl...@gmail.com> wrote:
>>> With GenomicRanges 1.19.48, I'm still having issues with re-naming the
>>> first assay and duplication of memory from my March 9 email. I tried
>>> assayNames<- as well. My use case is if I am given a
>>> SummarizedExperiment where the first element is not named "counts"
>>> (albeit the SE is most likely coming from summarizeOverlaps() and
>>> already named "counts"...).
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2015-03-31 r68129)
>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>> Running under: OS X 10.8.5 (Mountain Lion)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats4    parallel  stats     graphics  grDevices datasets  utils
>>>    methods   base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
>>> S4Vectors_0.5.22
>>> [5] BiocGenerics_0.13.10  testthat_0.9.1        devtools_1.7.0        
>>> knitr_1.9
>>> [9] BiocInstaller_1.17.6
>>>
>>> loaded via a namespace (and not attached):
>>> [1] formatR_1.1    XVector_0.7.4  tools_3.3.0    stringr_0.6.2  
>>> evaluate_0.5.5
>>>
>>> On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
>>> <michaelisaiahl...@gmail.com> wrote:
>>>>
>>>>
>>>> On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmor...@fredhutch.org> wrote:
>>>> >
>>>> > On 03/09/2015 08:07 AM, Michael Love wrote:
>>>> >>
>>>> >> Some guidance on how to avoid duplication of the matrix for developers
>>>> >> would be greatly appreciated.
>>>> >
>>>> >
>>>> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
>>>> > extraction of assays (but obviously you don't have dimnames on the 
>>>> > matrix). Row or column subsetting necessarily causes the subsetted assay 
>>>> > data to be duplicated. There should not be any duplication when 
>>>> > rowRanges() or colData() are changed without changing their dimension / 
>>>> > ordering.
>>>> >
>>>>
>>>> Thanks Martin for checking into the regression.
>>>>
>>>> Sorry, I should have been more specific earlier, I meant more 
>>>> guidance/documentation in the man page for SE. I scanned the 'Extension' 
>>>> section but didn't find a note on withDimnames for extracting the matrix 
>>>> or this example of renaming the assays (it seems like this could easily be 
>>>> relevant for other package authors).
>>>>
>>>> A prominent note there might help devs write more memory efficient 
>>>> packages.
>>>>
>>>> The argument section mentions speed but I'd explicitly mention memory 
>>>> given that we're often storing big matrices:
>>>>
>>>> "Setting withDimnames=FALSE  increases the speed with which assays are 
>>>> extracted."
>>>>
>>>> (its entirely possible the info is there but i missed it)
>>>>
>>>> Best,
>>>>
>>>> Mike
>>>>
>>>> >
>>>> >> Another example of a trouble point, is that if I am given an SE with
>>>> >> an unnamed assay and I need to give the assay a name, this also can
>>>> >> expand the memory used. I had found a solution (which works with
>>>> >> GenomicRanges 1.18 / current release) with:
>>>> >>
>>>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>>> >>
>>>> >> But now I'm looking in devel and this appears to no longer work. The
>>>> >> memory used expands, equivalent to:
>>>> >>
>>>> >> names(assays(se))[1] <- "foo"
>>>> >>
>>>> >> Here's some code to try this:
>>>> >>
>>>> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
>>>> >> se <- SummarizedExperiment(m)
>>>> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
>>>> >> names(assays(se))[1] <- "foo"
>>>> >>
>>>> >> while running gc() in between steps.
>>>> >
>>>> >
>>>> > I think this is a regression of some sort, and I'll look into it. Thanks 
>>>> > for the heads-up.
>>>> >
>>>> > Martin
>>>> >
>>>> >
>>>> >>
>>>> >>
>>>> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
>>>> >> <kasperdanielhan...@gmail.com> wrote:
>>>> >>>
>>>> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
>>>> >>> <st...@channing.harvard.edu>
>>>> >>> wrote:
>>>> >>>
>>>> >>>> I am glad you are keeping this discussion alive Kasper.
>>>> >>>>
>>>> >>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
>>>> >>>> kasperdanielhan...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> It sounds like the proposed changes are already made.  However (like
>>>> >>>>> others) I am still a bit mystified why this was necessary.  The old
>>>> >>>>> version
>>>> >>>>> did allow for a GRanges inside the DataFrame of the rowData, as far 
>>>> >>>>> as I
>>>> >>>>> recall.  So I assume this is for efficiency.  But why?  What kind of
>>>> >>>>> data/use cases is this for?
>>>> >>>>>
>>>> >>>>> I am happy to hear that SummarizedExperiment is going to be spun out 
>>>> >>>>> into
>>>> >>>>> its own package.  When that happens, I have some comments, which I'll
>>>> >>>>> include here in anticipation
>>>> >>>>>    1) I now very strongly believe it was a design mistake to not have
>>>> >>>>> colnames on the assays.  The advantage of this choice is that 
>>>> >>>>> sampleNames
>>>> >>>>> are only stored one place.  The extreme disadvantage is the high
>>>> >>>>> ineffeciency when you want colnames on an extracted assay.
>>>> >>>>>
>>>> >>>>
>>>> >>>> after example(SummarizedExperiment)
>>>> >>>>
>>>> >>>>> colnames(assays(se1)[[1]])
>>>> >>>>
>>>> >>>> [1] "A" "B" "C" "D" "E" "F"
>>>> >>>>
>>>> >>>> so this seems to be optional.  But attempts to set rownames will fail
>>>> >>>> silently
>>>> >>>>
>>>> >>>>> rownames(assays(se1)[[1]]) = as.character(1:200)
>>>> >>>>
>>>> >>>>
>>>> >>>>> rownames(assays(se1)[[1]])
>>>> >>>>
>>>> >>>>
>>>> >>>> NULL
>>>> >>>> seems we could issue a warning there
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> Vince, you need to be careful here.
>>>> >>>
>>>> >>> The assays are stored without colnames (unless something has recently
>>>> >>> changed).  The default is to - upon extraction - set the colnames of 
>>>> >>> the
>>>> >>> matrix.  This however requires a copy of the entire matrix.  So
>>>> >>> essentially, upon extraction, each assay is needlessly duplicated to 
>>>> >>> add
>>>> >>> the colnames.  This is what I mean by inefficient. I would prefer to 
>>>> >>> store
>>>> >>> the assays with colnames.  This means that changing sampleNames of the
>>>> >>> object will be inefficient (as it is for eSets) since it would require 
>>>> >>> a
>>>> >>> complete copy of everything.  But I would rather - much rather - copy 
>>>> >>> when
>>>> >>> setting sampleNames than copy when extracting an assay.
>>>> >>>
>>>> >>> Best,
>>>> >>> Kasper
>>>> >>>
>>>> >>>          [[alternative HTML version deleted]]
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> Bioc-devel@r-project.org mailing list
>>>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> Bioc-devel@r-project.org mailing list
>>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> > Computational Biology / Fred Hutchinson Cancer Research Center
>>>> > 1100 Fairview Ave. N.
>>>> > PO Box 19024 Seattle, WA 98109
>>>> >
>>>> > Location: Arnold Building M1 B861
>>>> > Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Reply via email to