On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo <robert.cast...@upf.edu> wrote:
> some of the goals behind this discussion are IMO similar to the ones for > biocMultiAssay: > > https://github.com/vjcitn/biocMultiAssay > > maybe Vince can confirm. > It is true that there are connections between the concerns But the way I see it, the container design we are talking about in this thread addresses the management of a fixed common assay type over a fixed set of samples. The biocMultiAssay deals with the management of multiple assay types over multiple samples, with possible disparities in sample sets over the different assay types. > robert. > > On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote: > >> Oh, I don't disagree. Perhaps the two problems can be addressed >> simultaneously by >> >> 1) deciding on what contracts a multi-assay container can/would demand to >> be useful >> 2) calling it something besides SummarizedExperiment, say, >> ExperimentCollection >> >> Then the SE API could stay the same as it is (which is already very >> useful) >> and progress could be sought in the offshoot (ExperimentCollection or >> whatever) without breaking things that rely on SE. >> >> Just off the top of my head, a most generically useful container for DNA >> methylation& CNV data (which can of course be called from the same assay) >> is Kasper& JP's GenomicRatioSet, which already has some weird quirks for >> eSet backwards compatibility. (e.g. sampleNames(x) works, but >> sampleNames(x)<- does not work; pData(x) calls colData(x); fData(x) calls >> rowData(x)) There are little niggles that I should probably just send in >> a >> patch for, but a cleaner overall container would be better, if for no >> other >> reason than the aforementioned ability to easily experiment with >> imputation. An approach that I've been using is to stuff the SNPs, CNV (as >> GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... >> somewhat less than optimal, especially when subsetting. >> >> But it does suggest that I could define a coercion from the current >> rambling wreck into a nice clean new class/API (ExperimentCollection or >> whatever) and I'll bet other package authors could, too. The presence of >> a >> GRangesFrame would then be handy for returning a given assay's results, so >> that the user could be blissfully ignorant of the storage backing (ff, >> BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management >> advantages of a SummarizedExperiment. >> >> JMHO >> >> >> >> >> >> >> >> Statistics is the grammar of science. >> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> >> >> On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey<st...@channing.harvard.edu> >> wrote: >> >> I am a bit concerned about any major alterations to the >>> SummarizedExperiment API. We have >>> two papers and plenty of working code that use it in meaningful ways. >>> Effort required to keep new >>> formulations back-compatible as well as bug-free has to be weighed >>> seriously. >>> >>> I agree that the name is not ideal. We are learning as we go. >>> >>> Seems to make sense to start with the contracts we want the instances >>> of >>> a class to satisfy. I have long felt >>> that X[i, j] idiom is one users and developers should be comfortable >>> with, >>> even insist on, and for consistency >>> with matrix operations idiom, it should work in a natural way for numeric >>> indexing. This seems like an important >>> constraint. subsetBy* is a useful idiom, but it is conceivable that we >>> would adopt filter() for row-oriented selections >>> and select() for column-oriented selections. Do we have to make any >>> special design considerations to allow >>> very smooth interoperation with out-of-memory resources for certain >>> components for developers who want to allow this? >>> >>> We should have a reasonable way to get data on what is out there, what >>> is used, how it is most effectively used. >>> What's the SE API? Is it well-adapted to requirements of DESeq2? Other >>> killer packages that use/don't use it? >>> Even getting data on the formal API for a class is not all that familiar. >>> And if folks are writing non-S4 interfaces (i.e., naked >>> functions) we have no way of identifying them. See below for one way of >>> discovering the API for SummarizedExperiment. >>> >>> In summary, I think we have to be careful about overdesigning too >>> early. Getting clear on contracts seems the best >>> way to ensure reuse, and we really want that so that reliability is >>> continually assessed. My sense is that it is good >>> to give developers something they'll gladly extend, not necessarily reuse >>> directly. So we don't have to have >>> broad consensus on class details, but on the minimal abstraction and on >>> obligatory tests on its basic implementation. >>> >>> methods(class="SummarizedExperiment") # perhaps an obsolete version of >>>> >>> methods cataloguer by MTM >>> >>> DataFrame with 76 rows and 3 columns >>> >>> generic >>> signature package >>> >>> <character> >>> <character> <character> >>> >>> 1 [ x="SummarizedExperiment", i="ANY", >>> j="ANY", drop="ANY" base >>> >>> 2 [ x="SummarizedExperiment", i="ANY", >>> j="missing", value="ANY" base >>> >>> 3 [ x="SummarizedExperiment", >>> i="ANY", j="missing" base >>> >>> 4 [<- x="SummarizedExperiment", i="ANY", j="ANY", >>> value="SummarizedExperiment" base >>> >>> 5 assay >>> x="SummarizedExperiment", i="character" GenomicRanges >>> >>> ... ... >>> ... ... >>> >>> 72 updateObject >>> object="SummarizedExperiment" BiocGenerics >>> >>> 73 values >>> x="SummarizedExperiment" S4Vectors >>> >>> 74 values<- >>> x="SummarizedExperiment" S4Vectors >>> >>> 75 width >>> x="SummarizedExperiment" BiocGenerics >>> >>> 76 width<- >>> x="SummarizedExperiment" BiocGenerics >>> >>> On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo<hcorr...@gmail.com> >>> wrote: >>> >>> May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' >>>> can >>>> return whatever makes sense (GRanges, or other data structures -thinking >>>> taxonomy for metagenomics for example-). GRangesFrame can inherit from >>>> this. >>>> >>>> On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès<hpa...@fredhutch.org> >>>> wrote: >>>> >>>> GRangesFrame is an interesting idea and I gave it some thoughts. >>>>> >>>>> There is this nice symmetry between GRanges and GRangesFrame: >>>>> >>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols() >>>>> >>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via >>>>> some accessor (e.g. rowRanges()) >>>>> >>>>> So GRanges and GRangesFrame are equivalent in terms of what they >>>>> can hold, but different in terms of API: the former has the ranges >>>>> API as primary API and the DataFrame API on its mcols() component, >>>>> and the latter has the DataFrame API as primary API and the ranges >>>>> API on its rowRanges() component. Nice switch! >>>>> >>>>> What does this API switch bring us? A GRangesFrame object is now >>>>> an object that fully behaves like a DataFrame and people can also >>>>> perform range-based operations on its rowRanges() component. >>>>> Here is what I'm afraid is going to happen: people will also want >>>>> to be able to perform range-based operations *directly* on >>>>> these objects, i.e. without having to call rowRanges() first. >>>>> So for example when they do subsetByOverlaps(), subsetting >>>>> happens vertically. Also the Hits object returned by findOverlaps() >>>>> would contain row indices. Problem with this is that these objects >>>>> now start to suffer from the "dual personality syndrome". For >>>>> example, it's not clear anymore what their length should be. >>>>> Strictly speaking it should be their number of columns (that's >>>>> what the length of a DataFrame is), but the ranges API that >>>>> we're trying to put on them also makes them feel like vectors >>>>> along the vertical dimension so it also feels that their length >>>>> should be their number of rows. Same thing with 1D subsetting. >>>>> Why does it subset the columns and not the rows? Most people >>>>> are now confused. >>>>> >>>>> It's interesting to note that the same thing happens with GRanges >>>>> objects, but in the opposite direction: people wish they could >>>>> do DataFrame operations directly on them without calling mcols() >>>>> first. But in order to preserve the good health of GRanges objects, >>>>> we've not done that (except for $, a shortcut for mcols(x)$, >>>>> the pressure was just too strong). >>>>> >>>>> H. >>>>> >>>>> >>>>> >>>>> On 03/03/2015 04:35 PM, Michael Lawrence wrote: >>>>> >>>>> Should be possible for the annotations to be of any type, as long as >>>>>> >>>>> they >>>> >>>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a >>>>>> DataFrame, GRanges, or whatever in there. But it would be nice to have >>>>>> >>>>> a >>>> >>>>> special class for the container with range information. The contract >>>>>> >>>>> for >>>> >>>>> the range annotation would be to have a granges() method. >>>>>> >>>>>> I agree it would be nice if there was a way with the methods package >>>>>> to >>>>>> easily assert such contracts. For example, one could define an >>>>>> >>>>> interface >>>> >>>>> with a set of generics (and optionally the relevant position in the >>>>>> generic >>>>>> signature). Then, once all of the methods have been assigned for a >>>>>> particular class, it is made to inherit from that contract class. >>>>>> There >>>>>> are >>>>>> lots of gotchas though. Not sure how useful it would be in practice. >>>>>> >>>>>> >>>>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty<haverty.pe...@gene.com> >>>>>> wrote: >>>>>> >>>>>> There are some nice similarities in these new imaginary types. A >>>>>> >>>>>>> "GRangesFrame" is a list of dimensionally identical things (columns) >>>>>>> >>>>>> and >>>> >>>>> some row meta-data (the GRanges). The SE-like object is similarly a >>>>>>> >>>>>> list >>>> >>>>> of dimensionally like things (matrices, RleDataFrames, BigMatrix >>>>>>> >>>>>> objects, >>>> >>>>> HDF5-backed things) with some row meta-data (a DataFrame or >>>>>>> GRangesFrame). >>>>>>> Elegant? Maybe they would actually be relatives in the class tree. >>>>>>> >>>>>>> I wonder if this kind of thing would be easier if we had Java-style >>>>>>> Interfaces or duck-typing. The "x" slot of "y" holds something that >>>>>>> implements this set of methods ... >>>>>>> >>>>>>> Oh, and kinda apropos, the genoset class will probably go away or >>>>>>> >>>>>> become >>>> >>>>> an extension to this new SE-like thing. The extra stuff that comes >>>>>>> >>>>>> along >>>> >>>>> with genoset will still be available. >>>>>>> >>>>>>> Pete >>>>>>> >>>>>>> ____________________ >>>>>>> Peter M. Haverty, Ph.D. >>>>>>> Genentech, Inc. >>>>>>> phave...@gene.com >>>>>>> >>>>>>> On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr.<tim.tri...@gmail.com >>>>>>> >>>>>> >>>>> wrote: >>>>>>> >>>>>>> This. >>>>>>> >>>>>>>> >>>>>>>> It would be damned near perfect as a return value for assays coming >>>>>>>> >>>>>>> out >>>> >>>>> of >>>>>>>> an object that held several such assays at several time points in a >>>>>>>> population, where there are both assay-wise and covariate-wise >>>>>>>> >>>>>>> "holes" >>>> >>>>> that >>>>>>>> could nonetheless be usefully imputed across assays. >>>>>>>> >>>>>>>> >>>>>>>> Statistics is the grammar of science. >>>>>>>> Karl Pearson<http://en.wikipedia.org/wiki/The_Grammar_of_Science> >>>>>>>> >>>>>>>> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty< >>>>>>>> >>>>>>> haverty.pe...@gene.com> >>>> >>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> I still think GRanges should be a subclass of DataFrame, >>>>>>>>>> >>>>>>>>>> which would make this easy, but I don't seem to be winning that >>>>>>>>>>> >>>>>>>>>>> argument. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Just impossible. As Michael mentioned back in November, they >>>>>>>>>> have >>>>>>>>>> conflicting APIs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges >>>>>>>>> (without mcols) as an index? >>>>>>>>> >>>>>>>>> >>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>> >>>>>>>>> >>>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioc-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>> >>>>>> >>>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpa...@fredhutch.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>> >>>> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > -- > Robert Castelo, PhD > Associate Professor > Dept. of Experimental and Health Sciences > Universitat Pompeu Fabra (UPF) > Barcelona Biomedical Research Park (PRBB) > Dr Aiguader 88 > E-08003 Barcelona, Spain > telf: +34.933.160.514 > fax: +34.933.160.550 > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel