[Bioc-devel] Using integrated contains in Bioconductor packages

Martin Morgan Tue, 05 Nov 2013 06:13:38 -0800

Hi package developers --

I found this article pretty intersting reading


  
http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310

especially of course the comments of Robert Gentleman and the reasons forsuccess of R (external packages written by domain experts) and Bioconductor(interoperability between different analysis capabilities enabled by usingsimilar data structures). It's also very important to provide 'integrated'containers that couple, say, a matrix of expression count data with theannotations of the genes / gene regions (rows) and sample phenotypic data (columns).

With these ideas in mind, I want to emphasize that new and existing Bioconductorpackages should be re-using established data structures. With omics data it isvery important to offer users a way to easily work with data across Bioconductorpackages. While you might implement 'internal' functions that perform numericalcalculations on an R `matrix`, say, the major input functions should reallysupport GenomicRanges::SummarizedExperiment objects, rather than (in additionto?) plain old matrix objects.

The rowData of summarized experiments can minimally contain names like therownames() of a matrix, but can typically contain much more useful information,e.g., the genomic coordinates of regions of the regions of interst (as GRangesor GRangesList objects) and / or other attributes that are useful to your ownanalysis (GC content of each region?) or to the user (p-values from previousanalysis?). Similarly the colData can be simple identifiers like colnames() of amatrix, but it's much more informative to tightly couple the phenotypic dataabout the samples. This makes it easy and error-free for the user to do thingslike subset both the phenotype and experssion data by some phenotype ofinterest, e.g., se[, colData(se)$Gender %in% "Female"].

Return values should respect the row and column indicies of the inputs asappropriate, so for instance it's easy for the user to add a matrix(assays(se)[["foo"]] <- foo(se, ...)), or vector or data.frame (preferablly,DataFrame) mcols(colData)$bar <- bar(se, ...) of results to their summarizedexperiment. It may often be appropriate to do this work for the user, returninga SummarizedExperiment annotated with your additional results.

There are similar data structures for other types of data, e.g.,Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel freeto ask on this list if you're looking for guidance.

Not all return values are as simple as a vector, matrix, or data.frame, and ofcourse one should not try to fit this into an inappropriate data structure.


Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Using integrated contains in Bioconductor packages

Reply via email to