Hi package developers --

I found this article pretty intersting reading

  
http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310

especially of course the comments of Robert Gentleman and the reasons for success of R (external packages written by domain experts) and Bioconductor (interoperability between different analysis capabilities enabled by using similar data structures). It's also very important to provide 'integrated' containers that couple, say, a matrix of expression count data with the annotations of the genes / gene regions (rows) and sample phenotypic data (columns).

With these ideas in mind, I want to emphasize that new and existing Bioconductor packages should be re-using established data structures. With omics data it is very important to offer users a way to easily work with data across Bioconductor packages. While you might implement 'internal' functions that perform numerical calculations on an R `matrix`, say, the major input functions should really support GenomicRanges::SummarizedExperiment objects, rather than (in addition to?) plain old matrix objects.

The rowData of summarized experiments can minimally contain names like the rownames() of a matrix, but can typically contain much more useful information, e.g., the genomic coordinates of regions of the regions of interst (as GRanges or GRangesList objects) and / or other attributes that are useful to your own analysis (GC content of each region?) or to the user (p-values from previous analysis?). Similarly the colData can be simple identifiers like colnames() of a matrix, but it's much more informative to tightly couple the phenotypic data about the samples. This makes it easy and error-free for the user to do things like subset both the phenotype and experssion data by some phenotype of interest, e.g., se[, colData(se)$Gender %in% "Female"].

Return values should respect the row and column indicies of the inputs as appropriate, so for instance it's easy for the user to add a matrix (assays(se)[["foo"]] <- foo(se, ...)), or vector or data.frame (preferablly, DataFrame) mcols(colData)$bar <- bar(se, ...) of results to their summarized experiment. It may often be appropriate to do this work for the user, returning a SummarizedExperiment annotated with your additional results.

There are similar data structures for other types of data, e.g., Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel free to ask on this list if you're looking for guidance.

Not all return values are as simple as a vector, matrix, or data.frame, and of course one should not try to fit this into an inappropriate data structure.

Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to