Re: [Bioc-devel] Subsetting eSet-like objects with duplicated indices

Martin Morgan Wed, 12 Feb 2014 06:50:40 -0800

On 02/11/2014 05:03 PM, Benilton Carvalho wrote:

Hi,


I'm trying to understand why FeatureSet objects behave slightly different
than eSet objects.

There's a combination of things going on, some of which are unfortunate /unintended.

The basic problem is that, with regard to row names, subsetting a matrix withduplicate indexes behaves differently from subsetting a data.frame


    > matrix(0, 2, 2, dimnames=list(1:2, 3:4))[c(1,1),]
      3 4
    1 0 0
    1 0 0
    > data.frame(x=1:2, y=3:4)[c(1, 1),]
        x y
    1   1 3
    1.1 1 3

The creation of artificial row names is particularly bad when the row nameidentifier has an integer component, like an Ensembl gene id, because then therow name appears somehow legitimate but really isn't.


What happens with subsetting an ExpressionSet? Some of each, unfortunately

    m = matrix(0, 2, 2, dimnames=list(1:2, 3:4))
    e = ExpressionSet(m)[c(1, 1),]
    rownames(fData(e))    ## featureNames(featureData(e))
    ## [1] "1"   "1.1"
    rownames(exprs(e))    ## featureNames(assayData(e))
    ## [1] "1" "1"

and perhaps more unfortunately the validity of the object returned by subsettingis not checked


    validObject(e)
    ## Error in validObject(e) :
    ##   invalid class "ExpressionSet" object: featureNames differ
    ##   between assayData and featureData

NChannelSet seems to behave better, checking that there are confusing labels andfailing.

Because the row identifiers need to be munged, and munged identifiers are bad,it seems like the NChannelSet failure is desired. The behavior of ExpressionSetneeds to be cleaned up. It seems like the identifiers could be managedseparately from the row names, and the validity of returned objects checked. Thelatter is likely to break code that current works, because an early paradigm wasto update an object incrementally.

An alternative is to 'start again' using the much more well-designed IRangesinfrastructure, along the lines of


.ExpressionExperiment <- setClass("ExpressionExperiment",
    representation(exptData="List",
                   rowData="DataFrame",
                   colData="DataFrame",
                   assays="SimpleList"))

Simon Anders will recognize this design from an earlier suggestion of his.


Martin


Here's the one example I'm trying to work out:

if (!require(pd.hugene.1.0.st.v1)){
   library(BiocInstaller)
   biocLite('pd.hugene.1.0.st.v1')
}
library(oligoData)
data(affyGeneFS)
affyGeneFS
data(sample.ExpressionSet)
sample.ExpressionSet

## subset ExpressionSet
## everything ok
sample.ExpressionSet[c(1, 1),]

## subset FeatureSet
## error: featureNames differ between assayData and featureData
affyGeneFS[c(1, 1),]

But FeatureSets are derived from NChannelSet objects... so:

example('NChannelSet-class')
obj
obj[c(1, 2),] ## OK
obj[c(1, 1),] ## not OK

I was wondering why/if this is intended (i.e., it works on "single channel"
eSets, but fails on NChannelSets)?

Thank you so much for any insight,

benilton

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Subsetting eSet-like objects with duplicated indices

Reply via email to