On 04/06/2014 04:21 PM, Michael Lawrence wrote:



On Sun, Apr 6, 2014 at 2:48 PM, Simon Anders <and...@embl.de
<mailto:and...@embl.de>> wrote:

    Hi Michael

    On 06/04/14 23:32, Michael Lawrence wrote:
     > On an arbitrary vector, the names do not need to be unique, but they DO
     > need to be unique on a DataFrame (according to the data.frame
     > conventions). Conditioning on whether there are duplicate names would be
     > too complicated, so it is left to the user to declare whether the names
     > are expected on the result. Since in general the vector names are not
     > valid rownames, the default is FALSE. I guess if we really wanted to be
     > consistent with R, we would mangle the names to make them unique, but
     > that check is expensive.

    Thanks for the response, but I'm not sure I understand it. I thought
    "use.names=TRUE" instructs "mcols" to use the rownames of the
    SummerizedExperiment object as rownames for the returned DataFrame. Now,
    as the rownames of the SummerizedExperiment have to be unique anyway (at
    least, I suppose they have to -- they are names, too, after all, and not
    just an arbitrary vector), how can it happen that duplicate names might
    appear?


I don't think the SE rownames are constrained to be unique. I haven't tested it,

Empirically, the row names can be duplicated, but the column names cannot.

The lack of constraint on row names is enabled by the rowData GenomicRanges, while the constraint on column names is introduced by the (rownames of the) colData DataFrame. So the lack of symmetry in the class leads to lack of symmetry for dimnames. The use of GenomicRanges for rows has been the subject of previous discussion.

It wouldn't be inconceivable to impose constraints on duplicate row names in SummarizedExperiment and set use.names=TRUE by default, or to redefine mcols(se) to use.names=!any(dupclicated(se)). There would be performance consequences (how much?) and an mcols inconsistency. I think this is part of the same discussion as

  https://stat.ethz.ch/pipermail/bioc-devel/2014-March/005409.html

which I have not yet followed through on.

Syntax wise, there is also

  mcols(se)[rownames(se) == "gene_D", "yellowness"]

This is more efficient (and more error prone) than either use.names or Michael's suggestion.

Martin

but I don't see the assertion in the code. This is because an SE is modeled as a
matrix, which does not have the same constraint as a data.frame.

    The use case: I have a SummerizedExperiment object with gene IDs in the
    rownames. Let's say I want to get the value in the meta-data column
    "yellowness" for "gene_D".

    With en ExpressionSet, I could write:
        fData(es)["gene_D","yellowness"]

    With SummerizeExperiment, it has to be:
        mcols(se,use.names=TRUE)["gene_D","yellowness"]

    Of course, it's no big deal, but I find it quite clumsy, and I wonder
    why it has to be this way.


Well, there's this syntax:
mcols(se["gene_D",])$yellowness


       Simon




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to