Re: [Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Hervé Pagès Mon, 13 Oct 2014 21:45:24 -0700

Hi,

On 10/11/2014 02:25 PM, Vincent Carey wrote:

On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence <lawrence.mich...@gene.com

wrote:

But what it would do exactly?

Probably would want to be able to extract a gene list from a TxDb, then
extract the desired type of structure from the TxDb.

Not too bad right now, but it would be nice to leverage the identifier
type information on the gene list object.

Currently:
tx <- transcripts(txdb, vals=list(gene_id=genes))

Proposed:
tx <- transcripts(txdb[GeneList])


yes, that makes sense.  i don't go to txdb's as naturally as i should.


Also coming a little late to the party, but I also have a preference
for Kasper's proposal of using subsetByXXX.

Supporting 'txdb[GeneList]' is arbitrarily making gene ids special,
when a TxDb contains other ids (transcript and exon ids).

Also if you push a little bit this concept, you quickly run into
some semantic headaches:

  - First, let's keep in mind that for a common track like the
    "UCSC Genes" track, a lot of transcripts are not linked to any
    gene.

  - Then, allowing subsetting a TxDb by a character vector means
    a TxDb has names. At least conceptually. So it's tempting to
    also support 'names(txdb)' (would return all the gene ids).

  - Finally, the names being unique, it seems natural to expect that
    'txdb[names(txdb)]' is a no-op. But it won't because
    'txdb[names(txdb)]' will drop all the transcripts that are not
    linked to a gene.

But before any TxDb subsetting can happen (via [ or subsetByXXX), we
need to bring back the classic (and healthier) pass-by-value semantic
on these objects. (Right now TxDb is a reference class and thus TxDb
objects have a pass-by-reference semantic.)

H.




On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmor...@fhcrc.org>
wrote:

On 10/11/2014 08:41 AM, Vincent Carey wrote:

Is there anything on the order of as([GeneSet], "GRanges") around?


no, I don't think so; obviously of use and following a common theme.
Martin

On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.g...@gene.com>
wrote:

  Sean and Vincent,


The goal of what we are doing builds off of what Martin has in GSEABase.
We were looking to see how much benefit we can get with something
lighter-weight that lies between indistinguishable character vectors and
the full machinery of GeneSets.

Either way, it seems like formalizing the semantic information is a way
to
do what you want. Furthermore, these classed id objects can be created
automatically when there is contextual information e.g. during queries
to
databases (or db-like objects), and then simply added to metadata
DataFrames and re-used.

~G




On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdav...@mail.nih.gov>
wrote:


On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.g...@gene.com>
wrote:

  Hey all,


We are in the (very) early stages of experimenting with something that
seems relevant here: classed identifiers. We are using them for
database/mart queries, but the same concept could be useful for the
cases
you're describing I think.

E.g.

  mysyms = GeneSymbol(c("BRAF", "BRCA1"))

mysyms

An object of class "GeneSymbol"
[1] "BRAF"  "BRCA1"

yourSE[mysyms, ]

...


  This approach has the flavor of some of the functionality that

Martin put
together for the GSEABase package (EntrezIdentifier, etc.).

Sean

This approach has the benefit of being declarative instead of
heuristic
(people won't be able to accidentally invoke it), while still giving
most
of the convenience I believe you are looking for.

The object classes inherit directly from character, so should "just
work"
most of the time, but as I said it's early days; lots more testing for
functionality and usefulness is needed.

~G


On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey <
st...@channing.harvard.edu>
wrote:

  OK by me to leave [ alone.  We could start with subsetByEntrez,

subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID.

Utilities to generate GRanges for queries in each of these
vocabularies
should, perhaps, be in the OrganismDb space?  Once those are in place
no additional infrastructure is necessary?

On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. <

tim.tri...@gmail.com>

wrote:

  Agreed with Sean, having tried implementing to "magical" alternative


--t

  On Sep 20, 2014, at 9:31 AM, Sean Davis <sdav...@mail.nih.gov>

wrote:

Hi, Vince.

I'm coming a little late to the party, but I agree with Kasper's

sentiment

that the less "magical" approach of using subsetByXXX might be the

cleaner

way to go for the time being.

Sean


On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey <

st...@channing.harvard.edu>

wrote:

  https://github.com/vjcitn/biocMultiAssay/blob/master/

vignettes/SEresolver.Rnw

shows some modifications to [ that allow subsetting of SE by
gene or pathway name

it may be premature to work at the [ level.  Kasper suggested

defining

a suite of subsetBy operations that would accomplish this


i think we could get something along these lines into the release

without

too much more work.  votes?


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


     [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biologist
Genentech Research

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Computational Biologist
Genentech Research

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Reply via email to