Re: [Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Marc Carlson Wed, 13 Feb 2013 15:38:44 -0800

Just posting  an update on this,

Just as I was composing a carefully worded email to the folks at UCSC, Isee they seem to have fixed the table browser so that it now looks thesame as the FTP site (and hence the results that come back fromrtracklayer). This means that the UCSC files now look like ourannotation again.



  Marc


On 02/12/2013 01:35 PM, Marc Carlson wrote:

On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote:

re:  '[BioC] question about Gviz' thread fallout:

Yesterday I rolled a relatively simple programmatic way to label UCSC
KnownGene entries with their symbols.  However, some isoforms (e.g. some
for NRIP1 and CDKN2B) seem to be missing from the mappings.

Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find

...This mapping is based on the very latest build available at UCSC
     for this organism as of March 2010.  2.6 is the last release where
     you can expect it to be here.  The GenomicFeatures package
     contains functionality that replaces the need for this mapping...

Alas, I'm too thick to find where, in the TxDb or elsewhere, I could
retrieve Hugo IDs for UCSC KnownGene entries without using org.Hs.egSYMBOL.
   The latter is what I usually do:

    library(Homo.sapiens)

    txs<- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
    head(names(txs))
    ## [1] "1"         "10"        "100"       "1000"      "10000"
"100008586"

    names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA)
    head(names(txs))
    ## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"

Now, I thought for a while, hell, this gets them all!  But, not really...

    txs$NRIP1
    ## GRanges with 1 range and 2 metadata columns:
    ##       seqnames               ranges strand |     tx_id     tx_name
    ##<Rle>              <IRanges>    <Rle>   |<integer>   <character>
    ##   [1]    chr21 [16333556, 16437126]      - |     71301  uc002yjx.2

Well, that's one of the isoforms.  But what about the other ones?

    org.Hs.egUCSCKG[[ "c002yjx.1" ]]
    ## NULL

    org.Hs.egUCSCKG[[ "uc010gkz.1" ]]
    ## NULL

I know UCSC identifiers can be a bit of a pain in the ass, but there do
exist mappings for these.  If they're going to be used as primary
identifiers for the TxDb packages, would it be possible to update them?

If it's an issue of time constraints, I will take a stab at it, but that
will almost guarantee more prattling from me on the mailing list.  On the
other hand, it might move GAF3.0 annotations out of the station.

Much obliged for any insights from the core developers.



Hi Tim,

So continuing from the other thread...

1st thing I noticed is that if you try to look up the two known gene IDs
that you gave me you will not have any luck.  From using the web
service, it seems that they are not actually valid UCSC known gene IDs.
At 1st I thought that maybe there had been updates since the last
Bioconductor release in October, but pasting these IDs into the UCSC
genome browser only lead me to this:

# Sorry, couldn't locate uc010gkz.1 in genome database
# Sorry, couldn't locate c002yjx.1 in genome database

So at this point I was a little curious where you actually got these ids
from?  (I will actually return to this in a minute)

Anyhow, looking deeper the website indicates that there is another
isoform for NRIP1 (other than:"uc002yjx.2") .  It is called
"uc021whl.1".  And it does indeed come up empty handed if you call
select like this:

select(Homo.sapiens, cols=c("SYMBOL","TXNAME"),
keys=c("uc002yjx.2","uc021whl.1"), keytype="TXNAME")

So what happened here?  Well the track data from UCSC doesn't have a
gene assigned to that isoform yet.  So the DB has no way of knowing that
it's connected.  Incidentally, this is still true even if you were to
download it this morning.

So here we have a situation where the UCSC web site has been updated,
but their track table (and in particular the table called
"knownToLocusLink") is not perfectly in sync with the web site.

Even weirder is the fact that if you use the "table browser" to download
the "latest" knownToLocusLink" table (which is yet another service on
their web site), you will get a table that has two isoforms (associated
with NPRIP1) that look very similar to the ones you mentioned before.
In fact I am willing to guess that this is where you got these from, and
that the shortened one is just a copy-paste typo).

So the problem here is that there seem to be three different ways to get
the same kind of data from UCSC genome browser.  There is the
website/browser.  There is rtracklayer, and then there is also the web
form access to the table browser (which is what I think you used).  AND
they all three seem to be in disagreement with each other.  My suspicion
is that some of these are just more up to date than others.  But I think
that only UCSC will really know which one is most current or why they
seem to disagree.

I have CC'd michael who maintains the excellent rtracklayer package in
case he has some insight.


    Marc


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Reply via email to