[Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Marc Carlson Tue, 12 Feb 2013 13:37:06 -0800

On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote:
> re:  '[BioC] question about Gviz' thread fallout:
>
> Yesterday I rolled a relatively simple programmatic way to label UCSC
> KnownGene entries with their symbols.  However, some isoforms (e.g. some
> for NRIP1 and CDKN2B) seem to be missing from the mappings.
>
> Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find
>
> ...This mapping is based on the very latest build available at UCSC
>     for this organism as of March 2010.  2.6 is the last release where
>     you can expect it to be here.  The GenomicFeatures package
>     contains functionality that replaces the need for this mapping...
>
> Alas, I'm too thick to find where, in the TxDb or elsewhere, I could
> retrieve Hugo IDs for UCSC KnownGene entries without using org.Hs.egSYMBOL.
>   The latter is what I usually do:
>
>    library(Homo.sapiens)
>
>    txs<- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
>    head(names(txs))
>    ## [1] "1"         "10"        "100"       "1000"      "10000"
> "100008586"
>
>    names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA)
>    head(names(txs))
>    ## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"
>
> Now, I thought for a while, hell, this gets them all!  But, not really...
>
>    txs$NRIP1
>    ## GRanges with 1 range and 2 metadata columns:
>    ##       seqnames               ranges strand |     tx_id     tx_name
>    ##<Rle>             <IRanges>   <Rle>  |<integer>  <character>
>    ##   [1]    chr21 [16333556, 16437126]      - |     71301  uc002yjx.2
>
> Well, that's one of the isoforms.  But what about the other ones?
>
>    org.Hs.egUCSCKG[[ "c002yjx.1" ]]
>    ## NULL
>
>    org.Hs.egUCSCKG[[ "uc010gkz.1" ]]
>    ## NULL
>
> I know UCSC identifiers can be a bit of a pain in the ass, but there do
> exist mappings for these.  If they're going to be used as primary
> identifiers for the TxDb packages, would it be possible to update them?
>
> If it's an issue of time constraints, I will take a stab at it, but that
> will almost guarantee more prattling from me on the mailing list.  On the
> other hand, it might move GAF3.0 annotations out of the station.
>
> Much obliged for any insights from the core developers.




Hi Tim,

So continuing from the other thread...

1st thing I noticed is that if you try to look up the two known gene IDs 
that you gave me you will not have any luck.  From using the web 
service, it seems that they are not actually valid UCSC known gene IDs.  
At 1st I thought that maybe there had been updates since the last 
Bioconductor release in October, but pasting these IDs into the UCSC 
genome browser only lead me to this:

# Sorry, couldn't locate uc010gkz.1 in genome database
# Sorry, couldn't locate c002yjx.1 in genome database

So at this point I was a little curious where you actually got these ids 
from?  (I will actually return to this in a minute)

Anyhow, looking deeper the website indicates that there is another 
isoform for NRIP1 (other than:"uc002yjx.2") .  It is called 
"uc021whl.1".  And it does indeed come up empty handed if you call 
select like this:

select(Homo.sapiens, cols=c("SYMBOL","TXNAME"), 
keys=c("uc002yjx.2","uc021whl.1"), keytype="TXNAME")

So what happened here?  Well the track data from UCSC doesn't have a 
gene assigned to that isoform yet.  So the DB has no way of knowing that 
it's connected.  Incidentally, this is still true even if you were to 
download it this morning.

So here we have a situation where the UCSC web site has been updated, 
but their track table (and in particular the table called 
"knownToLocusLink") is not perfectly in sync with the web site.

Even weirder is the fact that if you use the "table browser" to download 
the "latest" knownToLocusLink" table (which is yet another service on 
their web site), you will get a table that has two isoforms (associated 
with NPRIP1) that look very similar to the ones you mentioned before.  
In fact I am willing to guess that this is where you got these from, and 
that the shortened one is just a copy-paste typo).

So the problem here is that there seem to be three different ways to get 
the same kind of data from UCSC genome browser.  There is the 
website/browser.  There is rtracklayer, and then there is also the web 
form access to the table browser (which is what I think you used).  AND 
they all three seem to be in disagreement with each other.  My suspicion 
is that some of these are just more up to date than others.  But I think 
that only UCSC will really know which one is most current or why they 
seem to disagree.

I have CC'd michael who maintains the excellent rtracklayer package in 
case he has some insight.


   Marc


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Reply via email to