Re: [Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Tim Triche, Jr. Thu, 14 Feb 2013 09:31:56 -0800

Thanks much for sleuthing this out Marc.  As much as I respect Jim Kent and
David Haussler and the rest of the folks at UCSC, which is to say
enormously, sometimes I wonder if it is possible for a relatively small
crew to simultaneously maintain an "infrastructure" type of project (I
would guess that UCSC's traffic is a milliGoogle or so, but without the ad
revenue to hire new engineers) and do academic research.


One of the reasons I was interested in GAF3.0 for transcript annotations
was as a static snapshot that UCSC is promoting as both 1) stable and 2)
definitive.  The other reason is that it makes splice graph assembly much
easier :-)

Thanks again,

--t



On Wed, Feb 13, 2013 at 3:35 PM, Marc Carlson <mcarl...@fhcrc.org> wrote:

> Just posting  an update on this,
>
> Just as I was composing a carefully worded email to the folks at UCSC, I
> see they seem to have fixed the table browser so that it now looks the same
> as the FTP site (and hence the results that come back from rtracklayer).
>  This means that the UCSC files now look like our annotation again.
>
>
>   Marc
>
>
>
> On 02/12/2013 01:35 PM, Marc Carlson wrote:
>
>> On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote:
>>
>>> re:  '[BioC] question about Gviz' thread fallout:
>>>
>>> Yesterday I rolled a relatively simple programmatic way to label UCSC
>>> KnownGene entries with their symbols.  However, some isoforms (e.g. some
>>> for NRIP1 and CDKN2B) seem to be missing from the mappings.
>>>
>>> Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find
>>>
>>> ...This mapping is based on the very latest build available at UCSC
>>>      for this organism as of March 2010.  2.6 is the last release where
>>>      you can expect it to be here.  The GenomicFeatures package
>>>      contains functionality that replaces the need for this mapping...
>>>
>>> Alas, I'm too thick to find where, in the TxDb or elsewhere, I could
>>> retrieve Hugo IDs for UCSC KnownGene entries without using
>>> org.Hs.egSYMBOL.
>>>    The latter is what I usually do:
>>>
>>>     library(Homo.sapiens)
>>>
>>>     txs<- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene)
>>>     head(names(txs))
>>>     ## [1] "1"         "10"        "100"       "1000"      "10000"
>>> "100008586"
>>>
>>>     names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA)
>>>     head(names(txs))
>>>     ## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"
>>>
>>> Now, I thought for a while, hell, this gets them all!  But, not really...
>>>
>>>     txs$NRIP1
>>>     ## GRanges with 1 range and 2 metadata columns:
>>>     ##       seqnames               ranges strand |     tx_id     tx_name
>>>     ##<Rle>              <IRanges>    <Rle>   |<integer>   <character>
>>>     ##   [1]    chr21 [16333556, 16437126]      - |     71301  uc002yjx.2
>>>
>>> Well, that's one of the isoforms.  But what about the other ones?
>>>
>>>     org.Hs.egUCSCKG[[ "c002yjx.1" ]]
>>>     ## NULL
>>>
>>>     org.Hs.egUCSCKG[[ "uc010gkz.1" ]]
>>>     ## NULL
>>>
>>> I know UCSC identifiers can be a bit of a pain in the ass, but there do
>>> exist mappings for these.  If they're going to be used as primary
>>> identifiers for the TxDb packages, would it be possible to update them?
>>>
>>> If it's an issue of time constraints, I will take a stab at it, but that
>>> will almost guarantee more prattling from me on the mailing list.  On the
>>> other hand, it might move GAF3.0 annotations out of the station.
>>>
>>> Much obliged for any insights from the core developers.
>>>
>>
>>
>> Hi Tim,
>>
>> So continuing from the other thread...
>>
>> 1st thing I noticed is that if you try to look up the two known gene IDs
>> that you gave me you will not have any luck.  From using the web
>> service, it seems that they are not actually valid UCSC known gene IDs.
>> At 1st I thought that maybe there had been updates since the last
>> Bioconductor release in October, but pasting these IDs into the UCSC
>> genome browser only lead me to this:
>>
>> # Sorry, couldn't locate uc010gkz.1 in genome database
>> # Sorry, couldn't locate c002yjx.1 in genome database
>>
>> So at this point I was a little curious where you actually got these ids
>> from?  (I will actually return to this in a minute)
>>
>> Anyhow, looking deeper the website indicates that there is another
>> isoform for NRIP1 (other than:"uc002yjx.2") .  It is called
>> "uc021whl.1".  And it does indeed come up empty handed if you call
>> select like this:
>>
>> select(Homo.sapiens, cols=c("SYMBOL","TXNAME"),
>> keys=c("uc002yjx.2","uc021whl.**1"), keytype="TXNAME")
>>
>> So what happened here?  Well the track data from UCSC doesn't have a
>> gene assigned to that isoform yet.  So the DB has no way of knowing that
>> it's connected.  Incidentally, this is still true even if you were to
>> download it this morning.
>>
>> So here we have a situation where the UCSC web site has been updated,
>> but their track table (and in particular the table called
>> "knownToLocusLink") is not perfectly in sync with the web site.
>>
>> Even weirder is the fact that if you use the "table browser" to download
>> the "latest" knownToLocusLink" table (which is yet another service on
>> their web site), you will get a table that has two isoforms (associated
>> with NPRIP1) that look very similar to the ones you mentioned before.
>> In fact I am willing to guess that this is where you got these from, and
>> that the shortened one is just a copy-paste typo).
>>
>> So the problem here is that there seem to be three different ways to get
>> the same kind of data from UCSC genome browser.  There is the
>> website/browser.  There is rtracklayer, and then there is also the web
>> form access to the table browser (which is what I think you used).  AND
>> they all three seem to be in disagreement with each other.  My suspicion
>> is that some of these are just more up to date than others.  But I think
>> that only UCSC will really know which one is most current or why they
>> seem to disagree.
>>
>> I have CC'd michael who maintains the excellent rtracklayer package in
>> case he has some insight.
>>
>>
>>     Marc
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________**_________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>
> ______________________________**_________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>



-- 
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Reply via email to