Thanks much for sleuthing this out Marc. As much as I respect Jim Kent and David Haussler and the rest of the folks at UCSC, which is to say enormously, sometimes I wonder if it is possible for a relatively small crew to simultaneously maintain an "infrastructure" type of project (I would guess that UCSC's traffic is a milliGoogle or so, but without the ad revenue to hire new engineers) and do academic research.
One of the reasons I was interested in GAF3.0 for transcript annotations was as a static snapshot that UCSC is promoting as both 1) stable and 2) definitive. The other reason is that it makes splice graph assembly much easier :-) Thanks again, --t On Wed, Feb 13, 2013 at 3:35 PM, Marc Carlson <mcarl...@fhcrc.org> wrote: > Just posting an update on this, > > Just as I was composing a carefully worded email to the folks at UCSC, I > see they seem to have fixed the table browser so that it now looks the same > as the FTP site (and hence the results that come back from rtracklayer). > This means that the UCSC files now look like our annotation again. > > > Marc > > > > On 02/12/2013 01:35 PM, Marc Carlson wrote: > >> On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote: >> >>> re: '[BioC] question about Gviz' thread fallout: >>> >>> Yesterday I rolled a relatively simple programmatic way to label UCSC >>> KnownGene entries with their symbols. However, some isoforms (e.g. some >>> for NRIP1 and CDKN2B) seem to be missing from the mappings. >>> >>> Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find >>> >>> ...This mapping is based on the very latest build available at UCSC >>> for this organism as of March 2010. 2.6 is the last release where >>> you can expect it to be here. The GenomicFeatures package >>> contains functionality that replaces the need for this mapping... >>> >>> Alas, I'm too thick to find where, in the TxDb or elsewhere, I could >>> retrieve Hugo IDs for UCSC KnownGene entries without using >>> org.Hs.egSYMBOL. >>> The latter is what I usually do: >>> >>> library(Homo.sapiens) >>> >>> txs<- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene) >>> head(names(txs)) >>> ## [1] "1" "10" "100" "1000" "10000" >>> "100008586" >>> >>> names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA) >>> head(names(txs)) >>> ## [1] "A1BG" "NAT2" "ADA" "CDH2" "AKT3" "GAGE12F" >>> >>> Now, I thought for a while, hell, this gets them all! But, not really... >>> >>> txs$NRIP1 >>> ## GRanges with 1 range and 2 metadata columns: >>> ## seqnames ranges strand | tx_id tx_name >>> ##<Rle> <IRanges> <Rle> |<integer> <character> >>> ## [1] chr21 [16333556, 16437126] - | 71301 uc002yjx.2 >>> >>> Well, that's one of the isoforms. But what about the other ones? >>> >>> org.Hs.egUCSCKG[[ "c002yjx.1" ]] >>> ## NULL >>> >>> org.Hs.egUCSCKG[[ "uc010gkz.1" ]] >>> ## NULL >>> >>> I know UCSC identifiers can be a bit of a pain in the ass, but there do >>> exist mappings for these. If they're going to be used as primary >>> identifiers for the TxDb packages, would it be possible to update them? >>> >>> If it's an issue of time constraints, I will take a stab at it, but that >>> will almost guarantee more prattling from me on the mailing list. On the >>> other hand, it might move GAF3.0 annotations out of the station. >>> >>> Much obliged for any insights from the core developers. >>> >> >> >> Hi Tim, >> >> So continuing from the other thread... >> >> 1st thing I noticed is that if you try to look up the two known gene IDs >> that you gave me you will not have any luck. From using the web >> service, it seems that they are not actually valid UCSC known gene IDs. >> At 1st I thought that maybe there had been updates since the last >> Bioconductor release in October, but pasting these IDs into the UCSC >> genome browser only lead me to this: >> >> # Sorry, couldn't locate uc010gkz.1 in genome database >> # Sorry, couldn't locate c002yjx.1 in genome database >> >> So at this point I was a little curious where you actually got these ids >> from? (I will actually return to this in a minute) >> >> Anyhow, looking deeper the website indicates that there is another >> isoform for NRIP1 (other than:"uc002yjx.2") . It is called >> "uc021whl.1". And it does indeed come up empty handed if you call >> select like this: >> >> select(Homo.sapiens, cols=c("SYMBOL","TXNAME"), >> keys=c("uc002yjx.2","uc021whl.**1"), keytype="TXNAME") >> >> So what happened here? Well the track data from UCSC doesn't have a >> gene assigned to that isoform yet. So the DB has no way of knowing that >> it's connected. Incidentally, this is still true even if you were to >> download it this morning. >> >> So here we have a situation where the UCSC web site has been updated, >> but their track table (and in particular the table called >> "knownToLocusLink") is not perfectly in sync with the web site. >> >> Even weirder is the fact that if you use the "table browser" to download >> the "latest" knownToLocusLink" table (which is yet another service on >> their web site), you will get a table that has two isoforms (associated >> with NPRIP1) that look very similar to the ones you mentioned before. >> In fact I am willing to guess that this is where you got these from, and >> that the shortened one is just a copy-paste typo). >> >> So the problem here is that there seem to be three different ways to get >> the same kind of data from UCSC genome browser. There is the >> website/browser. There is rtracklayer, and then there is also the web >> form access to the table browser (which is what I think you used). AND >> they all three seem to be in disagreement with each other. My suspicion >> is that some of these are just more up to date than others. But I think >> that only UCSC will really know which one is most current or why they >> seem to disagree. >> >> I have CC'd michael who maintains the excellent rtracklayer package in >> case he has some insight. >> >> >> Marc >> >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> > > ______________________________**_________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel