Hi Taku,

This 'error' is not due to anything in the illuminahumanv4.db package. All that package does is link the probe IDs to Entrez Gene IDs, and then the org.Hs.eg.db package does the remainder of the annotation. So if we look at org.Hs.eg.db, we get this:

> select(org.Hs.eg.db, c("C16ORF15","C16orf15","C15orf16"), c("ENTREZID","SYMBOL","GENENAME"), "ALIAS")
     ALIAS ENTREZID SYMBOL                 GENENAME
1 C16ORF15   161725 OTUD7A OTU domain containing 7A
2 C16orf15   197335  WDR90      WD repeat domain 90
3 C15orf16   161725 OTUD7A OTU domain containing 7A


And if we go to NCBI and search the Gene database, we get (in order):

Gene ID 161725

Official Symbol
   OTUD7Aprovided by HGNC <http://www.genenames.org/>
Official Full Name
   OTU deubiquitinase 7Aprovided by HGNC <http://www.genenames.org/>
Primary source
HGNC:20718 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=20718> See related
   Ensembl:ENSG00000169918; <http://www.ensembl.org/id/ENSG00000169918>
   HPRD:12666; <http://www.hprd.org/protein/12666> MIM:612024;
   <http://www.ncbi.nlm.nih.gov/omim/612024> Vega:OTTHUMG00000129275
<http://vega.sanger.ac.uk/id/OTTHUMG00000129275> Gene type
   protein coding
RefSeq status
   PROVISIONAL
Organism
   Homo sapiens
<https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606> Lineage
   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
   Catarrhini; Hominidae; Homo
Also known as
   OTUD7; C15orf16; C16ORF15; CEZANNE2

And

Gene ID 197335

Official Symbol
   WDR90provided by HGNC <http://www.genenames.org/>
Official Full Name
   WD repeat domain 90provided by HGNC <http://www.genenames.org/>
Primary source
HGNC:26960 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=26960> See related
   Ensembl:ENSG00000161996; <http://www.ensembl.org/id/ENSG00000161996>
   HPRD:08311; <http://www.hprd.org/protein/08311> HPRD:14118;
   <http://www.hprd.org/protein/14118> Vega:OTTHUMG00000048040
<http://vega.sanger.ac.uk/id/OTTHUMG00000048040> Gene type
   protein coding
RefSeq status
   PROVISIONAL
Organism
   Homo sapiens
<https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606> Lineage
   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
   Catarrhini; Hominidae; Homo
Also known as
   C16orf15; C16orf16; C16orf17; C16orf18; C16orf19


So what is in the org.Hs.eg.db package conforms exactly to the data from NCBI. Please note that the annotation packages supplied by Bioconductor are simply re-formulations of data we get from sources like NCBI, and we make no claims as to the accuracy of those data. In other words, we try our best to ensure that the information you get from a given annotation package conforms exactly to what you would get by going to the NCBI website and searching by hand, but do NOT make any claims as to the accuracy of the data on the NCBI website.

And there have been any number of emails on this list by Marc Carlson, explaining to people that HGNC symbols and especially other random aliases are not unique, and should not be relied upon for annotating data accurately. So yeah, don't do that.

Best,

Jim


On 3/27/2014 6:11 AM, Taku Tokuyasu wrote:
Hello Mark,

I'm writing to report an apparent error in the illuminaHumanv4.db package,
version 1.20.0.  Specifically, the mapping for "C16ORF15" in ALIAS2PROBE
appears to be incorrect.  Below is an R code snippet:

library("illuminaHumanv4.db")
packageVersion("illuminaHumanv4.db")
# [1] '1.20.0'
#
http://www.bioconductor.org/packages/release/data/annotation/html/illuminaHumanv4.db.html

# Define some mappings
xxAP <- as.list(illuminaHumanv4ALIAS2PROBE)
xxS <- as.list(illuminaHumanv4SYMBOL)

# Compare these two:
xxS[xxAP[["C16ORF15"]]]
xxS[xxAP[["C16orf15"]]]

# I get:
# > xxS[xxAP[["C16ORF15"]]]
# $ILMN_1718060
# [1] "OTUD7A"
# $ILMN_1785146
# [1] "OTUD7A"
# $ILMN_2298160
# [1] "OTUD7A"
#
# > xxS[xxAP[["C16orf15"]]]
# $ILMN_1693042
# [1] "WDR90"
# $ILMN_1698185
# [1] "WDR90"

According to HGNC (via DuckDuckGo):
OTUD7A (OTU domain containing 7A)
Protein-coding gene on human chromosome 15q13.1, also known as *C15orf16*,
CEZANNE2, OTU domain containing 7, OTUD7, chromosome 15 open reading frame
16.

WDR90 (WD repeat domain 90)
Protein-coding gene on human chromosome 16p13.3, also known as *C16orf15*,
C16orf16, C16orf17, C16orf18, C16orf19, FLJ36483, KIAA1924, chromosome 16
open reading frame 15, chromosome 16 open reading frame 16, chromosome 16
open reading frame 17, chromosome 16 open reading frame 18, chromosome 16
open reading frame 19.

So it appears the ALIAS2PROBE mapping for C16ORF15 is actually for
C15orf16.  Indeed,
all.equal(xxAP[["C16ORF15"]], xxAP[["C15orf16"]])
# [1] TRUE

Some questions:
1) Why is there a mapping for both C16orf15 and C16ORF15?
2) Can you make the names for mappings like ALIAS2PROBE all upper case?
Perhaps there is a Bioconductor annotation convention that prevents this?
3) I also noticed:
nms <- names(xxAP)
length(nms); length(unique(nms)); length(unique(toupper(nms)))
# [1] 99696
# [1] 99696
# [1] 99378
Is there potentially a problem with the 300-odd names that are no longer
unique when raised to upper case?

Regards,

_Taku

  Taku A. Tokuyasu, PhD
Computational Biology Core
UCSF Helen Diller Family Comprehensive Cancer Center

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to