Hi Sonali, Thanks for reporting this. I’ve moved away from Entrez IDs for refseq with newer genome versions, as I think goseq should take the refs symbols directly. Documentation still needs to be updated for this.
It’s concerning that you’re seeing this behaviour for hg19, so I’ll look into it. A work around is to use the Entrez IDs and provide goseq with the flag “knownGene” rather than “refGene”. I’m currently on leave until January, so a fix might be slow. Cheers, Nadia. > On 10 Oct 2015, at 8:15 am, Arora, Sonali <[email protected]> wrote: > > Hi everyone, > > I am using goseq to perform Gene Ontology analysis of RNA-seq data and I > think I found the following bug. > > Since my RNA-seq data was aligned using Refseq table - so I need to > supply id as "refGene" which are actually EntrezId's from > goseq::supportedGeneIDs() > > The bug is - that the internal function(getgo) is expecting refseq > identifiers and the external function (goseq) is expecting entrez id's.. > > When I use refGene as id ( which is actually an entrez gene ids). I get > the following error - > >> pwf = nullp(genes, "hg19", "refGene") > Loading hg19 length data... > Warning message: > In pcls(G) : initial point very close to some inequality constraints >> GO.wall = goseq(pwf, "hg19", "refGene") > Fetching GO annotations... > Error in names(tmp) = rep(names(map), times = as.numeric(summary(map)[, : > attempt to set an attribute on NULL > > > After some debugging, I find that internal function getgo() returns a > vector of NULL's using "Entrez Gene Id" as id. > >> go = getgo(names(genes), "hg19", "refGene") >> head(go) > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > The last line of the function goseq::getgo() seems to be the issue.. > > Browse[2]> head(names(user2cat)) > [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019" > Browse[2]> head(genes) > [1] "1" "100" "1000" "10000" "10001" "100037417" > Browse[2]> head(user2cat[toupper(genes)]) > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > $<NA> > NULL > > If I give the getgo() function refseq identifiers instead of entrez id, > then the internal function getgo works - but the nullp function fails as > it is still expecting entrez gene id. > >> test_genes <- c(0, 0,1,0,0,0) >> names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017", > + "NM_000018", "NM_000019") >> go = getgo(names(test_genes), "hg19", "refGene") >> names(go) > [1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019" >> pwf = nullp(test_genes, "hg19", "refGene") > Loading hg19 length data... > Error in getlength(names(DEgenes), genome, id) : > The gene names specified do not match the gene names for genome hg19 > and ID refGene. > Gene names given were: NM_000014, NM_000015, NM_000016, NM_000017, > NM_000018, NM_000019, NA, NA, NA, NA > Required gene names are: 84251, 113451, 25932, 55707, 55707, 84871, > 50651, 55707, 7049, 1629 > > A reproducible example - > > # case - 1 > test_genes <- c(0, 0,1,0,0,0) > names(test_genes) <- c(1,100,1000, 10000, 10001, 100037417) > go = getgo(names(test_genes), "hg19", "refGene") > > # case -2 > names(test_genes) =c("NM_000014", "NM_000015", "NM_000016", "NM_000017", > "NM_000018", "NM_000019") > go = getgo(names(test_genes), "hg19", "refGene") > pwf = nullp(test_genes, "hg19", "refGene") > > >> sessionInfo() > R version 3.2.2 RC (2015-08-09 r68965) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats4 stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] org.Hs.eg.db_3.2.2 AnnotationDbi_1.31.19 IRanges_2.3.24 > S4Vectors_0.7.22 Biobase_2.29.1 BiocGenerics_0.15.10 > goseq_1.21.1 RSQLite_1.0.0 > [9] DBI_0.3.1 geneLenDataBase_1.5.0 BiasedUrn_1.06.1 > > loaded via a namespace (and not attached): > [1] XVector_0.9.4 GenomicRanges_1.21.30 > zlibbioc_1.15.0 GenomicAlignments_1.5.18 > BiocParallel_1.3.54 lattice_0.20-33 > [7] GenomeInfoDb_1.5.16 tools_3.2.2 grid_3.2.2 > SummarizedExperiment_0.3.9 nlme_3.1-122 mgcv_1.8-7 > [13] lambda.r_1.1.7 futile.logger_1.4.1 > Matrix_1.2-2 rtracklayer_1.29.28 > futile.options_1.0.0 bitops_1.0-6 > [19] RCurl_1.95-4.7 biomaRt_2.25.3 > GO.db_3.2.2 GenomicFeatures_1.21.31 > Biostrings_2.37.8 Rsamtools_1.21.20 > [25] XML_3.98-1.3 > > Please advise. > > Thanks and Regards, > Sonali. > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > > If you have any questions, please contact MCRI IT Servicedesk for further > assistance. > ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
