On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.st...@gmail.com> wrote:
> Question regarding gene name conversions. Once upon a time, I was doing a > lot of gene name conversions, particularly from NM_#### to HGNC symbol or > Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so > I could quickly merge() it instead of calling out to a webservice > repeatedly. Later the complexity of keeping the cache updated became > overwhelming, and carrying around a few megabytes of possibly outdated > identifiers is a bad idea. Per Bioconductor guidelines, I switched to the > built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists > org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL. > > These sometimes map to multiple values and sometimes map to nothing, > causing errors in my code. To clean it up, I wrapped their accessors with > some error checking. Things work again, assigning one human readable name > per transcript ID#. Problem is this method is very slow. I thought it could > be the error checking code, but even trying to streamline that doesn't > help. A profiler showed that most of my time was spent in .Call, actually > it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1] > was calling a sqlite query. Since I am nesting these calls in a loop, (NM > to EG to HGNC, a few thousands of times), these copious calls out to sqlite > are killing me. > Hi, Karl. It is a little hard to diagnose problems without code, but here is a little code to get a sense of how I might accomplish the task you are describing. I include timing information. If this isn't a representative workflow, perhaps you can show us some code and timing information. Sean > # Get all human refseq accessions > refseqs = keys(org.Hs.eg.db,keytype="REFSEQ") > # Time the lookup for symbol and entrez ID > system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ", + keys=refseqs, + columns=c('REFSEQ','SYMBOL','ENTREZID')))) user system elapsed 2.170 0.071 2.259 > head(symbols) REFSEQ SYMBOL ENTREZID 1 NM_130786 A1BG 1 2 NP_570602 A1BG 1 3 NM_000014 A2M 2 4 NP_000005 A2M 2 5 XM_006719056 A2M 2 6 XP_006719119 A2M 2 > I need a way to batch query, or preload to memory these lookup tables. I > tried using a hash, but checking if a value is already loaded into the > hash-cache is equally time consuming; and preloading the whole of > org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData > object, but we're back to the local-outdated cache problem. > > So I think the only solution would be to access the sqlite underlying the > org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden > under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL. > > I assume this problem has been handled before, and ask for your guidance. > > Thanks > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel