Thanks everyone! I wasn't familiar with the select() function, it appears to be specific to the AnnotationDb package, and not referenced by the examples of org.Hs.eg.db, where I was looking. Those examples for things like org.Hs.eg.dbREFSEQ call the list accessor or a vector accessor, and always for the full list. mapped_seqs <- mappedkeys(x) xx <- as.list(x[mapped_seqs]) xx[[1]]
Following that pattern, I developed a big slow DisplayName() with tryCatch for null values, and a double lookup to go from Refseq to EG to HGNC. I've reimplemented following your advice (Thanks Sean Davis and Vincent Carey). Examples with benchmarking are posted here: http://pastebin.com/HjNbtD1g Calling select() with a short vector of refseq IDs takes a similar amount of time, between a half second and a full second for all three methods. It's not an improvement. As my use-case is lots of calls for a few IDs, this is still a problem. However, my method looped the input ids and called one at a time, so runtime is linear, whereas the select() methods are nearly constant time, calling for fifty ids is therefore MUCH faster. Calling the whole database, as Sean showed, only uses a few seconds. This becomes an opportunity to pre-cache, as I tried to do, but with the linear algorithm needed more than an hour at package load time. A fourth version of the function, using the select() to eat 3 seconds and fill a hash(), subsequently blows the other methods out of the water. For example, to pull 50 genes with my slow method was 7 seconds, by select(org.Hs.eg.db) was 0.8 seconds, by select(Homo.sapiens) was 1.6 seconds, and by hash() is 0.003 seconds. (microbenchmark median of 30 evals). Thanks to your help I can fill the hash in a reasonable time, and save it on package load. Next step, I have to learn how to use the package onLoad(). On Sat, Nov 22, 2014 at 5:32 AM, Sean Davis <seand...@gmail.com> wrote: > > > On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.st...@gmail.com> wrote: > >> Question regarding gene name conversions. Once upon a time, I was doing a >> lot of gene name conversions, particularly from NM_#### to HGNC symbol or >> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix >> so >> I could quickly merge() it instead of calling out to a webservice >> repeatedly. Later the complexity of keeping the cache updated became >> overwhelming, and carrying around a few megabytes of possibly outdated >> identifiers is a bad idea. Per Bioconductor guidelines, I switched to the >> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists >> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL. >> >> These sometimes map to multiple values and sometimes map to nothing, >> causing errors in my code. To clean it up, I wrapped their accessors with >> some error checking. Things work again, assigning one human readable name >> per transcript ID#. Problem is this method is very slow. I thought it >> could >> be the error checking code, but even trying to streamline that doesn't >> help. A profiler showed that most of my time was spent in .Call, actually >> it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1] >> was calling a sqlite query. Since I am nesting these calls in a loop, (NM >> to EG to HGNC, a few thousands of times), these copious calls out to >> sqlite >> are killing me. >> > > Hi, Karl. > > It is a little hard to diagnose problems without code, but here is a > little code to get a sense of how I might accomplish the task you are > describing. I include timing information. If this isn't a representative > workflow, perhaps you can show us some code and timing information. > > Sean > > > # Get all human refseq accessions > > refseqs = keys(org.Hs.eg.db,keytype="REFSEQ") > > # Time the lookup for symbol and entrez ID > > system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ", > + keys=refseqs, > + columns=c('REFSEQ','SYMBOL','ENTREZID')))) > user system elapsed > 2.170 0.071 2.259 > > head(symbols) > REFSEQ SYMBOL ENTREZID > 1 NM_130786 A1BG 1 > 2 NP_570602 A1BG 1 > 3 NM_000014 A2M 2 > 4 NP_000005 A2M 2 > 5 XM_006719056 A2M 2 > 6 XP_006719119 A2M 2 > > >> I need a way to batch query, or preload to memory these lookup tables. I >> tried using a hash, but checking if a value is already loaded into the >> hash-cache is equally time consuming; and preloading the whole of >> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData >> object, but we're back to the local-outdated cache problem. >> >> So I think the only solution would be to access the sqlite underlying the >> org.Hs.eg.db myself, so I can use the batch query. Except that db is >> hidden >> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL. >> >> I assume this problem has been handled before, and ask for your guidance. >> >> Thanks >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel