On Mon, Nov 24, 2014 at 12:02 PM, Karl Stamm <karl.st...@gmail.com> wrote:
> Thanks everyone! I wasn't familiar with the select() function, it appears > to be specific to the AnnotationDb package, and not referenced by the > examples of org.Hs.eg.db, where I was looking. Those examples for things > like org.Hs.eg.dbREFSEQ call the list accessor or a vector accessor, and > always for the full list. > mapped_seqs <- mappedkeys(x) > xx <- as.list(x[mapped_seqs]) > xx[[1]] > > Following that pattern, I developed a big slow DisplayName() with tryCatch > for null values, and a double lookup to go from Refseq to EG to HGNC. > > I've reimplemented following your advice (Thanks Sean Davis and Vincent > Carey). Examples with benchmarking are posted here: > http://pastebin.com/HjNbtD1g > > Calling select() with a short vector of refseq IDs takes a similar amount > of time, between a half second and a full second for all three methods. > It's not an improvement. As my use-case is lots of calls for a few IDs, > this is still a problem. > > However, my method looped the input ids and called one at a time, so > runtime is linear, whereas the select() methods are nearly constant time, > calling for fifty ids is therefore MUCH faster. Calling the whole > database, as Sean showed, only uses a few seconds. This becomes an > opportunity to pre-cache, as I tried to do, but with the linear algorithm > needed more than an hour at package load time. > > A fourth version of the function, using the select() to eat 3 seconds and > fill a hash(), subsequently blows the other methods out of the water. For > example, to pull 50 genes with my slow method was 7 seconds, by > select(org.Hs.eg.db) was 0.8 seconds, by select(Homo.sapiens) was 1.6 > seconds, and by hash() is 0.003 seconds. (microbenchmark median of 30 > evals). > Hi, Karl. You might consider adding a list cache mechanism as well. The hash() is based on environments, if I recall. Last I checked, lists could be faster for smaller datasets. Sean > Thanks to your help I can fill the hash in a reasonable time, and save it > on package load. Next step, I have to learn how to use the package > onLoad(). > > > > On Sat, Nov 22, 2014 at 5:32 AM, Sean Davis <seand...@gmail.com> wrote: > > > > > > > On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.st...@gmail.com> > wrote: > > > >> Question regarding gene name conversions. Once upon a time, I was doing > a > >> lot of gene name conversions, particularly from NM_#### to HGNC symbol > or > >> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix > >> so > >> I could quickly merge() it instead of calling out to a webservice > >> repeatedly. Later the complexity of keeping the cache updated became > >> overwhelming, and carrying around a few megabytes of possibly outdated > >> identifiers is a bad idea. Per Bioconductor guidelines, I switched to > the > >> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists > >> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL. > >> > >> These sometimes map to multiple values and sometimes map to nothing, > >> causing errors in my code. To clean it up, I wrapped their accessors > with > >> some error checking. Things work again, assigning one human readable > name > >> per transcript ID#. Problem is this method is very slow. I thought it > >> could > >> be the error checking code, but even trying to streamline that doesn't > >> help. A profiler showed that most of my time was spent in .Call, > actually > >> it turns out each access to the "list" like this > org.Hs.egSYMBOL[[eg]][1] > >> was calling a sqlite query. Since I am nesting these calls in a loop, > (NM > >> to EG to HGNC, a few thousands of times), these copious calls out to > >> sqlite > >> are killing me. > >> > > > > Hi, Karl. > > > > It is a little hard to diagnose problems without code, but here is a > > little code to get a sense of how I might accomplish the task you are > > describing. I include timing information. If this isn't a > representative > > workflow, perhaps you can show us some code and timing information. > > > > Sean > > > > > # Get all human refseq accessions > > > refseqs = keys(org.Hs.eg.db,keytype="REFSEQ") > > > # Time the lookup for symbol and entrez ID > > > system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ", > > + keys=refseqs, > > + columns=c('REFSEQ','SYMBOL','ENTREZID')))) > > user system elapsed > > 2.170 0.071 2.259 > > > head(symbols) > > REFSEQ SYMBOL ENTREZID > > 1 NM_130786 A1BG 1 > > 2 NP_570602 A1BG 1 > > 3 NM_000014 A2M 2 > > 4 NP_000005 A2M 2 > > 5 XM_006719056 A2M 2 > > 6 XP_006719119 A2M 2 > > > > > >> I need a way to batch query, or preload to memory these lookup tables. I > >> tried using a hash, but checking if a value is already loaded into the > >> hash-cache is equally time consuming; and preloading the whole of > >> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData > >> object, but we're back to the local-outdated cache problem. > >> > >> So I think the only solution would be to access the sqlite underlying > the > >> org.Hs.eg.db myself, so I can use the batch query. Except that db is > >> hidden > >> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL. > >> > >> I assume this problem has been handled before, and ask for your > guidance. > >> > >> Thanks > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioc-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel