Thanks for the idea. I think this example works fast mainly due to the limited number of matches. For each fdf1$chr there are only 2 potential matches in fdf2. In reality there are only 24 possible values for chr (1-22 and X, Y). When I replace the chr seq with more realistic values, I run out of memory. I'll try it out our server and let you know how it goes. Fast and memory intensive will get me over the hump for now.
fdf1 <- data.frame(chr=sort(sample(seq(1:24), 100000, replace=TRUE)),p=runif(100000),d=sample(100000)) fdf2 <- data.frame(chr=sort(sample(seq(1:24), 200000, replace=TRUE)),s=runif(200000),t=runif(200000)) system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,])) Thanks again, Brian -----Original Message----- From: ila...@gmail.com [mailto:ila...@gmail.com] On Behalf Of ilai Sent: Wednesday, March 14, 2012 3:26 PM To: Davis, Brian Cc: r-help@R-project.org Subject: Re: [R] Needing a better solution to a lookup problem. You could try doing it without a loop (.C or other): (rgnsnp <- merge(region,snps)) (rgnsnp[with(rgnsnp,STOP>=POS & POS >= START),]) Here is my test for merge+search on 100k/200k: fdf1 <- data.frame(chr=1:100000,p=runif(100000),d=sample(100000)) fdf2 <- data.frame(chr=rep(1:100000,2),s=runif(200000),t=runif(200000)) system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,])) user system elapsed 2.560 0.152 2.905 Hope this helps Elai On Wed, Mar 14, 2012 at 1:27 PM, Davis, Brian <brian.da...@uth.tmc.edu> wrote: > I have a solution (actually a few) to this problem, but none are > computationally efficient enough to be useful. I'm hoping someone can > enlighten me to a better solution. > > I have data frame of chromosome/position pairs (along with other data for the > location). For each pair I need to determine if it is with in a given data > frame of ranges. I need to keep only the pairs that are within any of the > ranges for further processing. > > Example: > snps<-NULL > snps$CHR<-c("1","2","2","3","X") > snps$POS<-as.integer(c(295,640,670,100,1100)) > snps$DAT<-seq(1:length(snps$CHR)) > snps<-as.data.frame(snps, stringsAsFactors=FALSE) > > snps > CHR POS DAT > 1 1 295 1 > 2 2 640 2 > 3 2 670 3 > 4 3 100 4 > 5 X 1100 5 > > region<-NULL > region$CHR<-c("1","1","2","2","2","X") > region$START<-as.integer(c(10,210,430,650,810,1090)) > region$STOP<-as.integer(c(100,350,630,675,850,1111)) > region<-as.data.frame(region, stringsAsFactors=FALSE) > > region > CHR START STOP > 1 1 10 100 > 2 1 210 350 > 3 2 430 630 > 4 2 650 675 > 5 2 810 850 > 6 X 1090 1111 > > > The result I need would look like > > Res > > CHR POS DAT > 1 295 1 > 2 670 3 > X 1100 5 > > > I have a solution that works reasonably well on small sets, but my current > data set is ~100K snp entries, and my regions table has ~200K entries. I have > ~1500 files to go through > > I haven't found a good way to efficiently solve this problem. I've tried > various versions of mapply/lapply, for loops, etc which get the answer for > small sets but takes hours (per file) on my real data. Bioconductor seemed > like the obvious place to look, but my GoogleFu must not be that great. I > never found anything relevant. > > Any ideas or points to the right direction would be greatly appreciated. > > > > Brian Davis > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.