Martin, Thank you for your exploration of implementations of bsearch!
In my application, length(val) is very small (typically 2), so vectorization over val doesn't help -- though vectorization over tab could work by doing n-ary instead of 2-ary splits with something like match(TRUE, val < tab[L+(H-L)*(1:9/10)]) and (when H-L becomes small) match(TRUE,val < tab[L:H]) Then there are approaches like tries... but though I love this sort of programming, I'm trying to reuse as much well-tested, well-tuned library code as I can. Thanks again for your ideas! -s On Wed, Apr 6, 2011 at 12:59, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 04/04/2011 01:50 PM, William Dunlap wrote: > >> -----Original Message----- >>> From: r-help-boun...@r-project.org >>> [mailto:r-help-boun...@r-project.org] On Behalf Of Stavros Macrakis >>> Sent: Monday, April 04, 2011 1:15 PM >>> To: r-help >>> Subject: [R] General binary search? >>> >>> Is there a generic binary search routine in a standard library which >>> >>> a) works for character vectors >>> b) runs in O(log(N)) time? >>> >>> I'm aware of findInterval(x,vec), but it is restricted to >>> numeric vectors. >>> >> >> xtfrm(x) will convert a character (or other) vector to >> a numeric vector with the same ordering. findInterval >> can work on that. E.g., >> > f0<- function(x, vec) { >> tmp<- xtfrm(c(x, vec)) >> findInterval(tmp[seq_along(x)], tmp[-seq_along(x)]) >> } >> > f0(c("Baby", "Aunt", "Dog"), LETTERS) >> [1] 2 1 4 >> I've never looked at its speed. >> > > For a little progress (though no 'generic binary searchin a standard > library'), here's the 'one-liner' > > bsearch1 <- > function(val, tab, L=1L, H=length(tab)) > { > while (H >= L) { > M <- L + (H - L) %/% 2L > if (tab[M] > val) H <- M - 1L > else if (tab[M] < val) L <- M + 1L > else return(M) > } > return(L - 1L) > } > > It seems like a good candidate for the new (R-2.13) 'compiler' package, so > > library(compiler) > bsearch2 <- cmpfun(bsearch1) > > And Bill's suggestion > > bsearch3 <- function(val, tab) { > tmp <- xtfrm(c(val, tab)) > findInterval(tmp[seq_along(val)], tmp[-seq_along(val)]) > } > > which will work best when 'val' is a vector to be looked up. > > A quick look at data.table:::sortedmatch seemed to return matches, whereas > Stavros is looking for lower bounds. > > It seems that one could shift the weight more to C code by 'vectorizing' > the one-liner, first as > > bsearch5 <- > function(val, tab, L=1L, H=length(tab)) > { > b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) > i0 <- seq_along(val) > repeat { > M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L > i <- tab[M] > val[i0] > b[i0 + i * length(val)] <- > ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M)) > i0 <- which(b[i0, "H"] >= b[i0, "L"]) > if (!length(i0)) break; > } > b[,"L"] - 1L > } > > and then a little more thoughtfully (though more room for improvement) as > > bsearch7 <- > function(val, tab, L=1L, H=length(tab)) > { > b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) > i0 <- seq_along(val) > repeat { > updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L > tabM <- tab[M] > val0 <- val[i0] > i <- tabM < val0 > updt[i] <- M[i] + 1L > i <- tabM > val0 > updt[i] <- M[i] - 1L > b[i0 + i * length(val)] <- updt > i0 <- which(b[i0, "H"] >= b[i0, "L"]) > if (!length(i0)) break; > } > b[,"L"] - 1L > } > > none of bsearch 3, 5, or 7 is likely to benefit substantially from > compilation. > > Here's a little test data set converting numeric to character as an easy > cheat. > > set.seed(123L) > x <- sort(as.character(rnorm(1e6))) > y <- as.character(rnorm(1e4)) > > There seems to be some significant initial overhead, so we warm things up > (and also introduce the paradigm for multiple look-ups in bsearch 1, 2) > > warmup <- function(y, x) { > lapply(y, bsearch1, x) > lapply(y, bsearch2, x) > bsearch3(y, x) > bsearch5(y, x) > bsearch7(y, x) > } > replicate(3, warmup(y, x)) > > and then time > > > system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE)) > user system elapsed > 2.692 0.000 2.696 > > system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE)) > user system elapsed > 1.379 0.000 1.380 > > identical(res1, res2) > [1] TRUE > > system.time(res3 <- bsearch3(y, x)); identical(res1, res3) > user system elapsed > 8.339 0.001 8.350 > [1] TRUE > > system.time(res5 <- bsearch5(y, x)); identical(res1, res5) > user system elapsed > 0.700 0.000 0.702 > [1] TRUE > > system.time(res7 <- bsearch7(y, x)); identical(res1, res7) > user system elapsed > 0.222 0.000 0.222 > [1] TRUE > > Martin > > > >> Bill Dunlap >> Spotfire, TIBCO Software >> wdunlap tibco.com >> >> >>> I'm also aware of various hashing solutions (e.g. >>> new.env(hash=TRUE) and >>> fastmatch), but I need the greatest-lower-bound match in my >>> application. >>> >>> findInterval is also slow for large N=length(vec) because of the O(N) >>> checking it does, as Duncan Murdoch has pointed >>> out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: >>> though >>> its documentation says it runs in O(n * log(N)), it actually >>> runs in O(n * >>> log(N) + N), which is quite noticeable for largish N. But >>> that is easy >>> enough to work around by writing a variant of findInterval which calls >>> find_interv_vec without checking. >>> >>> -s >>> >>> PS Yes, binary search is a one-liner in R, but I always prefer to use >>> standard, fast native libraries when possible.... >>> >>> binarysearch<- function(val,tab,L,H) {while (H>=L) { >>> M=L+(H-L) %/% 2; if >>> (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; >>> return(L-1)} >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.