Martin,

Thank you for your exploration of implementations of bsearch!

In my application, length(val) is very small (typically 2), so vectorization
over val doesn't help -- though vectorization over tab could work by doing
n-ary instead of 2-ary splits with something like

         match(TRUE, val < tab[L+(H-L)*(1:9/10)])

and (when H-L becomes small)

         match(TRUE,val < tab[L:H])

Then there are approaches like tries... but though I love this sort of
programming, I'm trying to reuse as much well-tested, well-tuned library
code as I can.

Thanks again for your ideas!

             -s

On Wed, Apr 6, 2011 at 12:59, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 04/04/2011 01:50 PM, William Dunlap wrote:
>
>> -----Original Message-----
>>> From: r-help-boun...@r-project.org
>>> [mailto:r-help-boun...@r-project.org] On Behalf Of Stavros Macrakis
>>> Sent: Monday, April 04, 2011 1:15 PM
>>> To: r-help
>>> Subject: [R] General binary search?
>>>
>>> Is there a generic binary search routine in a standard library which
>>>
>>>    a) works for character vectors
>>>    b) runs in O(log(N)) time?
>>>
>>> I'm aware of findInterval(x,vec), but it is restricted to
>>> numeric vectors.
>>>
>>
>> xtfrm(x) will convert a character (or other) vector to
>> a numeric vector with the same ordering.  findInterval
>> can work on that.  E.g.,
>>    >  f0<- function(x, vec) {
>>        tmp<- xtfrm(c(x, vec))
>>        findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
>>      }
>>    >  f0(c("Baby", "Aunt", "Dog"), LETTERS)
>>    [1] 2 1 4
>> I've never looked at its speed.
>>
>
> For a little progress (though no 'generic binary searchin a standard
> library'), here's the 'one-liner'
>
> bsearch1 <-
>    function(val, tab, L=1L, H=length(tab))
> {
>    while (H >= L) {
>        M <- L + (H - L) %/% 2L
>        if (tab[M] > val) H <- M - 1L
>        else if (tab[M] < val) L <- M + 1L
>        else return(M)
>    }
>    return(L - 1L)
> }
>
> It seems like a good candidate for the new (R-2.13) 'compiler' package, so
>
> library(compiler)
> bsearch2 <- cmpfun(bsearch1)
>
> And Bill's suggestion
>
> bsearch3 <- function(val, tab) {
>    tmp <- xtfrm(c(val, tab))
>    findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])
> }
>
> which will work best when 'val' is a vector to be looked up.
>
> A quick look at data.table:::sortedmatch seemed to return matches, whereas
> Stavros is looking for lower bounds.
>
> It seems that one could shift the weight more to C code by 'vectorizing'
> the one-liner, first as
>
> bsearch5 <-
>    function(val, tab, L=1L, H=length(tab))
> {
>    b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
>    i0 <- seq_along(val)
>    repeat {
>        M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
>        i <- tab[M] > val[i0]
>        b[i0 + i * length(val)] <-
>            ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M))
>        i0 <- which(b[i0, "H"] >= b[i0, "L"])
>        if (!length(i0)) break;
>    }
>    b[,"L"] - 1L
> }
>
> and then a little more thoughtfully (though more room for improvement) as
>
> bsearch7 <-
>    function(val, tab, L=1L, H=length(tab))
> {
>    b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
>    i0 <- seq_along(val)
>    repeat {
>        updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
>        tabM <- tab[M]
>        val0 <- val[i0]
>        i <- tabM < val0
>        updt[i] <- M[i] + 1L
>        i <- tabM > val0
>        updt[i] <- M[i] - 1L
>        b[i0 + i * length(val)] <- updt
>        i0 <- which(b[i0, "H"] >= b[i0, "L"])
>        if (!length(i0)) break;
>    }
>    b[,"L"] - 1L
> }
>
> none of bsearch 3, 5, or 7 is likely to benefit substantially from
> compilation.
>
> Here's a little test data set converting numeric to character as an easy
> cheat.
>
> set.seed(123L)
> x <- sort(as.character(rnorm(1e6)))
> y <- as.character(rnorm(1e4))
>
> There seems to be some significant initial overhead, so we warm things up
> (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)
>
> warmup <- function(y, x) {
>    lapply(y, bsearch1, x)
>    lapply(y, bsearch2, x)
>    bsearch3(y, x)
>    bsearch5(y, x)
>    bsearch7(y, x)
> }
> replicate(3, warmup(y, x))
>
> and then time
>
> > system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))
>   user  system elapsed
>  2.692   0.000   2.696
> > system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))
>   user  system elapsed
>  1.379   0.000   1.380
> > identical(res1, res2)
> [1] TRUE
> > system.time(res3 <- bsearch3(y, x)); identical(res1, res3)
>   user  system elapsed
>  8.339   0.001   8.350
> [1] TRUE
> > system.time(res5 <- bsearch5(y, x)); identical(res1, res5)
>   user  system elapsed
>  0.700   0.000   0.702
> [1] TRUE
> > system.time(res7 <- bsearch7(y, x)); identical(res1, res7)
>   user  system elapsed
>  0.222   0.000   0.222
> [1] TRUE
>
> Martin
>
>
>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> I'm also aware of various hashing solutions (e.g.
>>> new.env(hash=TRUE) and
>>> fastmatch), but I need the greatest-lower-bound match in my
>>> application.
>>>
>>> findInterval is also slow for large N=length(vec) because of the O(N)
>>> checking it does, as Duncan Murdoch has pointed
>>> out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
>>> though
>>> its documentation says it runs in O(n * log(N)), it actually
>>> runs in O(n *
>>> log(N) + N), which is quite noticeable for largish N.  But
>>> that is easy
>>> enough to work around by writing a variant of findInterval which calls
>>> find_interv_vec without checking.
>>>
>>>             -s
>>>
>>> PS Yes, binary search is a one-liner in R, but I always prefer to use
>>> standard, fast native libraries when possible....
>>>
>>> binarysearch<- function(val,tab,L,H) {while (H>=L) {
>>> M=L+(H-L) %/% 2; if
>>> (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
>>> return(L-1)}
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to