On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
>
>> Sure,
>>
>> I was attempting to be concise and boiling it down to what I saw as the
>> root
>> issue, but you are right, I could have taken it a step further. So here
>> goes.
>>
>> I have a set of around around 20M string pairs. A given string (say, A)
>> can
>> either be equivalent to another string (B) or not. If A and B occur
>> together in
>> the same pair, they are equivalent. But equivalence is transitive, so if
>> A and B
>> occur together in one pair, and A and C occur together in another pair,
>> then A
>> and C are also equivalent. I need a way to quickly determine if any two
>> strings
>> from my data set are equivalent or not.
>>
>
> Do you mean that if A,B occur together and B,C occur together, then A,B
> and A,C are equivalent?
>
> Here's a function that returns a unique identifier (not well tested!),
> allowing for transitive relations but not circularity.
>
>      uid <- function(x, y)
>     {
>         i <- seq_along(x)                   # global index
>         xy <- paste0(x, y)                  # make unique identifiers
>         idx <- match(xy, xy)
>
>         repeat {
>             ## transitive look-up
>             y_idx <- match(y[idx], x)       # look up 'y' in 'x'
>             keep <- !is.na(y_idx)
>             if (!any(keep))                 # no transitive relations,
> done!
>                 break
>             x[idx[keep]] <- x[y_idx[keep]]
>             y[idx[keep]] <- y[y_idx[keep]]
>
>             ## create new index of values
>             xy <- paste0(x, y)
>             idx <- match(xy, xy)
>         }
>         idx
>     }
>
> Values with the same index are identical. Some tests
>
>     > x <- c(1, 2, 3, 4)
>     > y <- c(2, 3, 5, 6)
>     > uid(x, y)
>     [1] 1 1 1 4
>     > i <- sample(x); uid(x[i], y[i])
>     [1] 1 1 3 1
>     > uid(as.character(x), as.character(y))  ## character() ok
>     [1] 1 1 1 4
>     > uid(1:10, 1 + 1:10)
>      [1] 1 1 1 1 1 1 1 1 1 1
>     > uid(integer(), integer())
>     integer(0)
>     > x <- c(1, 2, 3)
>     > y <- c(2, 3, 1)
>     > uid(x, y)                              ## circular!
>       C-c C-c
>
> I think this will scale well enough, but the worst-case scenario can be
> made to be log(longest chain) and copying can be reduced by using an index
> i and subsetting the original vector on each iteration. I think you could
> test for circularity by checking that the updated x are not a permutation
> of the kept x, all(x[y_idx[keep]] %in% x[keep]))
>
> Martin
>
>

This problem (union-find) is discussed in Chapter 1 of Sedgwick's
"Algorithms".  There's an algorithm given that takes linear time to build
the structure, worst-case logarithmic time to query, and effectively
constant average time to query (inverse-Ackerman amortized complexity).

   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to