On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:
> On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote: > >> Sure, >> >> I was attempting to be concise and boiling it down to what I saw as the >> root >> issue, but you are right, I could have taken it a step further. So here >> goes. >> >> I have a set of around around 20M string pairs. A given string (say, A) >> can >> either be equivalent to another string (B) or not. If A and B occur >> together in >> the same pair, they are equivalent. But equivalence is transitive, so if >> A and B >> occur together in one pair, and A and C occur together in another pair, >> then A >> and C are also equivalent. I need a way to quickly determine if any two >> strings >> from my data set are equivalent or not. >> > > Do you mean that if A,B occur together and B,C occur together, then A,B > and A,C are equivalent? > > Here's a function that returns a unique identifier (not well tested!), > allowing for transitive relations but not circularity. > > uid <- function(x, y) > { > i <- seq_along(x) # global index > xy <- paste0(x, y) # make unique identifiers > idx <- match(xy, xy) > > repeat { > ## transitive look-up > y_idx <- match(y[idx], x) # look up 'y' in 'x' > keep <- !is.na(y_idx) > if (!any(keep)) # no transitive relations, > done! > break > x[idx[keep]] <- x[y_idx[keep]] > y[idx[keep]] <- y[y_idx[keep]] > > ## create new index of values > xy <- paste0(x, y) > idx <- match(xy, xy) > } > idx > } > > Values with the same index are identical. Some tests > > > x <- c(1, 2, 3, 4) > > y <- c(2, 3, 5, 6) > > uid(x, y) > [1] 1 1 1 4 > > i <- sample(x); uid(x[i], y[i]) > [1] 1 1 3 1 > > uid(as.character(x), as.character(y)) ## character() ok > [1] 1 1 1 4 > > uid(1:10, 1 + 1:10) > [1] 1 1 1 1 1 1 1 1 1 1 > > uid(integer(), integer()) > integer(0) > > x <- c(1, 2, 3) > > y <- c(2, 3, 1) > > uid(x, y) ## circular! > C-c C-c > > I think this will scale well enough, but the worst-case scenario can be > made to be log(longest chain) and copying can be reduced by using an index > i and subsetting the original vector on each iteration. I think you could > test for circularity by checking that the updated x are not a permutation > of the kept x, all(x[y_idx[keep]] %in% x[keep])) > > Martin > > This problem (union-find) is discussed in Chapter 1 of Sedgwick's "Algorithms". There's an algorithm given that takes linear time to build the structure, worst-case logarithmic time to query, and effectively constant average time to query (inverse-Ackerman amortized complexity). -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.