Thanks to Thomas, Martin, Jim and William, Your input was very informative, and thanks for the reference to Sedgwick.
In the end, it does seem to me that all these algorithms require fast lookup by ID of nodes to access data, and that conditional on such fast lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) (depending on whether lookup time is constant or logarithmic). I believe my original algorithm achieves that.
We come back to the fact that I assumed that R environments, implemented as hash tables, would give me that fast lookup. But on my systems, their efficiency (for insert and lookup) seems to degrade fast at several million entries. Certainly much faster than either O(1) or O(log(n)). I believe this does not have to do with disk access time. For example, I tested this on my desktop computer, running a pure hash insert loop. I observe 100% processor use but no disk access, as the size of the hash table approaches millions of entries.
I have tested this on two systems, but have not gone into the implementation of the hashed environments to look at this in details. If others have the same (or different) experiences with using hashed environments with millions of entries, it would be very useful to know.
Barring a solution to the hashed environment speed, it seems the way to speed this algorithm up (within the confines of R) would be to move away from hash tables and towards a numerically indexed array.
Thanks again for all of the help, Magnus On 11/4/2013 8:20 PM, Thomas Lumley wrote:
On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote: On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote: Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not. Do you mean that if A,B occur together and B,C occur together, then A,B and A,C are equivalent? Here's a function that returns a unique identifier (not well tested!), allowing for transitive relations but not circularity. uid <- function(x, y) { i <- seq_along(x) # global index xy <- paste0(x, y) # make unique identifiers idx <- match(xy, xy) repeat { ## transitive look-up y_idx <- match(y[idx], x) # look up 'y' in 'x' keep <- !is.na <http://is.na>(y_idx) if (!any(keep)) # no transitive relations, done! break x[idx[keep]] <- x[y_idx[keep]] y[idx[keep]] <- y[y_idx[keep]] ## create new index of values xy <- paste0(x, y) idx <- match(xy, xy) } idx } Values with the same index are identical. Some tests > x <- c(1, 2, 3, 4) > y <- c(2, 3, 5, 6) > uid(x, y) [1] 1 1 1 4 > i <- sample(x); uid(x[i], y[i]) [1] 1 1 3 1 > uid(as.character(x), as.character(y)) ## character() ok [1] 1 1 1 4 > uid(1:10, 1 + 1:10) [1] 1 1 1 1 1 1 1 1 1 1 > uid(integer(), integer()) integer(0) > x <- c(1, 2, 3) > y <- c(2, 3, 1) > uid(x, y) ## circular! C-c C-c I think this will scale well enough, but the worst-case scenario can be made to be log(longest chain) and copying can be reduced by using an index i and subsetting the original vector on each iteration. I think you could test for circularity by checking that the updated x are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep])) Martin This problem (union-find) is discussed in Chapter 1 of Sedgwick's "Algorithms". There's an algorithm given that takes linear time to build the structure, worst-case logarithmic time to query, and effectively constant average time to query (inverse-Ackerman amortized complexity). -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.