On 11/1/2013 10:12 PM, Martin Morgan wrote:
Do you mean that if A,B occur together and B,C occur together, then A,B
and A,C are equivalent?

Yes, that's what I meant, sorry, typo.

I like your uid() function. It avoids the 20M times loop, and the issue of circular references can be solved by ensuring that y is always (weakly) smaller than x.

I can see how the log(longest chain) would be the limiting case, since each round should always halve the length of the longest chain.

I have yet to test whether that intuition (about running time) works in practice.

I did test the program on a different data set, and it does seem that it fails to detect equivalence on the following input:

  a b
1 B A
2 C B
3 O G
4 O M
5 Y C
6 Y M

Everything should be equivalent here, but

> uid(d$a, d$b)
[1] 1 1 3 4 1 6

I'm looking to see what can be about this. The problem seems to be with
entries of the form:

O->G
O->M

Which imply that all of O/M/G are equivalent, but they are not detected as such. Will consider whether there is a good way around this.

Best,
Magnus

Here's a function that returns a unique identifier (not well tested!),
allowing for transitive relations but not circularity.

      uid <- function(x, y)
     {
         i <- seq_along(x)                   # global index
         xy <- paste0(x, y)                  # make unique identifiers
         idx <- match(xy, xy)

         repeat {
             ## transitive look-up
             y_idx <- match(y[idx], x)       # look up 'y' in 'x'
             keep <- !is.na(y_idx)
             if (!any(keep))                 # no transitive relations,
done!
                 break
             x[idx[keep]] <- x[y_idx[keep]]
             y[idx[keep]] <- y[y_idx[keep]]

             ## create new index of values
             xy <- paste0(x, y)
             idx <- match(xy, xy)
         }
         idx
     }

Values with the same index are identical. Some tests

     > x <- c(1, 2, 3, 4)
     > y <- c(2, 3, 5, 6)
     > uid(x, y)
     [1] 1 1 1 4
     > i <- sample(x); uid(x[i], y[i])
     [1] 1 1 3 1
     > uid(as.character(x), as.character(y))  ## character() ok
     [1] 1 1 1 4
     > uid(1:10, 1 + 1:10)
      [1] 1 1 1 1 1 1 1 1 1 1
     > uid(integer(), integer())
     integer(0)
     > x <- c(1, 2, 3)
     > y <- c(2, 3, 1)
     > uid(x, y)                              ## circular!
       C-c C-c

I think this will scale well enough, but the worst-case scenario can be
made to be log(longest chain) and copying can be reduced by using an
index i and subsetting the original vector on each iteration. I think
you could test for circularity by checking that the updated x are not a
permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))

Martin


The way I do this currently is to designate the smallest
(alphabetically) string
in each known equivalence set as the "main" entry. For each pair, I
therefore
insert two entries into the hash table, both pointing at the mail
value. So
assuming the input data:

A,B
B,C
D,E

I would then have:

A->A
B->A
C->B
D->D
E->D

Except that I also follow each chain until I reach the end
(key==value), and
insert pointers to the "main" value for every value I find along the
way. After
doing that, I end up with:

A->A
B->A
C->A
D->D
E->D

And I can very quickly check equivalence, either by comparing the hash
of two
strings, or simply by transforming each string into its hash, and then
I can use
simple comparison from then on. The code for generating the final hash
table is
as follows:

h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it
finds a
key whose value is equal to itself (until hash.get(X)==X), then
returns all the
values in a vector


h = hash.new()
for ( i in 1:nrow(d) )
{
     deep.a      = hash.deep.get(h, d$a[i])
     deep.b      = hash.deep.get(h, d$b[i])
     equivalents = sort(unique(c(deep.a,deep.b)))
     equiv.id    = min(equivalents)
     for ( equivalent in equivalents )
     {
         hash.put(h, equivalent, equiv.id)
     }
}


I would so much appreciate if there was a simpler and faster way to do
this.
Keeping my fingers crossed that one of the R-help geniuses who sees
this is
sufficiently interested to crack the problem

Best,
Magnus

On 11/1/2013 1:49 PM, jim holtman wrote:
It would be nice if you followed the posting guidelines and at least
showed the script that was creating your entries now so that we
understand the problem you are trying to solve.  A bit more
explanation of why you want this would be useful.  This gets to the
second part of my tag line:  Tell me what you want to do, not how you
want to do it.  There may be other solutions to your problem.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason
<zulutime....@gmail.com> wrote:
Pretty much what the subject says:

I used an env as the basis for a Hashtable in R, based on
information that
this is in fact the way environments are implemented under the hood.

I've been experimenting with doubling the number of entries, and so
far it
has seemed to be scaling more or less linearly, as expected.

But as I went from 17 million entries to 34 million entries, the
completion
time has gone from 18 hours, to 5 days and counting.


The keys and values are in all cases strings of equal length.

One might suspect that the slow-down might have to do with the
memory being
swapped to disk, but from what I know about my computing
environment, that
should not be the case.

So my first question:
Is anyone familiar with anything in the implementation of
environments that
would limit their use or slow them down (faster than O(nlog(n)) as the
number of entries is increased?

And my second question:
I realize that this is not strictly what R environments were
designed for,
but this is what my algorithm requires: I must go through these
millions of
entries, storing them in the hash table and sometimes retrieving
them along
the way, in a more or less random manner, which is contingent on the
data I
am encountering, and on the contents of the hash table at each moment.

Does anyone have a good recommendation for alternatives to implement
huge,
fast, table-like structures in R?

Best,
Magnus

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to