On Mon, Jul 11, 2011 at 1:13 PM, Alan Malloy <a...@malloys.org> wrote:
> If the sequence is already realized, or is cheap, and you want only a
> very small random subset of it, you can do better than shuffling the
> whole thing. Fliebel and I played around with several solutions to
> this, some time ago. I can't find the whole thing, but some
> interesting examples and benchmarking data are at 
> https://gist.github.com/805747
> if you want to try it out.

Also if the sequence is not "realized" but there's a cheap way to
calculate (nth the-hypothetical-sequence n) just knowing n and not
previous members, there are ways to lazily generate part of a random
permutation:

(take k (distinct (f (rand-int limit))))

where n ranges from 0 to limit-1, f changes n to (nth
the-hypothetical-sequence n) without realizing (or even creating as a
seq object) the-hypothetical-sequence, and k < limit is the number of
items desired from the head of the shuffled sequence. (take limit
(distinct (f (rand-int limit)))) represents a full random permutation
whose elements can be taken lazily.

The downside is that the (distinct ...) implements a rejection
algorithm, so that we get a uniform random element, then a uniform
random from what's left, then ... and so a uniformly-chosen random
permutation, but at the cost that if you realize most or all of the
permutation and not just the first few elements of a big one it will
slow down tremendously. For instance if limit is 10,000 and you
realize the whole thing it will need 10,000 tries on average before
generating the final element. The (distinct ...) also amounts to
holding onto the head.

So this works well if you want the first few elements of the
permutation, f exists, and you don't want to realize all of
the-hypothetical-sequence. It works terribly if you want the whole
permutation or much of it or f does not exist.

There are obvious ways to speed up distinct in this case and make it
more compact, though, for the special case of sets of nonnegative
integers from a range starting at 0: create a BitSet representing
"have we already generated this?" and keep a running count of
generated elements. This allows creating permutations of (range limit)
more efficiently and without realizing all of the Integer objects at
once, at the cost of having mutable state under the hood. Each
iteration we generate k = (rand-int (- limit cnt)) and then walk the
bitset, incrementing a counter n and decrementing k every time we see
a zero in the bitset at position n. When k hits zero, we flip that bit
of the bitset to 1 and output k; that is the next item in the lazy
sequence. Memory complexity is limit/8 bytes instead of limit*12 bytes
(Integer objects consume 12 bytes of memory each) for a space savings
factor of 72, nearly two whole orders of magnitude, over using
(distinct ...).

However, this seems to work only for shuffling (range limit) ... or,
we can use (map f (bitset-perm-of-range limit)) to get an arbitrary
lazy permutation of the-hypothetical-sequence the same way!

Time complexity is quadratic: we have to walk further and further
along the bitset to generate each element. It still beats the
(distinct ...) rejection algorithm when you (eventually) want most or
all of the permutation, and especially when you want that
two-order-of-magnitude memory savings.

Both algorithms avoid the core shuffle's gamut problem when (factorial
limit) exceeds Integer/MAX_VALUE; as long as limit itself doesn't
exceed that there is no problem.

First stab at an implementation:

(defn bitset-perm-of-range [limit]
  (let [bs (java.util.BitSet. limit)
        rem (atom limit)
        step (fn step []
               (lazy-seq
                 (locking bs
                   (if-not (zero? @rem)
                     (let [k (rand-int @rem)]
                       (loop [i 0 k k]
                         (if (.get bs i)
                           (recur (inc i) k)
                           (if (> k 0)
                             (recur (inc i) (dec k))
                             (do
                               (.set bs i)
                               (swap! rem dec)
                               (cons i (step)))))))))))]
    (step)))

It *seems* to work after very minimal testing on short ranges and one
small chunk of a big range:

=> (bitset-perm-of-range 10)
(0 7 4 1 8 6 5 3 2 9)
=> (bitset-perm-of-range 10)
(2 8 1 5 0 3 9 4 7 6)
=> (bitset-perm-of-range 10)
(5 9 2 1 4 3 7 6 8 0)
=> (take 10 (bitset-perm-of-range 1000000))
(113738 532538 423561 100835 157218 710244 339727 507963 199228 842555)

-- 
Protege: What is this seething mass of parentheses?!
Master: Your father's Lisp REPL. This is the language of a true
hacker. Not as clumsy or random as C++; a language for a more
civilized age.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to