On Mon, Jul 11, 2011 at 1:13 PM, Alan Malloy <a...@malloys.org> wrote: > If the sequence is already realized, or is cheap, and you want only a > very small random subset of it, you can do better than shuffling the > whole thing. Fliebel and I played around with several solutions to > this, some time ago. I can't find the whole thing, but some > interesting examples and benchmarking data are at > https://gist.github.com/805747 > if you want to try it out.
Also if the sequence is not "realized" but there's a cheap way to calculate (nth the-hypothetical-sequence n) just knowing n and not previous members, there are ways to lazily generate part of a random permutation: (take k (distinct (f (rand-int limit)))) where n ranges from 0 to limit-1, f changes n to (nth the-hypothetical-sequence n) without realizing (or even creating as a seq object) the-hypothetical-sequence, and k < limit is the number of items desired from the head of the shuffled sequence. (take limit (distinct (f (rand-int limit)))) represents a full random permutation whose elements can be taken lazily. The downside is that the (distinct ...) implements a rejection algorithm, so that we get a uniform random element, then a uniform random from what's left, then ... and so a uniformly-chosen random permutation, but at the cost that if you realize most or all of the permutation and not just the first few elements of a big one it will slow down tremendously. For instance if limit is 10,000 and you realize the whole thing it will need 10,000 tries on average before generating the final element. The (distinct ...) also amounts to holding onto the head. So this works well if you want the first few elements of the permutation, f exists, and you don't want to realize all of the-hypothetical-sequence. It works terribly if you want the whole permutation or much of it or f does not exist. There are obvious ways to speed up distinct in this case and make it more compact, though, for the special case of sets of nonnegative integers from a range starting at 0: create a BitSet representing "have we already generated this?" and keep a running count of generated elements. This allows creating permutations of (range limit) more efficiently and without realizing all of the Integer objects at once, at the cost of having mutable state under the hood. Each iteration we generate k = (rand-int (- limit cnt)) and then walk the bitset, incrementing a counter n and decrementing k every time we see a zero in the bitset at position n. When k hits zero, we flip that bit of the bitset to 1 and output k; that is the next item in the lazy sequence. Memory complexity is limit/8 bytes instead of limit*12 bytes (Integer objects consume 12 bytes of memory each) for a space savings factor of 72, nearly two whole orders of magnitude, over using (distinct ...). However, this seems to work only for shuffling (range limit) ... or, we can use (map f (bitset-perm-of-range limit)) to get an arbitrary lazy permutation of the-hypothetical-sequence the same way! Time complexity is quadratic: we have to walk further and further along the bitset to generate each element. It still beats the (distinct ...) rejection algorithm when you (eventually) want most or all of the permutation, and especially when you want that two-order-of-magnitude memory savings. Both algorithms avoid the core shuffle's gamut problem when (factorial limit) exceeds Integer/MAX_VALUE; as long as limit itself doesn't exceed that there is no problem. First stab at an implementation: (defn bitset-perm-of-range [limit] (let [bs (java.util.BitSet. limit) rem (atom limit) step (fn step [] (lazy-seq (locking bs (if-not (zero? @rem) (let [k (rand-int @rem)] (loop [i 0 k k] (if (.get bs i) (recur (inc i) k) (if (> k 0) (recur (inc i) (dec k)) (do (.set bs i) (swap! rem dec) (cons i (step)))))))))))] (step))) It *seems* to work after very minimal testing on short ranges and one small chunk of a big range: => (bitset-perm-of-range 10) (0 7 4 1 8 6 5 3 2 9) => (bitset-perm-of-range 10) (2 8 1 5 0 3 9 4 7 6) => (bitset-perm-of-range 10) (5 9 2 1 4 3 7 6 8 0) => (take 10 (bitset-perm-of-range 1000000)) (113738 532538 423561 100835 157218 710244 339727 507963 199228 842555) -- Protege: What is this seething mass of parentheses?! Master: Your father's Lisp REPL. This is the language of a true hacker. Not as clumsy or random as C++; a language for a more civilized age. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en