On Sun, Nov 8, 2009 at 7:33 AM, Michael Jaaka <michael.ja...@googlemail.com>wrote:
> Hi! How would you solve such problem: > > I have a collection of pairs (key, value) -> > [ [ "tom" 32 ] [ "tom" 2333 ] [ "anne" 12 ] [ "anne" 55 ] ] > > As you can see keys can occur more than once, also that collection is very > large so it should be evaluated lazily (for example the collection cames > from stream input). > > I have function with two arguments key and sequence of its values, the > pseudo code would be: > > callbackListener(string key, iterator values) > > now I would like get such effect that callbackListener will be called twice > for the example collection. > > once with "tom" and iterator (or a lazy seq) to lazy evaluated collection > of (32 and 2333) > and second with "anne" and iterator for collection of (12 and 55) > Something related to (let [ks (distinct (map first s)) vals-for-key (fn [k] (map second (filter #(= (first %) k) s)))] (doseq [k ks] (callback k (vals-for-key k)))) or to collect and return the return values of the callbacks, use "for" instead of "doseq". The problem is that this will realize s and hold onto its head, as Meikel points out. To make it really lazy you need something more like (defn do-it [callback producer] (let [ks (fn [p] (distinct (map first (p)))) vals-for-key (fn [k p] (map second (filter #(= (first %) k) (p))))] (doseq [k (ks p)] (callback k (vals-for-key k p)))) and call this with two arguments, the first the callback that will take a key and a lazy seq of the associated values, and the second a function that will produce anew a lazy sequence of key-value pairs from your disk file each time it's called, with the property that the lazy-seq clause that returns nil when the sequence is exhausted also has the side effect of closing the input stream so as not to leak file handles. This will grovel over the file N+1 times where N is the number of distinct keys, though, and will have in general two input streams open on the file at the same time, one of them crawling along discovering more distinct keys, and the other finding all the values for one key, then for the next, etc. It will also build up a hashset of all of the keys over time, hidden inside the "distinct" function's returned seq. So, if the summed lengths of the distinct keys is going to be big enough to blow the heap all on its own, you've got real problems and you'll probably need to use a scratch disk file. In that case I'd recommend a more imperative style with a loop/recur and lazy-seq, where the outer loop starts with a copy of the input file, runs the inner loop, and then deletes its input and reruns itself with the file generated by the inner loop (hence starting with a *copy* of the input file), and the inner loop reads the first key from the current input file, creates an empty temp file to be the current output file, then calls the callback with a lazy seq object whose realization actually advances through the input file, copying entries to the output file that are not matching the current key and producing lazy seq elements for entries that do match. The idea is that if your input file foo.dat reads [[x 1] [y 2] [x 3] [z 4] [y 5]] it will make a copy foo1.tmp, then call the callback with x and a lazy seq whose realization emits (1 3) while also generating foo2.tmp = [[y 2] [z 4] [y 5]]; then the outer loop deletes foo1.tmp and repeats on foo2.tmp and the callback is called with y and a lazy seq whose realization emits (2 5) while generating foo3.tmp = [[z 4]]; and then the outer loop deletes foo2.tmp and the callback is called with z and a lazy seq (4) whose realization creates an empty foo4.tmp; then the outer loop deletes foo3.tmp, sees foo4.tmp is empty, and deletes foo4.tmp and terminates. If you want the callback's return values, you'll need to make the outer loop another lazy-seq. I'm a bit uncomfortable with lazy-seq bodies that have side effects. You might want to consider if you need to be using a RDBMS instead of just disk files for whatever it is you're doing. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---