I'm currently working on a book on concurrent/parallel development for The Pragmatic Programmers. One of the subjects I'm covering is parallel programming in Clojure, but I've hit a roadblock with one of the examples. I'm hoping that I can get some help to work through it here.
The example counts the words contained within a Wikipedia dump. It should respond well to parallelisation (I have Java and Scala versions that perform excellently) but I've been incapable of coming up with a nice solution in Clojure. The code I'm working with is checked into GitHub: The basic sequential algorithm is: > (frequencies (mapcat get-words pages)) If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my MacBook Pro. One way to parallelise it is to create a parallel version of frequencies that uses reducers: > (defn frequencies-fold [words] > (r/fold (partial merge-with +) > (fn [counts word] (assoc counts word (inc (get counts word 0)))) > words)) And sure enough, if I use that, along with use the foldable-seq utility I posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad given that the parallel version is unable to use transients. Unfortunately, as they currently stand, reducers have a fatal flaw that means that, even with foldable-seq, they're basically useless with lazy sequences. Reducers always hold onto the head of the sequence they're given, so there's no way to use this approach for a complete Wikipedia dump (which runs to around 40GiB). So the other approach I've tried is to use pmap: > (defn frequencies-pmap [words] > (reduce (partial merge-with +) > (pmap frequencies > (partition-all 10000 words)))) But, for reasons I don't understand, this performs dreadfully - taking ~26s, i.e. significantly slower than the sequential version. I've tried playing with different partition sizes without materially affecting the result. So, what I'm looking for is either: a) An explanation for why the pmap-based solution performs so badly b) A way to fix the "holding onto head" problem that's inherent within reducers. With the last of these in mind, it strikes me that the problem fundamentally arises from the desire for reducers to follow the same basic API as "normal" code. So: (reduce (filter ... (map ... coll))) becomes: (r/fold (r/filter ... (r/map ... coll))) A very small change to the reducers API - passing the collection to the reduce and/or fold - would avoid the problem: (r/fold (r/filter ... (r/map ...)) coll) Anyway - I'd be very grateful for any insight into either of the above questions. Or for suggestions for an alternative approach that might be more fruitful. Many thanks in advance, -- paul.butcher->msgCount++ Snetterton, Castle Combe, Cadwell Park... Who says I have a one track mind? http://www.paulbutcher.com/ LinkedIn: http://www.linkedin.com/in/paulbutcher MSN: p...@paulbutcher.com AIM: paulrabutcher Skype: paulrabutcher -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.