I do not know about the most important parts of your performance difficulties, but on a more trivial point I might be able to shed some light.
See the ClojureDocs page for pmap, which refers to the page for future, linked below. If you call (shutdown-agents) the 60-second wait to exit should go away. http://clojuredocs.org/clojure_core/clojure.core/future Andy Sent from my iPhone On Sep 28, 2013, at 1:41 AM, Paul Butcher <p...@paulbutcher.com> wrote: > On 28 Sep 2013, at 00:27, Stuart Halloway <stuart.hallo...@gmail.com> wrote: > >> I have posted an example that shows partition-then-fold at >> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. >> >> I would be curious to know how this approach performs with your data. With >> the generated data I used, the partition+fold and partition+pmap approaches >> both used most of my cores and had similar perf. > > Hey Stuart, > > Thanks for getting back to me. > > I've updated the code for my word count example on GitHub with (I believe) > something that works along the lines you suggest: > > https://github.com/paulbutcher/parallel-word-count > > Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each > of these runs counts the words in the first 10000 pages of Wikipedia: > >> $ lein run 10000 ~/enwiki.xml sequential >> "Elapsed time: 23630.443 msecs" >> $ lein run 10000 ~/enwiki.xml fold >> "Elapsed time: 8697.79 msecs" >> $ lein run 10000 ~/enwiki.xml pmap 10000 >> "Elapsed time: 27393.703 msecs" >> $ lein run 10000 ~/enwiki.xml pthenf 10000 >> "Elapsed time: 37263.193 msecs" > > As you can see, the the foldable-seq version gives an almost 3x speedup > relative to the sequential version, and both the partition-then-pmap and > partition-then-fold versions are significantly slower. > > The last argument for the partition-then-pmap and partition-then-fold > versions is a partition size. I've tried various different sizes with no > material effect: > >> $ lein run 10000 ~/enwiki.xml pthenf 1000 >> "Elapsed time: 43187.385 msecs" >> $ lein run 10000 ~/enwiki.xml pthenf 100000 >> "Elapsed time: 35702.651 msecs" >> $ lein run 10000 ~/enwiki.xml pmap 1000 >> "Elapsed time: 34087.314 msecs" >> $ lein run 10000 ~/enwiki.xml pmap 100000 >> "Elapsed time: 47340.969 msecs" > > The performance of the partition-then-pmap version is actually much worse > than the numbers above suggest. There's something very weird going on with (I > guess) garbage collection - it takes a *long* time to finish after printing > the elapsed time and the performance is completely pathological with larger > page counts. > > Bottom line: the only way I've been able to obtain any kind of speedup > remains foldable-seq. > > I'd be very grateful indeed if you could take a look at how I've implemented > partition-then-fold to make sure that I've correctly captured your intent. Or > if you have any suggestions for anything else that might work, or to explain > the poor performance of partition-then-pmap and partition-then-fold. > > My guess is that the problem with partition-then-fold is the copying that's > going on during the (into [] %). I can see that it is operating in parallel > because the number of cores in use goes up, but the net effect is an overall > slowdown rather than a speedup. > > That it performs worse than foldable-seq isn't surprising to me, given that > it introduces an unnecessary copy. > > I still think that it's a crying shame to disallow folding over sequences - > as the above shows, the gains both in performance and programming ease are > significant, and it would only take a small modification to the reducers API > to fix the holding-onto-head problem. What would be the downside of making > this modification and allowing foldable sequences? > > -- > paul.butcher->msgCount++ > > Snetterton, Castle Combe, Cadwell Park... > Who says I have a one track mind? > > http://www.paulbutcher.com/ > LinkedIn: http://www.linkedin.com/in/paulbutcher > MSN: p...@paulbutcher.com > AIM: paulrabutcher > Skype: paulrabutcher > > On 28 Sep 2013, at 00:27, Stuart Halloway <stuart.hallo...@gmail.com> wrote: > >> Hi Paul, >> >> I have posted an example that shows partition-then-fold at >> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj. >> >> I would be curious to know how this approach performs with your data. With >> the generated data I used, the partition+fold and partition+pmap approaches >> both used most of my cores and had similar perf. >> >> Enjoying your book! >> >> Stu >> >> >> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <p...@paulbutcher.com> wrote: >>> I'm currently working on a book on concurrent/parallel development for The >>> Pragmatic Programmers. One of the subjects I'm covering is parallel >>> programming in Clojure, but I've hit a roadblock with one of the examples. >>> I'm hoping that I can get some help to work through it here. >>> >>> The example counts the words contained within a Wikipedia dump. It should >>> respond well to parallelisation (I have Java and Scala versions that >>> perform excellently) but I've been incapable of coming up with a nice >>> solution in Clojure. >>> >>> The code I'm working with is checked into GitHub: >>> >>> The basic sequential algorithm is: >>> >>>> (frequencies (mapcat get-words pages)) >>> >>> >>> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my >>> MacBook Pro. >>> >>> One way to parallelise it is to create a parallel version of frequencies >>> that uses reducers: >>> >>>> (defn frequencies-fold [words] >>>> (r/fold (partial merge-with +) >>>> (fn [counts word] (assoc counts word (inc (get counts word 0)))) >>>> >>>> words)) >>> >>> And sure enough, if I use that, along with use the foldable-seq utility I >>> posted about here are while ago it runs in ~8s, almost a 3x speedup, not >>> bad given that the parallel version is unable to use transients. >>> >>> Unfortunately, as they currently stand, reducers have a fatal flaw that >>> means that, even with foldable-seq, they're basically useless with lazy >>> sequences. Reducers always hold onto the head of the sequence they're >>> given, so there's no way to use this approach for a complete Wikipedia dump >>> (which runs to around 40GiB). >>> >>> So the other approach I've tried is to use pmap: >>> >>>> (defn frequencies-pmap [words] >>>> >>>> (reduce (partial merge-with +) >>>> >>>> (pmap frequencies >>>> >>>> (partition-all 10000 words)))) >>> >>> But, for reasons I don't understand, this performs dreadfully - taking >>> ~26s, i.e. significantly slower than the sequential version. >>> >>> I've tried playing with different partition sizes without materially >>> affecting the result. >>> >>> So, what I'm looking for is either: >>> >>> a) An explanation for why the pmap-based solution performs so badly >>> >>> b) A way to fix the "holding onto head" problem that's inherent within >>> reducers. >>> >>> With the last of these in mind, it strikes me that the problem >>> fundamentally arises from the desire for reducers to follow the same basic >>> API as "normal" code. So: >>> >>> (reduce (filter ... (map ... coll))) >>> >>> becomes: >>> >>> (r/fold (r/filter ... (r/map ... coll))) >>> >>> A very small change to the reducers API - passing the collection to the >>> reduce and/or fold - would avoid the problem: >>> >>> (r/fold (r/filter ... (r/map ...)) coll) >>> >>> Anyway - I'd be very grateful for any insight into either of the above >>> questions. Or for suggestions for an alternative approach that might be >>> more fruitful. >>> >>> Many thanks in advance, >>> >>> -- >>> paul.butcher->msgCount++ >>> >>> Snetterton, Castle Combe, Cadwell Park... >>> Who says I have a one track mind? >>> >>> http://www.paulbutcher.com/ >>> LinkedIn: http://www.linkedin.com/in/paulbutcher >>> MSN: p...@paulbutcher.com >>> AIM: paulrabutcher >>> Skype: paulrabutcher >>> >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clojure@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+unsubscr...@googlegroups.com >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> --- >>> You received this message because you are subscribed to the Google Groups >>> "Clojure" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to clojure+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with your >> first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/groups/opt_out. > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.