I would go a bit more further and suggest that you do not use sequences at all and work only with reducible/foldable collections. Make an input reader which returns a foldable collection and you will have the most performant solution. The thing about holding into the head is being worked on right now, see http://dev.clojure.org/jira/browse/CLJ-1250
JW On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote: > > On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com<javascript:>> > wrote: > > I have posted an example that shows partition-then-fold at > https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj > . > > I would be curious to know how this approach performs with your data. > With the generated data I used, the partition+fold and partition+pmap > approaches both used most of my cores and had similar perf. > > > Hey Stuart, > > Thanks for getting back to me. > > I've updated the code for my word count example on GitHub with (I believe) > something that works along the lines you suggest: > > https://github.com/paulbutcher/parallel-word-count > > Here are some sample runs on my machine (a 4-core retina MacBook Pro). > Each of these runs counts the words in the first 10000 pages of Wikipedia: > > $ lein run 10000 ~/enwiki.xml sequential > "Elapsed time: 23630.443 msecs" > $ lein run 10000 ~/enwiki.xml fold > "Elapsed time: 8697.79 msecs" > $ lein run 10000 ~/enwiki.xml pmap 10000 > "Elapsed time: 27393.703 msecs" > $ lein run 10000 ~/enwiki.xml pthenf 10000 > "Elapsed time: 37263.193 msecs" > > > As you can see, the the foldable-seq version gives an almost 3x speedup > relative to the sequential version, and both the partition-then-pmap and > partition-then-fold versions are significantly slower. > > The last argument for the partition-then-pmap and partition-then-fold > versions is a partition size. I've tried various different sizes with no > material effect: > > $ lein run 10000 ~/enwiki.xml pthenf 1000 > "Elapsed time: 43187.385 msecs" > $ lein run 10000 ~/enwiki.xml pthenf 100000 > "Elapsed time: 35702.651 msecs" > $ lein run 10000 ~/enwiki.xml pmap 1000 > "Elapsed time: 34087.314 msecs" > $ lein run 10000 ~/enwiki.xml pmap 100000 > "Elapsed time: 47340.969 msecs" > > > The performance of the partition-then-pmap version is actually much worse > than the numbers above suggest. There's something very weird going on with > (I guess) garbage collection - it takes a *long* time to finish after > printing the elapsed time and the performance is completely pathological > with larger page counts. > > Bottom line: the only way I've been able to obtain any kind of speedup > remains foldable-seq. > > I'd be very grateful indeed if you could take a look at how I've > implemented partition-then-fold to make sure that I've correctly captured > your intent. Or if you have any suggestions for anything else that might > work, or to explain the poor performance of partition-then-pmap and > partition-then-fold. > > My guess is that the problem with partition-then-fold is the copying > that's going on during the (into [] %). I can see that it is operating in > parallel because the number of cores in use goes up, but the net effect is > an overall slowdown rather than a speedup. > > That it performs worse than foldable-seq isn't surprising to me, given > that it introduces an unnecessary copy. > > I still think that it's a crying shame to disallow folding over sequences > - as the above shows, the gains both in performance and programming ease > are significant, and it would only take a small modification to the > reducers API to fix the holding-onto-head problem. What would be the > downside of making this modification and allowing foldable sequences? > > -- > paul.butcher->msgCount++ > > Snetterton, Castle Combe, Cadwell Park... > Who says I have a one track mind? > > http://www.paulbutcher.com/ > LinkedIn: http://www.linkedin.com/in/paulbutcher > MSN: pa...@paulbutcher.com <javascript:> > AIM: paulrabutcher > Skype: paulrabutcher > > On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com<javascript:>> > wrote: > > Hi Paul, > > I have posted an example that shows partition-then-fold at > https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj > . > > I would be curious to know how this approach performs with your data. > With the generated data I used, the partition+fold and partition+pmap > approaches both used most of my cores and had similar perf. > > Enjoying your book! > > Stu > > > On Sat, May 25, 2013 at 12:34 PM, Paul Butcher > <pa...@paulbutcher.com<javascript:> > > wrote: > >> I'm currently working on a book on concurrent/parallel development for >> The Pragmatic Programmers. One of the subjects I'm covering is parallel >> programming in Clojure, but I've hit a roadblock with one of the examples. >> I'm hoping that I can get some help to work through it here. >> >> The example counts the words contained within a Wikipedia dump. It should >> respond well to parallelisation (I have Java and Scala versions that >> perform excellently) but I've been incapable of coming up with a nice >> solution in Clojure. >> >> The code I'm working with is checked into >> GitHub<https://github.com/paulbutcher/parallel-word-count> >> : >> >> The basic sequential algorithm is: >> >> (frequencies (mapcat get-words pages)) >> >> >> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on >> my MacBook Pro. >> >> One way to parallelise it is to create a parallel version of frequencies >> that uses reducers: >> >> (defn frequencies-fold [words] >> (r/fold (partial merge-with +) >> (fn [counts word] (assoc counts word (inc (get counts word 0 >> )))) >> >> >> words)) >> >> >> And sure enough, if I use that, along with use the >> foldable-seq<https://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj> >> utility >> I posted about >> here<https://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J> are >> while ago it runs in ~8s, almost a 3x speedup, not bad given that the >> parallel version is unable to use transients. >> >> Unfortunately, as they currently stand, reducers have a fatal flaw that >> means that, even with foldable-seq, they're basically useless with lazy >> sequences. Reducers always hold onto the >> head<https://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ> >> of >> the sequence they're given, so there's no way to use this approach for a >> complete Wikipedia dump (which runs to around 40GiB). >> >> So the other approach I've tried is to use pmap: >> >> (defn frequencies-pmap [words] >> (reduce (partial merge-with +) >> (pmap frequencies >> (partition-all 10000 words)))) >> >> >> But, for reasons I don't understand, this performs dreadfully - taking >> ~26s, i.e. significantly slower than the sequential version. >> >> I've tried playing with different partition sizes without materially >> affecting the result. >> >> So, what I'm looking for is either: >> >> a) An explanation for why the pmap-based solution performs so badly >> >> b) A way to fix the "holding onto head" problem that's inherent within >> reducers. >> >> With the last of these in mind, it strikes me that the problem >> fundamentally arises from the desire for reducers to follow the same basic >> API as "normal" code. So: >> >> (reduce (filter ... (map ... coll))) >> >> becomes: >> >> (r/fold (r/filter ... (r/map ... coll))) >> >> A very small change to the reducers API - passing the collection to the >> reduce and/or fold - would avoid the problem: >> >> (r/fold (r/filter ... (r/map ...)) coll) >> >> Anyway - I'd be very grateful for any insight into either of the above >> questions. Or for suggestions for an alternative approach that might be >> more fruitful. >> >> Many thanks in advance, >> >> -- >> paul.butcher->msgCount++ >> >> Snetterton, Castle Combe, Cadwell Park... >> Who says I have a one track mind? >> >> http://www.paulbutcher.com/ >> LinkedIn: http://www.linkedin.com/in/paulbutcher >> MSN: pa...@paulbutcher.com <javascript:> >> AIM: paulrabutcher >> Skype: paulrabutcher >> >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com<javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+u...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > > -- > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clo...@googlegroups.com <javascript:> > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+u...@googlegroups.com <javascript:> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+u...@googlegroups.com <javascript:>. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.