Re: Parallelising over a lazy sequence - request for help

Paul Butcher Sat, 28 Sep 2013 01:42:26 -0700

On 28 Sep 2013, at 00:27, Stuart Halloway <[email protected]> wrote:


> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
> 
> I would be curious to know how this approach performs with your data.  With 
> the generated data I used, the partition+fold and partition+pmap approaches 
> both used most of my cores and had similar perf.

Hey Stuart, 

Thanks for getting back to me.

I've updated the code for my word count example on GitHub with (I believe) 
something that works along the lines you suggest:

https://github.com/paulbutcher/parallel-word-count

Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each of 
these runs counts the words in the first 10000 pages of Wikipedia:

> $ lein run 10000 ~/enwiki.xml sequential
> "Elapsed time: 23630.443 msecs"
> $ lein run 10000 ~/enwiki.xml fold
> "Elapsed time: 8697.79 msecs"
> $ lein run 10000 ~/enwiki.xml pmap 10000
> "Elapsed time: 27393.703 msecs"
> $ lein run 10000 ~/enwiki.xml pthenf 10000
> "Elapsed time: 37263.193 msecs"

As you can see, the the foldable-seq version gives an almost 3x speedup 
relative to the sequential version, and both the partition-then-pmap and 
partition-then-fold versions are significantly slower.

The last argument for the partition-then-pmap and partition-then-fold versions 
is a partition size. I've tried various different sizes with no material effect:

> $ lein run 10000 ~/enwiki.xml pthenf 1000
> "Elapsed time: 43187.385 msecs"
> $ lein run 10000 ~/enwiki.xml pthenf 100000
> "Elapsed time: 35702.651 msecs"
> $ lein run 10000 ~/enwiki.xml pmap 1000
> "Elapsed time: 34087.314 msecs"
> $ lein run 10000 ~/enwiki.xml pmap 100000
> "Elapsed time: 47340.969 msecs"

The performance of the partition-then-pmap version is actually much worse than 
the numbers above suggest. There's something very weird going on with (I guess) 
garbage collection - it takes a *long* time to finish after printing the 
elapsed time and the performance is completely pathological with larger page 
counts.

Bottom line: the only way I've been able to obtain any kind of speedup remains 
foldable-seq.

I'd be very grateful indeed if you could take a look at how I've implemented 
partition-then-fold to make sure that I've correctly captured your intent. Or 
if you have any suggestions for anything else that might work, or to explain 
the poor performance of partition-then-pmap and partition-then-fold.

My guess is that the problem with partition-then-fold is the copying that's 
going on during the (into [] %). I can see that it is operating in parallel 
because the number of cores in use goes up, but the net effect is an overall 
slowdown rather than a speedup.

That it performs worse than foldable-seq isn't surprising to me, given that it 
introduces an unnecessary copy.

I still think that it's a crying shame to disallow folding over sequences - as 
the above shows, the gains both in performance and programming ease are 
significant, and it would only take a small modification to the reducers API to 
fix the holding-onto-head problem. What would be the downside of making this 
modification and allowing foldable sequences?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: [email protected]
AIM: paulrabutcher
Skype: paulrabutcher

On 28 Sep 2013, at 00:27, Stuart Halloway <[email protected]> wrote:

> Hi Paul,
> 
> I have posted an example that shows partition-then-fold at 
> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
> 
> I would be curious to know how this approach performs with your data.  With 
> the generated data I used, the partition+fold and partition+pmap approaches 
> both used most of my cores and had similar perf.
> 
> Enjoying your book!
> 
> Stu
> 
> 
> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <[email protected]> wrote:
> I'm currently working on a book on concurrent/parallel development for The 
> Pragmatic Programmers. One of the subjects I'm covering is parallel 
> programming in Clojure, but I've hit a roadblock with one of the examples. 
> I'm hoping that I can get some help to work through it here.
> 
> The example counts the words contained within a Wikipedia dump. It should 
> respond well to parallelisation (I have Java and Scala versions that perform 
> excellently) but I've been incapable of coming up with a nice solution in 
> Clojure.
> 
> The code I'm working with is checked into GitHub: 
> 
> The basic sequential algorithm is:
> 
>> (frequencies (mapcat get-words pages))
> 
> 
> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my 
> MacBook Pro.
> 
> One way to parallelise it is to create a parallel version of frequencies that 
> uses reducers:
> 
>> (defn frequencies-fold [words]
>>   (r/fold (partial merge-with +)
>>           (fn [counts word] (assoc counts word (inc (get counts word 0))))
>> 
>>           words))
> 
> And sure enough, if I use that, along with use the foldable-seq utility I 
> posted about here are while ago it runs in ~8s, almost a 3x speedup, not bad 
> given that the parallel version is unable to use transients.
> 
> Unfortunately, as they currently stand, reducers have a fatal flaw that means 
> that, even with foldable-seq, they're basically useless with lazy sequences. 
> Reducers always hold onto the head of the sequence they're given, so there's 
> no way to use this approach for a complete Wikipedia dump (which runs to 
> around 40GiB).
> 
> So the other approach I've tried is to use pmap:
> 
>> (defn frequencies-pmap [words]
>> 
>>   (reduce (partial merge-with +) 
>> 
>>     (pmap frequencies 
>> 
>>       (partition-all 10000 words))))
>> 
> 
> But, for reasons I don't understand, this performs dreadfully - taking ~26s, 
> i.e. significantly slower than the sequential version.
> 
> I've tried playing with different partition sizes without materially 
> affecting the result.
> 
> So, what I'm looking for is either:
> 
> a) An explanation for why the pmap-based solution performs so badly
> 
> b) A way to fix the "holding onto head" problem that's inherent within 
> reducers.
> 
> With the last of these in mind, it strikes me that the problem fundamentally 
> arises from the desire for reducers to follow the same basic API as "normal" 
> code. So:
> 
> (reduce (filter ... (map ... coll)))
> 
> becomes:
> 
> (r/fold (r/filter ... (r/map ... coll)))
> 
> A very small change to the reducers API - passing the collection to the 
> reduce and/or fold - would avoid the problem:
> 
> (r/fold (r/filter ... (r/map ...)) coll)
> 
> Anyway - I'd be very grateful for any insight into either of the above 
> questions. Or for suggestions for an alternative approach that might be more 
> fruitful.
> 
> Many thanks in advance,
> 
> --
> paul.butcher->msgCount++
> 
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
> 
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: [email protected]
> AIM: paulrabutcher
> Skype: paulrabutcher
> 
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  
> 
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to