I do not know about the most important parts of your performance difficulties,
but on a more trivial point I might be able to shed some light.
See the ClojureDocs page for pmap, which refers to the page for future, linked
below. If you call (shutdown-agents) the 60-second wait to exit should go away.
http://clojuredocs.org/clojure_core/clojure.core/future
Andy
Sent from my iPhone
On Sep 28, 2013, at 1:41 AM, Paul Butcher <[email protected]> wrote:
> On 28 Sep 2013, at 00:27, Stuart Halloway <[email protected]> wrote:
>
>> I have posted an example that shows partition-then-fold at
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>>
>> I would be curious to know how this approach performs with your data. With
>> the generated data I used, the partition+fold and partition+pmap approaches
>> both used most of my cores and had similar perf.
>
> Hey Stuart,
>
> Thanks for getting back to me.
>
> I've updated the code for my word count example on GitHub with (I believe)
> something that works along the lines you suggest:
>
> https://github.com/paulbutcher/parallel-word-count
>
> Here are some sample runs on my machine (a 4-core retina MacBook Pro). Each
> of these runs counts the words in the first 10000 pages of Wikipedia:
>
>> $ lein run 10000 ~/enwiki.xml sequential
>> "Elapsed time: 23630.443 msecs"
>> $ lein run 10000 ~/enwiki.xml fold
>> "Elapsed time: 8697.79 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 10000
>> "Elapsed time: 27393.703 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 10000
>> "Elapsed time: 37263.193 msecs"
>
> As you can see, the the foldable-seq version gives an almost 3x speedup
> relative to the sequential version, and both the partition-then-pmap and
> partition-then-fold versions are significantly slower.
>
> The last argument for the partition-then-pmap and partition-then-fold
> versions is a partition size. I've tried various different sizes with no
> material effect:
>
>> $ lein run 10000 ~/enwiki.xml pthenf 1000
>> "Elapsed time: 43187.385 msecs"
>> $ lein run 10000 ~/enwiki.xml pthenf 100000
>> "Elapsed time: 35702.651 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 1000
>> "Elapsed time: 34087.314 msecs"
>> $ lein run 10000 ~/enwiki.xml pmap 100000
>> "Elapsed time: 47340.969 msecs"
>
> The performance of the partition-then-pmap version is actually much worse
> than the numbers above suggest. There's something very weird going on with (I
> guess) garbage collection - it takes a *long* time to finish after printing
> the elapsed time and the performance is completely pathological with larger
> page counts.
>
> Bottom line: the only way I've been able to obtain any kind of speedup
> remains foldable-seq.
>
> I'd be very grateful indeed if you could take a look at how I've implemented
> partition-then-fold to make sure that I've correctly captured your intent. Or
> if you have any suggestions for anything else that might work, or to explain
> the poor performance of partition-then-pmap and partition-then-fold.
>
> My guess is that the problem with partition-then-fold is the copying that's
> going on during the (into [] %). I can see that it is operating in parallel
> because the number of cores in use goes up, but the net effect is an overall
> slowdown rather than a speedup.
>
> That it performs worse than foldable-seq isn't surprising to me, given that
> it introduces an unnecessary copy.
>
> I still think that it's a crying shame to disallow folding over sequences -
> as the above shows, the gains both in performance and programming ease are
> significant, and it would only take a small modification to the reducers API
> to fix the holding-onto-head problem. What would be the downside of making
> this modification and allowing foldable sequences?
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: [email protected]
> AIM: paulrabutcher
> Skype: paulrabutcher
>
> On 28 Sep 2013, at 00:27, Stuart Halloway <[email protected]> wrote:
>
>> Hi Paul,
>>
>> I have posted an example that shows partition-then-fold at
>> https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj.
>>
>> I would be curious to know how this approach performs with your data. With
>> the generated data I used, the partition+fold and partition+pmap approaches
>> both used most of my cores and had similar perf.
>>
>> Enjoying your book!
>>
>> Stu
>>
>>
>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <[email protected]> wrote:
>>> I'm currently working on a book on concurrent/parallel development for The
>>> Pragmatic Programmers. One of the subjects I'm covering is parallel
>>> programming in Clojure, but I've hit a roadblock with one of the examples.
>>> I'm hoping that I can get some help to work through it here.
>>>
>>> The example counts the words contained within a Wikipedia dump. It should
>>> respond well to parallelisation (I have Java and Scala versions that
>>> perform excellently) but I've been incapable of coming up with a nice
>>> solution in Clojure.
>>>
>>> The code I'm working with is checked into GitHub:
>>>
>>> The basic sequential algorithm is:
>>>
>>>> (frequencies (mapcat get-words pages))
>>>
>>>
>>> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s on my
>>> MacBook Pro.
>>>
>>> One way to parallelise it is to create a parallel version of frequencies
>>> that uses reducers:
>>>
>>>> (defn frequencies-fold [words]
>>>> (r/fold (partial merge-with +)
>>>> (fn [counts word] (assoc counts word (inc (get counts word 0))))
>>>>
>>>> words))
>>>
>>> And sure enough, if I use that, along with use the foldable-seq utility I
>>> posted about here are while ago it runs in ~8s, almost a 3x speedup, not
>>> bad given that the parallel version is unable to use transients.
>>>
>>> Unfortunately, as they currently stand, reducers have a fatal flaw that
>>> means that, even with foldable-seq, they're basically useless with lazy
>>> sequences. Reducers always hold onto the head of the sequence they're
>>> given, so there's no way to use this approach for a complete Wikipedia dump
>>> (which runs to around 40GiB).
>>>
>>> So the other approach I've tried is to use pmap:
>>>
>>>> (defn frequencies-pmap [words]
>>>>
>>>> (reduce (partial merge-with +)
>>>>
>>>> (pmap frequencies
>>>>
>>>> (partition-all 10000 words))))
>>>
>>> But, for reasons I don't understand, this performs dreadfully - taking
>>> ~26s, i.e. significantly slower than the sequential version.
>>>
>>> I've tried playing with different partition sizes without materially
>>> affecting the result.
>>>
>>> So, what I'm looking for is either:
>>>
>>> a) An explanation for why the pmap-based solution performs so badly
>>>
>>> b) A way to fix the "holding onto head" problem that's inherent within
>>> reducers.
>>>
>>> With the last of these in mind, it strikes me that the problem
>>> fundamentally arises from the desire for reducers to follow the same basic
>>> API as "normal" code. So:
>>>
>>> (reduce (filter ... (map ... coll)))
>>>
>>> becomes:
>>>
>>> (r/fold (r/filter ... (r/map ... coll)))
>>>
>>> A very small change to the reducers API - passing the collection to the
>>> reduce and/or fold - would avoid the problem:
>>>
>>> (r/fold (r/filter ... (r/map ...)) coll)
>>>
>>> Anyway - I'd be very grateful for any insight into either of the above
>>> questions. Or for suggestions for an alternative approach that might be
>>> more fruitful.
>>>
>>> Many thanks in advance,
>>>
>>> --
>>> paul.butcher->msgCount++
>>>
>>> Snetterton, Castle Combe, Cadwell Park...
>>> Who says I have a one track mind?
>>>
>>> http://www.paulbutcher.com/
>>> LinkedIn: http://www.linkedin.com/in/paulbutcher
>>> MSN: [email protected]
>>> AIM: paulrabutcher
>>> Skype: paulrabutcher
>>>
>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to [email protected]
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> ---
>>> You received this message because you are subscribed to the Google Groups
>>> "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to [email protected]
>> Note that posts from new members are moderated - please be patient with your
>> first post.
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.