Re: Parallelising over a lazy sequence - request for help

Stuart Halloway Sun, 29 Sep 2013 08:47:02 -0700

To be clear, I don't object to the approach, only to naming it "fold"
and/or tying it to interfaces related to folding.


Stu


On Sat, Sep 28, 2013 at 5:29 PM, Paul Butcher <p...@paulbutcher.com> wrote:

> On 28 Sep 2013, at 22:00, Alex Miller <a...@puredanger.com> wrote:
>
> Reducers (and fork/join in general) are best suited for fine-grained
> computational parallelism on in-memory data. The problem in question
> involves processing more data than will fit in memory.
>
> So the question is then what is the best way to parallelize computation
> over the stream. There are many ways to do so - pmap is one easy mechanism
> to get parallelism over a lazy sequence but the chunking behavior of pmap
> can easily be non-optimal for particular use cases. Stu's other solution
> (prepare-with-partition-then-fold) reads chunks of data into memory, then
> uses f/j over the top of those chunks.
>
>
> Yes - I'm aware that that's how Stu's solution works. However when I use
> that in my example, the performance sucks. As I said in my earlier message,
> I believe that the reason for this is the additional copying that it
> involves.
>
> To date, I have exactly one implementation that has good parallel
> performance - the one that uses foldable-seq (which is 3x faster than the
> sequential version). Everything else I've tried performs worse than the
> sequential version.
>
> Yet another possible solution is to use a pool of compute threads and to
> read from the stream, give those compute tasks to the pool to execute in a
> background thread, then reassemble the results at the end (or in a separate
> thread). The balance in this approach is to make the tasks the right
> "chunkiness". Using a buffered queue is one technique to avoid having your
> input reader overload the capacity of the processing system.
>
>
> That's basically *exactly* what the foldable-seq implementation does.
>
> I would also mention that using transients as you build input collections
> will be a big win.
>
>
> I'm sure that that's true - but the foldable-seq implementation is the one
> that would gain most from that, and that's the one that already runs 3x
> faster than the sequential implementation. But it's also the one that Stu
> objects to. What I'm trying to find is an implementation that Stu doesn't
> object to, but is faster than the sequential version of the algorithm.
>
> --
> paul.butcher->msgCount++
>
> Snetterton, Castle Combe, Cadwell Park...
> Who says I have a one track mind?
>
> http://www.paulbutcher.com/
> LinkedIn: http://www.linkedin.com/in/paulbutcher
> MSN: p...@paulbutcher.com
> AIM: paulrabutcher
> Skype: paulrabutcher
>
> On 28 Sep 2013, at 22:00, Alex Miller <a...@puredanger.com> wrote:
>
> Reducers (and fork/join in general) are best suited for fine-grained
> computational parallelism on in-memory data. The problem in question
> involves processing more data than will fit in memory.
>
> So the question is then what is the best way to parallelize computation
> over the stream. There are many ways to do so - pmap is one easy mechanism
> to get parallelism over a lazy sequence but the chunking behavior of pmap
> can easily be non-optimal for particular use cases. Stu's other solution
> (prepare-with-partition-then-fold) reads chunks of data into memory, then
> uses f/j over the top of those chunks.
>
> Yet another possible solution is to use a pool of compute threads and to
> read from the stream, give those compute tasks to the pool to execute in a
> background thread, then reassemble the results at the end (or in a separate
> thread). The balance in this approach is to make the tasks the right
> "chunkiness". Using a buffered queue is one technique to avoid having your
> input reader overload the capacity of the processing system.
>
> I would also mention that using transients as you build input collections
> will be a big win.
>
> Alex
>
> On Saturday, September 28, 2013 11:14:41 AM UTC-5, Jozef Wagner wrote:
>>
>> I would go a bit more further and suggest that you do not use sequences
>> at all and work only with reducible/foldable collections. Make an input
>> reader which returns a foldable collection and you will have the most
>> performant solution. The thing about holding into the head is being worked
>> on right now, see 
>> http://dev.clojure.org/**jira/browse/CLJ-1250<http://dev.clojure.org/jira/browse/CLJ-1250>
>>
>> JW
>>
>> On Saturday, September 28, 2013 10:41:20 AM UTC+2, Paul Butcher wrote:
>>>
>>> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
>>>
>>> I have posted an example that shows partition-then-fold at
>>> https://github.com/**stuarthalloway/exploring-**
>>> clojure/blob/master/examples/**exploring/reducing_apple_pie.**clj<https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj>
>>> .
>>>
>>> I would be curious to know how this approach performs with your data.
>>>  With the generated data I used, the partition+fold and partition+pmap
>>> approaches both used most of my cores and had similar perf.
>>>
>>>
>>> Hey Stuart,
>>>
>>> Thanks for getting back to me.
>>>
>>> I've updated the code for my word count example on GitHub with (I
>>> believe) something that works along the lines you suggest:
>>>
>>> https://github.com/**paulbutcher/parallel-word-**count<https://github.com/paulbutcher/parallel-word-count>
>>>
>>> Here are some sample runs on my machine (a 4-core retina MacBook Pro).
>>> Each of these runs counts the words in the first 10000 pages of Wikipedia:
>>>
>>> $ lein run 10000 ~/enwiki.xml sequential
>>> "Elapsed time: 23630.443 msecs"
>>> $ lein run 10000 ~/enwiki.xml fold
>>> "Elapsed time: 8697.79 msecs"
>>> $ lein run 10000 ~/enwiki.xml pmap 10000
>>> "Elapsed time: 27393.703 msecs"
>>> $ lein run 10000 ~/enwiki.xml pthenf 10000
>>> "Elapsed time: 37263.193 msecs"
>>>
>>>
>>> As you can see, the the foldable-seq version gives an almost 3x speedup
>>> relative to the sequential version, and both the partition-then-pmap and
>>> partition-then-fold versions are significantly slower.
>>>
>>> The last argument for the partition-then-pmap and partition-then-fold
>>> versions is a partition size. I've tried various different sizes with no
>>> material effect:
>>>
>>> $ lein run 10000 ~/enwiki.xml pthenf 1000
>>> "Elapsed time: 43187.385 msecs"
>>> $ lein run 10000 ~/enwiki.xml pthenf 100000
>>> "Elapsed time: 35702.651 msecs"
>>> $ lein run 10000 ~/enwiki.xml pmap 1000
>>> "Elapsed time: 34087.314 msecs"
>>> $ lein run 10000 ~/enwiki.xml pmap 100000
>>> "Elapsed time: 47340.969 msecs"
>>>
>>>
>>> The performance of the partition-then-pmap version is actually much
>>> worse than the numbers above suggest. There's something very weird going on
>>> with (I guess) garbage collection - it takes a *long* time to finish after
>>> printing the elapsed time and the performance is completely pathological
>>> with larger page counts.
>>>
>>> Bottom line: the only way I've been able to obtain any kind of speedup
>>> remains foldable-seq.
>>>
>>> I'd be very grateful indeed if you could take a look at how I've
>>> implemented partition-then-fold to make sure that I've correctly captured
>>> your intent. Or if you have any suggestions for anything else that might
>>> work, or to explain the poor performance of partition-then-pmap and
>>> partition-then-fold.
>>>
>>> My guess is that the problem with partition-then-fold is the copying
>>> that's going on during the (into [] %). I can see that it is operating in
>>> parallel because the number of cores in use goes up, but the net effect is
>>> an overall slowdown rather than a speedup.
>>>
>>> That it performs worse than foldable-seq isn't surprising to me, given
>>> that it introduces an unnecessary copy.
>>>
>>> I still think that it's a crying shame to disallow folding over
>>> sequences - as the above shows, the gains both in performance and
>>> programming ease are significant, and it would only take a small
>>> modification to the reducers API to fix the holding-onto-head problem. What
>>> would be the downside of making this modification and allowing foldable
>>> sequences?
>>>
>>> --
>>> paul.butcher->msgCount++
>>>
>>> Snetterton, Castle Combe, Cadwell Park...
>>> Who says I have a one track mind?
>>>
>>> http://www.paulbutcher.com/
>>> LinkedIn: 
>>> http://www.linkedin.**com/in/paulbutcher<http://www.linkedin.com/in/paulbutcher>
>>> MSN: pa...@paulbutcher.com
>>> AIM: paulrabutcher
>>> Skype: paulrabutcher
>>>
>>> On 28 Sep 2013, at 00:27, Stuart Halloway <stuart....@gmail.com> wrote:
>>>
>>> Hi Paul,
>>>
>>> I have posted an example that shows partition-then-fold at
>>> https://github.com/**stuarthalloway/exploring-**
>>> clojure/blob/master/examples/**exploring/reducing_apple_pie.**clj<https://github.com/stuarthalloway/exploring-clojure/blob/master/examples/exploring/reducing_apple_pie.clj>
>>> .
>>>
>>> I would be curious to know how this approach performs with your data.
>>>  With the generated data I used, the partition+fold and partition+pmap
>>> approaches both used most of my cores and had similar perf.
>>>
>>> Enjoying your book!
>>>
>>> Stu
>>>
>>>
>>> On Sat, May 25, 2013 at 12:34 PM, Paul Butcher <pa...@paulbutcher.com>wrote:
>>>
>>>> I'm currently working on a book on concurrent/parallel development for
>>>> The Pragmatic Programmers. One of the subjects I'm covering is parallel
>>>> programming in Clojure, but I've hit a roadblock with one of the examples.
>>>> I'm hoping that I can get some help to work through it here.
>>>>
>>>> The example counts the words contained within a Wikipedia dump. It
>>>> should respond well to parallelisation (I have Java and Scala versions that
>>>> perform excellently) but I've been incapable of coming up with a nice
>>>> solution in Clojure.
>>>>
>>>> The code I'm working with is checked into 
>>>> GitHub<https://github.com/paulbutcher/parallel-word-count>
>>>> :
>>>>
>>>> The basic sequential algorithm is:
>>>>
>>>> (frequencies (mapcat get-words pages))
>>>>
>>>>
>>>> If I run that on the first 10k pages in Wikipedia dump, it takes ~21s
>>>> on my MacBook Pro.
>>>>
>>>> One way to parallelise it is to create a parallel version of
>>>> frequencies that uses reducers:
>>>>
>>>> (defn frequencies-fold [words]
>>>>   (r/fold (partial merge-with +)
>>>>           (fn [counts word] (assoc cou**nts word (inc (get counts word*
>>>> * 0))))
>>>>
>>>>
>>>>           words))
>>>>
>>>>
>>>> And sure enough, if I use that, along with use the 
>>>> foldable-seq<https://github.com/paulbutcher/parallel-word-count/blob/master/src/foldable_seq/core.clj>
>>>>  utility
>>>> I posted about 
>>>> here<https://groups.google.com/d/msg/clojure/8RKCjF00ukQ/b5mmmOB5Uh4J> are
>>>> while ago it runs in ~8s, almost a 3x speedup, not bad given that the
>>>> parallel version is unable to use transients.
>>>>
>>>> Unfortunately, as they currently stand, reducers have a fatal flaw that
>>>> means that, even with foldable-seq, they're basically useless with lazy
>>>> sequences. Reducers always hold onto the 
>>>> head<https://groups.google.com/d/msg/clojure-dev/qJo7z_9CVdw/ARnHe1bThuMJ> 
>>>> of
>>>> the sequence they're given, so there's no way to use this approach for a
>>>> complete Wikipedia dump (which runs to around 40GiB).
>>>>
>>>> So the other approach I've tried is to use pmap:
>>>>
>>>> (defn frequencies-pmap [words]
>>>>   (reduce (partial merge-with +)
>>>>     (pmap frequencies
>>>>       (partition-all 10000 words))))
>>>>
>>>>
>>>> But, for reasons I don't understand, this performs dreadfully - taking
>>>> ~26s, i.e. significantly slower than the sequential version.
>>>>
>>>> I've tried playing with different partition sizes without materially
>>>> affecting the result.
>>>>
>>>> So, what I'm looking for is either:
>>>>
>>>> a) An explanation for why the pmap-based solution performs so badly
>>>>
>>>> b) A way to fix the "holding onto head" problem that's inherent within
>>>> reducers.
>>>>
>>>> With the last of these in mind, it strikes me that the problem
>>>> fundamentally arises from the desire for reducers to follow the same basic
>>>> API as "normal" code. So:
>>>>
>>>> (reduce (filter ... (map ... coll)))
>>>>
>>>> becomes:
>>>>
>>>> (r/fold (r/filter ... (r/map ... coll)))
>>>>
>>>> A very small change to the reducers API - passing the collection to the
>>>> reduce and/or fold - would avoid the problem:
>>>>
>>>> (r/fold (r/filter ... (r/map ...)) coll)
>>>>
>>>> Anyway - I'd be very grateful for any insight into either of the above
>>>> questions. Or for suggestions for an alternative approach that might be
>>>> more fruitful.
>>>>
>>>> Many thanks in advance,
>>>>
>>>> --
>>>> paul.butcher->msgCount++
>>>>
>>>> Snetterton, Castle Combe, Cadwell Park...
>>>> Who says I have a one track mind?
>>>>
>>>> http://www.paulbutcher.com/
>>>> LinkedIn: 
>>>> http://www.linkedin.**com/in/paulbutcher<http://www.linkedin.com/in/paulbutcher>
>>>> MSN: pa...@paulbutcher.com
>>>> AIM: paulrabutcher
>>>> Skype: paulrabutcher
>>>>
>>>>
>>>> --
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+u...@googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/clojure?hl=en<http://groups.google.com/group/clojure?hl=en>
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to clojure+u...@googlegroups.com.
>>>> For more options, visit 
>>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
>>>> .
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/**group/clojure?hl=en<http://groups.google.com/group/clojure?hl=en>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit 
>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
>>> .
>>>
>>>
>>>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Parallelising over a lazy sequence - request for help

Reply via email to