Re: Spread work onto multiple threads (in pure Clojure)

Lee Spector Mon, 10 Oct 2011 14:04:50 -0700

A weakness of my pmapall:

#<Exception java.lang.Exception: Can't await in agent action>


Which means, I think, that I can't call pmapall within a function that I pass 
to pmapall. Unfortunate.

Is there a better way?

 -Lee

PS to see these exceptions one must change the call to agent in my definition 
with something like #(agent % :error-handler (fn [agnt except] (println 
except))).


On Oct 10, 2011, at 4:07 PM, Lee Spector wrote:

> 
> I think that the following partially answers my own question and that it 
> provides a way to get decent multicore performance for collections of 
> non-uniform but compute-intensive tasks through a simple, pmap-like interface.
> 
> But I'm not sure if it's the best approach and I'd like some feedback. If it 
> *is* a good approach then maybe we should refine it and make it more widely 
> available. If it's not the best approach then I'd love some advice on how to 
> do it better.
> 
> The use case here -- which I think must be shared by at least some others -- 
> is that I have a finite, non-lazy collection of inputs on which I'd like to 
> run an expensive (but not uniformly expensive) function, gathering the 
> results in a non-lazy sequence. This is a *very* common need in my own 
> projects. I don't care about the order in which the function calls are made, 
> and I'm not concerned about the memory overhead of retaining all of the 
> results since I want to keep them all anyway. I just want all of the 
> computations done as quickly as possible, using all of my available cores. 
> The pmap function seems at first to provide an elegant way to do what's 
> needed, e.g. with (doall (pmap f inputs)), but as discussed earlier in this 
> thread it will often wait for the completion of earlier calls before starting 
> later calls, and this will be particularly problematic when the runtimes of 
> the calls are uneven.
> 
> The medusa package provides something that would seem to fit the bill better, 
> but it comes with other baggage that I don't need or want. I just want a 
> version of pmap that will use all available CPU resources to aggressively 
> complete all of the computations in a non-lazy context.
> 
> Here's my stab at doing this using agents:
> 
> (defn pmapall
>  "Like pmap but: 1) coll should be finite, 2) the returned sequence
>   will not be lazy, 3) calls to f may occur in any order, to maximize
>   multicore processor utilization, and 4) takes only one coll so far."
>  [f coll]
>  (let [agents (map agent coll)]
>    (dorun (map #(send % f) agents))
>    (apply await agents)
>    (doall (map deref agents))))
> 
> I should make a version that takes multiple colls, but I for now've written 
> it to take just one for clarity.
> 
> This does appear to speed things up pretty significantly in certain 
> circumstances, but maybe not as much as possible. Is it the best approach?
> 
> To show that it beats pmap I define a time-wasting function (I want to see 
> real cpu utilization so I'm not using delays) like this:
> 
> (defn burn
>  []
>  (dotimes [i 10000]
>    (reduce * (map float (take 1000 (iterate inc i))))))
> 
> And then I define a function that takes a lot or a little bit of time 
> depending on its argument:
> 
> (defn fast-or-slow
>  [n]
>  (if (zero? n) 
>    :done
>    (do (burn)
>        :done)))
> 
> And then I create an vector of inputs in which the slow ones are scattered 
> sparsely:
> 
> (def inputs (take 1000 (cycle (conj (repeat 20 0) 1))))
> 
> And then on a 48 core node I get timings like this:
> 
> user=> (time (last (pmapall fast-or-slow inputs)))
> "Elapsed time: 37244.151 msecs"
> :done
> 
> user=> (time (last (doall (pmap fast-or-slow inputs))))
> "Elapsed time: 110862.187 msecs"
> :done
> 
> And by the way, plain old serial map does this:
> 
> user=> (time (last (doall (map fast-or-slow inputs))))
> "Elapsed time: 260941.836 msecs"
> 
> So we've improved things; pmap is a little better than twice as fast as map, 
> and pmapall is roughly 3 times faster than pmap. So I think I'm ready to 
> switch all of my pmaps to pmapalls. But that's still nothing close to a 48x 
> speedup, even though all of the tasks should be completely independent and I 
> wouldn't expect a huge loss for coordination. And another confusing thing is 
> that even with pmapall, when I look at the CPU utilization I see numbers 
> close to 4800% in some cases (like the one above) but numbers topping out at 
> something more like 1900% in some others (e.g. with different input vectors).
> 
> So I feel like I'm moving in the right direction but that I'm still probably 
> missing something.
> 
> Is there a better way to do this? Surely it will come in handy for others as 
> well if there's a simple way to more effectively dispatch tasks to multiple 
> cores.
> 
> Thoughts? Code?
> 
> Thanks,
> 
> -Lee
> 
> 
> On Oct 9, 2011, at 7:24 PM, Lee Spector wrote:
> 
>> 
>> I've been playing with medusa and it sometimes does what I expect, but 
>> sometimes it's doing something strange and I'm wondering if someone can help 
>> me to do one specific medusa-like thing but more simply (and without the 
>> strangeness, which I haven't fully traced but I hope to avoid having to).
>> 
>> What I want is a version of pmap that will always use any available cores to 
>> compute remaining values (except of course the last couple of values, when 
>> there are less remaining values than cores). 
>> 
>> In other words, I want the behavior that Andy Fingerhut describes medusa as 
>> having here:
>> 
>>> On Sep 22, 2011, at 11:34 PM, Andy Fingerhut wrote:
>>> 
>>>> pmap will limit the maximum number of simultaneous threads.  So will the 
>>>> medusa library's medusa-pmap.
>>>> 
>>>> The difference is that if one job early in the list takes, e.g., 100 times 
>>>> longer than the next 20, and you have 4 cpus available, pmap will start 
>>>> the first (4+2)=6 in parallel threads, let jobs 2 through 6 complete, and 
>>>> then wait for the first one to finish before starting number 7.  Thus most 
>>>> of the time will be spent running one thread.  This has the advantage of 
>>>> limiting the memory required to store intermediate results, but the 
>>>> disadvantage of low usage of multiple cpu cores.
>>>> 
>>>> Medusa has a medusa-pmap that will also limit the parallelism, but it lets 
>>>> you pick the level of parallelism (it isn't fixed at (# cpus + 2)), and it 
>>>> will continue starting new threads, even if that means using more memory 
>>>> when one job takes much longer than the following jobs.
>>>> 
>> 
>> FWIW I'll always be calling this on a finite sequence (although it may have 
>> 10000 elements or so, so I presume that I shouldn't start threads for all of 
>> them at once), and I will want a fully realized result sequence so I'll 
>> surround calls with doall if necessary. I will want all of the pmapped 
>> computations to finish before I continue doing anything else.
>> 
>> I know that I could launch the threads in other ways -- e.g. I sometimes use 
>> agents for something similar -- but pmap is much more elegant in many cases, 
>> and anyway I'm not even sure I'm getting full utilization with my other 
>> methods.
>> 
>> The medusa-pmap function is very close to what I want but medusa seems to do 
>> more that what I need, it requires initi/startup calls, it involves timeouts 
>> which I will never want, and it behaves strangely when I run my code on a 48 
>> core node. (It runs fine on my 4-core laptop, and then it seems to work 
>> beautifully on the 48 core node too for a while, giving me nearly 4800% 
>> utilization, but then it does something that looks like it might be caused 
>> by everything hitting the timeouts... which I bumped up by several orders of 
>> magnitude but I'm still getting the weird behavior).
>> 
>> So: Is there a simpler way to get "pmap but not lazy and use all cores fully 
>> until the whole result sequence is completed"? This is usually what I really 
>> want when I'm writing code that utilizes concurrency.
>> 
>> I've looked at the pmap source but it isn't obvious to me how to change that 
>> to do what I want... So any pointers would be appreciated.
>> 
>> Thanks,
>> 
>> -Lee
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspec...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Spread work onto multiple threads (in pure Clojure)

Reply via email to