Re: Spread work onto multiple threads (in pure Clojure)

Lee Spector Mon, 10 Oct 2011 13:07:46 -0700

I think that the following partially answers my own question and that it 
provides a way to get decent multicore performance for collections of 
non-uniform but compute-intensive tasks through a simple, pmap-like interface.

But I'm not sure if it's the best approach and I'd like some feedback. If it 
*is* a good approach then maybe we should refine it and make it more widely 
available. If it's not the best approach then I'd love some advice on how to do 
it better.

The use case here -- which I think must be shared by at least some others -- is 
that I have a finite, non-lazy collection of inputs on which I'd like to run an 
expensive (but not uniformly expensive) function, gathering the results in a 
non-lazy sequence. This is a *very* common need in my own projects. I don't 
care about the order in which the function calls are made, and I'm not 
concerned about the memory overhead of retaining all of the results since I 
want to keep them all anyway. I just want all of the computations done as 
quickly as possible, using all of my available cores. The pmap function seems 
at first to provide an elegant way to do what's needed, e.g. with (doall (pmap 
f inputs)), but as discussed earlier in this thread it will often wait for the 
completion of earlier calls before starting later calls, and this will be 
particularly problematic when the runtimes of the calls are uneven.

The medusa package provides something that would seem to fit the bill better, 
but it comes with other baggage that I don't need or want. I just want a 
version of pmap that will use all available CPU resources to aggressively 
complete all of the computations in a non-lazy context.

Here's my stab at doing this using agents:

(defn pmapall
  "Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far."
  [f coll]
  (let [agents (map agent coll)]
    (dorun (map #(send % f) agents))
    (apply await agents)
    (doall (map deref agents))))

I should make a version that takes multiple colls, but I for now've written it 
to take just one for clarity.

This does appear to speed things up pretty significantly in certain 
circumstances, but maybe not as much as possible. Is it the best approach?

To show that it beats pmap I define a time-wasting function (I want to see real 
cpu utilization so I'm not using delays) like this:

(defn burn
  []
  (dotimes [i 10000]
    (reduce * (map float (take 1000 (iterate inc i))))))

And then I define a function that takes a lot or a little bit of time depending 
on its argument:

(defn fast-or-slow
  [n]
  (if (zero? n) 
    :done
    (do (burn)
        :done)))

And then I create an vector of inputs in which the slow ones are scattered 
sparsely:

(def inputs (take 1000 (cycle (conj (repeat 20 0) 1))))

And then on a 48 core node I get timings like this:

user=> (time (last (pmapall fast-or-slow inputs)))
"Elapsed time: 37244.151 msecs"
:done

user=> (time (last (doall (pmap fast-or-slow inputs))))
"Elapsed time: 110862.187 msecs"
:done

And by the way, plain old serial map does this:

user=> (time (last (doall (map fast-or-slow inputs))))
"Elapsed time: 260941.836 msecs"

So we've improved things; pmap is a little better than twice as fast as map, 
and pmapall is roughly 3 times faster than pmap. So I think I'm ready to switch 
all of my pmaps to pmapalls. But that's still nothing close to a 48x speedup, 
even though all of the tasks should be completely independent and I wouldn't 
expect a huge loss for coordination. And another confusing thing is that even 
with pmapall, when I look at the CPU utilization I see numbers close to 4800% 
in some cases (like the one above) but numbers topping out at something more 
like 1900% in some others (e.g. with different input vectors).

So I feel like I'm moving in the right direction but that I'm still probably 
missing something.

Is there a better way to do this? Surely it will come in handy for others as 
well if there's a simple way to more effectively dispatch tasks to multiple 
cores.

Thoughts? Code?

Thanks,

 -Lee

On Oct 9, 2011, at 7:24 PM, Lee Spector wrote:

> 
> I've been playing with medusa and it sometimes does what I expect, but 
> sometimes it's doing something strange and I'm wondering if someone can help 
> me to do one specific medusa-like thing but more simply (and without the 
> strangeness, which I haven't fully traced but I hope to avoid having to).
> 
> What I want is a version of pmap that will always use any available cores to 
> compute remaining values (except of course the last couple of values, when 
> there are less remaining values than cores). 
> 
> In other words, I want the behavior that Andy Fingerhut describes medusa as 
> having here:
> 
>> On Sep 22, 2011, at 11:34 PM, Andy Fingerhut wrote:
>> 
>>> pmap will limit the maximum number of simultaneous threads.  So will the 
>>> medusa library's medusa-pmap.
>>> 
>>> The difference is that if one job early in the list takes, e.g., 100 times 
>>> longer than the next 20, and you have 4 cpus available, pmap will start the 
>>> first (4+2)=6 in parallel threads, let jobs 2 through 6 complete, and then 
>>> wait for the first one to finish before starting number 7.  Thus most of 
>>> the time will be spent running one thread.  This has the advantage of 
>>> limiting the memory required to store intermediate results, but the 
>>> disadvantage of low usage of multiple cpu cores.
>>> 
>>> Medusa has a medusa-pmap that will also limit the parallelism, but it lets 
>>> you pick the level of parallelism (it isn't fixed at (# cpus + 2)), and it 
>>> will continue starting new threads, even if that means using more memory 
>>> when one job takes much longer than the following jobs.
>>> 
> 
> FWIW I'll always be calling this on a finite sequence (although it may have 
> 10000 elements or so, so I presume that I shouldn't start threads for all of 
> them at once), and I will want a fully realized result sequence so I'll 
> surround calls with doall if necessary. I will want all of the pmapped 
> computations to finish before I continue doing anything else.
> 
> I know that I could launch the threads in other ways -- e.g. I sometimes use 
> agents for something similar -- but pmap is much more elegant in many cases, 
> and anyway I'm not even sure I'm getting full utilization with my other 
> methods.
> 
> The medusa-pmap function is very close to what I want but medusa seems to do 
> more that what I need, it requires initi/startup calls, it involves timeouts 
> which I will never want, and it behaves strangely when I run my code on a 48 
> core node. (It runs fine on my 4-core laptop, and then it seems to work 
> beautifully on the 48 core node too for a while, giving me nearly 4800% 
> utilization, but then it does something that looks like it might be caused by 
> everything hitting the timeouts... which I bumped up by several orders of 
> magnitude but I'm still getting the weird behavior).
> 
> So: Is there a simpler way to get "pmap but not lazy and use all cores fully 
> until the whole result sequence is completed"? This is usually what I really 
> want when I'm writing code that utilizes concurrency.
> 
> I've looked at the pmap source but it isn't obvious to me how to change that 
> to do what I want... So any pointers would be appreciated.
> 
> Thanks,
> 
> -Lee

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Spread work onto multiple threads (in pure Clojure)

Reply via email to