I made an alternative implementation using a thread pool and a queue, based
on the example at
http://clojure.org/concurrent_programming
In short, your pmapall and the pool-based implementation (below) both give
approximately
perfect scaling on my 4/8-core system (Intel i7 920 and HT).
Both give close to full load on all cores and a factor 4.4 speedup compared
to single threaded.
This seems about right, the CPU has four physical cores and get a few
percent extra performance
from the virtual cores, so the speedup is approximately linear with the
number of cores.
pmap-pool may be a tiny bit faster than the pmapall, but they are so close
that I can't
really tell.
It is possible that there is some sort of synchronization overhead on your
48-core machine.
95% of the tasks are practically noops, after all - just the cost of a
single function call.
There are only 48 tasks in your test that actually require computation, so
each
core will do a bunch of noops and perhaps one "real" task.
In real time, a single i7 920 runs the test just as fast as your 48 cores. I
don't expect that's
representative for what your 48 cores can do.
I suggest
* Increase the test size and/or the density of "heavy" tasks.
* Let the "light" tasks do a bit more computation, at least enough to pay
for the
overhead of calling them.
* Start with a smaller number of threads, and see where it stops scaling
linearly.
Threadpool/queue-based implementation:
(import '(java.util.concurrent Executors))
(defn pmap-pool [f coll]
(let [queue (ref coll) ;; shared queue of work units
nthreads (.availableProcessors (Runtime/getRuntime))
pool (Executors/newFixedThreadPool nthreads)
tasks (map (fn [_]
(fn [] ; one task per thread
(let [local-res (atom [])] ;; collect results per
thread to minimize synchronization
(while (seq @queue)
;; queue may be emptied between 'while'
;; and 'dosync'.
(when-let [wu (dosync
;; grab work unit, update
queue
(when-let [w (first @queue)]
(alter queue next)
w))]
(swap! local-res conj (f wu))))
local-res)))
(range nthreads))
results (doall (map #(deref (.get %)) ;; blocks until completion
(.invokeAll pool tasks))) ;; start all tasks
results (reduce concat results)]
(.shutdown pool)
;; sanity check
(when-not (and (empty? @queue)
(= (count results) (count coll))
(every? #(= % :done) results))
(println "ERROR: queue " (count @queue) " #results" (count results)))
results))
Results on an i7 920, 4 cores/8 threads (hyperthreading), Ubuntu 10.10:
user=> (time (last (map fast-or-slow inputs))))
"Elapsed time: 161891.732036 msecs", 100% CPU (out of 800% possible)
user=> (time (last (pmap fast-or-slow inputs))))
"Elapsed time: 163139.249677 msecs", 100% CPU
pmap has zero effect on my system, it won't use more than one core.
user=> (time (last (pmapall fast-or-slow inputs))))
"Elapsed time: 37710.349712 msecs", ~793% CPU
user=> (time (last (pmap-pool fast-or-slow inputs))))
"Elapsed time: 36393.132824 msecs", ~795% CPU
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en