I'll post more on this later, but I wanted to point out one case where I found that pmap was not achieving the desired level of speedup (# of CPUs/cores) that you would initially expect, and it is not due to any reasons that I've posted about before.
Imagine a 4-core CPU. There are 4 physical CPU cores on the same chip, so if you can find some computation task that is completely compute-bound, _and that computation task does not require reading from or writing to main memory or any other storage media like hard drives after it is "warmed up"_, then pmap or pmapall should ideally be able to achieve an elapsed time of 1/4 of doing the tasks sequentially. However, suppose instead that the computation task reads from a large data structure and creates a new large data structure as a result, and that the amount of memory reading is large compared to the amount of computation required. Then it is possible that the speedup will not be limited by the number of CPU cores available, but by the bandwidth that main memory can be read or written. For example, the 4 CPU cores in the hypothetical example might have a shared bus for writing to main memory that has a maximum capacity of 20 Gbytes/sec. If doing the tasks sequentially on one CPU core can completely use that CPU core, and require writing 10 Gbytes/sec to main memory, then it doesn't matter whether you have 2 cores or 50 cores on that chip, the 20 Gbytes/sec of write bandwidth to main memory will limit your parallel speedup to a factor of 2 (i.e. parallel run elapsed times will be at least 1/2 of the sequential elapsed times). I believe I have found similar cases where I did not have a large output data structure, but I simply generated data structures that very soon become garbage. Because I allocated them, and the garbage collector did not detect them as garbage until *after* they had been written from the CPU's cache to main memory, caused the parallel speedup to be limited by the CPU's write bandwidth to main memory, which was lower than the number of cores. In such cases, it would be ideal if the garbage collector could somehow be made smart enough to detect the allocated structures as garbage, and reallocate their memory, before they were ever written to main memory. This would allow that memory to stay in cache, and the computation could proceed with an order of magnitude less data written to main memory. Andy On Mon, Oct 10, 2011 at 6:55 PM, Lee Spector <lspec...@hampshire.edu> wrote: > > Interesting. I'll try some of your suggested tests to see if my pmapall all > is behaving better than I thought. > > Does your pmap-pool permit nesting? (That is, does it permit passing > pmap-pool a function which itself calls pmap-pool?). If so then that would > be a reason to prefer it over my pmapall. > > Thanks, > > -Lee > > > On Oct 10, 2011, at 9:43 PM, j-g-faustus wrote: > > > I made an alternative implementation using a thread pool and a queue, > based on the example at > > http://clojure.org/concurrent_programming > > > > In short, your pmapall and the pool-based implementation (below) both > give approximately > > perfect scaling on my 4/8-core system (Intel i7 920 and HT). > > Both give close to full load on all cores and a factor 4.4 speedup > compared to single threaded. > > This seems about right, the CPU has four physical cores and get a few > percent extra performance > > from the virtual cores, so the speedup is approximately linear with the > number of cores. > > > > pmap-pool may be a tiny bit faster than the pmapall, but they are so > close that I can't > > really tell. > > > > It is possible that there is some sort of synchronization overhead on > your 48-core machine. > > 95% of the tasks are practically noops, after all - just the cost of a > single function call. > > There are only 48 tasks in your test that actually require computation, > so each > > core will do a bunch of noops and perhaps one "real" task. > > > > In real time, a single i7 920 runs the test just as fast as your 48 > cores. I don't expect that's > > representative for what your 48 cores can do. > > > > I suggest > > * Increase the test size and/or the density of "heavy" tasks. > > * Let the "light" tasks do a bit more computation, at least enough to pay > for the > > overhead of calling them. > > * Start with a smaller number of threads, and see where it stops scaling > linearly. > > > > > > Threadpool/queue-based implementation: > > > > (import '(java.util.concurrent Executors)) > > (defn pmap-pool [f coll] > > (let [queue (ref coll) ;; shared queue of work units > > nthreads (.availableProcessors (Runtime/getRuntime)) > > pool (Executors/newFixedThreadPool nthreads) > > tasks (map (fn [_] > > (fn [] ; one task per thread > > (let [local-res (atom [])] ;; collect results per > thread to minimize synchronization > > (while (seq @queue) > > ;; queue may be emptied between 'while' > > ;; and 'dosync'. > > (when-let [wu (dosync > > ;; grab work unit, update > queue > > (when-let [w (first > @queue)] > > (alter queue next) > > w))] > > (swap! local-res conj (f wu)))) > > local-res))) > > (range nthreads)) > > results (doall (map #(deref (.get %)) ;; blocks until completion > > (.invokeAll pool tasks))) ;; start all tasks > > results (reduce concat results)] > > (.shutdown pool) > > ;; sanity check > > (when-not (and (empty? @queue) > > (= (count results) (count coll)) > > (every? #(= % :done) results)) > > (println "ERROR: queue " (count @queue) " #results" (count > results))) > > results)) > > > > Results on an i7 920, 4 cores/8 threads (hyperthreading), Ubuntu 10.10: > > > > user=> (time (last (map fast-or-slow inputs)))) > > "Elapsed time: 161891.732036 msecs", 100% CPU (out of 800% possible) > > > > user=> (time (last (pmap fast-or-slow inputs)))) > > "Elapsed time: 163139.249677 msecs", 100% CPU > > pmap has zero effect on my system, it won't use more than one core. > > > > user=> (time (last (pmapall fast-or-slow inputs)))) > > "Elapsed time: 37710.349712 msecs", ~793% CPU > > > > user=> (time (last (pmap-pool fast-or-slow inputs)))) > > "Elapsed time: 36393.132824 msecs", ~795% CPU > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "Clojure" group. > > To post to this group, send email to clojure@googlegroups.com > > Note that posts from new members are moderated - please be patient with > your first post. > > To unsubscribe from this group, send email to > > clojure+unsubscr...@googlegroups.com > > For more options, visit this group at > > http://groups.google.com/group/clojure?hl=en > > -- > Lee Spector, Professor of Computer Science > Cognitive Science, Hampshire College > 893 West Street, Amherst, MA 01002-3359 > lspec...@hampshire.edu, http://hampshire.edu/lspector/ > Phone: 413-559-5352, Fax: 413-559-5438 > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en