Have you considered that you're realizing very large lazy sequences and might be thrashing around in GC ? The parallel versions needs X times the available memory of the sequential version, where X is the number of concurrent threads right?
David On Wed, Aug 4, 2010 at 10:36 AM, Lee Spector <lspec...@hampshire.edu> wrote: > > Apologies for the length of this message -- I'm hoping to be complete, but > that made the message pretty long. > > Also BTW most of the tests below were run using Clojure 1.1. If part of the > answer to my questions is "use 1.2" then I'll upgrade ASAP (but I haven't > done so yet because I'd prefer to be confused by one thing at a time :-). I > don't think that can be the full answer, though, since the last batch of > runs below WERE run under 1.2 and they're also problematic... > > Also, for most of the runs described here (with the one exception noted > below) I am running under Linux: > > [lspec...@fly ~]$ cat /proc/version > Linux version 2.6.18-164.6.1.el5 (mockbu...@builder10.centos.org) (gcc > version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Nov 3 16:12:36 EST > 2009 > > with this Java version: > > [lspec...@fly ~]$ java -version > java version "1.6.0_16" > Java(TM) SE Runtime Environment (build 1.6.0_16-b01) > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) > > SO: Most of the documentation and discussion about clojure concurrency is > about managing state that may be shared between concurrent processes, but I > have what I guess are more basic questions about how concurrent processes > can/should be started even in the absence of shared state (or when all > that's shared is immutable) and about how to get the most out of concurrency > on multiple cores. > > I often have large numbers of relatively long, independent processes and I > want to farm them out to multiple cores. (For those who care this is often > in the context of evolutionary computation systems, with each of the > processes being a fitness test.) I had thought that I was farming these out > in the right way to multiple cores, using agents or sometimes just pmap, but > then I noticed that my runtimes weren't scaling in the way that I expected > across machines with different numbers of cores (even though I usually saw > near total utilization of all cores in "top"). > > This led me to do some more systematic testing and I'm confused/concerned > about what I'm seeing, so I'm going to present my tests and results here in > the hope that someone can clear things up for me. I know that timing things > in clojure can be complicated both on account of laziness and on account of > optimizations that happen on the Java side, but I think I've done the right > things to avoid getting tripped up too much by these issues. Still, it's > quite possible that I've coded some things incorrectly and/or that I'm > misunderstanding some basic concepts, and I'd appreciate any help that > anyone can provide. > > First I defined a function that would take a non-trivial amount of time to > execute, as follows: > > (defn burn > ([] (count > (take 1E6 > (repeatedly > #(* 9999999999 9999999999))))) > ([_] (burn))) > > The implementation with an ignored argument just serves to make some of my > later calls neater -- I suppose I might incur a tiny additional cost when > calling it that way but this will be swamped by the things I'm timing. > > Then I defined functions for calling this multiple times either > sequentially or concurrently, using three different techniques for starting > the concurrent processes: > > (defn burn-sequentially [n] > (print n " sequential burns: ") > (time (dotimes [i n] (burn)))) > > (defn burn-via-pmap [n] > (print n " burns via pmap: ") > (time (doall (pmap burn (range n))))) > > (defn burn-via-futures [n] > (print n " burns via futures: ") > (time (doall (pmap deref (map (fn [_] (future (burn))) > (range n)))))) > > (defn burn-via-agents [n] > (print n " burns via agents: ") > (time (let [agents (map #(agent %) (range n))] > (dorun (map #(send % burn) agents)) > (apply await agents)))) > > Finally, since there's often quite a bit of variability in the run time of > these things (maybe because of garbage collection? Optimization? I'm not > sure), I define a simple macro to execute a call three times: > > (defmacro thrice [expression] > `(do ~expression ~expression ~expression)) > > Now I can do some timings, and I'll first show you what happens in one of > the cases where everything performs as expected. > > On a 16-core machine (details at > http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-1.local), > running four burns thrice, with the code: > > (thrice (burn-sequentially 4)) > (thrice (burn-via-pmap 4)) > (thrice (burn-via-futures 4)) > (thrice (burn-via-agents 4)) > > I get: > > 4 sequential burns: "Elapsed time: 2308.616 msecs" > 4 sequential burns: "Elapsed time: 1510.207 msecs" > 4 sequential burns: "Elapsed time: 1182.743 msecs" > 4 burns via pmap: "Elapsed time: 470.988 msecs" > 4 burns via pmap: "Elapsed time: 457.015 msecs" > 4 burns via pmap: "Elapsed time: 446.84 msecs" > 4 burns via futures: "Elapsed time: 417.368 msecs" > 4 burns via futures: "Elapsed time: 401.444 msecs" > 4 burns via futures: "Elapsed time: 398.786 msecs" > 4 burns via agents: "Elapsed time: 421.103 msecs" > 4 burns via agents: "Elapsed time: 426.775 msecs" > 4 burns via agents: "Elapsed time: 408.416 msecs" > > The improvement from the first line to the second is something I always see > (along with frequent improvements across the three calls in a "thrice"), and > I assume this is due to optimizations talking place in the JVM. Then we see > that all of the ways of starting concurrent burns perform about the same, > and all produce a speedup over the sequential burns of somewhere in the > neighborhood of 3x-4x. Pretty much exactly what I would expect and want. So > far so good. > > However, in the same JVM launch I then went on to do the same thing but > with 16 and then 48 burns in each call: > > (thrice (burn-sequentially 16)) > (thrice (burn-via-pmap 16)) > (thrice (burn-via-futures 16)) > (thrice (burn-via-agents 16)) > > (thrice (burn-sequentially 48)) > (thrice (burn-via-pmap 48)) > (thrice (burn-via-futures 48)) > (thrice (burn-via-agents 48)) > > This produced: > > 16 sequential burns: "Elapsed time: 5821.574 msecs" > 16 sequential burns: "Elapsed time: 6580.684 msecs" > 16 sequential burns: "Elapsed time: 6648.013 msecs" > 16 burns via pmap: "Elapsed time: 5953.194 msecs" > 16 burns via pmap: "Elapsed time: 7517.196 msecs" > 16 burns via pmap: "Elapsed time: 7380.047 msecs" > 16 burns via futures: "Elapsed time: 1168.827 msecs" > 16 burns via futures: "Elapsed time: 1068.98 msecs" > 16 burns via futures: "Elapsed time: 1048.745 msecs" > 16 burns via agents: "Elapsed time: 1041.05 msecs" > 16 burns via agents: "Elapsed time: 1030.712 msecs" > 16 burns via agents: "Elapsed time: 1041.139 msecs" > 48 sequential burns: "Elapsed time: 15909.333 msecs" > 48 sequential burns: "Elapsed time: 14825.631 msecs" > 48 sequential burns: "Elapsed time: 15232.646 msecs" > 48 burns via pmap: "Elapsed time: 13586.897 msecs" > 48 burns via pmap: "Elapsed time: 3106.56 msecs" > 48 burns via pmap: "Elapsed time: 3041.272 msecs" > 48 burns via futures: "Elapsed time: 2968.991 msecs" > 48 burns via futures: "Elapsed time: 2895.506 msecs" > 48 burns via futures: "Elapsed time: 2818.724 msecs" > 48 burns via agents: "Elapsed time: 2802.906 msecs" > 48 burns via agents: "Elapsed time: 2754.364 msecs" > 48 burns via agents: "Elapsed time: 2743.038 msecs" > > Looking first at the 16-burn runs, we see that concurrency via pmap is > actually generally WORSE than sequential. I cannot understand why this > should be the case. I guess if I were running on a single core I would > expect to see a slight loss when going to pmap because there would be some > cost for managing the 16 threads that wouldn't be compensated for by actual > concurrency. But I'm running on 16 cores and I should be getting a major > speedup, not a slowdown. There are only 16 threads, so there shouldn't be a > lot of time lost to overhead. > > Also interesting, in this case when I start the processes using futures or > agents I DO see a speedup. It's on the order of 6x-7x, not close to the 16x > that I would hope for, but at least it's a speedup. Why is this so different > from the case with pmap? (Recall that my pmap-based method DID produce about > the same speedup as my other methods when doing only 4 burns.) > > For the calls with 48 burns we again see nearly the expected, reasonably > good pattern with all concurrent calls performing nearly equivalently (I > suppose that the steady improvement over all of the calls is again some kind > of JVM optimization), with a speedup in the concurrent calls over the > sequential calls in the neighborhood of 5x-6x. Again, not the ~16x that I > might hope for, but at least it's in the right direction. The very first of > the pmap calls with 48 burns is an anomaly, with only a slight improvement > over the sequential calls, so I suppose that's another small mystery. > > The big mystery so far, however, is in the case of the 16 burns via pmap, > which is bizarrely slow on this 16-core machine. > > Next I tried the same thing on a 48 core machine ( > http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-2.local). > Here I got: > > 4 sequential burns: "Elapsed time: 3062.871 msecs" > 4 sequential burns: "Elapsed time: 2249.048 msecs" > 4 sequential burns: "Elapsed time: 2417.677 msecs" > 4 burns via pmap: "Elapsed time: 705.968 msecs" > 4 burns via pmap: "Elapsed time: 679.865 msecs" > 4 burns via pmap: "Elapsed time: 685.017 msecs" > 4 burns via futures: "Elapsed time: 687.097 msecs" > 4 burns via futures: "Elapsed time: 636.543 msecs" > 4 burns via futures: "Elapsed time: 660.116 msecs" > 4 burns via agents: "Elapsed time: 708.163 msecs" > 4 burns via agents: "Elapsed time: 709.433 msecs" > 4 burns via agents: "Elapsed time: 713.536 msecs" > 16 sequential burns: "Elapsed time: 8065.446 msecs" > 16 sequential burns: "Elapsed time: 8069.239 msecs" > 16 sequential burns: "Elapsed time: 8102.791 msecs" > 16 burns via pmap: "Elapsed time: 11288.757 msecs" > 16 burns via pmap: "Elapsed time: 12182.506 msecs" > 16 burns via pmap: "Elapsed time: 14609.397 msecs" > 16 burns via futures: "Elapsed time: 2519.603 msecs" > 16 burns via futures: "Elapsed time: 2436.699 msecs" > 16 burns via futures: "Elapsed time: 2776.869 msecs" > 16 burns via agents: "Elapsed time: 2178.028 msecs" > 16 burns via agents: "Elapsed time: 2871.38 msecs" > 16 burns via agents: "Elapsed time: 2244.687 msecs" > 48 sequential burns: "Elapsed time: 24118.218 msecs" > 48 sequential burns: "Elapsed time: 24096.667 msecs" > 48 sequential burns: "Elapsed time: 24057.327 msecs" > 48 burns via pmap: "Elapsed time: 10369.224 msecs" > 48 burns via pmap: "Elapsed time: 6837.071 msecs" > 48 burns via pmap: "Elapsed time: 4163.926 msecs" > 48 burns via futures: "Elapsed time: 3980.298 msecs" > 48 burns via futures: "Elapsed time: 4066.35 msecs" > 48 burns via futures: "Elapsed time: 4068.199 msecs" > 48 burns via agents: "Elapsed time: 4012.069 msecs" > 48 burns via agents: "Elapsed time: 4052.759 msecs" > 48 burns via agents: "Elapsed time: 4085.018 msecs" > > Essentially this is the same picture that I got on the 16-core machine: > decent (but less than I would like -- only something like 3x-4x) speedups > with most concurrent methods in most cases but a bizarre anomaly with 16 > burns started with pmap, which is again considerably slower than the > sequential runs. Why should this be? When I run only 4 burns or a full 48 > burns the pmap method performs decently (that is, at least things get faster > than the sequential calls), but with 16 burns something very odd happens. > > Finally, I ran the same thing on my MacBook Pro 3.06 GHz Intel Core 2 Duo, > Mac OS X 10.6.4, with Clojure 1.2.0-master-SNAPSHOT under > Eclipse/Counterclockwise, with a bunch of applications running, so probably > this is acting more or less like a single core machine, and got: > > 4 sequential burns: "Elapsed time: 3487.224 msecs" > 4 sequential burns: "Elapsed time: 2327.569 msecs" > 4 sequential burns: "Elapsed time: 2137.697 msecs" > 4 burns via pmap: "Elapsed time: 12478.725 msecs" > 4 burns via pmap: "Elapsed time: 12815.75 msecs" > 4 burns via pmap: "Elapsed time: 8464.909 msecs" > 4 burns via futures: "Elapsed time: 11494.17 msecs" > 4 burns via futures: "Elapsed time: 12365.537 msecs" > 4 burns via futures: "Elapsed time: 12098.571 msecs" > 4 burns via agents: "Elapsed time: 10361.749 msecs" > 4 burns via agents: "Elapsed time: 12458.174 msecs" > 4 burns via agents: "Elapsed time: 9016.093 msecs" > 16 sequential burns: "Elapsed time: 8706.674 msecs" > 16 sequential burns: "Elapsed time: 8748.006 msecs" > 16 sequential burns: "Elapsed time: 8729.54 msecs" > 16 burns via pmap: "Elapsed time: 46022.281 msecs" > 16 burns via pmap: "Elapsed time: 44845.725 msecs" > 16 burns via pmap: "Elapsed time: 45393.156 msecs" > 16 burns via futures: "Elapsed time: 52822.863 msecs" > 16 burns via futures: "Elapsed time: 50647.708 msecs" > 16 burns via futures: "Elapsed time: 50337.916 msecs" > 16 burns via agents: "Elapsed time: 48615.905 msecs" > 16 burns via agents: "Elapsed time: 56703.723 msecs" > 16 burns via agents: "Elapsed time: 69765.913 msecs" > 48 sequential burns: "Elapsed time: 38885.616 msecs" > 48 sequential burns: "Elapsed time: 38651.573 msecs" > 48 sequential burns: "Elapsed time: 36669.02 msecs" > 48 burns via pmap: "Elapsed time: 169108.022 msecs" > 48 burns via pmap: "Elapsed time: 176656.455 msecs" > 48 burns via pmap: "Elapsed time: 182119.986 msecs" > 48 burns via futures: "Elapsed time: 176764.722 msecs" > 48 burns via futures: "Elapsed time: 169257.577 msecs" > 48 burns via futures: "Elapsed time: 157205.693 msecs" > 48 burns via agents: "Elapsed time: 140618.333 msecs" > 48 burns via agents: "Elapsed time: 137992.773 msecs" > 48 burns via agents: "Elapsed time: 143153.696 msecs" > > Here we have a very depressing picture. Although I wouldn't expect to get > any speedup from concurrency the concurrency-related slowdowns have now > spread to all of my concurrency-starting methods with all numbers of burns. > It is way way way worse to be using the concurrency methods than the > straightforward sequential method in every circumstance. Again, I understand > why one should expect a small loss in a case like this, but these are huge > losses and the number of threads that have to be coordinated (with no > shared) is quite small -- just 4-48. > > My guess is that all of this is stemming from some confusion on my part > about how I should be starting and managing concurrent processes, and my > greatest hope is that one of you will show me an alternative to my > burn-via-* functions that provides a speedup nearly linear with the number > of cores and only a negligible loss when there's only one core available... > > But any help of any kind would be appreciated. > > Thanks, > > -Lee > > -- > Lee Spector, Professor of Computer Science > School of Cognitive Science, Hampshire College > 893 West Street, Amherst, MA 01002-3359 > lspec...@hampshire.edu, http://hampshire.edu/lspector/ > Phone: 413-559-5352, Fax: 413-559-5438 > > Check out Genetic Programming and Evolvable Machines: > http://www.springer.com/10710 - http://gpemjournal.blogspot.com/ > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com> > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en