Lee: I ran Linux perf and also watched the run queue (with vmstat) and your bottleneck is basically memory access. The CPUs are idle 80% of the time by stalled cycles. Here's what I got on my machine.
Intel Core i7 4 cores with Hyper thread (8 virtual processors) 16 GiB of Memory Oracle JVM and this is how I ran perf. (Note: After monitoring the GC, even 1GiB Heap is a lot of memory for your benchmark, it mostly used a lot less than 256 MiB) Also I ran the (time (doall ..)) three times to make sure that the JVM was warmed up. :) perf stat java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar "Elapsed time: 49194.258825 msecs" Warm-up 1 (10000 10000 10000 10000 10000 10000 10000 10000) "Elapsed time: 48200.221677 msecs" Warm-up 2 (10000 10000 10000 10000 10000 10000 10000 10000) "Elapsed time: 50050.013156 msecs" Run (10000 10000 10000 10000 10000 10000 10000 10000) Performance counter stats for 'java -server -Xmx1024m -Xms1024m -jar target/cpu-burn-0.1.0-SNAPSHOT-standalone.jar': 1094967.286876 task-clock # 7.383 CPUs utilized 170,384 context-switches # 0.000 M/sec 2,932 CPU-migrations # 0.000 M/sec 24,734 page-faults # 0.000 M/sec 3,004,827,628,531 cycles # 2.744 GHz [83.34%] *2,430,611,924,472 stalled-cycles-frontend # 80.89% frontend cycles idle [83.34%]* * 1,976,836,501,358 stalled-cycles-backend # 65.79% backend cycles idle [66.67%]* 650,868,504,974 instructions # 0.22 insns per cycle # 3.73 stalled cycles per insn [83.33%] 124,479,033,687 branches # 113.683 M/sec [83.33%] 1,753,021,734 branch-misses # 1.41% of all branches [83.33%] 148.307812402 seconds time elapsed Now if you run vmstat 1 while running your benchmark you'll notice that the run queue will be most of the time at 8, meaning that 8 "processes" are waiting for CPU, and this is due to memory accesses (in this case, since this is not true for all applications). So, I feel your benchmark may be artificial and does not truly represent your real application, make sure to profile your real application, and optimize according to the bottlenecks. There are really useful tools out there for profiling, such as VisualVM, perf. Good Luck, Carlos On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: > > > I've been running compute intensive (multi-day), highly parallelizable > Clojure processes on high-core-count machines and blithely assuming that > since I saw near maximal CPU utilization in "top" and the like that I was > probably getting good speedups. > > But a colleague recently did some tests and the results are really quite > alarming. > > On intel machines we're seeing speedups but much less than I expected -- > about a 2x speedup going from 1 to 8 cores. > > But on AMD processors we're seeing SLOWDOWNS, with the same tests taking > almost twice as long on 8 cores as on 1. > > I'm baffled, and unhappy that my runs are probably going slower on 48-core > and 64-core nodes than on single-core nodes. > > It's possible that I'm just doing something wrong in the way that I > dispatch the tasks, or that I've missed some Clojure or JVM setting... but > right now I'm mystified and would really appreciate some help. > > I'm aware that there's overhead for multicore distribution and that one > can expect slowdowns if the computations that are being distributed are > fast relative to the dispatch overhead, but this should not be the case > here. We're distributing computations that take seconds or minutes, and not > huge numbers of them (at least in our tests while trying to figure out > what's going on). > > I'm also aware that the test that produced the data I give below, insofar > as it uses pmap to do the distribution, may leave cores idle for a bit if > some tasks take a lot longer than others, because of the way that pmap > allocates cores to threads. But that also shouldn't be a big issue here > because for this test all of the threads are doing the exact same > computation. And I also tried using an agent-based dispatch approach that > shouldn't have the pmap thread allocation issue, and the results were about > the same. > > Note also that all of the computations in this test are purely functional > and independent -- there shouldn't be any resource contention issues. > > The test: I wrote a time-consuming function that just does a bunch of math > and list manipulation (which is what takes a lot of time in my real > applications): > > (defn burn > ([] (loop [i 0 > value '()] > (if (>= i 10000) > (count (last (take 10000 (iterate reverse value)))) > (recur (inc i) > (cons > (* (int i) > (+ (float i) > (- (int i) > (/ (float i) > (inc (int i)))))) > value))))) > ([_] (burn))) > > Then I have a main function like this: > > (defn -main > [& args] > (time (doall (pmap burn (range 8)))) > (System/exit 0)) > > We run it with "lein run" (we've tried both leingingen 1.7.1 and > 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We > also tried Java 1.6.0_22. We've tried various JVM memory options (via > :jvm-opts with -Xmx and -Xms settings) and also with and without > -XX:+UseParallelGC. None of this seems to change the picture substantially. > > The results that we get generally look like this: > > - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code > above, it takes about 45 seconds (and all cores appear to be fully loaded > as it does so). If we change the pmap to just plain map, so that we use > only a single core, the time goes up to about 1 minute and 36 seconds. So > the speedup for 8 cores is just about 2x, even though there are 8 > completely independent tasks. So that's pretty depressing. > > - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, > running the same test (with pmap) takes about 4 minutes and 2 seconds. > That's really slow! Changing the pmap to map here produces a runtime of > about 2 minutes and 20 seconds. So it's quite a bit faster on one core than > on 8! And all of these times are terrible compared to those on the intel. > > Another strange observation is that we can run multiple instances of the > test on the same machine and (up to some limit, presumably) they don't seem > to slow each other down, even though just one instance of the test appears > to be maxing out all of the CPU according to "top". I suppose that means > that "top" isn't telling me what I thought -- my colleague says it can mean > that something is blocked in some way with a full instruction queue. But > I'm not interested in running multiple instances. I have single > computations that involve multiple expensive but independent > subcomputations, and I want to farm those subcomputations out to multiple > cores -- and get speedups as a result. My subcomputations are so completely > independent that I think I should be able to get speedups approaching a > factor of n for n cores, but what I see is a factor of only about 2 on > intel machines, and a bizarre factor of about 1/2 on AMD machines. > > Any help would be greatly appreciated! > > Thanks, > > -Lee > > -- > Lee Spector, Professor of Computer Science > Cognitive Science, Hampshire College > 893 West Street, Amherst, MA 01002-3359 > lspe...@hampshire.edu <javascript:>, http://hampshire.edu/lspector/ > Phone: 413-559-5352, Fax: 413-559-5438 > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en