I haven't analyzed your results in detail, but here are some results I had on my 2GHz 4-core Intel core i7 MacBook Pro vintage 2011.
When running multiple threads within a single JVM invocation, I never got a speedup of even 2. The highest speedup I measured was 1.82 speedup when I ran 8 threads using -XX:+UseParallelGC. I tried with -XX:+UseParNewGC but never got a speedup over 1.45 (with 4 threads in parallel -- it was lower with 8 threads). When running multiple invocations of "lein2 run" in parallel as separate processes, I was able to achieve a speedup of 1.88 with 2 processes, 3.40 with 4 processes, and 5.34 with 8 processes (it went over 4 I think because of 2 hyperthreads per each of the 4 cores). This is a strong indication that the issue is some kind of interference between multiple threads in the same JVM, not the hardware, at least on my hardware and OS (OS was Mac OS X 10.6.8, JVM was Apple/Oracle Java 1.6.0_37). My first guess would be that even with -XX:+UseParallelGC or -XX:+UseParNewGC, there is either some kind of interference with garbage collection, or perhaps there is even some kind of interference between them when allocating memory? Should JVM memory allocations be completely parallel with no synchronization when running multiple threads, or do memory allocations sometimes lock a shared data structure? Andy On Dec 8, 2012, at 11:10 AM, Wm. Josiah Erikson wrote: > Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running > things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, > and found some very strange results, which I shall post here, after I explain > a little function that Lee wrote that is designed to get improved results > over pmap. It looks like this: > > (defn pmapall > "Like pmap but: 1) coll should be finite, 2) the returned sequence > will not be lazy, 3) calls to f may occur in any order, to maximize > multicore processor utilization, and 4) takes only one coll so far." > [f coll] > (let [agents (map agent coll)] > (dorun (map #(send % f) agents)) > (apply await agents) > (doall (map deref agents)))) > > Refer to Lee's first post for the benchmarking routine we're running. > > I figured that, in order to figure out if it was Java's multithreading that > was the problem (as opposed to memory bandwidth, or the OS, or whatever), > I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies > of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn > (range 2))) or whatever. Does this make sense? I THINK it does. If it > doesn't, then that's cool - just let me know why and I'll feel less crazy, > because I am finding my results rather confounding. > > On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap > burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( > doall( pmapall burn (range 8))). > > So that's weird, because although we do see decreased slowdown from using > pmapall, we still don't see a speedup compared to map. Watching processor > utilization while these are going on shows that map is using one core, and > both pmap and pmapall are using all four cores fully, as they should. So, > maybe the OS or the hardware just can't deal with running that many copies of > burn at once? Maybe there's a memory bottleneck? > > Now here's the weird part: it takes around 29 seconds to do four concurrent > copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of > (burn 2). Yes. Read that again. What? Watching top while this is going on > shows what you would expect to see: When I run four concurrent copies, I've > got four copies of Java using 100% of a core each, and when I run eight > concurrent copies, I see eight copies of Java, all using around 50% of the > processor each. > > Also, by the way, it takes 48 seconds to run two concurrent copies of ( > doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( > pmap burn (range 4))). > > What is going on here? Is Java's multithreading really THAT bad? This appears > to me to prove that Java, or clojure, has something very seriously wrong with > it, or has outrageous amounts of overhead when spawning a new thread. No? > > all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure > 1.5.0-beta1 > (I tried increasing the memory allowed for the pmap and pmapall runs, even to > 8g, and it doesn't help at all) > Java(TM) SE Runtime Environment (build 1.7.0_03-b04) > Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): > > on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP > > > Any thoughts or ideas? > > There's more weirdness, too, in case anybody in interested. I'm getting > results that vary strangely from other benchmarks that are available, and > make no sense to me. Check this out (these are incomplete, because I decided > to dig deeper with the above benchmarks, but you'll see, I think, why this is > so confusing, if you know how fast these processors are "supposed" to be): > > all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure > 1.5.0-beta1 > Java(TM) SE Runtime Environment (build 1.7.0_03-b04) > Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): > > Key: 1. (pmap range 8) : > 2. (map range 8) : > 3. (8 concurrent copies of pmap range 8) : > 4. (8 concurrent copies of map range 8) : > 5. pmapall range 8: > > 4x AMD Opteron 6168: > 1. 4:02.06 > 2. 2:20.29 > 3. > 4. > > AMD Phenom II X4 945: > 1. 2:31.65 > 2. 1:29.90 > 3. 3:32.60 > 4. 3:08.97 > 5. 1:48.36 > > AMD Phenom II X6 1100T: > 1. 2:03.71 > 2. 1:14.76 > 3. 2:20.14 > 4. 1:57.38 > 5. 2:14.43 > > AMD FX 8120: > 1. 4:50.06 > 2. 1:25.04 > 3. 5:55.84 > 4. 2:46.94 > 5. 4:36.61 > > AMD FX 8350: > 1. 3:42.35 > 2. 1:13.94 > 3. 3:00.46 > 4. 2:06.18 > 5. 3:56.95 > > Intel Core i7 3770K: > 1. 0:44 > 2. 1:37.18 > 3. 2:29.41 > 4. 2:16.05 > 5. 0:44.42 > > 2 x Intel Paxville DP Xeon: > 1. 6:26.112 > 2. 3:20.149 > 3. 8:09.85 > 4. 7:06.52 > 5. 5:55.29 > > > > On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall Bockrath-Vandegrift > wrote: > Lee Spector <lspe...@hampshire.edu> writes: > > > I'm also aware that the test that produced the data I give below, > > insofar as it uses pmap to do the distribution, may leave cores idle > > for a bit if some tasks take a lot longer than others, because of the > > way that pmap allocates cores to threads. > > Although it doesn’t impact your benchmark, `pmap` may be further > adversely affecting the performance of your actual program. There’s a > open bug regarding `pmap` and chunked seqs: > > http://dev.clojure.org/jira/browse/CLJ-862 > > The impact is that `pmap` with chunked seq input will spawn futures for > its function applications in flights of 32, spawning as many flights as > necessary to reach or exceed #CPUS + 2. On a 48-way system, it will > initially launch 64 futures, then spawn an additional 32 every time the > number of active unrealized futures drops below 50, leading to > significant contention for a CPU-bound application. > > I hope it can be made useful in a future version of Clojure, but right > now `pmap` is more of an attractive nuisance than anything else. > > -Marshall > > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en