Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 945, and found some very strange results, which I shall post here, after I explain a little function that Lee wrote that is designed to get improved results over pmap. It looks like this:
(defn pmapall "Like pmap but: 1) coll should be finite, 2) the returned sequence will not be lazy, 3) calls to f may occur in any order, to maximize multicore processor utilization, and 4) takes only one coll so far." [f coll] (let [agents (map agent coll)] (dorun (map #(send % f) agents)) (apply await agents) (doall (map deref agents)))) Refer to Lee's first post for the benchmarking routine we're running. I figured that, in order to figure out if it was Java's multithreading that was the problem (as opposed to memory bandwidth, or the OS, or whatever), I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( map burn (range 2))) or whatever. Does this make sense? I THINK it does. If it doesn't, then that's cool - just let me know why and I'll feel less crazy, because I am finding my results rather confounding. On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 to do ( doall( pmapall burn (range 8))). So that's weird, because although we do see decreased slowdown from using pmapall, we still don't see a speedup compared to map. Watching processor utilization while these are going on shows that map is using one core, and both pmap and pmapall are using all four cores fully, as they should. So, maybe the OS or the hardware just can't deal with running that many copies of burn at once? Maybe there's a memory bottleneck? Now here's the weird part: it takes around 29 seconds to do four concurrent copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies of (burn 2). Yes. Read that again. What? Watching top while this is going on shows what you would expect to see: When I run four concurrent copies, I've got four copies of Java using 100% of a core each, and when I run eight concurrent copies, I see eight copies of Java, all using around 50% of the processor each. Also, by the way, it takes 48 seconds to run two concurrent copies of ( doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( doall( pmap burn (range 4))). What is going on here? Is Java's multithreading really THAT bad? This appears to me to prove that Java, or clojure, has something very seriously wrong with it, or has outrageous amounts of overhead when spawning a new thread. No? all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1 (I tried increasing the memory allowed for the pmap and pmapall runs, even to 8g, and it doesn't help at all) Java(TM) SE Runtime Environment (build 1.7.0_03-b04) Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP Any thoughts or ideas? There's more weirdness, too, in case anybody in interested. I'm getting results that vary strangely from other benchmarks that are available, and make no sense to me. Check this out (these are incomplete, because I decided to dig deeper with the above benchmarks, but you'll see, I think, why this is so confusing, if you know how fast these processors are "supposed" to be): all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and clojure 1.5.0-beta1 Java(TM) SE Runtime Environment (build 1.7.0_03-b04) Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode): Key: 1. (pmap range 8) : 2. (map range 8) : 3. (8 concurrent copies of pmap range 8) : 4. (8 concurrent copies of map range 8) : 5. pmapall range 8: 4x AMD Opteron 6168: 1. 4:02.06 2. 2:20.29 3. 4. AMD Phenom II X4 945: 1. 2:31.65 2. 1:29.90 3. 3:32.60 4. 3:08.97 5. 1:48.36 AMD Phenom II X6 1100T: 1. 2:03.71 2. 1:14.76 3. 2:20.14 4. 1:57.38 5. 2:14.43 AMD FX 8120: 1. 4:50.06 2. 1:25.04 3. 5:55.84 4. 2:46.94 5. 4:36.61 AMD FX 8350: 1. 3:42.35 2. 1:13.94 3. 3:00.46 4. 2:06.18 5. 3:56.95 Intel Core i7 3770K: 1. 0:44 2. 1:37.18 3. 2:29.41 4. 2:16.05 5. 0:44.42 2 x Intel Paxville DP Xeon: 1. 6:26.112 2. 3:20.149 3. 8:09.85 4. 7:06.52 5. 5:55.29 On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall Bockrath-Vandegrift wrote: > > Lee Spector <lspe...@hampshire.edu <javascript:>> writes: > > > I'm also aware that the test that produced the data I give below, > > insofar as it uses pmap to do the distribution, may leave cores idle > > for a bit if some tasks take a lot longer than others, because of the > > way that pmap allocates cores to threads. > > Although it doesn’t impact your benchmark, `pmap` may be further > adversely affecting the performance of your actual program. There’s a > open bug regarding `pmap` and chunked seqs: > > http://dev.clojure.org/jira/browse/CLJ-862 > > The impact is that `pmap` with chunked seq input will spawn futures for > its function applications in flights of 32, spawning as many flights as > necessary to reach or exceed #CPUS + 2. On a 48-way system, it will > initially launch 64 futures, then spawn an additional 32 every time the > number of active unrealized futures drops below 50, leading to > significant contention for a CPU-bound application. > > I hope it can be made useful in a future version of Clojure, but right > now `pmap` is more of an attractive nuisance than anything else. > > -Marshall > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en