Re: abysmal multicore performance, especially on AMD processors

Wm. Josiah Erikson Sat, 08 Dec 2012 12:17:28 -0800

Hi guys - I'm the colleague Lee speaks of. Because Jim mentioned running 
things on a 4-core Phenom II, I did some benchmarking on a Phenom II X4 
945, and found some very strange results, which I shall post here, after I 
explain a little function that Lee wrote that is designed to get improved 
results over pmap. It looks like this:

(defn pmapall
  "Like pmap but: 1) coll should be finite, 2) the returned sequence
   will not be lazy, 3) calls to f may occur in any order, to maximize
   multicore processor utilization, and 4) takes only one coll so far."
  [f coll]
  (let [agents (map agent coll)]
    (dorun (map #(send % f) agents))
    (apply await agents)
    (doall (map deref agents))))

Refer to Lee's first post for the benchmarking routine we're running.

I figured that, in order to figure out if it was Java's multithreading that 
was the problem (as opposed to memory bandwidth, or the OS, or whatever), 
I'd compare ( doall( pmapall burn (range 8))) to running 8 concurrent 
copies of (burn (rand-int 8) or even just (burn 2) or 4 copies of ( doall( 
map burn (range 2))) or whatever. Does this make sense? I THINK it does. If 
it doesn't, then that's cool - just let me know why and I'll feel less 
crazy, because I am finding my results rather confounding.

On said Phenom II X4 945 with 16GB of RAM, it takes 2:31 to do ( doall( 
pmap burn (range 8))), 1:29 to do ( doall( map burn (range 8))), and 1:48 
to do ( doall( pmapall burn (range 8))).

So that's weird, because although we do see decreased slowdown from using 
pmapall, we still don't see a speedup compared to map. Watching processor 
utilization while these are going on shows that map is using one core, and 
both pmap and pmapall are using all four cores fully, as they should. So, 
maybe the OS or the hardware just can't deal with running that many copies 
of burn at once? Maybe there's a memory bottleneck?

Now here's the weird part: it takes around 29 seconds to do four concurrent 
copies of ( doall( map burn (range 2))), around 33 seconds to run 8 copies 
of (burn 2). Yes. Read that again. What? Watching top while this is going 
on shows what you would expect to see: When I run four concurrent copies, 
I've got four copies of Java using 100% of a core each, and when I run 
eight concurrent copies, I see eight copies of Java, all using around 50% 
of the processor each.

Also, by the way, it takes 48 seconds to run two concurrent copies of ( 
doall( map burn (range 4))) and 1:07 to run two concurrent copies of ( 
doall( pmap burn (range 4))). 

What is going on here? Is Java's multithreading really THAT bad? This 
appears to me to prove that Java, or clojure, has something very seriously 
wrong with it, or has outrageous amounts of overhead when spawning a new 
thread. No?

all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and 
clojure 1.5.0-beta1
(I tried increasing the memory allowed for the pmap and pmapall runs, even 
to 8g, and it doesn't help at all)
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

on ROCKS 6.0 (CentOS 6.2) with kernel 2.6.32-220.13.1.el6.x86_64 #1 SMP

Any thoughts or ideas?

There's more weirdness, too, in case anybody in interested. I'm getting 
results that vary strangely from other benchmarks that are available, and 
make no sense to me. Check this out (these are incomplete, because I 
decided to dig deeper with the above benchmarks, but you'll see, I think, 
why this is so confusing, if you know how fast these processors are 
"supposed" to be):

all run with :jvm-opts ["-Xmx1g" "-Xms1g" "-XX:+AggressiveOpts"] and 
clojure 1.5.0-beta1
Java(TM) SE Runtime Environment (build 1.7.0_03-b04)
Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode):

Key:    1. (pmap range 8) : 
    2. (map range 8) : 
    3. (8 concurrent copies of pmap range 8) : 
    4. (8 concurrent copies of map range 8) :
    5. pmapall range 8:

4x AMD Opteron 6168:
    1. 4:02.06
    2. 2:20.29
    3.
    4.

AMD Phenom II X4 945:
    1. 2:31.65
    2. 1:29.90
    3. 3:32.60
    4. 3:08.97
    5. 1:48.36

AMD Phenom II X6 1100T:
    1. 2:03.71
    2. 1:14.76
    3. 2:20.14
    4. 1:57.38
    5. 2:14.43

AMD FX 8120:
    1. 4:50.06
    2. 1:25.04
    3. 5:55.84
    4. 2:46.94
    5. 4:36.61

AMD FX 8350:
    1. 3:42.35
    2. 1:13.94
    3. 3:00.46
    4. 2:06.18
    5. 3:56.95

Intel Core i7 3770K:
    1. 0:44
    2. 1:37.18
    3. 2:29.41
    4. 2:16.05
    5. 0:44.42

2 x Intel Paxville DP Xeon:
    1. 6:26.112
    2. 3:20.149
    3. 8:09.85
    4. 7:06.52
    5. 5:55.29

On Saturday, December 8, 2012 9:36:56 AM UTC-5, Marshall 
Bockrath-Vandegrift wrote:
>
> Lee Spector <lspe...@hampshire.edu <javascript:>> writes: 
>
> > I'm also aware that the test that produced the data I give below, 
> > insofar as it uses pmap to do the distribution, may leave cores idle 
> > for a bit if some tasks take a lot longer than others, because of the 
> > way that pmap allocates cores to threads. 
>
> Although it doesn’t impact your benchmark, `pmap` may be further 
> adversely affecting the performance of your actual program.  There’s a 
> open bug regarding `pmap` and chunked seqs: 
>
>     http://dev.clojure.org/jira/browse/CLJ-862 
>
> The impact is that `pmap` with chunked seq input will spawn futures for 
> its function applications in flights of 32, spawning as many flights as 
> necessary to reach or exceed #CPUS + 2.  On a 48-way system, it will 
> initially launch 64 futures, then spawn an additional 32 every time the 
> number of active unrealized futures drops below 50, leading to 
> significant contention for a CPU-bound application. 
>
> I hope it can be made useful in a future version of Clojure, but right 
> now `pmap` is more of an attractive nuisance than anything else. 
>
> -Marshall 
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to