Re: abysmal multicore performance, especially on AMD processors

meteorfox Sat, 08 Dec 2012 19:05:40 -0800

Lee:

So I ran


On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote:
>
>
> I've been running compute intensive (multi-day), highly parallelizable 
> Clojure processes on high-core-count machines and blithely assuming that 
> since I saw near maximal CPU utilization in "top" and the like that I was 
> probably getting good speedups. 
>
> But a colleague recently did some tests and the results are really quite 
> alarming. 
>
> On intel machines we're seeing speedups but much less than I expected -- 
> about a 2x speedup going from 1 to 8 cores. 
>
> But on AMD processors we're seeing SLOWDOWNS, with the same tests taking 
> almost twice as long on 8 cores as on 1. 
>
> I'm baffled, and unhappy that my runs are probably going slower on 48-core 
> and 64-core nodes than on single-core nodes. 
>
> It's possible that I'm just doing something wrong in the way that I 
> dispatch the tasks, or that I've missed some Clojure or JVM setting... but 
> right now I'm mystified and would really appreciate some help. 
>
> I'm aware that there's overhead for multicore distribution and that one 
> can expect slowdowns if the computations that are being distributed are 
> fast relative to the dispatch overhead, but this should not be the case 
> here. We're distributing computations that take seconds or minutes, and not 
> huge numbers of them (at least in our tests while trying to figure out 
> what's going on). 
>
> I'm also aware that the test that produced the data I give below, insofar 
> as it uses pmap to do the distribution, may leave cores idle for a bit if 
> some tasks take a lot longer than others, because of the way that pmap 
> allocates cores to threads. But that also shouldn't be a big issue here 
> because for this test all of the threads are doing the exact same 
> computation. And I also tried using an agent-based dispatch approach that 
> shouldn't have the pmap thread allocation issue, and the results were about 
> the same. 
>
> Note also that all of the computations in this test are purely functional 
> and independent -- there shouldn't be any resource contention issues. 
>
> The test: I wrote a time-consuming function that just does a bunch of math 
> and list manipulation (which is what takes a lot of time in my real 
> applications): 
>
> (defn burn 
>   ([] (loop [i 0 
>              value '()] 
>         (if (>= i 10000) 
>           (count (last (take 10000 (iterate reverse value)))) 
>           (recur (inc i) 
>                  (cons 
>                    (* (int i) 
>                       (+ (float i) 
>                          (- (int i) 
>                             (/ (float i) 
>                                (inc (int i)))))) 
>                    value))))) 
>   ([_] (burn))) 
>
> Then I have a main function like this: 
>
> (defn -main 
>   [& args] 
>   (time (doall (pmap burn (range 8)))) 
>   (System/exit 0)) 
>
> We run it with "lein run" (we've tried both leingingen 1.7.1 and 
> 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We 
> also tried Java 1.6.0_22. We've tried various JVM memory options (via 
> :jvm-opts with -Xmx and -Xms settings) and also with and without 
> -XX:+UseParallelGC. None of this seems to change the picture substantially. 
>
> The results that we get generally look like this: 
>
> - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code 
> above, it takes about 45 seconds (and all cores appear to be fully loaded 
> as it does so). If we change the pmap to just plain map, so that we use 
> only a single core, the time goes up to about 1 minute and 36 seconds. So 
> the speedup for 8 cores is just about 2x, even though there are 8 
> completely independent tasks. So that's pretty depressing. 
>
> - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, 
> running the same test (with pmap) takes about 4 minutes and 2 seconds. 
> That's really slow! Changing the pmap to map here produces a runtime of 
> about 2 minutes and 20 seconds. So it's quite a bit faster on one core than 
> on 8! And all of these times are terrible compared to those on the intel. 
>
> Another strange observation is that we can run multiple instances of the 
> test on the same machine and (up to some limit, presumably) they don't seem 
> to slow each other down, even though just one instance of the test appears 
> to be maxing out all of the CPU according to "top". I suppose that means 
> that "top" isn't telling me what I thought -- my colleague says it can mean 
> that something is blocked in some way with a full instruction queue. But 
> I'm not interested in running multiple instances. I have single 
> computations that involve multiple expensive but independent 
> subcomputations, and I want to farm those subcomputations out to multiple 
> cores -- and get speedups as a result. My subcomputations are so completely 
> independent that I think I should be able to get speedups approaching a 
> factor of n for n cores, but what I see is a factor of only about 2 on 
> intel machines, and a bizarre factor of about 1/2 on AMD machines. 
>
> Any help would be greatly appreciated! 
>
> Thanks, 
>
>  -Lee 
>
> -- 
> Lee Spector, Professor of Computer Science 
> Cognitive Science, Hampshire College 
> 893 West Street, Amherst, MA 01002-3359 
> lspe...@hampshire.edu <javascript:>, http://hampshire.edu/lspector/ 
> Phone: 413-559-5352, Fax: 413-559-5438 
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: abysmal multicore performance, especially on AMD processors

Reply via email to