Correction regarding the run-queue, this is not completely correct, :S . But the stalled cycles and memory accesses still holds.
Sorry for the misinformation. On Friday, December 7, 2012 8:25:14 PM UTC-5, Lee wrote: > > > I've been running compute intensive (multi-day), highly parallelizable > Clojure processes on high-core-count machines and blithely assuming that > since I saw near maximal CPU utilization in "top" and the like that I was > probably getting good speedups. > > But a colleague recently did some tests and the results are really quite > alarming. > > On intel machines we're seeing speedups but much less than I expected -- > about a 2x speedup going from 1 to 8 cores. > > But on AMD processors we're seeing SLOWDOWNS, with the same tests taking > almost twice as long on 8 cores as on 1. > > I'm baffled, and unhappy that my runs are probably going slower on 48-core > and 64-core nodes than on single-core nodes. > > It's possible that I'm just doing something wrong in the way that I > dispatch the tasks, or that I've missed some Clojure or JVM setting... but > right now I'm mystified and would really appreciate some help. > > I'm aware that there's overhead for multicore distribution and that one > can expect slowdowns if the computations that are being distributed are > fast relative to the dispatch overhead, but this should not be the case > here. We're distributing computations that take seconds or minutes, and not > huge numbers of them (at least in our tests while trying to figure out > what's going on). > > I'm also aware that the test that produced the data I give below, insofar > as it uses pmap to do the distribution, may leave cores idle for a bit if > some tasks take a lot longer than others, because of the way that pmap > allocates cores to threads. But that also shouldn't be a big issue here > because for this test all of the threads are doing the exact same > computation. And I also tried using an agent-based dispatch approach that > shouldn't have the pmap thread allocation issue, and the results were about > the same. > > Note also that all of the computations in this test are purely functional > and independent -- there shouldn't be any resource contention issues. > > The test: I wrote a time-consuming function that just does a bunch of math > and list manipulation (which is what takes a lot of time in my real > applications): > > (defn burn > ([] (loop [i 0 > value '()] > (if (>= i 10000) > (count (last (take 10000 (iterate reverse value)))) > (recur (inc i) > (cons > (* (int i) > (+ (float i) > (- (int i) > (/ (float i) > (inc (int i)))))) > value))))) > ([_] (burn))) > > Then I have a main function like this: > > (defn -main > [& args] > (time (doall (pmap burn (range 8)))) > (System/exit 0)) > > We run it with "lein run" (we've tried both leingingen 1.7.1 and > 2.0.0-preview10) with Java 1.7.0_03 Java HotSpot(TM) 64-Bit Server VM. We > also tried Java 1.6.0_22. We've tried various JVM memory options (via > :jvm-opts with -Xmx and -Xms settings) and also with and without > -XX:+UseParallelGC. None of this seems to change the picture substantially. > > The results that we get generally look like this: > > - On an Intel Core i7 3770K with 8 cores and 16GB of RAM, running the code > above, it takes about 45 seconds (and all cores appear to be fully loaded > as it does so). If we change the pmap to just plain map, so that we use > only a single core, the time goes up to about 1 minute and 36 seconds. So > the speedup for 8 cores is just about 2x, even though there are 8 > completely independent tasks. So that's pretty depressing. > > - But much worse: on a 4 x Opteron 6272 with 48 cores and 32GB of RAM, > running the same test (with pmap) takes about 4 minutes and 2 seconds. > That's really slow! Changing the pmap to map here produces a runtime of > about 2 minutes and 20 seconds. So it's quite a bit faster on one core than > on 8! And all of these times are terrible compared to those on the intel. > > Another strange observation is that we can run multiple instances of the > test on the same machine and (up to some limit, presumably) they don't seem > to slow each other down, even though just one instance of the test appears > to be maxing out all of the CPU according to "top". I suppose that means > that "top" isn't telling me what I thought -- my colleague says it can mean > that something is blocked in some way with a full instruction queue. But > I'm not interested in running multiple instances. I have single > computations that involve multiple expensive but independent > subcomputations, and I want to farm those subcomputations out to multiple > cores -- and get speedups as a result. My subcomputations are so completely > independent that I think I should be able to get speedups approaching a > factor of n for n cores, but what I see is a factor of only about 2 on > intel machines, and a bizarre factor of about 1/2 on AMD machines. > > Any help would be greatly appreciated! > > Thanks, > > -Lee > > -- > Lee Spector, Professor of Computer Science > Cognitive Science, Hampshire College > 893 West Street, Amherst, MA 01002-3359 > lspe...@hampshire.edu <javascript:>, http://hampshire.edu/lspector/ > Phone: 413-559-5352, Fax: 413-559-5438 > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en