Re: starting and getting the most out of concurrent processes

Laurent PETIT Wed, 04 Aug 2010 09:13:23 -0700

2010/8/4 Laurent PETIT <laurent.pe...@gmail.com>

> 2010/8/4 David Nolen <dnolen.li...@gmail.com>
>
> Have you considered that you're realizing very large lazy sequences and
>> might be thrashing around in GC ? The parallel versions needs X times the
>> available memory of the sequential version, where X is the number of
>> concurrent threads right?
>
>
sorry david, I didn't read your post carefully enough.



>
> David, I don't think so, the burn function does not seem to hold onto the
> head.
>
> There's indeed a "potential" problem in the pmap version since it holds
> onto the head of n-sized sequence, but since n ranges from 4 to 32 it hardly
> can be the problem.
>
> Lee, I don't have the general answer, but as a related note, I think there
> may be a problem with the "with futures" version:
>
> a. you quickly "bootstrap" all futures in the inner call to map
> b. you collect the results of the futures in parallel in the outer call to
> pmap
>
> as a reminder:
> (defn burn-via-futures [n]
>  (print n " burns via futures: ")
>  (time (doall (pmap deref (map (fn [_] (future (burn)))
>                                                  (range n))))))
>
> problem #1: since map is lazy, the bootstrapping of the futures will follow
> the consumption of the seq by pmap (modulo chunked seq behavior). So to
> quickly bootstrap all your futures before passing the seq to pmap, you
> should wrap the (map) inside a doall
> problem #2: maybe deref is a quick enough operation that using pmap with
> deref does not make sense (or would make sense if the number of cores were
> realllly big, e.g. if the coll would be of size 1.000.000 and the number of
> core of the same magnitude order.
>
>
>
>> David
>>
>>
>> On Wed, Aug 4, 2010 at 10:36 AM, Lee Spector <lspec...@hampshire.edu>wrote:
>>
>>>
>>> Apologies for the length of this message -- I'm hoping to be complete,
>>> but that made the message pretty long.
>>>
>>> Also BTW most of the tests below were run using Clojure 1.1. If part of
>>> the answer to my questions is "use 1.2" then I'll upgrade ASAP (but I
>>> haven't done so yet because I'd prefer to be confused by one thing at a time
>>> :-). I don't think that can be the full answer, though, since the last batch
>>> of runs below WERE run under 1.2 and they're also problematic...
>>>
>>> Also, for most of the runs described here (with the one exception noted
>>> below) I am running under Linux:
>>>
>>> [lspec...@fly ~]$ cat /proc/version
>>> Linux version 2.6.18-164.6.1.el5 (mockbu...@builder10.centos.org) (gcc
>>> version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Nov 3 16:12:36 EST
>>> 2009
>>>
>>> with this Java version:
>>>
>>> [lspec...@fly ~]$ java -version
>>> java version "1.6.0_16"
>>> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
>>> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
>>>
>>> SO: Most of the documentation and discussion about clojure concurrency is
>>> about managing state that may be shared between concurrent processes, but I
>>> have what I guess are more basic questions about how concurrent processes
>>> can/should be started even in the absence of shared state (or when all
>>> that's shared is immutable) and about how to get the most out of concurrency
>>> on multiple cores.
>>>
>>> I often have large numbers of relatively long, independent processes and
>>> I want to farm them out to multiple cores. (For those who care this is often
>>> in the context of evolutionary computation systems, with each of the
>>> processes being a fitness test.) I had thought that I was farming these out
>>> in the right way to multiple cores, using agents or sometimes just pmap, but
>>> then I noticed that my runtimes weren't scaling in the way that I expected
>>> across machines with different numbers of cores (even though I usually saw
>>> near total utilization of all cores in "top").
>>>
>>> This led me to do some more systematic testing and I'm confused/concerned
>>> about what I'm seeing, so I'm going to present my tests and results here in
>>> the hope that someone can clear things up for me. I know that timing things
>>> in clojure can be complicated both on account of laziness and on account of
>>> optimizations that happen on the Java side, but I think I've done the right
>>> things to avoid getting tripped up too much by these issues. Still, it's
>>> quite possible that I've coded some things incorrectly and/or that I'm
>>> misunderstanding some basic concepts, and I'd appreciate any help that
>>> anyone can provide.
>>>
>>> First I defined a function that would take a non-trivial amount of time
>>> to execute, as follows:
>>>
>>> (defn burn
>>>  ([] (count
>>>        (take 1E6
>>>          (repeatedly
>>>            #(* 9999999999 9999999999)))))
>>>  ([_] (burn)))
>>>
>>> The implementation with an ignored argument just serves to make some of
>>> my later calls neater -- I suppose I might incur a tiny additional cost when
>>> calling it that way but this will be swamped by the things I'm timing.
>>>
>>> Then I defined functions for calling this multiple times either
>>> sequentially or concurrently, using three different techniques for starting
>>> the concurrent processes:
>>>
>>> (defn burn-sequentially [n]
>>>  (print n " sequential burns: ")
>>>  (time (dotimes [i n] (burn))))
>>>
>>> (defn burn-via-pmap [n]
>>>  (print n " burns via pmap: ")
>>>  (time (doall (pmap burn (range n)))))
>>>
>>> (defn burn-via-futures [n]
>>>  (print n " burns via futures: ")
>>>  (time (doall (pmap deref (map (fn [_] (future (burn)))
>>>                                                  (range n))))))
>>>
>>> (defn burn-via-agents [n]
>>>  (print n " burns via agents: ")
>>>  (time (let [agents (map #(agent %) (range n))]
>>>          (dorun (map #(send % burn) agents))
>>>          (apply await agents))))
>>>
>>> Finally, since there's often quite a bit of variability in the run time
>>> of these things (maybe because of garbage collection? Optimization? I'm not
>>> sure), I define a simple macro to execute a call three times:
>>>
>>> (defmacro thrice [expression]
>>>  `(do ~expression ~expression ~expression))
>>>
>>> Now I can do some timings, and I'll first show you what happens in one of
>>> the cases where everything performs as expected.
>>>
>>> On a 16-core machine (details at
>>> http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-1.local),
>>> running four burns thrice, with the code:
>>>
>>> (thrice (burn-sequentially 4))
>>> (thrice (burn-via-pmap 4))
>>> (thrice (burn-via-futures 4))
>>> (thrice (burn-via-agents 4))
>>>
>>> I get:
>>>
>>> 4  sequential burns: "Elapsed time: 2308.616 msecs"
>>> 4  sequential burns: "Elapsed time: 1510.207 msecs"
>>> 4  sequential burns: "Elapsed time: 1182.743 msecs"
>>> 4  burns via pmap: "Elapsed time: 470.988 msecs"
>>> 4  burns via pmap: "Elapsed time: 457.015 msecs"
>>> 4  burns via pmap: "Elapsed time: 446.84 msecs"
>>> 4  burns via futures: "Elapsed time: 417.368 msecs"
>>> 4  burns via futures: "Elapsed time: 401.444 msecs"
>>> 4  burns via futures: "Elapsed time: 398.786 msecs"
>>> 4  burns via agents: "Elapsed time: 421.103 msecs"
>>> 4  burns via agents: "Elapsed time: 426.775 msecs"
>>> 4  burns via agents: "Elapsed time: 408.416 msecs"
>>>
>>> The improvement from the first line to the second is something I always
>>> see (along with frequent improvements across the three calls in a "thrice"),
>>> and I assume this is due to optimizations talking place in the JVM. Then we
>>> see that all of the ways of starting concurrent burns perform about the
>>> same, and all produce a speedup over the sequential burns of somewhere in
>>> the neighborhood of 3x-4x. Pretty much exactly what I would expect and want.
>>> So far so good.
>>>
>>> However, in the same JVM launch I then went on to do the same thing but
>>> with 16 and then 48 burns in each call:
>>>
>>> (thrice (burn-sequentially 16))
>>> (thrice (burn-via-pmap 16))
>>> (thrice (burn-via-futures 16))
>>> (thrice (burn-via-agents 16))
>>>
>>> (thrice (burn-sequentially 48))
>>> (thrice (burn-via-pmap 48))
>>> (thrice (burn-via-futures 48))
>>> (thrice (burn-via-agents 48))
>>>
>>> This produced:
>>>
>>> 16  sequential burns: "Elapsed time: 5821.574 msecs"
>>> 16  sequential burns: "Elapsed time: 6580.684 msecs"
>>> 16  sequential burns: "Elapsed time: 6648.013 msecs"
>>> 16  burns via pmap: "Elapsed time: 5953.194 msecs"
>>> 16  burns via pmap: "Elapsed time: 7517.196 msecs"
>>> 16  burns via pmap: "Elapsed time: 7380.047 msecs"
>>> 16  burns via futures: "Elapsed time: 1168.827 msecs"
>>> 16  burns via futures: "Elapsed time: 1068.98 msecs"
>>> 16  burns via futures: "Elapsed time: 1048.745 msecs"
>>> 16  burns via agents: "Elapsed time: 1041.05 msecs"
>>> 16  burns via agents: "Elapsed time: 1030.712 msecs"
>>> 16  burns via agents: "Elapsed time: 1041.139 msecs"
>>> 48  sequential burns: "Elapsed time: 15909.333 msecs"
>>> 48  sequential burns: "Elapsed time: 14825.631 msecs"
>>> 48  sequential burns: "Elapsed time: 15232.646 msecs"
>>> 48  burns via pmap: "Elapsed time: 13586.897 msecs"
>>> 48  burns via pmap: "Elapsed time: 3106.56 msecs"
>>> 48  burns via pmap: "Elapsed time: 3041.272 msecs"
>>> 48  burns via futures: "Elapsed time: 2968.991 msecs"
>>> 48  burns via futures: "Elapsed time: 2895.506 msecs"
>>> 48  burns via futures: "Elapsed time: 2818.724 msecs"
>>> 48  burns via agents: "Elapsed time: 2802.906 msecs"
>>> 48  burns via agents: "Elapsed time: 2754.364 msecs"
>>> 48  burns via agents: "Elapsed time: 2743.038 msecs"
>>>
>>> Looking first at the 16-burn runs, we see that concurrency via pmap is
>>> actually generally WORSE than sequential. I cannot understand why this
>>> should be the case. I guess if I were running on a single core I would
>>> expect to see a slight loss when going to pmap because there would be some
>>> cost for managing the 16 threads that wouldn't be compensated for by actual
>>> concurrency. But I'm running on 16 cores and I should be getting a major
>>> speedup, not a slowdown. There are only 16 threads, so there shouldn't be a
>>> lot of time lost to overhead.
>>>
>>> Also interesting, in this case when I start the processes using futures
>>> or agents I DO see a speedup. It's on the order of 6x-7x, not close to the
>>> 16x that I would hope for, but at least it's a speedup. Why is this so
>>> different from the case with pmap? (Recall that my pmap-based method DID
>>> produce about the same speedup as my other methods when doing only 4 burns.)
>>>
>>> For the calls with 48 burns we again see nearly the expected, reasonably
>>> good pattern with all concurrent calls performing nearly equivalently (I
>>> suppose that the steady improvement over all of the calls is again some kind
>>> of JVM optimization), with a speedup in the concurrent calls over the
>>> sequential calls in the neighborhood of 5x-6x. Again, not the ~16x that I
>>> might hope for, but at least it's in the right direction. The very first of
>>> the pmap calls with 48 burns is an anomaly, with only a slight improvement
>>> over the sequential calls, so I suppose that's another small mystery.
>>>
>>> The big mystery so far, however, is in the case of the 16 burns via pmap,
>>> which is bizarrely slow on this 16-core machine.
>>>
>>> Next I tried the same thing on a 48 core machine (
>>> http://fly.hampshire.edu/ganglia/?p=2&c=Rocks-Cluster&h=compute-4-2.local).
>>> Here I got:
>>>
>>> 4  sequential burns: "Elapsed time: 3062.871 msecs"
>>> 4  sequential burns: "Elapsed time: 2249.048 msecs"
>>> 4  sequential burns: "Elapsed time: 2417.677 msecs"
>>> 4  burns via pmap: "Elapsed time: 705.968 msecs"
>>> 4  burns via pmap: "Elapsed time: 679.865 msecs"
>>> 4  burns via pmap: "Elapsed time: 685.017 msecs"
>>> 4  burns via futures: "Elapsed time: 687.097 msecs"
>>> 4  burns via futures: "Elapsed time: 636.543 msecs"
>>> 4  burns via futures: "Elapsed time: 660.116 msecs"
>>> 4  burns via agents: "Elapsed time: 708.163 msecs"
>>> 4  burns via agents: "Elapsed time: 709.433 msecs"
>>> 4  burns via agents: "Elapsed time: 713.536 msecs"
>>> 16  sequential burns: "Elapsed time: 8065.446 msecs"
>>> 16  sequential burns: "Elapsed time: 8069.239 msecs"
>>> 16  sequential burns: "Elapsed time: 8102.791 msecs"
>>> 16  burns via pmap: "Elapsed time: 11288.757 msecs"
>>> 16  burns via pmap: "Elapsed time: 12182.506 msecs"
>>> 16  burns via pmap: "Elapsed time: 14609.397 msecs"
>>> 16  burns via futures: "Elapsed time: 2519.603 msecs"
>>> 16  burns via futures: "Elapsed time: 2436.699 msecs"
>>> 16  burns via futures: "Elapsed time: 2776.869 msecs"
>>> 16  burns via agents: "Elapsed time: 2178.028 msecs"
>>> 16  burns via agents: "Elapsed time: 2871.38 msecs"
>>> 16  burns via agents: "Elapsed time: 2244.687 msecs"
>>> 48  sequential burns: "Elapsed time: 24118.218 msecs"
>>> 48  sequential burns: "Elapsed time: 24096.667 msecs"
>>> 48  sequential burns: "Elapsed time: 24057.327 msecs"
>>> 48  burns via pmap: "Elapsed time: 10369.224 msecs"
>>> 48  burns via pmap: "Elapsed time: 6837.071 msecs"
>>> 48  burns via pmap: "Elapsed time: 4163.926 msecs"
>>> 48  burns via futures: "Elapsed time: 3980.298 msecs"
>>> 48  burns via futures: "Elapsed time: 4066.35 msecs"
>>> 48  burns via futures: "Elapsed time: 4068.199 msecs"
>>> 48  burns via agents: "Elapsed time: 4012.069 msecs"
>>> 48  burns via agents: "Elapsed time: 4052.759 msecs"
>>> 48  burns via agents: "Elapsed time: 4085.018 msecs"
>>>
>>> Essentially this is the same picture that I got on the 16-core machine:
>>> decent (but less than I would like -- only something like 3x-4x) speedups
>>> with most concurrent methods in most cases but a bizarre anomaly with 16
>>> burns started with pmap, which is again considerably slower than the
>>> sequential runs. Why should this be? When I run only 4 burns or a full 48
>>> burns the pmap method performs decently (that is, at least things get faster
>>> than the sequential calls), but with 16 burns something very odd happens.
>>>
>>> Finally, I ran the same thing on my MacBook Pro 3.06 GHz Intel Core 2
>>> Duo, Mac OS X 10.6.4, with Clojure 1.2.0-master-SNAPSHOT under
>>> Eclipse/Counterclockwise, with a bunch of applications running, so probably
>>> this is acting more or less like a single core machine, and got:
>>>
>>> 4  sequential burns: "Elapsed time: 3487.224 msecs"
>>> 4  sequential burns: "Elapsed time: 2327.569 msecs"
>>> 4  sequential burns: "Elapsed time: 2137.697 msecs"
>>> 4  burns via pmap: "Elapsed time: 12478.725 msecs"
>>> 4  burns via pmap: "Elapsed time: 12815.75 msecs"
>>> 4  burns via pmap: "Elapsed time: 8464.909 msecs"
>>> 4  burns via futures: "Elapsed time: 11494.17 msecs"
>>> 4  burns via futures: "Elapsed time: 12365.537 msecs"
>>> 4  burns via futures: "Elapsed time: 12098.571 msecs"
>>> 4  burns via agents: "Elapsed time: 10361.749 msecs"
>>> 4  burns via agents: "Elapsed time: 12458.174 msecs"
>>> 4  burns via agents: "Elapsed time: 9016.093 msecs"
>>> 16  sequential burns: "Elapsed time: 8706.674 msecs"
>>> 16  sequential burns: "Elapsed time: 8748.006 msecs"
>>> 16  sequential burns: "Elapsed time: 8729.54 msecs"
>>> 16  burns via pmap: "Elapsed time: 46022.281 msecs"
>>> 16  burns via pmap: "Elapsed time: 44845.725 msecs"
>>> 16  burns via pmap: "Elapsed time: 45393.156 msecs"
>>> 16  burns via futures: "Elapsed time: 52822.863 msecs"
>>> 16  burns via futures: "Elapsed time: 50647.708 msecs"
>>> 16  burns via futures: "Elapsed time: 50337.916 msecs"
>>> 16  burns via agents: "Elapsed time: 48615.905 msecs"
>>> 16  burns via agents: "Elapsed time: 56703.723 msecs"
>>> 16  burns via agents: "Elapsed time: 69765.913 msecs"
>>> 48  sequential burns: "Elapsed time: 38885.616 msecs"
>>> 48  sequential burns: "Elapsed time: 38651.573 msecs"
>>> 48  sequential burns: "Elapsed time: 36669.02 msecs"
>>> 48  burns via pmap: "Elapsed time: 169108.022 msecs"
>>> 48  burns via pmap: "Elapsed time: 176656.455 msecs"
>>> 48  burns via pmap: "Elapsed time: 182119.986 msecs"
>>> 48  burns via futures: "Elapsed time: 176764.722 msecs"
>>> 48  burns via futures: "Elapsed time: 169257.577 msecs"
>>> 48  burns via futures: "Elapsed time: 157205.693 msecs"
>>> 48  burns via agents: "Elapsed time: 140618.333 msecs"
>>> 48  burns via agents: "Elapsed time: 137992.773 msecs"
>>> 48  burns via agents: "Elapsed time: 143153.696 msecs"
>>>
>>> Here we have a very depressing picture. Although I wouldn't expect to get
>>> any speedup from concurrency the concurrency-related slowdowns have now
>>> spread to all of my concurrency-starting methods with all numbers of burns.
>>> It is way way way worse to be using the concurrency methods than the
>>> straightforward sequential method in every circumstance. Again, I understand
>>> why one should expect a small loss in a case like this, but these are huge
>>> losses and the number of threads that have to be coordinated (with no
>>> shared) is quite small -- just 4-48.
>>>
>>> My guess is that all of this is stemming from some confusion on my part
>>> about how I should be starting and managing concurrent processes, and my
>>> greatest hope is that one of you will show me an alternative to my
>>> burn-via-* functions that provides a speedup nearly linear with the number
>>> of cores and only a negligible loss when there's only one core available...
>>>
>>> But any help of any kind would be appreciated.
>>>
>>> Thanks,
>>>
>>>  -Lee
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspec...@hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>> Check out Genetic Programming and Evolvable Machines:
>>> http://www.springer.com/10710 - http://gpemjournal.blogspot.com/
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clojure@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com<clojure%2bunsubscr...@googlegroups.com>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: starting and getting the most out of concurrent processes

Reply via email to