Re: Poor parallelization performance across 18 cores (but not 4)

Andy Fingerhut Thu, 19 Nov 2015 12:59:01 -0800

David:

No new suggestions to add right now.  Herwig's suggestion that it could be
the Java allocator has some evidence for it given your results.  I'm not
sure whether this StackOverflow Q&A on TLAB is fully accurate, but it may
provide some useful info:


http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab

I mainly wanted to give you a virtual high-five, kudos, and thank-you
thank-you thank-you thank-you thank-you for taking the time to run these
experiments.  Similar performance issues with many threads in the same JVM
on a many-core machine have come up before in the past, and so far I don't
know if anyone has gotten to the bottom of it yet.

Andy


On Wed, Nov 18, 2015 at 10:36 PM, David Iba <david...@gmail.com> wrote:

> OK, have a few updates to report:
>
>    - Oracle vs OpenJDK did not make a difference
>    - Whenever I run N>1 threads calling any of these functions with
>    swap/vswap, there is some overhead compared to running 18 separate
>    single-run processes in parallel.  This overhead seems to increase as N
>    increases.
>    - For both swap and vswap, the function timings from running 18
>       futures (from one JVM) show about 1.5X the time from running 18 separate
>       JVM processes.
>       - For the swap version (f2), very often a few of the calls would go
>       rogue and take around 3X the time of the others.
>          - this did not happen for the vswap version of f2.
>       - Running 9 processes with 2 f2-calling threads each was maybe 4%
>    slower than 18 processes of 1.
>    - Running 4 processes with 4 f2-calling threads each was mostly the
>    same speed as the 18x1, but there were a couple of those rogue threads that
>    took 2-3X the time of the others.
>
> Any ideas?
>
> On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote:
>>
>> No worries.  Thanks, I'll give that a try as well!
>>
>> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote:
>>>
>>> Oh, then I completely mis-understood the problem at hand here. If that's
>>> the case then do the following:
>>>
>>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that
>>> changes anything.
>>>
>>> Timothy
>>>
>>>
>>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba <davi...@gmail.com> wrote:
>>>
>>>> Timothy:  Each thread (call of f2) creates its own "local" atom, so I
>>>> don't think there should be any swap retries.
>>>>
>>>> Gianluca:  Good idea!  I've only tried OpenJDK, but I will look into
>>>> trying Oracle and report back.
>>>>
>>>> Andy:  jvisualvm was showing pretty much all of the memory allocated in
>>>> the eden space and a little in the first survivor (no major/full GC's), and
>>>> total GC Time was very minimal.
>>>>
>>>> I'm in the middle of running some more tests and will report back when
>>>> I get a chance today or tomorrow.  Thanks for all the feedback on this!
>>>>
>>>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote:
>>>>>
>>>>> This sort of code is somewhat the worst case situation for atoms (or
>>>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or 
>>>>> CAS
>>>>> operation that most x86 CPUs have as an instruction. If we expand swap! it
>>>>> looks something like this:
>>>>>
>>>>> (loop [old-val @x*]
>>>>>   (let [new-val (assoc old-val :k i)]
>>>>>     (if (compare-and-swap x* old-val new-val)
>>>>>        new-val
>>>>>        (recur @x*)))
>>>>>
>>>>> Compare-and-swap can be defined as "updates the content of the
>>>>> reference to new-val only if the current value of the reference is equal 
>>>>> to
>>>>> the old-val).
>>>>>
>>>>> So in essence, only one core can be modifying the contents of an atom
>>>>> at a time, if the atom is modified during the execution of the swap! call,
>>>>> then swap! will continue to re-run your function until it's able to update
>>>>> the atom without it being modified during the function's execution.
>>>>>
>>>>> So let's say you have some super long task that you need to integrate
>>>>> into a ref, he's one way to do it, but probably not the best:
>>>>>
>>>>> (let [a (atom 0)]
>>>>>   (dotimes [x 18]
>>>>>     (future
>>>>>         (swap! a long-operation-on-score some-param))))
>>>>>
>>>>>
>>>>> In this case long-operation-on-score will need to be re-run every time
>>>>> a thread modifies the atom. However if our function only needs the state 
>>>>> of
>>>>> the ref to add to it, then we can do something like this instead:
>>>>>
>>>>> (let [a (atom 0)]
>>>>>   (dotimes [x 18]
>>>>>     (future
>>>>>         (let [score (long-operation-on-score some-param)
>>>>>           (swap! a + score)))))
>>>>>
>>>>> Now we only have a simple addition inside the swap! and we will have
>>>>> less contention between the CPUs because they will most likely be spending
>>>>> more time inside 'long-operation-on-score' instead of inside the swap.
>>>>>
>>>>> *TL;DR*: do as little work as possible inside swap! the more you have
>>>>> inside swap! the higher chance you will have of throwing away work due to
>>>>> swap! retries.
>>>>>
>>>>> Timothy
>>>>>
>>>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta <giat...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> by the way, have you tried both Oracle and Open JDK with the same
>>>>>> results?
>>>>>> Gianluca
>>>>>>
>>>>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut
>>>>>> wrote:
>>>>>>>
>>>>>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be
>>>>>>> GC-related".
>>>>>>>
>>>>>>> What is jvisualvm showing you related to GC and/or memory allocation
>>>>>>> when you tried the 18-core version with 18 threads in the same process?
>>>>>>>
>>>>>>> Even memory allocation could become a point of contention, depending
>>>>>>> upon how the memory allocation works with many threads.  e.g. Depends on
>>>>>>> whether a thread gets a large chunk of memory on a global lock, and then
>>>>>>> locally carves it up into the small pieces it needs for each individual
>>>>>>> Java 'new' allocation, or gets a global lock for every 'new'.  The 
>>>>>>> latter
>>>>>>> would give terrible performance as # cores increase, but I don't know 
>>>>>>> how
>>>>>>> to tell whether that is the case, except by knowing more about how the
>>>>>>> memory allocator is implemented in your JVM.  Maybe digging through 
>>>>>>> OpenJDK
>>>>>>> source code in the right place would tell?
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba <davi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> correction: that "do" should be a "doall".  (My actual test code
>>>>>>>> was a bit different, but each run printed some info when it started so 
>>>>>>>> it
>>>>>>>> doesn't have to do with delayed evaluation of lazy seq's or anything).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote:
>>>>>>>>>
>>>>>>>>> Andy:  Interesting.  Thanks for educating me on the fact that atom
>>>>>>>>> swap's don't use the STM.  Your theory seems plausible... I will try 
>>>>>>>>> those
>>>>>>>>> tests next time I launch the 18-core instance, but yeah, not sure how
>>>>>>>>> illuminating the results will be.
>>>>>>>>>
>>>>>>>>> Niels: along the lines of this (so that each thread prints its
>>>>>>>>> time as well as printing the overall time):
>>>>>>>>>
>>>>>>>>>    1.   (time
>>>>>>>>>    2.    (let [f f1
>>>>>>>>>    3.          n-runs 18
>>>>>>>>>    4.          futs (do (for [i (range n-runs)]
>>>>>>>>>    5.                     (future (time (f)))))]
>>>>>>>>>    6.      (doseq [fut futs]
>>>>>>>>>    7.        @fut)))
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van
>>>>>>>>> Klaveren wrote:
>>>>>>>>>>
>>>>>>>>>> Could you also show how you are running these functions in
>>>>>>>>>> parallel and time them ? The way you start the functions can have as 
>>>>>>>>>> much
>>>>>>>>>> impact as the functions themselves.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Niels
>>>>>>>>>>
>>>>>>>>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I have functions f1 and f2 below, and let's say they run in T1
>>>>>>>>>>> and T2 amount of time when running a single instance/thread.  The 
>>>>>>>>>>> issue I'm
>>>>>>>>>>> facing is that parallelizing f2 across 18 cores takes anywhere from 
>>>>>>>>>>> 2-5X
>>>>>>>>>>> T2, and for more complex funcs takes absurdly long.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. (defn f1 []
>>>>>>>>>>>    2.   (apply + (range 2e9)))
>>>>>>>>>>>    3.
>>>>>>>>>>>    4. ;; Note: each call to (f2) makes its own x* atom, so the
>>>>>>>>>>>    'swap!' should never retry.
>>>>>>>>>>>    5. (defn f2 []
>>>>>>>>>>>    6.   (let [x* (atom {})]
>>>>>>>>>>>    7.     (loop [i 1e9]
>>>>>>>>>>>    8.       (when-not (zero? i)
>>>>>>>>>>>    9.         (swap! x* assoc :k i)
>>>>>>>>>>>    10.         (recur (dec i))))))
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Of note:
>>>>>>>>>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly
>>>>>>>>>>> T1 and T2 for 4 runs in parallel)
>>>>>>>>>>> - running 18 f1's in parallel on the 18-core machine also
>>>>>>>>>>> parallelizes well.
>>>>>>>>>>> - Disabling hyperthreading doesn't help.
>>>>>>>>>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related
>>>>>>>>>>> - also tried on dedicated 18-core ec2 instance with same issues,
>>>>>>>>>>> so not shared-tenancy-related
>>>>>>>>>>> - if I make a jar that runs a single f2 and launch 18 in
>>>>>>>>>>> parallel, it parallelizes well (so I don't think it's 
>>>>>>>>>>> machine/aws-related)
>>>>>>>>>>>
>>>>>>>>>>> Could it be that the 18 f2's in parallel on a single JVM
>>>>>>>>>>> instance is overworking the STM with all the swap's?  Any other 
>>>>>>>>>>> theories?
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Clojure" group.
>>>>>>>> To post to this group, send email to clo...@googlegroups.com
>>>>>>>> Note that posts from new members are moderated - please be patient
>>>>>>>> with your first post.
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> clojure+u...@googlegroups.com
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/clojure?hl=en
>>>>>>>> ---
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Clojure" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to clojure+u...@googlegroups.com.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Clojure" group.
>>>>>> To post to this group, send email to clo...@googlegroups.com
>>>>>> Note that posts from new members are moderated - please be patient
>>>>>> with your first post.
>>>>>> To unsubscribe from this group, send email to
>>>>>> clojure+u...@googlegroups.com
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/clojure?hl=en
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Clojure" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to clojure+u...@googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> “One of the main causes of the fall of the Roman Empire was
>>>>> that–lacking zero–they had no way to indicate successful termination of
>>>>> their C programs.”
>>>>> (Robert Firth)
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+u...@googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/clojure?hl=en
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to clojure+u...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> “One of the main causes of the fall of the Roman Empire was that–lacking
>>> zero–they had no way to indicate successful termination of their C
>>> programs.”
>>> (Robert Firth)
>>>
>> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Poor parallelization performance across 18 cores (but not 4)

Reply via email to