OK, have a few updates to report: - Oracle vs OpenJDK did not make a difference - Whenever I run N>1 threads calling any of these functions with swap/vswap, there is some overhead compared to running 18 separate single-run processes in parallel. This overhead seems to increase as N increases. - For both swap and vswap, the function timings from running 18 futures (from one JVM) show about 1.5X the time from running 18 separate JVM processes. - For the swap version (f2), very often a few of the calls would go rogue and take around 3X the time of the others. - this did not happen for the vswap version of f2. - Running 9 processes with 2 f2-calling threads each was maybe 4% slower than 18 processes of 1. - Running 4 processes with 4 f2-calling threads each was mostly the same speed as the 18x1, but there were a couple of those rogue threads that took 2-3X the time of the others.
Any ideas? On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote: > > No worries. Thanks, I'll give that a try as well! > > On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: >> >> Oh, then I completely mis-understood the problem at hand here. If that's >> the case then do the following: >> >> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that changes >> anything. >> >> Timothy >> >> >> On Wed, Nov 18, 2015 at 9:00 AM, David Iba <davi...@gmail.com> wrote: >> >>> Timothy: Each thread (call of f2) creates its own "local" atom, so I >>> don't think there should be any swap retries. >>> >>> Gianluca: Good idea! I've only tried OpenJDK, but I will look into >>> trying Oracle and report back. >>> >>> Andy: jvisualvm was showing pretty much all of the memory allocated in >>> the eden space and a little in the first survivor (no major/full GC's), and >>> total GC Time was very minimal. >>> >>> I'm in the middle of running some more tests and will report back when I >>> get a chance today or tomorrow. Thanks for all the feedback on this! >>> >>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: >>>> >>>> This sort of code is somewhat the worst case situation for atoms (or >>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or >>>> CAS >>>> operation that most x86 CPUs have as an instruction. If we expand swap! it >>>> looks something like this: >>>> >>>> (loop [old-val @x*] >>>> (let [new-val (assoc old-val :k i)] >>>> (if (compare-and-swap x* old-val new-val) >>>> new-val >>>> (recur @x*))) >>>> >>>> Compare-and-swap can be defined as "updates the content of the >>>> reference to new-val only if the current value of the reference is equal >>>> to >>>> the old-val). >>>> >>>> So in essence, only one core can be modifying the contents of an atom >>>> at a time, if the atom is modified during the execution of the swap! call, >>>> then swap! will continue to re-run your function until it's able to update >>>> the atom without it being modified during the function's execution. >>>> >>>> So let's say you have some super long task that you need to integrate >>>> into a ref, he's one way to do it, but probably not the best: >>>> >>>> (let [a (atom 0)] >>>> (dotimes [x 18] >>>> (future >>>> (swap! a long-operation-on-score some-param)))) >>>> >>>> >>>> In this case long-operation-on-score will need to be re-run every time >>>> a thread modifies the atom. However if our function only needs the state >>>> of >>>> the ref to add to it, then we can do something like this instead: >>>> >>>> (let [a (atom 0)] >>>> (dotimes [x 18] >>>> (future >>>> (let [score (long-operation-on-score some-param) >>>> (swap! a + score))))) >>>> >>>> Now we only have a simple addition inside the swap! and we will have >>>> less contention between the CPUs because they will most likely be spending >>>> more time inside 'long-operation-on-score' instead of inside the swap. >>>> >>>> *TL;DR*: do as little work as possible inside swap! the more you have >>>> inside swap! the higher chance you will have of throwing away work due to >>>> swap! retries. >>>> >>>> Timothy >>>> >>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta <giat...@gmail.com> >>>> wrote: >>>> >>>>> by the way, have you tried both Oracle and Open JDK with the same >>>>> results? >>>>> Gianluca >>>>> >>>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut >>>>> wrote: >>>>>> >>>>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be >>>>>> GC-related". >>>>>> >>>>>> What is jvisualvm showing you related to GC and/or memory allocation >>>>>> when you tried the 18-core version with 18 threads in the same process? >>>>>> >>>>>> Even memory allocation could become a point of contention, depending >>>>>> upon how the memory allocation works with many threads. e.g. Depends on >>>>>> whether a thread gets a large chunk of memory on a global lock, and then >>>>>> locally carves it up into the small pieces it needs for each individual >>>>>> Java 'new' allocation, or gets a global lock for every 'new'. The >>>>>> latter >>>>>> would give terrible performance as # cores increase, but I don't know >>>>>> how >>>>>> to tell whether that is the case, except by knowing more about how the >>>>>> memory allocator is implemented in your JVM. Maybe digging through >>>>>> OpenJDK >>>>>> source code in the right place would tell? >>>>>> >>>>>> Andy >>>>>> >>>>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba <davi...@gmail.com> wrote: >>>>>> >>>>>>> correction: that "do" should be a "doall". (My actual test code was >>>>>>> a bit different, but each run printed some info when it started so it >>>>>>> doesn't have to do with delayed evaluation of lazy seq's or anything). >>>>>>> >>>>>>> >>>>>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >>>>>>>> >>>>>>>> Andy: Interesting. Thanks for educating me on the fact that atom >>>>>>>> swap's don't use the STM. Your theory seems plausible... I will try >>>>>>>> those >>>>>>>> tests next time I launch the 18-core instance, but yeah, not sure how >>>>>>>> illuminating the results will be. >>>>>>>> >>>>>>>> Niels: along the lines of this (so that each thread prints its time >>>>>>>> as well as printing the overall time): >>>>>>>> >>>>>>>> 1. (time >>>>>>>> 2. (let [f f1 >>>>>>>> 3. n-runs 18 >>>>>>>> 4. futs (do (for [i (range n-runs)] >>>>>>>> 5. (future (time (f)))))] >>>>>>>> 6. (doseq [fut futs] >>>>>>>> 7. @fut))) >>>>>>>> >>>>>>>> >>>>>>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van >>>>>>>> Klaveren wrote: >>>>>>>>> >>>>>>>>> Could you also show how you are running these functions in >>>>>>>>> parallel and time them ? The way you start the functions can have as >>>>>>>>> much >>>>>>>>> impact as the functions themselves. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Niels >>>>>>>>> >>>>>>>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba wrote: >>>>>>>>>> >>>>>>>>>> I have functions f1 and f2 below, and let's say they run in T1 >>>>>>>>>> and T2 amount of time when running a single instance/thread. The >>>>>>>>>> issue I'm >>>>>>>>>> facing is that parallelizing f2 across 18 cores takes anywhere from >>>>>>>>>> 2-5X >>>>>>>>>> T2, and for more complex funcs takes absurdly long. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. (defn f1 [] >>>>>>>>>> 2. (apply + (range 2e9))) >>>>>>>>>> 3. >>>>>>>>>> 4. ;; Note: each call to (f2) makes its own x* atom, so the >>>>>>>>>> 'swap!' should never retry. >>>>>>>>>> 5. (defn f2 [] >>>>>>>>>> 6. (let [x* (atom {})] >>>>>>>>>> 7. (loop [i 1e9] >>>>>>>>>> 8. (when-not (zero? i) >>>>>>>>>> 9. (swap! x* assoc :k i) >>>>>>>>>> 10. (recur (dec i)))))) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Of note: >>>>>>>>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly >>>>>>>>>> T1 and T2 for 4 runs in parallel) >>>>>>>>>> - running 18 f1's in parallel on the 18-core machine also >>>>>>>>>> parallelizes well. >>>>>>>>>> - Disabling hyperthreading doesn't help. >>>>>>>>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related >>>>>>>>>> - also tried on dedicated 18-core ec2 instance with same issues, >>>>>>>>>> so not shared-tenancy-related >>>>>>>>>> - if I make a jar that runs a single f2 and launch 18 in >>>>>>>>>> parallel, it parallelizes well (so I don't think it's >>>>>>>>>> machine/aws-related) >>>>>>>>>> >>>>>>>>>> Could it be that the 18 f2's in parallel on a single JVM instance >>>>>>>>>> is overworking the STM with all the swap's? Any other theories? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Clojure" group. >>>>>>> To post to this group, send email to clo...@googlegroups.com >>>>>>> Note that posts from new members are moderated - please be patient >>>>>>> with your first post. >>>>>>> To unsubscribe from this group, send email to >>>>>>> clojure+u...@googlegroups.com >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/clojure?hl=en >>>>>>> --- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Clojure" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to clojure+u...@googlegroups.com. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Clojure" group. >>>>> To post to this group, send email to clo...@googlegroups.com >>>>> Note that posts from new members are moderated - please be patient >>>>> with your first post. >>>>> To unsubscribe from this group, send email to >>>>> clojure+u...@googlegroups.com >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/clojure?hl=en >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Clojure" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to clojure+u...@googlegroups.com. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> >>>> -- >>>> “One of the main causes of the fall of the Roman Empire was >>>> that–lacking zero–they had no way to indicate successful termination of >>>> their C programs.” >>>> (Robert Firth) >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clo...@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+u...@googlegroups.com >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to clojure+u...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> “One of the main causes of the fall of the Roman Empire was that–lacking >> zero–they had no way to indicate successful termination of their C >> programs.” >> (Robert Firth) >> > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.