David: No new suggestions to add right now. Herwig's suggestion that it could be the Java allocator has some evidence for it given your results. I'm not sure whether this StackOverflow Q&A on TLAB is fully accurate, but it may provide some useful info:
http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab I mainly wanted to give you a virtual high-five, kudos, and thank-you thank-you thank-you thank-you thank-you for taking the time to run these experiments. Similar performance issues with many threads in the same JVM on a many-core machine have come up before in the past, and so far I don't know if anyone has gotten to the bottom of it yet. Andy On Wed, Nov 18, 2015 at 10:36 PM, David Iba <david...@gmail.com> wrote: > OK, have a few updates to report: > > - Oracle vs OpenJDK did not make a difference > - Whenever I run N>1 threads calling any of these functions with > swap/vswap, there is some overhead compared to running 18 separate > single-run processes in parallel. This overhead seems to increase as N > increases. > - For both swap and vswap, the function timings from running 18 > futures (from one JVM) show about 1.5X the time from running 18 separate > JVM processes. > - For the swap version (f2), very often a few of the calls would go > rogue and take around 3X the time of the others. > - this did not happen for the vswap version of f2. > - Running 9 processes with 2 f2-calling threads each was maybe 4% > slower than 18 processes of 1. > - Running 4 processes with 4 f2-calling threads each was mostly the > same speed as the 18x1, but there were a couple of those rogue threads that > took 2-3X the time of the others. > > Any ideas? > > On Thursday, November 19, 2015 at 1:08:14 AM UTC+9, David Iba wrote: >> >> No worries. Thanks, I'll give that a try as well! >> >> On Thursday, November 19, 2015 at 1:04:04 AM UTC+9, tbc++ wrote: >>> >>> Oh, then I completely mis-understood the problem at hand here. If that's >>> the case then do the following: >>> >>> Change "atom" to "volatile!" and "swap!" to "vswap!". See if that >>> changes anything. >>> >>> Timothy >>> >>> >>> On Wed, Nov 18, 2015 at 9:00 AM, David Iba <davi...@gmail.com> wrote: >>> >>>> Timothy: Each thread (call of f2) creates its own "local" atom, so I >>>> don't think there should be any swap retries. >>>> >>>> Gianluca: Good idea! I've only tried OpenJDK, but I will look into >>>> trying Oracle and report back. >>>> >>>> Andy: jvisualvm was showing pretty much all of the memory allocated in >>>> the eden space and a little in the first survivor (no major/full GC's), and >>>> total GC Time was very minimal. >>>> >>>> I'm in the middle of running some more tests and will report back when >>>> I get a chance today or tomorrow. Thanks for all the feedback on this! >>>> >>>> On Thursday, November 19, 2015 at 12:38:55 AM UTC+9, tbc++ wrote: >>>>> >>>>> This sort of code is somewhat the worst case situation for atoms (or >>>>> really for CAS). Clojure's swap! is based off the "compare-and-swap" or >>>>> CAS >>>>> operation that most x86 CPUs have as an instruction. If we expand swap! it >>>>> looks something like this: >>>>> >>>>> (loop [old-val @x*] >>>>> (let [new-val (assoc old-val :k i)] >>>>> (if (compare-and-swap x* old-val new-val) >>>>> new-val >>>>> (recur @x*))) >>>>> >>>>> Compare-and-swap can be defined as "updates the content of the >>>>> reference to new-val only if the current value of the reference is equal >>>>> to >>>>> the old-val). >>>>> >>>>> So in essence, only one core can be modifying the contents of an atom >>>>> at a time, if the atom is modified during the execution of the swap! call, >>>>> then swap! will continue to re-run your function until it's able to update >>>>> the atom without it being modified during the function's execution. >>>>> >>>>> So let's say you have some super long task that you need to integrate >>>>> into a ref, he's one way to do it, but probably not the best: >>>>> >>>>> (let [a (atom 0)] >>>>> (dotimes [x 18] >>>>> (future >>>>> (swap! a long-operation-on-score some-param)))) >>>>> >>>>> >>>>> In this case long-operation-on-score will need to be re-run every time >>>>> a thread modifies the atom. However if our function only needs the state >>>>> of >>>>> the ref to add to it, then we can do something like this instead: >>>>> >>>>> (let [a (atom 0)] >>>>> (dotimes [x 18] >>>>> (future >>>>> (let [score (long-operation-on-score some-param) >>>>> (swap! a + score))))) >>>>> >>>>> Now we only have a simple addition inside the swap! and we will have >>>>> less contention between the CPUs because they will most likely be spending >>>>> more time inside 'long-operation-on-score' instead of inside the swap. >>>>> >>>>> *TL;DR*: do as little work as possible inside swap! the more you have >>>>> inside swap! the higher chance you will have of throwing away work due to >>>>> swap! retries. >>>>> >>>>> Timothy >>>>> >>>>> On Wed, Nov 18, 2015 at 8:13 AM, gianluca torta <giat...@gmail.com> >>>>> wrote: >>>>> >>>>>> by the way, have you tried both Oracle and Open JDK with the same >>>>>> results? >>>>>> Gianluca >>>>>> >>>>>> On Tuesday, November 17, 2015 at 8:28:49 PM UTC+1, Andy Fingerhut >>>>>> wrote: >>>>>>> >>>>>>> David, you say "Based on jvisualvm monitoring, doesn't seem to be >>>>>>> GC-related". >>>>>>> >>>>>>> What is jvisualvm showing you related to GC and/or memory allocation >>>>>>> when you tried the 18-core version with 18 threads in the same process? >>>>>>> >>>>>>> Even memory allocation could become a point of contention, depending >>>>>>> upon how the memory allocation works with many threads. e.g. Depends on >>>>>>> whether a thread gets a large chunk of memory on a global lock, and then >>>>>>> locally carves it up into the small pieces it needs for each individual >>>>>>> Java 'new' allocation, or gets a global lock for every 'new'. The >>>>>>> latter >>>>>>> would give terrible performance as # cores increase, but I don't know >>>>>>> how >>>>>>> to tell whether that is the case, except by knowing more about how the >>>>>>> memory allocator is implemented in your JVM. Maybe digging through >>>>>>> OpenJDK >>>>>>> source code in the right place would tell? >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On Tue, Nov 17, 2015 at 2:00 AM, David Iba <davi...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> correction: that "do" should be a "doall". (My actual test code >>>>>>>> was a bit different, but each run printed some info when it started so >>>>>>>> it >>>>>>>> doesn't have to do with delayed evaluation of lazy seq's or anything). >>>>>>>> >>>>>>>> >>>>>>>> On Tuesday, November 17, 2015 at 6:49:16 PM UTC+9, David Iba wrote: >>>>>>>>> >>>>>>>>> Andy: Interesting. Thanks for educating me on the fact that atom >>>>>>>>> swap's don't use the STM. Your theory seems plausible... I will try >>>>>>>>> those >>>>>>>>> tests next time I launch the 18-core instance, but yeah, not sure how >>>>>>>>> illuminating the results will be. >>>>>>>>> >>>>>>>>> Niels: along the lines of this (so that each thread prints its >>>>>>>>> time as well as printing the overall time): >>>>>>>>> >>>>>>>>> 1. (time >>>>>>>>> 2. (let [f f1 >>>>>>>>> 3. n-runs 18 >>>>>>>>> 4. futs (do (for [i (range n-runs)] >>>>>>>>> 5. (future (time (f)))))] >>>>>>>>> 6. (doseq [fut futs] >>>>>>>>> 7. @fut))) >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tuesday, November 17, 2015 at 5:33:01 PM UTC+9, Niels van >>>>>>>>> Klaveren wrote: >>>>>>>>>> >>>>>>>>>> Could you also show how you are running these functions in >>>>>>>>>> parallel and time them ? The way you start the functions can have as >>>>>>>>>> much >>>>>>>>>> impact as the functions themselves. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Niels >>>>>>>>>> >>>>>>>>>> On Tuesday, November 17, 2015 at 6:38:39 AM UTC+1, David Iba >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I have functions f1 and f2 below, and let's say they run in T1 >>>>>>>>>>> and T2 amount of time when running a single instance/thread. The >>>>>>>>>>> issue I'm >>>>>>>>>>> facing is that parallelizing f2 across 18 cores takes anywhere from >>>>>>>>>>> 2-5X >>>>>>>>>>> T2, and for more complex funcs takes absurdly long. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. (defn f1 [] >>>>>>>>>>> 2. (apply + (range 2e9))) >>>>>>>>>>> 3. >>>>>>>>>>> 4. ;; Note: each call to (f2) makes its own x* atom, so the >>>>>>>>>>> 'swap!' should never retry. >>>>>>>>>>> 5. (defn f2 [] >>>>>>>>>>> 6. (let [x* (atom {})] >>>>>>>>>>> 7. (loop [i 1e9] >>>>>>>>>>> 8. (when-not (zero? i) >>>>>>>>>>> 9. (swap! x* assoc :k i) >>>>>>>>>>> 10. (recur (dec i)))))) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Of note: >>>>>>>>>>> - On a 4-core machine, both f1 and f2 parallelize well (roungly >>>>>>>>>>> T1 and T2 for 4 runs in parallel) >>>>>>>>>>> - running 18 f1's in parallel on the 18-core machine also >>>>>>>>>>> parallelizes well. >>>>>>>>>>> - Disabling hyperthreading doesn't help. >>>>>>>>>>> - Based on jvisualvm monitoring, doesn't seem to be GC-related >>>>>>>>>>> - also tried on dedicated 18-core ec2 instance with same issues, >>>>>>>>>>> so not shared-tenancy-related >>>>>>>>>>> - if I make a jar that runs a single f2 and launch 18 in >>>>>>>>>>> parallel, it parallelizes well (so I don't think it's >>>>>>>>>>> machine/aws-related) >>>>>>>>>>> >>>>>>>>>>> Could it be that the 18 f2's in parallel on a single JVM >>>>>>>>>>> instance is overworking the STM with all the swap's? Any other >>>>>>>>>>> theories? >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Clojure" group. >>>>>>>> To post to this group, send email to clo...@googlegroups.com >>>>>>>> Note that posts from new members are moderated - please be patient >>>>>>>> with your first post. >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> clojure+u...@googlegroups.com >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/group/clojure?hl=en >>>>>>>> --- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Clojure" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to clojure+u...@googlegroups.com. >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Clojure" group. >>>>>> To post to this group, send email to clo...@googlegroups.com >>>>>> Note that posts from new members are moderated - please be patient >>>>>> with your first post. >>>>>> To unsubscribe from this group, send email to >>>>>> clojure+u...@googlegroups.com >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/clojure?hl=en >>>>>> --- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Clojure" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to clojure+u...@googlegroups.com. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> “One of the main causes of the fall of the Roman Empire was >>>>> that–lacking zero–they had no way to indicate successful termination of >>>>> their C programs.” >>>>> (Robert Firth) >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To post to this group, send email to clo...@googlegroups.com >>>> Note that posts from new members are moderated - please be patient with >>>> your first post. >>>> To unsubscribe from this group, send email to >>>> clojure+u...@googlegroups.com >>>> For more options, visit this group at >>>> http://groups.google.com/group/clojure?hl=en >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to clojure+u...@googlegroups.com. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> “One of the main causes of the fall of the Roman Empire was that–lacking >>> zero–they had no way to indicate successful termination of their C >>> programs.” >>> (Robert Firth) >>> >> -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.