On 10/13/2016 11:43 AM, Richard Biener wrote: > On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška <mli...@suse.cz> wrote: >> On 10/04/2016 11:45 AM, Richard Biener wrote: >>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mli...@suse.cz> wrote: >>>> On 09/07/2016 02:09 PM, Richard Biener wrote: >>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mli...@suse.cz> wrote: >>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote: >>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek >>>>>>> <ja...@redhat.com> wrote: >>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote: >>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications. >>>>>>>>>> The best proxy we have for that is -pthread. >>>>>>>>>> >>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks >>>>>>>>>> garbage data otherwise, which in many ways is even worse. >>>>>>>>> >>>>>>>>> It will likely be catastrophically slower in some cases. >>>>>>>>> >>>>>>>>> Catastrophically as in too slow to be usable. >>>>>>>>> >>>>>>>>> An atomic instruction is a lot more expensive than a single >>>>>>>> increment. Also >>>>>>>>> they sometimes are really slow depending on the state of the machine. >>>>>>>> >>>>>>>> Can't we just have thread-local copies of all the counters (perhaps >>>>>>>> using >>>>>>>> __thread pointer as base) and just atomically merge at thread >>>>>>>> termination? >>>>>>> >>>>>>> I suggested that as well but of course it'll have its own class of >>>>>>> issues (short lived threads, so we need to somehow re-use counters from >>>>>>> terminated threads, large number of threads and thus using too much >>>>>>> memory for the counters) >>>>>>> >>>>>>> Richard. >>>>>> >>>>>> Hello. >>>>>> >>>>>> I've got written the approach on my TODO list, let's see whether it >>>>>> would be doable in a reasonable amount of time. >>>>>> >>>>>> I've just finished some measurements to illustrate slow-down of >>>>>> -fprofile-update=atomic approach. >>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate >>>>>> -fprofile-update=atomic >>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s >>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s >>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s >>>>>> >>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not >>>>>> having much experience with default option >>>>>> selection, but these numbers can probably help. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Look at the generated code for an instrumented simple loop and see that >>>>> for >>>>> the non-atomic updates we happily apply store-motion to the counter update >>>>> and thus we only get one counter update per loop exit rather than one per >>>>> loop iteration. Now see what happens for the atomic case (I suspect you >>>>> get one per iteration). >>>>> >>>>> I'll bet this accounts for most of the slowdown. >>>>> >>>>> Back in time ICC which had atomic counter updates (but using function >>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also >>>>> didn't have early inlining -- removing abstraction helps reducing the >>>>> number >>>>> of counters significantly). >>>>> >>>>> Richard. >>>> >>>> Hi. >>>> >>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS >>>> profile counter updates. My first attempt is to utilize TLS storage, where >>>> every function is accumulating arcs counters. These are eventually added >>>> (using atomic operations) to the global one at the very end of a function. >>>> Currently I rely on target support of TLS, which is questionable whether >>>> to have such a requirement for -fprofile-update=atomic, or to add a new >>>> option value >>>> like -fprofile-update=atomic-tls? >>>> >>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s >>>> to finish. >>>> Time shrinks to 50%, compared to the current implementation. >>>> >>>> Thoughts? >>> >>> Hmm, I thought I suggested that you can simply use automatic storage >>> (which effectively >>> is TLS...) for regions that are not forked or abnormally left (which >>> means SESE regions >>> that have no calls that eventually terminate or throw externally). >>> >>> So why did you end up with TLS? >> >> Hi. >> >> Usage for TLS does not makes sense, stupid mistake ;) >> >> By using SESE regions, do you mean the infrastructure that is utilized >> by Graphite machinery? > > No, just as "single-entry single-exit region" which means placing of > initializations of the internal counters to zero and the updates of the > actual counters is "obvious". > > Note that this "optimization" isn't one if the SESE region does not contain > cycle(s). Unless there is a way to do an atomic update of a bunch of > counters faster than doing them separately. This optimization will also > increase register pressure (or force the internal counters to the stack). > Thus selecting which counters to "optimize" and which ones to leave in place > might be necessary.
Ok, I must admit the selection which counters to optimize is crucial. Current implementation (atomic increments at places where a BB is reached) has advantage that it does not increase register pressure and it does not update a global arcs counter when the BB is not visited. On the other hand, having a local counters which are updated at function exit (my current implementation) possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads call a function very often. This is perf report of cray benchmark: │ if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0; 0.01 │3c0: xor %ecx,%ecx 0.00 │3c2: mov 0x110(%rsp),%rdx 0.00 │ lock add %rdx,__gcov0.ray_sphere 7.96 │ mov 0x118(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x8 11.39 │ mov 0x120(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x10 11.09 │ mov 0x128(%rsp),%rdx 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x18 11.27 │ mov 0x130(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x20 11.02 │ mov 0x138(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x28 11.46 │ mov 0x140(%rsp),%rdx 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x30 11.84 │ mov 0x148(%rsp),%rdx 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x38 11.57 │ mov 0x150(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x40 6.86 │ mov 0x158(%rsp),%rdx │ lock add %rdx,__gcov0.ray_sphere+0x48 My current approach does atomic increment when maybe_hot_bb_p return false and local counters are used otherwise. Question is how to find a better place where to initialize and store local counter values? Ideas welcomed. Thanks, Martin > > Richard. > >> Thanks, >> Martin >> >>> >>> Richard. >>> >>>> Martin >>>> >>>>> >>>>>> Martin >>>>>> >>>>>>> >>>>>>>> Jakub >>>>>>> >>>>>>> >>>>>> >>>> >>