On 10/04/2016 11:45 AM, Richard Biener wrote: > On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <mli...@suse.cz> wrote: >> On 09/07/2016 02:09 PM, Richard Biener wrote: >>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mli...@suse.cz> wrote: >>>> On 08/18/2016 06:06 PM, Richard Biener wrote: >>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <ja...@redhat.com> >>>>> wrote: >>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote: >>>>>>>> I'd prefer to make updates atomic in multi-threaded applications. >>>>>>>> The best proxy we have for that is -pthread. >>>>>>>> >>>>>>>> Is it slower, most definitely, but odds are we're giving folks >>>>>>>> garbage data otherwise, which in many ways is even worse. >>>>>>> >>>>>>> It will likely be catastrophically slower in some cases. >>>>>>> >>>>>>> Catastrophically as in too slow to be usable. >>>>>>> >>>>>>> An atomic instruction is a lot more expensive than a single >>>>>> increment. Also >>>>>>> they sometimes are really slow depending on the state of the machine. >>>>>> >>>>>> Can't we just have thread-local copies of all the counters (perhaps >>>>>> using >>>>>> __thread pointer as base) and just atomically merge at thread >>>>>> termination? >>>>> >>>>> I suggested that as well but of course it'll have its own class of issues >>>>> (short lived threads, so we need to somehow re-use counters from >>>>> terminated threads, large number of threads and thus using too much >>>>> memory for the counters) >>>>> >>>>> Richard. >>>> >>>> Hello. >>>> >>>> I've got written the approach on my TODO list, let's see whether it would >>>> be doable in a reasonable amount of time. >>>> >>>> I've just finished some measurements to illustrate slow-down of >>>> -fprofile-update=atomic approach. >>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate >>>> -fprofile-update=atomic >>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s >>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s >>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s >>>> >>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not >>>> having much experience with default option >>>> selection, but these numbers can probably help. >>>> >>>> Thoughts? >>> >>> Look at the generated code for an instrumented simple loop and see that for >>> the non-atomic updates we happily apply store-motion to the counter update >>> and thus we only get one counter update per loop exit rather than one per >>> loop iteration. Now see what happens for the atomic case (I suspect you >>> get one per iteration). >>> >>> I'll bet this accounts for most of the slowdown. >>> >>> Back in time ICC which had atomic counter updates (but using function >>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also >>> didn't have early inlining -- removing abstraction helps reducing the number >>> of counters significantly). >>> >>> Richard. >> >> Hi. >> >> During Cauldron I discussed with Richi approaches how to speed-up ARCS >> profile counter updates. My first attempt is to utilize TLS storage, where >> every function is accumulating arcs counters. These are eventually added >> (using atomic operations) to the global one at the very end of a function. >> Currently I rely on target support of TLS, which is questionable whether >> to have such a requirement for -fprofile-update=atomic, or to add a new >> option value >> like -fprofile-update=atomic-tls? >> >> Running the patch on tramp3d, compared to previous numbers, it takes 88s to >> finish. >> Time shrinks to 50%, compared to the current implementation. >> >> Thoughts? > > Hmm, I thought I suggested that you can simply use automatic storage > (which effectively > is TLS...) for regions that are not forked or abnormally left (which > means SESE regions > that have no calls that eventually terminate or throw externally). > > So why did you end up with TLS?
Hi. Usage for TLS does not makes sense, stupid mistake ;) By using SESE regions, do you mean the infrastructure that is utilized by Graphite machinery? Thanks, Martin > > Richard. > >> Martin >> >>> >>>> Martin >>>> >>>>> >>>>>> Jakub >>>>> >>>>> >>>> >>