On 10/13/2016 11:43 AM, Richard Biener wrote:
> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška <[email protected]> wrote:
>> On 10/04/2016 11:45 AM, Richard Biener wrote:
>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška <[email protected]> wrote:
>>>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <[email protected]> wrote:
>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek
>>>>>>> <[email protected]> wrote:
>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications.
>>>>>>>>>> The best proxy we have for that is -pthread.
>>>>>>>>>>
>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks
>>>>>>>>>> garbage data otherwise, which in many ways is even worse.
>>>>>>>>>
>>>>>>>>> It will likely be catastrophically slower in some cases.
>>>>>>>>>
>>>>>>>>> Catastrophically as in too slow to be usable.
>>>>>>>>>
>>>>>>>>> An atomic instruction is a lot more expensive than a single
>>>>>>>> increment. Also
>>>>>>>>> they sometimes are really slow depending on the state of the machine.
>>>>>>>>
>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps
>>>>>>>> using
>>>>>>>> __thread pointer as base) and just atomically merge at thread
>>>>>>>> termination?
>>>>>>>
>>>>>>> I suggested that as well but of course it'll have its own class of
>>>>>>> issues (short lived threads, so we need to somehow re-use counters from
>>>>>>> terminated threads, large number of threads and thus using too much
>>>>>>> memory for the counters)
>>>>>>>
>>>>>>> Richard.
>>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I've got written the approach on my TODO list, let's see whether it
>>>>>> would be doable in a reasonable amount of time.
>>>>>>
>>>>>> I've just finished some measurements to illustrate slow-down of
>>>>>> -fprofile-update=atomic approach.
>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate
>>>>>> -fprofile-update=atomic
>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>>>>>
>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not
>>>>>> having much experience with default option
>>>>>> selection, but these numbers can probably help.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Look at the generated code for an instrumented simple loop and see that
>>>>> for
>>>>> the non-atomic updates we happily apply store-motion to the counter update
>>>>> and thus we only get one counter update per loop exit rather than one per
>>>>> loop iteration. Now see what happens for the atomic case (I suspect you
>>>>> get one per iteration).
>>>>>
>>>>> I'll bet this accounts for most of the slowdown.
>>>>>
>>>>> Back in time ICC which had atomic counter updates (but using function
>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>>>> didn't have early inlining -- removing abstraction helps reducing the
>>>>> number
>>>>> of counters significantly).
>>>>>
>>>>> Richard.
>>>>
>>>> Hi.
>>>>
>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>>>> profile counter updates. My first attempt is to utilize TLS storage, where
>>>> every function is accumulating arcs counters. These are eventually added
>>>> (using atomic operations) to the global one at the very end of a function.
>>>> Currently I rely on target support of TLS, which is questionable whether
>>>> to have such a requirement for -fprofile-update=atomic, or to add a new
>>>> option value
>>>> like -fprofile-update=atomic-tls?
>>>>
>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s
>>>> to finish.
>>>> Time shrinks to 50%, compared to the current implementation.
>>>>
>>>> Thoughts?
>>>
>>> Hmm, I thought I suggested that you can simply use automatic storage
>>> (which effectively
>>> is TLS...) for regions that are not forked or abnormally left (which
>>> means SESE regions
>>> that have no calls that eventually terminate or throw externally).
>>>
>>> So why did you end up with TLS?
>>
>> Hi.
>>
>> Usage for TLS does not makes sense, stupid mistake ;)
>>
>> By using SESE regions, do you mean the infrastructure that is utilized
>> by Graphite machinery?
>
> No, just as "single-entry single-exit region" which means placing of
> initializations of the internal counters to zero and the updates of the
> actual counters is "obvious".
>
> Note that this "optimization" isn't one if the SESE region does not contain
> cycle(s). Unless there is a way to do an atomic update of a bunch of
> counters faster than doing them separately. This optimization will also
> increase register pressure (or force the internal counters to the stack).
> Thus selecting which counters to "optimize" and which ones to leave in place
> might be necessary.
Ok, I must admit the selection which counters to optimize is crucial. Current
implementation
(atomic increments at places where a BB is reached) has advantage that it does
not increase
register pressure and it does not update a global arcs counter when the BB is
not visited.
On the other hand, having a local counters which are updated at function exit
(my current implementation)
possibly updates counters for BB that not seen and it creates a huge memory
locking spot if multiple threads
call a function very often. This is perf report of cray benchmark:
│ if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
0.01 │3c0: xor %ecx,%ecx
0.00 │3c2: mov 0x110(%rsp),%rdx
0.00 │ lock add %rdx,__gcov0.ray_sphere
7.96 │ mov 0x118(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x8
11.39 │ mov 0x120(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x10
11.09 │ mov 0x128(%rsp),%rdx
0.00 │ lock add %rdx,__gcov0.ray_sphere+0x18
11.27 │ mov 0x130(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x20
11.02 │ mov 0x138(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x28
11.46 │ mov 0x140(%rsp),%rdx
0.00 │ lock add %rdx,__gcov0.ray_sphere+0x30
11.84 │ mov 0x148(%rsp),%rdx
0.00 │ lock add %rdx,__gcov0.ray_sphere+0x38
11.57 │ mov 0x150(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x40
6.86 │ mov 0x158(%rsp),%rdx
│ lock add %rdx,__gcov0.ray_sphere+0x48
My current approach does atomic increment when maybe_hot_bb_p return false
and local counters are used otherwise. Question is how to find a better place
where to initialize and store local counter values?
Ideas welcomed.
Thanks,
Martin
>
> Richard.
>
>> Thanks,
>> Martin
>>
>>>
>>> Richard.
>>>
>>>> Martin
>>>>
>>>>>
>>>>>> Martin
>>>>>>
>>>>>>>
>>>>>>>> Jakub
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>