Re: Asm volatile causing performance regressions on ARM

Richard Biener Mon, 03 Mar 2014 05:55:57 -0800

On Mon, Mar 3, 2014 at 1:53 PM, David Brown <da...@westcontrol.com> wrote:
> On 03/03/14 11:49, Richard Biener wrote:
>> On Mon, Mar 3, 2014 at 11:41 AM, David Brown <da...@westcontrol.com> wrote:
>>> On 28/02/14 13:19, Richard Sandiford wrote:
>>>> Georg-Johann Lay <a...@gjlay.de> writes:
>>>>> Notice that in code1, func might contain such asm-pairs to implement
>>>>> atomic operations, but moving costly_func across func does *not*
>>>>> affect the interrupt respond times in such a disastrous way.
>>>>>
>>>>> Thus you must be *very* careful w.r.t. optimizing against asm volatile
>>>>> + memory clobber.  It's too easy to miss some side effects of *real*
>>>>> code.
>>>>
>>>> I understand the example, but I don't think volatile asms guarantee
>>>> what you want here.
>>>>
>>>>> Optimizing code to scrap and pointing to some GCC internal reasoning or 
>>>>> some
>>>>> standard's wording does not help with real code.
>>>>
>>>> But how else can a compiler work?  It doesn't just regurgitate canned code,
>>>> so it can't rely on human intuition as to what "makes sense".  We have to
>>>> have specific rules and guarantees and say that anything outside those
>>>> rules and guarantees is undefined.
>>>>
>>>> It sounds like you want an asm with an extra-strong ordering guarantee.
>>>> I think that would need to be an extension, since it would need to consider
>>>> cases where the asm is used in a function.  (Shades of carries_dependence
>>>> or whatever in the huge atomic thread.)  I think anything where:
>>>>
>>>>   void foo (void) { X; }
>>>>   void bar (void) { Y1; foo (); Y2; }
>>>>
>>>> has different semantics from:
>>>>
>>>>   void bar (void) { Y1; X; Y2; }
>>>>
>>>> is very dangerous.  And assuming that any function call could enable
>>>> or disable interrupts, and therefore that nothing can be moved across
>>>> a non-const function call, would limit things a bit too much.
>>>>
>>>> Thanks,
>>>> Richard
>>>>
>>>>
>>>
>>> I think the problem stems from "volatile" being a barrier to /data flow/
>>> changes,
>>
>> What kind of /data flow/ changes?  It certainly isn't that currently,
>> only two volatiles always conflict but not a volatile and a non-volatile mem:
>>
>> static int
>> true_dependence_1 (const_rtx mem, enum machine_mode mem_mode, rtx mem_addr,
>>                    const_rtx x, rtx x_addr, bool mem_canonicalized)
>> {
>> ...
>>   if (MEM_VOLATILE_P (x) && MEM_VOLATILE_P (mem))
>>     return 1;
>>
>> bool
>> refs_may_alias_p_1 (ao_ref *ref1, ao_ref *ref2, bool tbaa_p)
>> {
>> ...
>>   /* Two volatile accesses always conflict.  */
>>   if (ref1->volatile_p
>>       && ref2->volatile_p)
>>     return true;
>>
>>> but what is needed in this case is a barrier to /control flow/
>>> changes.  To my knowledge, C does not provide any way of doing this, nor
>>> are there existing gcc extensions to guarantee the ordering.  But it
>>> certainly is the case that control flow ordering like this is important
>>> - it can be critical in embedded systems (such as in the example here by
>>> Georg-Johann), but it can also be important for non-embedded systems
>>> (such as to minimise the time spend while holding a lock).
>>
>> Can you elaborate on this?  I have a hard time thinking of a
>> control flow transform that affects volatiles.
>>
>> Richard.
>>
>
> I am perhaps not expressing myself very clearly here (and I don't know
> the internals of gcc well enough to use the source to help).
>
> Normal (i.e., not "asm") volatile accesses force an order on those
> volatile data accesses - if the source code says a volatile read of "x"
> then a volatile read of "y", then the compiler has to issue those reads
> in that order.  It can't re-order them, or hoist them out of a loop, or
> do any other re-ordering optimisations.  Clobbers, inputs and outputs in
> inline assembly give a similar ordering on the data flow.  But none of
> this affects the /control/ flow.  So the __attribute__((const))
> "costly_func" described by Georg-Johann can be moved freely by the
> compiler amongst these volatile /data/ accesses.
>
> The C abstract machine does not have any concept of timings, only of
> observable accesses (volatile accesses, calls to external code, and
> entry/exit from main()).  So it does not distinguish between the sequences:
>
>         volX = 1;
>         y = costly_func(z);
>         volX = 2;
>
> and
>
>         y = costly_func(z);
>         volX = 1;
>         volX = 2;
>
> and
>         volX = 1;
>         volX = 2;
>         y = costly_func(z);
>
> (This assumes that costly_func is __attribute__((const)), and y and z
> are non-volatile.)
>
> For some real-world usage, however, these sequences are very different.
>  In "big" systems, it is unlikely to change correctness.  If "volX" were
> part of a locking mechanism, for example, then each version of this code
> would be correct - but they might differ in the length of time that the
> locks were held, and that could seriously affect performance.  In
> embedded systems, low performance could mean failure.  The problem is
> exasperated by small cpus that need library functions for seemingly
> simple operations - gcc might happily move a division operation around
> without realising the massive time cost on an 8-bit processor.
>
> In particular, I have seen code like this:
>
> extern volatile int v1, v2;
> extern volatile bool interruptEnable;
> int c;
> void foo(int a) {
>         int b = a / c;
>         interruptEnable = 0;
>         v1 = b;
>         v2 = b;
>         interruptEnable = 1;
> }
>
> get transformed to move the division in between the interrupt disable
> and writing to "v1".  This is a valid transform from C's viewpoint.
> Putting a "volatile asm("" ::: "memory");" before disabling the
> interrupts usually helps, but AFAIK it is not guaranteed by gcc.  Making
> "b" volatile /will/ help, but means extra memory and instructions -
> something you often want to avoid in embedded systems.


That's not what I call "control flow" but it's rather data dependences
again (or value dependences if you like to distinguish it from things
in memory).

Indeed nothing specifies the point where a/c is executed apart from
that it will be computed before its value is consumed.

You can't get both, "strict ordering" and "no penalty due to using
volatile".  But in the above case the scheduler description for the
target should ensure that a/b is moved as far away from its consumer
as possible - but wait - probably that gets disabled by volatiles
being a scheduling barrier ... ;)  (and at expansion time TER likely
"moves" a/c directly before v1 = b).

Richard.

Re: Asm volatile causing performance regressions on ARM

Reply via email to