Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Monday, May 16, 2022 1:18 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de;
>> jeffreya...@gmail.com
>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
>> the method of argument promotions.
>> 
>> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Tamar Christina <tamar.christ...@arm.com> writes:
>> >>> -----Original Message-----
>> >>> From: Richard Sandiford <richard.sandif...@arm.com>
>> >>> Sent: Monday, May 16, 2022 12:36 PM
>> >>> To: Tamar Christina <tamar.christ...@arm.com>
>> >>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; rguent...@suse.de;
>> >>> jeffreya...@gmail.com
>> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
>> >>> target decide the method of argument promotions.
>> >>>
>> >>> Tamar Christina <tamar.christ...@arm.com> writes:
>> >>> > Hi All,
>> >>> >
>> >>> > Some targets require function parameters to be promoted to a
>> >>> > different type on expand time because the target may not have
>> >>> > native instructions to work on such types.  As an example the
>> >>> > AArch64 port does not have native instructions working on integer
>> >>> > 8- or 16-bit values.  As such it promotes every parameter of these
>> types to 32-bits.
>> >>>
>> >>> This doesn't seem specific to parameters though.  It applies to any
>> >>> 8- or 16-bit variable.  E.g.:
>> >>>
>> >>> #include <stdint.h>
>> >>> uint8_t foo(uint32_t x, uint32_t y) {
>> >>>     uint8_t z = x != 0 ? x : y;
>> >>>     return z + 1;
>> >>> }
>> >>>
>> >>> generates:
>> >>>
>> >>> foo:
>> >>>         cmp     w0, 0
>> >>>         and     w1, w1, 255
>> >>>         and     w0, w0, 255
>> >>>         csel    w0, w1, w0, eq
>> >>>         add     w0, w0, 1
>> >>>         ret
>> >>>
>> >>> So I think the new behaviour is really a modification of the
>> >>> PROMOTE_MODE behaviour rather than the
>> PROMOTE_FUNCTION_MODE behaviour.
>> >>>
>> >>> FWIW, I agree with Richard that it would be better not to add a new
>> hook.
>> >>> I think we're really making PROMOTE_MODE choose between
>> SIGN_EXTEND,
>> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
>> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
>> choice.
>> >>
>> >> Ah, I hadn't realized this also applied to locals.. ok I can modify
>> >> PROMOTE_MODE then, but I also need the actual SSA_NAME and not just
>> the type so will have to pass this along.
>> >>
>> >> From a practical point of view.. the actual hook however is
>> >> implemented by 34 targets, would I need to CC maintainers for each of
>> >> them or would global maintainer approval suffice for these mostly
>> mechanical changes?
>> >
>> > Yeah, single approval should be enough mechanical changes.  It would
>> > be good to do the interface change and mechanical target changes as a
>> > separate prepatch if possible though.
>> >
>> > I'm not sure about passing the SSA name to the target though, or the
>> > way that the aarch64 hook uses the info.  It looks like a single cold
>> > comparison could defeat the optimisation for hot code.
>
> I'm not sure I follow why the likelihood of the comparison matters in this 
> case..
> I'll expand on it below..

I meant the likelihood that the comparison is executed at all,
not which outcome is more likely.  E.g. suppose the only comparison
occurs on a failure path that eventually calls abort, and that there are
other paths (without comparisons of the same value) that would benefit
from the any-extend optimisation.  We'd prioritise the cold comparison
over optimising the other (hot) code.

I'm just suspicious of heuristics along the lines of “don't do X
if there is a single instance of Y”. :-)

>> > If we do try to make the decision based on uses at expand time, it
>> > might be better for the analysis to be in target-independent code,
>> > with help from the target to decide where extensions are cheap.  It
>> > still feels a bit hacky though.
>
> I thought about it but can't see most target having this problem. I did go
> with an optimistic heuristics. There are of course various ways to defeat it
> but looking through the corpus of code I don't see any but the simple cases
> in practice. (more below).
>
>> >
>> > What stops us from forming cbz/cbnz when the extension is done close
>> > to the comparison (from the comment in 2/3)?  If we can solve that,
>> > could we simply do an any-extend all the time, and treat removing
>> > redundant extensions as a global availability problem?
>> 
>
> In such cases there's no gain from doing the extension at all, e.g.
> and w0, w0, 255
> cmp w0, 0
> b.eq .Lfoo
>
> will be optimized to
>
> tst w0, 0xff
> b.ne .Lfoo
>
> already.
>
> In RTL the problem occurs when you have nested control flow like nested if 
> and switch statements
> The example in 2/3 was intended to show that before what we'd do is
>
> and w22, w0, 255
> .... <code that clobbers cc and caller saves>
> <switch1>
> cbz w22, .Lfoo1
> ....
> <switch2>
> cbz w22, .Lfoo2
>
> If we have a single comparison we already sink the zero_extend today.
>
> Now if we instead any-extend w0 we end up with:
>
> mov w22, w0
> .... <code that clobbers cc and caller saves>
> <switch1>
> tst w22, 0xff
> b.ne .Lfoo1
> ....
> <switch2>
> tst w22, 0xff
> b.ne .Lfoo2
>
> So you get an additional tst before each branch. You also can't perform the 
> tst higher since CC is clobbered.
> The cbz is nice because the zero extend doesn't use CC of course and so 
> having the value live allows us to optimize
> The branch.

Once the cbz has been formed (in combine), where does the optimisation
of it happen?

> I don't think branch likeliness matters here as you must keep w22 live in 
> both cases somehow. In the SPECCPU 2017
> Benchmark perlbench (which uses a lot of nested switches) this pattern is 
> responsible for an extra 0.3% codesize increase
> which the approach in 2/3 prevents.
>
>> (which would run after combine)
>> 
>> >
>> > What kind of code do we emit when do an extension just before an
>> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then it
>> > should be safe to do the extension directly into R:
>> >
>> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
>> 
>> Oops, that should of course be:
>> 
>>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
>> 
>> > which avoids the problem of having two values live at once (the
>> > zero-extended value and the any-extended value).
>
> I'm not sure it does, as the any-extended value must remain live. i.e. above 
> you can't get rid of w22,
> you can only choose between having it be zero of any extended.  But I am not 
> sure how you would do
> that after expand.

These per-operation extends are emitted during expand.  The question is
whether we do them into fresh registers:

   (set (reg:SI Rtmp) (zero_extend:SI (subreg:QI (reg:SI R))))

which leaves both R and Rtmp live at the same time, or whether we
do them in-situ:

   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))

Expand should know that the latter is valid, given the DECL_RTL.

Thanks,
Richard

Reply via email to