RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

Tamar Christina Thu, 07 Nov 2024 23:34:40 -0800

> -----Original Message-----
> From: Jeff Law <jeffreya...@gmail.com>
> Sent: Thursday, November 7, 2024 8:08 PM
> To: Tamar Christina <tamar.christ...@arm.com>; Li, Pan2 <pan2...@intel.com>;
> Richard Biener <richard.guent...@gmail.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> rdapp....@gmail.com
> Subject: Re: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
> SAT_ADD into branchless
> 
> 
> 
> On 11/7/24 8:07 AM, Tamar Christina wrote:
> >
> >> -----Original Message-----
> >> From: Li, Pan2 <pan2...@intel.com>
> >> Sent: Thursday, November 7, 2024 12:57 PM
> >> To: Tamar Christina <tamar.christ...@arm.com>; Richard Biener
> >> <richard.guent...@gmail.com>
> >> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
> >> jeffreya...@gmail.com; rdapp....@gmail.com
> >> Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
> >> SAT_ADD into branchless
> >>
> >> I see your point that the backend can leverage condition move to emit the
> branch
> >> code.
> >>
> >>> For instance see https://godbolt.org/z/fvrq3aq6K
> >>> On ISAs with conditional operations the branch version gets ifconverted.
> >>> On AArch64 we get:
> >>> sat_add_u_1(unsigned int, unsigned int):
> >>>          adds    w0, w0, w1
> >>>          csinv   w0, w0, wzr, cc
> >>>          ret
> >>> so just 2 instructions, and also branchless. On x86_64 we get:
> >>> sat_add_u_1(unsigned int, unsigned int):
> >>>          add     edi, esi
> >>>          mov     eax, -1
> >>>          cmovnc  eax, edi
> >>>          ret
> >>> so 3 instructions but a dependency chain of 2.
> >>> also branchless.  This patch would regress both of these.
> >>
> >> But the above Godbolt may not be a good example for evidence, because both
> >> x86_64 and aarch64 implemented usadd
> >> already.
> >> Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1
> and
> >> sat_add_u_2 are almost the
> >> same when the backend implemented usadd.
> >>
> >> #include <stdint.h>
> >>
> >> #define T uint32_t
> >>
> >>    T sat_add_u_1 (T x, T y)
> >>    {
> >>      return (T)(x + y) < x ? -1 : (x + y);
> >>    }
> >>
> >>    T sat_add_u_2 (T x, T y)
> >>    {
> >>      return (x + y) | -((x + y) < x);
> >>    }
> >>
> >> It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD
> GIMPLE
> >> IR), the x86_64
> >> will have below asm dump for -O3. Looks like no obvious difference here.
> >>
> >> sat_add_u_1(unsigned int, unsigned int):
> >>          add     edi, esi
> >>          mov     eax, -1
> >>          cmovnc  eax, edi
> >>          ret
> >>
> >> sat_add_u_2(unsigned int, unsigned int):
> >>          add     edi, esi
> >>          sbb     eax, eax
> >>          or      eax, edi
> >>          ret
> >>
> >
> > Because CE is able to recognize the idiom back into a conditional move.
> > Pick a target that doesn't have conditional instructions, like PowerPC
> > https://godbolt.org/z/4bTv18WMv
> >
> > You'll see that this canonicalization has made codegen worse.
> >
> > After:
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> >          add 4,3,4
> >          rldicl 9,4,0,32
> >          subf 3,3,9
> >          sradi 3,3,63
> >          or 3,3,4
> >          rldicl 3,3,0,32
> >          blr
> >
> > and before
> >
> > .L.sat_add_u_1(unsigned int, unsigned int):
> >          add 4,3,4
> >          cmplw 0,4,3
> >          bge 0,.L2
> >          li 4,-1
> > .L2:
> >          rldicl 3,4,0,32
> >          blr
> >
> > It means now it always has to execute 6 instructions, whereas before it was 
> > 4 or
> 5 depending
> > on the order of the branch. So for those architectures, it's always slower.
> I'm not sure it's that simple.  It'll depend on the micro-architecture.
> So things like strength of the branch predictors, how fetch blocks are
> handled (can you have embedded not-taken branches, short-forward-branch
> optimizations, etc).


It's a pretty safe assumption that most modern branch predictors don't have an 
issue with
a simple diamond node like this. But since you brought up fetch blocks, at 5 
instructions you're
far more likely to fit the entire thing in a fetch block, especially if within 
a loop where most targets
align loops.

Where as the more instructions you add, the larger the chance of you needing 
multiple fetches for the
loop. You then weigh the cost of a misprediction (which will get better as the 
loop runs) vs the cost of a
underutilize fetch, which will not get better.   You are also making an 
assumption that logical operations are cheap
on the particular core in the longer sequence.

I would agree that removing the branch is universally a good idea if you 
produced the same or less
Instructions, or cheaper instructions, as I said before.  I disagree that 
replacing it with more instructions
is a universally better thing, and you yourself said "it'll depend".

Tamar

> 
> Jeff

RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

Reply via email to