On 11/7/24 8:07 AM, Tamar Christina wrote:

-----Original Message-----
From: Li, Pan2 <pan2...@intel.com>
Sent: Thursday, November 7, 2024 12:57 PM
To: Tamar Christina <tamar.christ...@arm.com>; Richard Biener
<richard.guent...@gmail.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
jeffreya...@gmail.com; rdapp....@gmail.com
Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
SAT_ADD into branchless

I see your point that the backend can leverage condition move to emit the branch
code.

For instance see https://godbolt.org/z/fvrq3aq6K
On ISAs with conditional operations the branch version gets ifconverted.
On AArch64 we get:
sat_add_u_1(unsigned int, unsigned int):
         adds    w0, w0, w1
         csinv   w0, w0, wzr, cc
         ret
so just 2 instructions, and also branchless. On x86_64 we get:
sat_add_u_1(unsigned int, unsigned int):
         add     edi, esi
         mov     eax, -1
         cmovnc  eax, edi
         ret
so 3 instructions but a dependency chain of 2.
also branchless.  This patch would regress both of these.

But the above Godbolt may not be a good example for evidence, because both
x86_64 and aarch64 implemented usadd
already.
Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1 and
sat_add_u_2 are almost the
same when the backend implemented usadd.

#include <stdint.h>

#define T uint32_t

   T sat_add_u_1 (T x, T y)
   {
     return (T)(x + y) < x ? -1 : (x + y);
   }

   T sat_add_u_2 (T x, T y)
   {
     return (x + y) | -((x + y) < x);
   }

It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD GIMPLE
IR), the x86_64
will have below asm dump for -O3. Looks like no obvious difference here.

sat_add_u_1(unsigned int, unsigned int):
         add     edi, esi
         mov     eax, -1
         cmovnc  eax, edi
         ret

sat_add_u_2(unsigned int, unsigned int):
         add     edi, esi
         sbb     eax, eax
         or      eax, edi
         ret


Because CE is able to recognize the idiom back into a conditional move.
Pick a target that doesn't have conditional instructions, like PowerPC
https://godbolt.org/z/4bTv18WMv

You'll see that this canonicalization has made codegen worse.

After:

.L.sat_add_u_1(unsigned int, unsigned int):
         add 4,3,4
         rldicl 9,4,0,32
         subf 3,3,9
         sradi 3,3,63
         or 3,3,4
         rldicl 3,3,0,32
         blr

and before

.L.sat_add_u_1(unsigned int, unsigned int):
         add 4,3,4
         cmplw 0,4,3
         bge 0,.L2
         li 4,-1
.L2:
         rldicl 3,4,0,32
         blr

It means now it always has to execute 6 instructions, whereas before it was 4 
or 5 depending
on the order of the branch. So for those architectures, it's always slower.
I'm not sure it's that simple. It'll depend on the micro-architecture. So things like strength of the branch predictors, how fetch blocks are handled (can you have embedded not-taken branches, short-forward-branch optimizations, etc).

Jeff

Reply via email to