On 11/7/24 8:07 AM, Tamar Christina wrote:
-----Original Message-----
From: Li, Pan2 <pan2...@intel.com>
Sent: Thursday, November 7, 2024 12:57 PM
To: Tamar Christina <tamar.christ...@arm.com>; Richard Biener
<richard.guent...@gmail.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
jeffreya...@gmail.com; rdapp....@gmail.com
Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
SAT_ADD into branchless
I see your point that the backend can leverage condition move to emit the branch
code.
For instance see https://godbolt.org/z/fvrq3aq6K
On ISAs with conditional operations the branch version gets ifconverted.
On AArch64 we get:
sat_add_u_1(unsigned int, unsigned int):
adds w0, w0, w1
csinv w0, w0, wzr, cc
ret
so just 2 instructions, and also branchless. On x86_64 we get:
sat_add_u_1(unsigned int, unsigned int):
add edi, esi
mov eax, -1
cmovnc eax, edi
ret
so 3 instructions but a dependency chain of 2.
also branchless. This patch would regress both of these.
But the above Godbolt may not be a good example for evidence, because both
x86_64 and aarch64 implemented usadd
already.
Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1 and
sat_add_u_2 are almost the
same when the backend implemented usadd.
#include <stdint.h>
#define T uint32_t
T sat_add_u_1 (T x, T y)
{
return (T)(x + y) < x ? -1 : (x + y);
}
T sat_add_u_2 (T x, T y)
{
return (x + y) | -((x + y) < x);
}
It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD GIMPLE
IR), the x86_64
will have below asm dump for -O3. Looks like no obvious difference here.
sat_add_u_1(unsigned int, unsigned int):
add edi, esi
mov eax, -1
cmovnc eax, edi
ret
sat_add_u_2(unsigned int, unsigned int):
add edi, esi
sbb eax, eax
or eax, edi
ret
Because CE is able to recognize the idiom back into a conditional move.
Pick a target that doesn't have conditional instructions, like PowerPC
https://godbolt.org/z/4bTv18WMv
You'll see that this canonicalization has made codegen worse.
After:
.L.sat_add_u_1(unsigned int, unsigned int):
add 4,3,4
rldicl 9,4,0,32
subf 3,3,9
sradi 3,3,63
or 3,3,4
rldicl 3,3,0,32
blr
and before
.L.sat_add_u_1(unsigned int, unsigned int):
add 4,3,4
cmplw 0,4,3
bge 0,.L2
li 4,-1
.L2:
rldicl 3,4,0,32
blr
It means now it always has to execute 6 instructions, whereas before it was 4
or 5 depending
on the order of the branch. So for those architectures, it's always slower.
I'm not sure it's that simple. It'll depend on the micro-architecture.
So things like strength of the branch predictors, how fetch blocks are
handled (can you have embedded not-taken branches, short-forward-branch
optimizations, etc).
Jeff