RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

Li, Pan2 Thu, 07 Nov 2024 15:34:35 -0800

Thanks Tamar and Jeff for comments.

> I'm not sure it's that simple.  It'll depend on the micro-architecture. 
> So things like strength of the branch predictors, how fetch blocks are 
> handled (can you have embedded not-taken branches, short-forward-branch 
> optimizations, etc).


> After:
> 
> .L.sat_add_u_1(unsigned int, unsigned int):
>          add 4,3,4
>          rldicl 9,4,0,32
>          subf 3,3,9
>          sradi 3,3,63
>          or 3,3,4
>          rldicl 3,3,0,32
>          blr
> 
> and before
> 
> .L.sat_add_u_1(unsigned int, unsigned int):
>          add 4,3,4
>          cmplw 0,4,3
>          bge 0,.L2
>          li 4,-1
> .L2:
>          rldicl 3,4,0,32
>          blr

I am not familiar with branch prediction, but the branch should be 50% token 
and 50% not-token
according to the range of sat add input. It is the worst case for branch 
prediction? I mean if we call
100 times with token, not-token, token, not-token... sequence, the branch 
version will be still faster?
Feel free to correct me if I'm wrong.

Back to these 16 forms of sat add as below, is there any suggestion which one 
or two form(s) may be
cheaper than others from the perspective of gimple IR? Independent with the 
backend implemented SAT_ADD or not.

#define DEF_SAT_U_ADD_1(T)               \
T sat_u_add_##T##_1 (T x, T y)           \
{                                        \
  return (T)(x + y) >= x ? (x + y) : -1; \
}

#define DEF_SAT_U_ADD_2(T)              \
T sat_u_add_##T##_2 (T x, T y)          \
{                                       \
  return (T)(x + y) < x ? -1 : (x + y); \
}

#define DEF_SAT_U_ADD_3(T)               \
T sat_u_add_##T##_3 (T x, T y)           \
{                                        \
  return x <= (T)(x + y) ? (x + y) : -1; \
}

#define DEF_SAT_U_ADD_4(T)              \
T sat_u_add_##T##_4 (T x, T y)          \
{                                       \
  return x > (T)(x + y) ? -1 : (x + y); \
}

#define DEF_SAT_U_ADD_5(T)      \
T sat_u_add_##T##_1 (T x, T y)  \
{                               \
  if ((T)(x + y) >= x)          \
    return x + y;               \
  else                          \
    return -1;                  \
}

#define DEF_SAT_U_ADD_6(T)      \
T sat_u_add_##T##_6 (T x, T y)  \
{                               \
  if ((T)(x + y) < x)           \
    return -1;                  \
  else                          \
    return x + y;               \
}

#define DEF_SAT_U_ADD_7(T)      \
T sat_u_add_##T##_7 (T x, T y)  \
{                               \
  if (x <= (T)(x + y))          \
    return x + y;               \
  else                          \
    return -1;                  \
}

#define DEF_SAT_U_ADD_8(T)      \
T sat_u_add_##T##_8 (T x, T y)  \
{                               \
  if (x > (T)(x + y))           \
    return -1;                  \
  else                          \
    return x + y;               \
}

#define DEF_SAT_U_ADD_9(T)                                     \
T sat_u_add_##T##_9 (T x, T y)                                 \
{                                                              \
  T ret;                                                       \
  return __builtin_add_overflow (x, y, &ret) == 0 ? ret : - 1; \
}

#define DEF_SAT_U_ADD_10(T)                                \
T sat_u_add_##T##_10 (T x, T y)                            \
{                                                          \
  T ret;                                                   \
  return !__builtin_add_overflow (x, y, &ret) ? ret : - 1; \
}

#define DEF_SAT_U_ADD_11(T)                     \
T sat_u_add_##T##_11 (T x, T y)                 \
{                                               \
  T ret;                                        \
  if (__builtin_add_overflow (x, y, &ret) == 0) \
    return ret;                                 \
  else                                          \
    return -1;                                  \
}

#define DEF_SAT_U_ADD_12(T)                 \
T sat_u_add_##T##_12 (T x, T y)             \
{                                           \
  T ret;                                    \
  if (!__builtin_add_overflow (x, y, &ret)) \
    return ret;                             \
  else                                      \
    return -1;                              \
}

#define DEF_SAT_U_ADD_13(T)                                   \
T sat_u_add_##T##_13 (T x, T y)                               \
{                                                             \
  T ret;                                                      \
  return __builtin_add_overflow (x, y, &ret) != 0 ? -1 : ret; \
}

#define DEF_SAT_U_ADD_14(T)                              \
T sat_u_add_##T##_14 (T x, T y)                          \
{                                                        \
  T ret;                                                 \
  return __builtin_add_overflow (x, y, &ret) ? -1 : ret; \
}

#define DEF_SAT_U_ADD_15(T)                     \
T sat_u_add_##T##_15 (T x, T y)                 \
{                                               \
  T ret;                                        \
  if (__builtin_add_overflow (x, y, &ret) != 0) \
    return -1;                                  \
  else                                          \
    return ret;                                 \
}

#define DEF_SAT_U_ADD_16(T)                \
T sat_u_add_##T##_16 (T x, T y)            \
{                                          \
  T ret;                                   \
  if (__builtin_add_overflow (x, y, &ret)) \
    return -1;                             \
  else                                     \
    return ret;                            \
}

Pan

-----Original Message-----
From: Jeff Law <jeffreya...@gmail.com> 
Sent: Friday, November 8, 2024 4:08 AM
To: Tamar Christina <tamar.christ...@arm.com>; Li, Pan2 <pan2...@intel.com>; 
Richard Biener <richard.guent...@gmail.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
rdapp....@gmail.com
Subject: Re: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD 
into branchless



On 11/7/24 8:07 AM, Tamar Christina wrote:
> 
>> -----Original Message-----
>> From: Li, Pan2 <pan2...@intel.com>
>> Sent: Thursday, November 7, 2024 12:57 PM
>> To: Tamar Christina <tamar.christ...@arm.com>; Richard Biener
>> <richard.guent...@gmail.com>
>> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com;
>> jeffreya...@gmail.com; rdapp....@gmail.com
>> Subject: RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned
>> SAT_ADD into branchless
>>
>> I see your point that the backend can leverage condition move to emit the 
>> branch
>> code.
>>
>>> For instance see https://godbolt.org/z/fvrq3aq6K
>>> On ISAs with conditional operations the branch version gets ifconverted.
>>> On AArch64 we get:
>>> sat_add_u_1(unsigned int, unsigned int):
>>>          adds    w0, w0, w1
>>>          csinv   w0, w0, wzr, cc
>>>          ret
>>> so just 2 instructions, and also branchless. On x86_64 we get:
>>> sat_add_u_1(unsigned int, unsigned int):
>>>          add     edi, esi
>>>          mov     eax, -1
>>>          cmovnc  eax, edi
>>>          ret
>>> so 3 instructions but a dependency chain of 2.
>>> also branchless.  This patch would regress both of these.
>>
>> But the above Godbolt may not be a good example for evidence, because both
>> x86_64 and aarch64 implemented usadd
>> already.
>> Thus, they all go to usadd<QImode>. For example as below, the sat_add_u_1 and
>> sat_add_u_2 are almost the
>> same when the backend implemented usadd.
>>
>> #include <stdint.h>
>>
>> #define T uint32_t
>>
>>    T sat_add_u_1 (T x, T y)
>>    {
>>      return (T)(x + y) < x ? -1 : (x + y);
>>    }
>>
>>    T sat_add_u_2 (T x, T y)
>>    {
>>      return (x + y) | -((x + y) < x);
>>    }
>>
>> It will become different when take gcc 14.2 (which doesn’t have .SAT_ADD 
>> GIMPLE
>> IR), the x86_64
>> will have below asm dump for -O3. Looks like no obvious difference here.
>>
>> sat_add_u_1(unsigned int, unsigned int):
>>          add     edi, esi
>>          mov     eax, -1
>>          cmovnc  eax, edi
>>          ret
>>
>> sat_add_u_2(unsigned int, unsigned int):
>>          add     edi, esi
>>          sbb     eax, eax
>>          or      eax, edi
>>          ret
>>
> 
> Because CE is able to recognize the idiom back into a conditional move.
> Pick a target that doesn't have conditional instructions, like PowerPC
> https://godbolt.org/z/4bTv18WMv
> 
> You'll see that this canonicalization has made codegen worse.
> 
> After:
> 
> .L.sat_add_u_1(unsigned int, unsigned int):
>          add 4,3,4
>          rldicl 9,4,0,32
>          subf 3,3,9
>          sradi 3,3,63
>          or 3,3,4
>          rldicl 3,3,0,32
>          blr
> 
> and before
> 
> .L.sat_add_u_1(unsigned int, unsigned int):
>          add 4,3,4
>          cmplw 0,4,3
>          bge 0,.L2
>          li 4,-1
> .L2:
>          rldicl 3,4,0,32
>          blr
> 
> It means now it always has to execute 6 instructions, whereas before it was 4 
> or 5 depending
> on the order of the branch. So for those architectures, it's always slower.
I'm not sure it's that simple.  It'll depend on the micro-architecture. 
So things like strength of the branch predictors, how fetch blocks are 
handled (can you have embedded not-taken branches, short-forward-branch 
optimizations, etc).

Jeff

RE: [PATCH v2 01/10] Match: Simplify branch form 4 of unsigned SAT_ADD into branchless

Reply via email to