Am 27.02.24 um 12:15 schrieb Tamar Christina:
Am 19.02.24 um 08:36 schrieb Richard Biener:
On Sat, Feb 17, 2024 at 11:30 AM <pan2...@intel.com> wrote:

From: Pan Li <pan2...@intel.com>

This patch would like to add the middle-end presentation for the
unsigned saturation add.  Aka set the result of add to the max
when overflow.  It will take the pattern similar as below.

SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))

Does this even try to wort out the costs?

For example, with the following example


#define T __UINT16_TYPE__

T sat_add1 (T x, T y)
{
    return (x + y) | (- (T)((T)(x + y) < x));
}

T sat_add2 (T x, T y)
{
      T z = x + y;
      if (z < x)
          z = (T) -1;
      return z;
}

And then "avr-gcc -S -Os -dp" the code is


sat_add1:
        add r22,r24      ;  7   [c=8 l=2]  *addhi3/0
        adc r23,r25
        ldi r18,lo8(1)   ;  8   [c=4 l=2]  *movhi/4
        ldi r19,0
        cp r22,r24       ;  9   [c=8 l=2]  cmphi3/2
        cpc r23,r25
        brlo .L2                 ;  10  [c=16 l=1]  branch
        ldi r19,0                ;  31  [c=4 l=1]  movqi_insn/0
        ldi r18,0                ;  32  [c=4 l=1]  movqi_insn/0
.L2:
        clr r24  ;  13  [c=12 l=4]  neghi2/1
        clr r25
        sub r24,r18
        sbc r25,r19
        or r24,r22               ;  29  [c=4 l=1]  iorqi3/0
        or r25,r23               ;  30  [c=4 l=1]  iorqi3/0
        ret              ;  35  [c=0 l=1]  return

sat_add2:
        add r22,r24      ;  8   [c=8 l=2]  *addhi3/0
        adc r23,r25
        cp r22,r24       ;  9   [c=8 l=2]  cmphi3/2
        cpc r23,r25
        brsh .L3                 ;  10  [c=16 l=1]  branch
        ldi r22,lo8(-1)  ;  5   [c=4 l=2]  *movhi/4
        ldi r23,lo8(-1)
.L3:
        mov r25,r23      ;  21  [c=4 l=1]  movqi_insn/0
        mov r24,r22      ;  22  [c=4 l=1]  movqi_insn/0
        ret              ;  25  [c=0 l=1]  return

i.e. the conditional jump is better than overly smart arithmetic
(smaller and faster code with less register pressure).
With larger dypes the difference is even more pronounced-


*on AVR. https://godbolt.org/z/7jaExbTa8  shows the branchless code is better.
And the branchy code will vectorize worse if at all 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

AVR is a GCC backend

https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=gcc/config/avr

and likely not the only backend where tricky arithmetic is more
expensive than branching more often than not.

Johann



But looking at that output it just seems like it's your expansion that's 
inefficient.

But fair point, perhaps it should be just a normal DEF_INTERNAL_SIGNED_OPTAB_FN 
so that we
provide the additional optimization only for targets that want it.

Tamar

Reply via email to