Am 27.02.24 um 12:15 schrieb Tamar Christina:
Am 19.02.24 um 08:36 schrieb Richard Biener:
On Sat, Feb 17, 2024 at 11:30 AM <pan2...@intel.com> wrote:
From: Pan Li <pan2...@intel.com>
This patch would like to add the middle-end presentation for the
unsigned saturation add. Aka set the result of add to the max
when overflow. It will take the pattern similar as below.
SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
Does this even try to wort out the costs?
For example, with the following example
#define T __UINT16_TYPE__
T sat_add1 (T x, T y)
{
return (x + y) | (- (T)((T)(x + y) < x));
}
T sat_add2 (T x, T y)
{
T z = x + y;
if (z < x)
z = (T) -1;
return z;
}
And then "avr-gcc -S -Os -dp" the code is
sat_add1:
add r22,r24 ; 7 [c=8 l=2] *addhi3/0
adc r23,r25
ldi r18,lo8(1) ; 8 [c=4 l=2] *movhi/4
ldi r19,0
cp r22,r24 ; 9 [c=8 l=2] cmphi3/2
cpc r23,r25
brlo .L2 ; 10 [c=16 l=1] branch
ldi r19,0 ; 31 [c=4 l=1] movqi_insn/0
ldi r18,0 ; 32 [c=4 l=1] movqi_insn/0
.L2:
clr r24 ; 13 [c=12 l=4] neghi2/1
clr r25
sub r24,r18
sbc r25,r19
or r24,r22 ; 29 [c=4 l=1] iorqi3/0
or r25,r23 ; 30 [c=4 l=1] iorqi3/0
ret ; 35 [c=0 l=1] return
sat_add2:
add r22,r24 ; 8 [c=8 l=2] *addhi3/0
adc r23,r25
cp r22,r24 ; 9 [c=8 l=2] cmphi3/2
cpc r23,r25
brsh .L3 ; 10 [c=16 l=1] branch
ldi r22,lo8(-1) ; 5 [c=4 l=2] *movhi/4
ldi r23,lo8(-1)
.L3:
mov r25,r23 ; 21 [c=4 l=1] movqi_insn/0
mov r24,r22 ; 22 [c=4 l=1] movqi_insn/0
ret ; 25 [c=0 l=1] return
i.e. the conditional jump is better than overly smart arithmetic
(smaller and faster code with less register pressure).
With larger dypes the difference is even more pronounced-
*on AVR. https://godbolt.org/z/7jaExbTa8 shows the branchless code is better.
And the branchy code will vectorize worse if at all
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492
AVR is a GCC backend
https://gcc.gnu.org/git/?p=gcc.git;a=tree;f=gcc/config/avr
and likely not the only backend where tricky arithmetic is more
expensive than branching more often than not.
Johann
But looking at that output it just seems like it's your expansion that's
inefficient.
But fair point, perhaps it should be just a normal DEF_INTERNAL_SIGNED_OPTAB_FN
so that we
provide the additional optimization only for targets that want it.
Tamar