Akram Ahmad <akram.ah...@arm.com> writes: > On 23/10/2024 12:20, Richard Sandiford wrote: >> Thanks for doing this. The approach looks good. My main question is: >> are we sure that we want to use the Advanced SIMD instructions for >> signed saturating SI and DI arithmetic on GPRs? E.g. for addition, >> we only saturate at the negative limit if both operands are negative, >> and only saturate at the positive limit if both operands are positive. >> So for 32-bit values we can use: >> >> asr tmp, x or y, #31 >> eor tmp, tmp, #0x80000000 >> >> to calculate the saturation value and: >> >> adds res, x, y >> csel res, tmp, res, vs >> >> to calculate the full result. That's the same number of instructions >> as two fmovs for the inputs, the sqadd, and the fmov for the result, >> but it should be more efficient. >> >> The reason for asking now, rather than treating it as a potential >> future improvement, is that it would also avoid splitting the patterns >> for signed and unsigned ops. (The length of the split alternative can be >> conservatively set to 16 even for the unsigned version, since nothing >> should care in practice. The split will have happened before >> shorten_branches.) > > Hi Richard, thanks for looking over this. > > I might be misunderstanding your suggestion, but is there a way to > efficiently check the signedness of the second operand (let's say 'y') > if it is stored in a register? This is a problem we considered and > couldn't solve post-reload, as we only have three registers (including > two operands) to work with. (I might be wrong in terms of how many > registers we have available). AFAIK that's why we only use adds, csinv > / subs, csel in the unsigned case.
Ah, ok. For post-reload splits, we would need to add: (clobber (match_operand:GPI 3 "scratch_operand")) then use "X" as the constraint for the Advanced SIMD alternative and "&r" as the constraint in the GPR alternative. But I suppose that also sinks my dream of a unified pattern, since the unsigned case wouldn't need the extra operand. In both cases (signed and unsigned), the pattern should clobber CC_REGNUM, since the split changes the flags. > [...] >>> diff --git >>> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >>> >>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >>> new file mode 100644 >>> index 00000000000..63eb21e438b >>> --- /dev/null >>> +++ >>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c >>> @@ -0,0 +1,79 @@ >>> +/* { dg-do assemble { target { aarch64*-*-* } } } */ >>> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */ >>> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */ >>> + >>> +/* >>> +** uadd_lane: { xfail *-*-* } >> Just curious: why does this fail? Is it a vector costing issue? > This is due to a missing pattern from match.pd- I've sent another patch > upstream to rectify this. In essence, this function exposes a commutative > form of an existing addition pattern, but that form isn't currently > commutative > when it should be. It's a similar reason for why the uqsubs are also > marked as > xfail, so that same patch series contains a fix for the uqsub case too. Ah, ok, thanks. Richard