Hi all,

I'm looking into implementing the usad/ssad optabs for aarch64 to catch code 
like in PR 85693
and I'm a bit lost with what the midend expects the optabs to produce.
The documentation for them says that the addend operand (op 3) is of mode equal 
or wider than
the mode of the product (and consequently of operands 1 and 2) with the result 
operand 0 being
the same mode as operand 3.

The x86 implementation for usadv16qi (for example) takes a V16QI vector and 
returns a V4SI vector.
I'm confused as to what is the reduction logic expected by the midend?
The PSADBW instruction that x86 uses in that case accumulates the two V8QI 
halves of the input into
two 16-bit values (you don't need any more bits to represent a sum of 8 byte 
differences I believe):
one placed at bit 0, and the other placed at bit 64. The bit ranges [16 - 63] 
and [80 - 127] are left as zeroes.
So it produces a V2DI result in essence.

If the input V16QI vectors look like:
{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }

then the result V4SI view (before being added into operand 3) is:
{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }   (1)

whereas a normal widening reduction of V16QI -> V4SI to me would look more like:

{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS (a[8-11] - 
b[8-11])), SUM (ABS (a[12-15] - b[12-15])) }  (2)

My question is, does the vectoriser depend on the semantics of [us]sad 
producing the result in (1)?
If so, do you think it's worth clarifying in the documentation?

Thanks,
Kyrill

Reply via email to