Hi all, I'm looking into implementing the usad/ssad optabs for aarch64 to catch code like in PR 85693 and I'm a bit lost with what the midend expects the optabs to produce. The documentation for them says that the addend operand (op 3) is of mode equal or wider than the mode of the product (and consequently of operands 1 and 2) with the result operand 0 being the same mode as operand 3.
The x86 implementation for usadv16qi (for example) takes a V16QI vector and returns a V4SI vector. I'm confused as to what is the reduction logic expected by the midend? The PSADBW instruction that x86 uses in that case accumulates the two V8QI halves of the input into two 16-bit values (you don't need any more bits to represent a sum of 8 byte differences I believe): one placed at bit 0, and the other placed at bit 64. The bit ranges [16 - 63] and [80 - 127] are left as zeroes. So it produces a V2DI result in essence. If the input V16QI vectors look like: { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } then the result V4SI view (before being added into operand 3) is: { SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 } (1) whereas a normal widening reduction of V16QI -> V4SI to me would look more like: { SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS (a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) } (2) My question is, does the vectoriser depend on the semantics of [us]sad producing the result in (1)? If so, do you think it's worth clarifying in the documentation? Thanks, Kyrill