On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: >Hi all, > >I'm looking into implementing the usad/ssad optabs for aarch64 to catch >code like in PR 85693 >and I'm a bit lost with what the midend expects the optabs to produce. >The documentation for them says that the addend operand (op 3) is of >mode equal or wider than >the mode of the product (and consequently of operands 1 and 2) with the >result operand 0 being >the same mode as operand 3. > >The x86 implementation for usadv16qi (for example) takes a V16QI vector >and returns a V4SI vector. >I'm confused as to what is the reduction logic expected by the midend? >The PSADBW instruction that x86 uses in that case accumulates the two >V8QI halves of the input into >two 16-bit values (you don't need any more bits to represent a sum of 8 >byte differences I believe): >one placed at bit 0, and the other placed at bit 64. The bit ranges [16 >- 63] and [80 - 127] are left as zeroes. >So it produces a V2DI result in essence. > >If the input V16QI vectors look like: >{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 >} >{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 >} > >then the result V4SI view (before being added into operand 3) is: >{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 } >(1) > >whereas a normal widening reduction of V16QI -> V4SI to me would look >more like: > >{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS >(a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) } (2) > >My question is, does the vectoriser depend on the semantics of [us]sad >producing the result in (1)?
No, it doesn't. It is required that any association of the embedded reduction is correct and thus this requires appropriate - ffast-math flags. Note it's also the reason why we do not implement constant folding of SAD. >If so, do you think it's worth clarifying in the documentation? Probably yes - but I'm not sure the current state of affairs is best... Do other targets implement the same reduction order as x86? Other similar reduction ops have high /low or even /odd variants. But they also do not reduce the outputs. Note DOT_PROD has the very same issue. Richard. >Thanks, >Kyrill