On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill  Tkachov 
<kyrylo.tkac...@foss.arm.com> wrote:
>Hi all,
>
>I'm looking into implementing the usad/ssad optabs for aarch64 to catch
>code like in PR 85693
>and I'm a bit lost with what the midend expects the optabs to produce.
>The documentation for them says that the addend operand (op 3) is of
>mode equal or wider than
>the mode of the product (and consequently of operands 1 and 2) with the
>result operand 0 being
>the same mode as operand 3.
>
>The x86 implementation for usadv16qi (for example) takes a V16QI vector
>and returns a V4SI vector.
>I'm confused as to what is the reduction logic expected by the midend?
>The PSADBW instruction that x86 uses in that case accumulates the two
>V8QI halves of the input into
>two 16-bit values (you don't need any more bits to represent a sum of 8
>byte differences I believe):
>one placed at bit 0, and the other placed at bit 64. The bit ranges [16
>- 63] and [80 - 127] are left as zeroes.
>So it produces a V2DI result in essence.
>
>If the input V16QI vectors look like:
>{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15
>}
>{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15
>}
>
>then the result V4SI view (before being added into operand 3) is:
>{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }  
>(1)
>
>whereas a normal widening reduction of V16QI -> V4SI to me would look
>more like:
>
>{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS
>(a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) }  (2)
>
>My question is, does the vectoriser depend on the semantics of [us]sad
>producing the result in (1)?

No, it doesn't. It is required that any association of the embedded reduction 
is correct and thus this requires appropriate - ffast-math flags. Note it's 
also the reason why we do not implement constant folding of SAD. 

>If so, do you think it's worth clarifying in the documentation?

Probably yes - but I'm not sure the current state of affairs is best... Do 
other targets implement the same reduction order as x86? Other similar 
reduction ops have high /low or even /odd variants. But they also do not reduce 
the outputs. 

Note DOT_PROD has the very same issue.

Richard. 

>Thanks,
>Kyrill

Reply via email to