Hi Richard,
On 09/05/18 19:37, Richard Biener wrote:
On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
Hi all,
I'm looking into implementing the usad/ssad optabs for aarch64 to catch
code like in PR 85693
and I'm a bit lost with what the midend expects the optabs to produce.
The documentation for them says that the addend operand (op 3) is of
mode equal or wider than
the mode of the product (and consequently of operands 1 and 2) with the
result operand 0 being
the same mode as operand 3.
The x86 implementation for usadv16qi (for example) takes a V16QI vector
and returns a V4SI vector.
I'm confused as to what is the reduction logic expected by the midend?
The PSADBW instruction that x86 uses in that case accumulates the two
V8QI halves of the input into
two 16-bit values (you don't need any more bits to represent a sum of 8
byte differences I believe):
one placed at bit 0, and the other placed at bit 64. The bit ranges [16
- 63] and [80 - 127] are left as zeroes.
So it produces a V2DI result in essence.
If the input V16QI vectors look like:
{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15
}
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15
}
then the result V4SI view (before being added into operand 3) is:
{ SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 }
(1)
whereas a normal widening reduction of V16QI -> V4SI to me would look
more like:
{ SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS
(a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) } (2)
My question is, does the vectoriser depend on the semantics of [us]sad
producing the result in (1)?
No, it doesn't. It is required that any association of the embedded reduction
is correct and thus this requires appropriate - ffast-math flags. Note it's
also the reason why we do not implement constant folding of SAD.
At the moment I'm looking at the integer modes, so I guess reassociation and
-ffast-math doesn't come into play, but I'll keep that in mind.
If so, do you think it's worth clarifying in the documentation?
Probably yes - but I'm not sure the current state of affairs is best... Do
other targets implement the same reduction order as x86? Other similar
reduction ops have high /low or even /odd variants. But they also do not reduce
the outputs.
AFAICS only x86 and powerpc implement this so far. The powerpc implementation
synthesises the V16QI -> V4SI reduction using multiple instructions.
The result it produces is variant (2) in my original post. So the two ports
differ.
From a purely target implementation perspective it is convenient to not impose
any particular reduction strategy.
If we say that the only requirement from the [us]sad optabs is that the result
vector should be suitable for a full V4SI -> SI reduction
but not rely on any particular approach, then each target can provide its
optimal sequence.
For example, an aarch64 implementation I'm experimenting with now would compute
the V16QI -> V16QI absolute differences vector,
reduce that into a single HImode value (there is a full widening reduction
instruction in aarch64 for that) and then do a widening add of
that value into element zero of the result V4SI vector. Following the notation
above, this would produce from:
{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
the V4SI result:
{ SUM (ABS (a[0-15] - b[0-15])), 0, 0, 0 }
Matching the x86 or powerpc strategy would require a more costly sequence on
aarch64, but of course this would only be
safe if we had some guarantees that the midend won't rely on any particular
reduction strategy and just treat it as a vector
on which to perform a full reduction at the end of a loop.
Thanks,
Kyrill
Note DOT_PROD has the very same issue.
Richard.
Thanks,
Kyrill