On May 10, 2018 10:53:19 AM GMT+02:00, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: >Hi Richard, > >On 09/05/18 19:37, Richard Biener wrote: >> On May 9, 2018 6:19:47 PM GMT+02:00, Kyrill Tkachov ><kyrylo.tkac...@foss.arm.com> wrote: >>> Hi all, >>> >>> I'm looking into implementing the usad/ssad optabs for aarch64 to >catch >>> code like in PR 85693 >>> and I'm a bit lost with what the midend expects the optabs to >produce. >>> The documentation for them says that the addend operand (op 3) is of >>> mode equal or wider than >>> the mode of the product (and consequently of operands 1 and 2) with >the >>> result operand 0 being >>> the same mode as operand 3. >>> >>> The x86 implementation for usadv16qi (for example) takes a V16QI >vector >>> and returns a V4SI vector. >>> I'm confused as to what is the reduction logic expected by the >midend? >>> The PSADBW instruction that x86 uses in that case accumulates the >two >>> V8QI halves of the input into >>> two 16-bit values (you don't need any more bits to represent a sum >of 8 >>> byte differences I believe): >>> one placed at bit 0, and the other placed at bit 64. The bit ranges >[16 >>> - 63] and [80 - 127] are left as zeroes. >>> So it produces a V2DI result in essence. >>> >>> If the input V16QI vectors look like: >>> { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, >a15 >>> } >>> { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, >b15 >>> } >>> >>> then the result V4SI view (before being added into operand 3) is: >>> { SUM (ABS (a[0-7] - b[0-7])), 0, SUM (ABS (a[8-15] - b[8-15])), 0 } >>> (1) >>> >>> whereas a normal widening reduction of V16QI -> V4SI to me would >look >>> more like: >>> >>> { SUM (ABS (a[0-3] - b[0-3])), SUM (ABS (a[4-7] - b[4-7])), SUM (ABS >>> (a[8-11] - b[8-11])), SUM (ABS (a[12-15] - b[12-15])) } (2) >>> >>> My question is, does the vectoriser depend on the semantics of >[us]sad >>> producing the result in (1)? >> No, it doesn't. It is required that any association of the embedded >reduction is correct and thus this requires appropriate - ffast-math >flags. Note it's also the reason why we do not implement constant >folding of SAD. > >At the moment I'm looking at the integer modes, so I guess >reassociation and -ffast-math doesn't come into play, but I'll keep >that in mind. > >>> If so, do you think it's worth clarifying in the documentation? >> Probably yes - but I'm not sure the current state of affairs is >best... Do other targets implement the same reduction order as x86? >Other similar reduction ops have high /low or even /odd variants. But >they also do not reduce the outputs. > >AFAICS only x86 and powerpc implement this so far. The powerpc >implementation synthesises the V16QI -> V4SI reduction using multiple >instructions. >The result it produces is variant (2) in my original post. So the two >ports differ. > >From a purely target implementation perspective it is convenient to not >impose any particular reduction strategy. >If we say that the only requirement from the [us]sad optabs is that the >result vector should be suitable for a full V4SI -> SI reduction >but not rely on any particular approach, then each target can provide >its optimal sequence. > >For example, an aarch64 implementation I'm experimenting with now would >compute the V16QI -> V16QI absolute differences vector, >reduce that into a single HImode value (there is a full widening >reduction instruction in aarch64 for that) and then do a widening add >of >that value into element zero of the result V4SI vector. Following the >notation above, this would produce from: > >{ a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 >} >{ b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 >} > >the V4SI result: > >{ SUM (ABS (a[0-15] - b[0-15])), 0, 0, 0 } > >Matching the x86 or powerpc strategy would require a more costly >sequence on aarch64, but of course this would only be >safe if we had some guarantees that the midend won't rely on any >particular reduction strategy and just treat it as a vector >on which to perform a full reduction at the end of a loop.
OK, sounds reasonable. BTW, in other context I needed a very specific reduction order because the result was not used in a reduction. For that purpose we'd then need different optabs. Richard. >Thanks, >Kyrill > >> Note DOT_PROD has the very same issue. >> >> Richard. >> >>> Thanks, >>> Kyrill