Spencer Abson <spencer.ab...@arm.com> writes:
> This series incrementally adds support for operations on unpacked vectors
> of floating-point values.  By "unpacked", we're referring to the in-register
> layout of partial SVE vector modes.  For example, the elements of a VNx4HF
> are stored as:
>
> ... | X | HF | X | HF | X | HF | X | HF |
>
> Where 'X' denotes the undefined upper half of the 32-bit container that each
> 16-bit value is stored in.  This padding must not affect the operation's
> behavior, so should not be interpreted if the operation may trap.
>
> The series is organised as follows:
>       * NFCs to iterators.md that lay the groundwork for the rest of the
>       series.
>       * Unpacked conversions, in which a solution to the issue described
>       above is given.
>       * Unpacked comparisons, which are slightly less trivial than...
>       * Unpacked unary/binary/ternary operations, each of which is broken
>       down into:
>               * Defining the unconditional expansion
>               * Supporting OP/UNSPEC_SEL combiner patterns under
>               SVE_RELAXED_GP
>               * Defining the conditional expander (if applicable)
>
> This allows each change to aarch64-sve.md to be testable; once the conditional
> expander for an operation is defined, the rules in match.pd canonicalize any
> occurrence of that operation combined with a VEC_COND_EXPR into these
> conditional forms, which would make the SVE_RELAXED_GP patterns dead at trunk.
> I’ve taken this approach because I believe it’s valuable to have these
> patterns to fall back on.
>
> Notes on code generation under -ftrapping-math:
>
> 1) In the example below, we're currently unable to remove (1) in favour of
> (2).
>
> ptrue   p6.b, all   (1)
> ptrue   p7.d, all   (2)
> ld1w    z30.d, p6/z, [x1]
> ld1w    z29.d, p6/z, [x3]
> fsub    z30.s, p7/m, z30.s, #1.0
>
> In the expanded RTL, the predicate source of the LD1Ws is a
> (subreg:VNx2BI (reg:VNx16BI 111) 0), where every bit of 111 is a 1.  The
> predicate source of the FSUB is a (subreg:VNx4BI (reg:VNx16BI 112) 0),
> where every 8th bit of 112 is a 1, and the rest are 0.

Interesting.  I imagine that could be a common problem, so perhaps for
the unpacked FP modes we should use the stricter ptrue even for loads
and stores, at least if flag_trapping_math.

On the other hand, we don't often run out of predicate registers,
and in more complex cases there might be other uses of the .b ptrue,
so it might not be worth it.

> 2) The AND emitted by the conditional expander typically follows a CMP<CC>
> operation, where it is trivially redundant.
>
> cmpne   p5.d, p7/z, z0.d, #0
> ptrue   p6.d, vl32
> and p6.b, p6/z, p5.b, p5.b
>
> The fold we need here is slightly different from what the existing
> *cmp<cmp_op><mode>_and splitting patterns achieve, in that we don’t need to
> replace p7 with p6 to make the AND redundant.
>
> The AND in this case has the structure:
>
> (set (reg:VNx4BI 113)
>     (and (subreg:VNx4BI (reg:VNx16BI 111) 0)
>          (subreg:VNx4BI (reg:VNx2BI 112) 0)
>
> This problem feels somewhat related to how we might handle
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118151.

Yeah, I agree it looks like there would be some overlap.

Thanks,
Richard

Reply via email to