Re: [SVE] Support for variable-sized machine modes

Richard Biener Thu, 17 Nov 2016 02:25:42 -0800

On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford
<richard.sandif...@arm.com> wrote:
> As described in the covering note, one of big differences for SVE is that
> things like mode sizes, offsets, and numbers of vector elements can depend
> on a runtime parameter.  This message describes how the SVE patches handle
> that and how they deal with vector constants in which the number of elements
> isn't fixed at compile time.
>
>
> Mode sizes and numbers of elements
> ==================================
>
> Having runtime mode sizes and numbers of elements means for example that:
>
>   GET_MODE_SIZE
>   GET_MODE_BITSIZE
>   GET_MODE_PRECISION
>   GET_MODE_NUNITS
>   TYPE_VECTOR_SUBPARTS
>
> are now runtime invariants rather than compile-time constants.  The first
> question is what the representation of these runtime invariants should be.
> Two obvious choices are:
>
>   (1) Make them tree or rtl expressions (as appropriate for the IR
>       they're part of).
>   (2) Use a new representation.
>
> One of the main problems with (1) is that it's much more general than
> we need.  If we made something like GET_MODE_SIZE an rtx, it would be
> hard to enforce statically that the value has a suitable form.  It would
> also slow down the compiler, including for targets that don't need runtime
> sizes.
>
> We therefore went for approach (2).  The idea is to add a new
> "polynomial integer" (poly_int) class that represents a general:
>
>   C0 + C1 * X1 + ... + Cn * Xn
>
> where each coefficient Ci is a compile-time constant and where each
> indeterminate Xi is a nonnegative runtime parameter.  The class takes
> "n" and the coefficient type as template parameters, so unlike (1) it
> can continue to occupy less memory than a pointer where appropriate.
>
> The value of "n" for mode sizes and offsets depends on the target.
> For all targets except AArch64, "n" is 1 and the class degenerates
> to a constant.
>
> One difficulty with using runtime sizes is that some common questions
> might not be decidable at compile time.  E.g. if mode A has size 2 + 2X
> and mode B has size 4, the condition:
>
>   GET_MODE_SIZE (A) <= GET_MODE_SIZE (B)
>
> is true for X<=1 and false for X>=2.  It's therefore no longer possible
> for target-independent code to use these kinds of comparison for modes
> that might be vectors.  Instead it needs to ask "might the size be <=?"
> or "must the size be <=?".
>
> If a target only has constant sizes, it would be silly for target-specific
> code to have to make the distinction between "may" and "must", since the
> target knows that they amount to the same thing.  poly_int therefore
> provides an implicit conversion to a constant if "n" is 1 and if we're
> compiling target-specific code.  Whether this conversion is available
> is controlled by a new TARGET_C_FILE macro.
>
> The idea is to allow current targets to compile as-is with very few
> changes while at the same time ensuring that people working on target-
> independent code can be reasonably confident of "doing the right thing"
> for runtime sizes without having to test SVE specifically.
>
> However, even with SVE, all non-vector modes still have a compile-time size.
> In these cases we had two options: use may/must operations anyway, or add
> static type checking to enforce the fact that the mode isn't a vector.
> The latter seemed better in most cases.  The patches therefore add the
> following classes to wrap a machine mode enum:
>
>   scalar_int_mode: modes that satisfy SCALAR_INT_MODE_P
>   scalar_float_mode: modes that satisfy SCALAR_FLOAT_MODE_P
>   scalar_mode: modes that hold some kind of scalar
>   complex_mode: modes that hold a complex value
>
> These wrappers have other benefits too.  They replace some runtime asserts
> with static type checking and also make sure that the size or precision of
> a vector mode isn't accidentally used instead of the size of precision of
> an element.  (This sometimes happened when handling vector shifts by a
> scalar amount, for example.)
>
> We reused the is_a<>, as_a<> and dyn_cast<> operators for machine modes.
> E.g.:
>
>   is_a <scalar_mode> (M)
>
> tests whether M is scalar and:
>
>   as_a <scalar_int_mode> (M)
>
> forcibly converts M to a scalar_int_mode, asserting if it isn't one.
> We also used:
>
>   is_a <scalar_int_mode> (M, &RES)
>
> as a convenient way of testing whether M is a scalar_int_mode and
> storing it as one in RES if so.  This helps with various multi-line
> "if" statements, particularly in simplification routines.
>
> For consistency, the patches make machine_mode itself a wrapper class
> and rename the enum to machine_mode_enum.  FOOmode identifiers have
> the most specific type appropriate to them, so for example DImode is a
> scalar_int_mode and DFmode is a scalar_float_mode.  The raw enum values
> are still available with the E_ prefix (e.g. E_DImode) and are useful
> for things like case statements.
>
> I've attached the implementation of poly_int.  It contains a big block
> comment at the start describing the approach and summarising the
> available operations.
>
> I've also attached the new version of machmode.h, with the wrapper
> classes described above.
>
> One thing we haven't done but should is add self-tests for the
> poly_int class.  A lot of this code was written before self tests
> were available, but one of the reasons for making "n" a template
> parameter was precisely to allow n==2 to be tested on targets that
> don't need runtime parameters.
>
> Note that many things besides the macros above need to become polynomial.
> Other examples include SUBREG_BYTE, frame offsets, frame sizes, and the
> values returned by get_inner_reference.
>
>
> Representing runtime parameters in the IR
> =========================================
>
> Even though we used polynomials rather than IR to encode things like
> mode sizes, we still need a way of representing the runtime parameters
> in IR.  This is used when incrementing vector ivs and allocating stack
> frames, for example.
>
> There were two ways we considered doing this in rtl:
>
> (1) Add a new rtl code for the poly_ints themselves.  This would give
>     constants like:
>
>       (const_poly_int [(const_int C0)
>                        (const_int C1)
>                        ...
>                        (const_int Cn)])
>
>     (although the coefficients could be const_wide_ints instead
>     of const_ints where appropriate).  The runtime value would be:
>
>       C0 + C1 * X1 + ... + Cn * Xn
>
> (2) Add a new rtl code for the polynomial indeterminates Xi,
>     then use them in const wrappers.  A constant like C0 + C1 * X1
>     would then look like:
>
>       (const:M (plus:M (mult:M (const_param:M X1)
>                                (const_int C1))
>                        (const_int C0)))
>
> There didn't seem to be that much to choose between them.  However,
> DWARF location expressions that depend on the SVE vector length use
> a pseudo register to encode that length.  This is very similar to the
> const_param used in expression (2), and the DWARF expression would use
> similar arithmetic operations to construct the full polynomial constant.
> We therefore went for (2).
>
> Most uses of rtx polynomial constants use helper functions that abstract
> the underlying representation, so it would be easy to change to (1) (or
> to a third approach) in future.
>
> Unlike rtl, trees have no established practice of wrapping arbitrary
> arithmetic in a const-like wrapper, so (1) seemed like the best approach.
> The patches therefore add a new POLY_CST node that holds one INTEGER_CST
> per coefficient.  Again, the actual representation is usually hidden
> behind accessor functions; very little code operates on POLY_CSTs directly.
>
>
> Constructing variable-length vectors
> ====================================
>
> Currently both tree and rtl vector constants require the number of
> elements to be known at compile time and allow the elements to be
> arbitrarily different from one another.  SVE vector constants instead
> have a variable number of elements and require the constant to have
> some inherent structure, so that the values remain predictable however
> long the vector is.  In practice there are two useful types of constant:
>
> (a) a duplicate of a single value to all elements.
>
> (b) a linear series in which element E has the value BASE + E * STEP,
>     for some given BASE and STEP.
>
> For integers, (a) could simply be seen as a special form of (b) in
> which the step is zero.  However, we've deliberately not defined (b)
> for floating-point modes, in order to avoid specifying whether element
> E should really be calculcated as BASE + E * STEP or as BASE with STEP
> added E times (which would round differently).  So treating (a) as a
> separate kind of constant from (b) is useful for floating-point types.
>
> We need to support the same operations for non-constant vectors as well
> as constant ones.  Both operations have direct analogues in SVE.
>
> rtl already supports (a) for variables via vec_duplicate.  For constants
> we simply wrapped such vec_duplicates in a (const ...), so for example:
>
>   (const:VnnHI (vec_duplicate:VnnHI (const_int 10)))
>
> represents a vector constant in which each element is the 16-bit value 10.
>
> For (b) we created a new vec_series rtl code that takes the base and step
> as operands.  A vec_series is constant if it has constant operands, in which
> case it too can be wrapped in a (const ...).  For example:
>
>   (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3)))
>
> represents the constant { 1, 4, 7, 10, ... }.
>
> We only use constant vec_duplicate and vec_series when the number of
> elements is variable.  Vectors with a constant number of elements
> continue to use const_vector.  It might be worth considering using
> vec_duplicate across the board in future though, since it's significantly
> more compact when the number of elements is large.
>
> In both vec_duplicate and vec_series constants, the value of the element
> can be any constant that is valid for the element's mode; it doesn't have
> to be a const_int.
>
> The patches take a similar approach for trees.  A new VEC_DUPLICATE_EXPR
> returns a vector in which every element is equal to operand 0, while a new
> VEC_SERIES_EXPR creates a linear series, taking the same two operands as the
> rtl code.  The trees are TREE_CONSTANT if the operands are TREE_CONSTANT.
>
> The new trees are valid gimple values iff they are TREE_CONSTANT.
> This means that the constant forms can be used in a very similar way
> to VECTOR_CST, rather than always requiring a separate gimple assignment.


Hmm.  They are hopefully (at least VEC_DUPLICATE_EXPR) not GIMPLE_SINGLE_RHS.
But it means they'd appear (when TREE_CONSTANT) as gimple operand in
GENERIC form.

> Variable-length permutes
> ========================
>
> SVE has a similar set of permute instructions to Advanced SIMD: it has
> a TBL instruction for general permutes and specific instructions like
> TRN1 for certain common operations.  Although it would be possible to
> construct variable-length masks for all the special-purpose permutes,
> the expression to construct things like "interleave high" would be
> relatively complex.  It seemed better to add optabs and internal
> functions for certain kinds of well-known operation, specifically:
>
>   - IFN_VEC_INTERLEAVE_HI
>   - IFN_VEC_INTERLEAVE_LO
>   - IFN_VEC_EXTRACT_EVEN
>   - IFN_VEC_EXTRACT_ODD
>   - IFN_VEC_REVERSE

It's a step backwards from a unified representation of permutes in GIMPLE.
I think it would be better to have the internal functions generate the
well-known permute mask instead.  Thus you'd have

mask = IFN_VEC_INTERLEAVE_HI_MASK ();
vec = VEC_PERM_EXPR <vec1, vec2, mask>;

extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE.
interleave could use a double-size element mode to use VEC_SERIES_EXPR with
0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector
for example.
And then view-convert to the original size element mode to the at the mask.

I really wonder how you handle arbitrary permutes generated by SLP
loop vectorization ;)
(well, I guess "not supported" for the moment).

In general, did you do any compile-time / memory-use benchmarking of
the middle-end
changes for a) targets not using variable-size modes, b) a target
having them, with/without
the patches?  How is debugging experience (of GCC itself) for targets without
variable-size modes when dealing with RTL?

A question on SVE itself -- is the vector size "fixed" in hardware or
can it change say,
per process? [just thinking of SMT and partitioning of the vector resources]

Given that for HPC everybody recompiles their code for a specific
machine I'd have
expected a -mvector-size=N switch to be a more pragmatic approach for GCC 7
and also one that (if the size is really "fixed" in HW) might result
in better code
generation (at least initially).

Thanks,
Richard.

> These names follow existing target-independent terminology rather than
> the usual AArch64 scheme.
>
> Thanks,
> Richard
>

Re: [SVE] Support for variable-sized machine modes

Reply via email to