Re: [SVE] Support for variable-sized machine modes

Richard Sandiford Thu, 17 Nov 2016 14:01:28 -0800

Thanks for the comments.

Richard Biener <richard.guent...@gmail.com> writes:
> On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford
>> Constructing variable-length vectors
>> ====================================
>>
>> Currently both tree and rtl vector constants require the number of
>> elements to be known at compile time and allow the elements to be
>> arbitrarily different from one another.  SVE vector constants instead
>> have a variable number of elements and require the constant to have
>> some inherent structure, so that the values remain predictable however
>> long the vector is.  In practice there are two useful types of constant:
>>
>> (a) a duplicate of a single value to all elements.
>>
>> (b) a linear series in which element E has the value BASE + E * STEP,
>>     for some given BASE and STEP.
>>
>> For integers, (a) could simply be seen as a special form of (b) in
>> which the step is zero.  However, we've deliberately not defined (b)
>> for floating-point modes, in order to avoid specifying whether element
>> E should really be calculcated as BASE + E * STEP or as BASE with STEP
>> added E times (which would round differently).  So treating (a) as a
>> separate kind of constant from (b) is useful for floating-point types.
>>
>> We need to support the same operations for non-constant vectors as well
>> as constant ones.  Both operations have direct analogues in SVE.
>>
>> rtl already supports (a) for variables via vec_duplicate.  For constants
>> we simply wrapped such vec_duplicates in a (const ...), so for example:
>>
>>   (const:VnnHI (vec_duplicate:VnnHI (const_int 10)))
>>
>> represents a vector constant in which each element is the 16-bit value 10.
>>
>> For (b) we created a new vec_series rtl code that takes the base and step
>> as operands.  A vec_series is constant if it has constant operands, in which
>> case it too can be wrapped in a (const ...).  For example:
>>
>>   (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3)))
>>
>> represents the constant { 1, 4, 7, 10, ... }.
>>
>> We only use constant vec_duplicate and vec_series when the number of
>> elements is variable.  Vectors with a constant number of elements
>> continue to use const_vector.  It might be worth considering using
>> vec_duplicate across the board in future though, since it's significantly
>> more compact when the number of elements is large.
>>
>> In both vec_duplicate and vec_series constants, the value of the element
>> can be any constant that is valid for the element's mode; it doesn't have
>> to be a const_int.
>>
>> The patches take a similar approach for trees.  A new VEC_DUPLICATE_EXPR
>> returns a vector in which every element is equal to operand 0, while a new
>> VEC_SERIES_EXPR creates a linear series, taking the same two operands as the
>> rtl code.  The trees are TREE_CONSTANT if the operands are TREE_CONSTANT.
>>
>> The new trees are valid gimple values iff they are TREE_CONSTANT.
>> This means that the constant forms can be used in a very similar way
>> to VECTOR_CST, rather than always requiring a separate gimple assignment.
>
> Hmm.  They are hopefully (at least VEC_DUPLICATE_EXPR) not GIMPLE_SINGLE_RHS.
> But it means they'd appear (when TREE_CONSTANT) as gimple operand in
> GENERIC form.


You guessed correctly: they're GIMPLE_SINGLE_RHS :-)  That seemed to be
our current way of handling this kind of expression.  Do you not like it
because of the overhead of the extra tree node in plain:

   reg = VEC_DUPLICATE_EXPR <X>

assignments?  The problem is that if we treat them as unary, the
VEC_DUPLICATE_EXPR node appears and disappears depending on whether
the assignment has an operator or not, which makes them significantly
different from VECTOR_CST.  The idea was to make them as similar as
possible, so that most code wouldn't care that a different tree code
is being used.

I think in practice most duplicates are used as operands rather than
as rhses in their own right.

>> Variable-length permutes
>> ========================
>>
>> SVE has a similar set of permute instructions to Advanced SIMD: it has
>> a TBL instruction for general permutes and specific instructions like
>> TRN1 for certain common operations.  Although it would be possible to
>> construct variable-length masks for all the special-purpose permutes,
>> the expression to construct things like "interleave high" would be
>> relatively complex.  It seemed better to add optabs and internal
>> functions for certain kinds of well-known operation, specifically:
>>
>>   - IFN_VEC_INTERLEAVE_HI
>>   - IFN_VEC_INTERLEAVE_LO
>>   - IFN_VEC_EXTRACT_EVEN
>>   - IFN_VEC_EXTRACT_ODD
>>   - IFN_VEC_REVERSE
>
> It's a step backwards from a unified representation of permutes in GIMPLE.
> I think it would be better to have the internal functions generate the
> well-known permute mask instead.  Thus you'd have
>
> mask = IFN_VEC_INTERLEAVE_HI_MASK ();
> vec = VEC_PERM_EXPR <vec1, vec2, mask>;
>
> extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE.
> interleave could use a double-size element mode to use VEC_SERIES_EXPR with
> 0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector
> for example.
> And then view-convert to the original size element mode to the at the mask.

I don't think that would be better in practice, for a few reasons:

(1) The maximum SVE vector lengthis 256 bytes, so a general 2-operand mask
    for a permute on bytes would need a range of [0, 511].  That would be
    too big to hold in an element.  (FWIW, the general SVE permute
    instruction permutes a single vector.)

(2) The view-convert trick doesn't work for 64-bit elements, since there's
    no 128-bit vec_series instruction.  (And if there were 128-bit vector
    operations, we'd have the same problem for 256 bits, etc.)

    I suspect the best way of generating an interleave mask for 64 bits
    would be to interleave two vec_series.  That suggests that the
    primitive operation really is the interleave rather than the mask.

(3) On a related note, I think one of the attractions of a single
    unified permute representation was that it left the implementation
    up to the target.  Using vec_series-based tricks for things like
    interleaves would be doing the opposite: it would be baking in a
    particular implementation and leaving the target to do more work
    if it wanted a different implementation.

(4) The main benefit of directly-mapped internal functions is that they
    correspond to target optabs.  This makes them very light-weight to
    support (often not more than a line in internal-fn.def).  In contrast,
    if we had internal functions for the masks, we'd need custom expand
    code to code-generate the mask.  And for INTERLEAVE_LO masks in
    particular, the sequence we generate is likely to be expensive.
    It would also be code that we'd never want to see used in practice;
    it would just be there "in case".

(5) It doesn't really seem to add any generality.  Any code that wants
    to know what the permute is doing and potentially rewrite it will
    need to understand the mask internal function as well.  Or to look
    through the view-convert sequence and recognise it as actually
    describing an interleave.

(6) We'd still want the mask internal function and permute to be
    treated as effectively a single operation, since they're
    significantly more efficient as a unit than as separate
    instructions.  It seems like we'd be risking a repeat of the
    (VEC_)COND_EXPR situation to make that happen.  I think in the past
    you've been uneasy about the comparison that can be embedded in
    operand 0 of a COND_EXPR, but AIUI that was important on some
    targets to avoid the comparison becoming completely dissociated from
    the ?: and being rewritten into a form that the target couldn't
    handle easily.  I'd be afraid of the same thing happening here.

But reading the original message back, I didn't make it clear that we only
use the new internal functions if the associated optab is implemented.
Existing targets aren't affected and continue to use permutes for
everything.

[Hmm... sorry for the long-winded answer]

> I really wonder how you handle arbitrary permutes generated by SLP
> loop vectorization ;)
> (well, I guess "not supported" for the moment).

Right.  This is high up the list of "nice to haves", but I was expecting
the result would still involve internal functions.

> In general, did you do any compile-time / memory-use benchmarking of
> the middle-end changes for a) targets not using variable-size modes,
> b) a target having them, with/without the patches?

I'll get back to you on this one -- wanted to respond to the other
points before I had results.

> How is debugging experience (of GCC itself) for targets without
> variable-size modes when dealing with RTL?

I've not found it to be much different, although I suppose I prefer
printf debugging and so probably don't use gdb often enough to give a
meaningful answer.  Using the machine mode enum wrappers means that the
values of mode parameters are printed as "..." in backtraces, but I think
some python can sort that out.

> A question on SVE itself -- is the vector size "fixed" in hardware or
> can it change say,
> per process? [just thinking of SMT and partitioning of the vector resources]

It can change per process.

> Given that for HPC everybody recompiles their code for a specific
> machine I'd have expected a -mvector-size=N switch to be a more
> pragmatic approach for GCC 7 and also one that (if the size is really
> "fixed" in HW) might result in better code generation (at least
> initially).

The port does have an -msve-vector-bits=N switch if you really want
to force a specific length, but since the architecture has been
designed for VL-agnostic code, knowing the vector length makes no
practical difference on most workloads I've tried.  I've even seen
cases where fixing the length makes things worse: the variable-length
IR maps nicely to the architecture, whereas using hard-coded numbers
can tempt the compiler to mess things around a bit :-)

Also, -msve-vector-bits=128 (the smallest supported size) actually
generates normal vector-length agnostic code.  Genuinely fixing the
length to 128 bits would make the SVE .md patterns and address
legitimisation decisions clash with the current AdvSIMD ones: we'd need
a lot of GCC changes to make 128-bit-specific code be on a par with the
current vector-length-agnostic code.

More fundamentally though: as you asked above, the vector length is
a per-process choice rather than a machine-wide choice.  We'd also
want GCC's SVE output to run on any SVE implementation by default,
especially in cases where GCC is used as a distribution compiler.

Thanks,
Richard

Re: [SVE] Support for variable-sized machine modes

Reply via email to