Re: [SVE] Support for variable-sized machine modes

Richard Sandiford Thu, 24 Nov 2016 09:07:43 -0800

Richard Biener <richard.guent...@gmail.com> writes:
> On Thu, Nov 17, 2016 at 11:00 PM, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>> Thanks for the comments.
>>
>> Richard Biener <richard.guent...@gmail.com> writes:
>>> On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford
>>>> Constructing variable-length vectors
>>>> ====================================
>>>>
>>>> Currently both tree and rtl vector constants require the number of
>>>> elements to be known at compile time and allow the elements to be
>>>> arbitrarily different from one another.  SVE vector constants instead
>>>> have a variable number of elements and require the constant to have
>>>> some inherent structure, so that the values remain predictable however
>>>> long the vector is.  In practice there are two useful types of constant:
>>>>
>>>> (a) a duplicate of a single value to all elements.
>>>>
>>>> (b) a linear series in which element E has the value BASE + E * STEP,
>>>>     for some given BASE and STEP.
>>>>
>>>> For integers, (a) could simply be seen as a special form of (b) in
>>>> which the step is zero.  However, we've deliberately not defined (b)
>>>> for floating-point modes, in order to avoid specifying whether element
>>>> E should really be calculcated as BASE + E * STEP or as BASE with STEP
>>>> added E times (which would round differently).  So treating (a) as a
>>>> separate kind of constant from (b) is useful for floating-point types.
>>>>
>>>> We need to support the same operations for non-constant vectors as well
>>>> as constant ones.  Both operations have direct analogues in SVE.
>>>>
>>>> rtl already supports (a) for variables via vec_duplicate.  For constants
>>>> we simply wrapped such vec_duplicates in a (const ...), so for example:
>>>>
>>>>   (const:VnnHI (vec_duplicate:VnnHI (const_int 10)))
>>>>
>>>> represents a vector constant in which each element is the 16-bit value 10.
>>>>
>>>> For (b) we created a new vec_series rtl code that takes the base and step
>>>> as operands.  A vec_series is constant if it has constant operands, in 
>>>> which
>>>> case it too can be wrapped in a (const ...).  For example:
>>>>
>>>>   (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3)))
>>>>
>>>> represents the constant { 1, 4, 7, 10, ... }.
>>>>
>>>> We only use constant vec_duplicate and vec_series when the number of
>>>> elements is variable.  Vectors with a constant number of elements
>>>> continue to use const_vector.  It might be worth considering using
>>>> vec_duplicate across the board in future though, since it's significantly
>>>> more compact when the number of elements is large.
>>>>
>>>> In both vec_duplicate and vec_series constants, the value of the element
>>>> can be any constant that is valid for the element's mode; it doesn't have
>>>> to be a const_int.
>>>>
>>>> The patches take a similar approach for trees.  A new VEC_DUPLICATE_EXPR
>>>> returns a vector in which every element is equal to operand 0, while a new
>>>> VEC_SERIES_EXPR creates a linear series, taking the same two operands as 
>>>> the
>>>> rtl code.  The trees are TREE_CONSTANT if the operands are TREE_CONSTANT.
>>>>
>>>> The new trees are valid gimple values iff they are TREE_CONSTANT.
>>>> This means that the constant forms can be used in a very similar way
>>>> to VECTOR_CST, rather than always requiring a separate gimple assignment.
>>>
>>> Hmm.  They are hopefully (at least VEC_DUPLICATE_EXPR) not 
>>> GIMPLE_SINGLE_RHS.
>>> But it means they'd appear (when TREE_CONSTANT) as gimple operand in
>>> GENERIC form.
>>
>> You guessed correctly: they're GIMPLE_SINGLE_RHS :-)  That seemed to be
>> our current way of handling this kind of expression.  Do you not like it
>> because of the overhead of the extra tree node in plain:
>>
>>    reg = VEC_DUPLICATE_EXPR <X>
>>
>> assignments?
>
> Not the overhead but it's a step backward of getting rid of GENERIC
> _expressions_
> in GIMPLE.  They require special handling in most of the middle-end (genmatch,
> value-numbering, PRE to just name a few).
>
>>  The problem is that if we treat them as unary, the
>> VEC_DUPLICATE_EXPR node appears and disappears depending on whether
>> the assignment has an operator or not, which makes them significantly
>> different from VECTOR_CST.  The idea was to make them as similar as
>> possible, so that most code wouldn't care that a different tree code
>> is being used.
>>
>> I think in practice most duplicates are used as operands rather than
>> as rhses in their own right.
>
> Well, then maybe restrict them to this.  Or add VEC_DUPLICATE_EXPR
> and VEC_DUPLICATE_CONST (like we fold a RHS CONSTRUCTOR
> to a VECTOR_CST when all elements become constant).


OK.  I did wonder about that, but thought it would be bad to have two
codes that do essentially the same thing.  It'd be a bit like the
NOP_EXPR/CONVERT_EXPR thing, which I thought we were also trying
to move away from.

Do we have a plan for how the other GIMPLE_SINGLE_RHS codes are
going to be migrated?

>>>> Variable-length permutes
>>>> ========================
>>>>
>>>> SVE has a similar set of permute instructions to Advanced SIMD: it has
>>>> a TBL instruction for general permutes and specific instructions like
>>>> TRN1 for certain common operations.  Although it would be possible to
>>>> construct variable-length masks for all the special-purpose permutes,
>>>> the expression to construct things like "interleave high" would be
>>>> relatively complex.  It seemed better to add optabs and internal
>>>> functions for certain kinds of well-known operation, specifically:
>>>>
>>>>   - IFN_VEC_INTERLEAVE_HI
>>>>   - IFN_VEC_INTERLEAVE_LO
>>>>   - IFN_VEC_EXTRACT_EVEN
>>>>   - IFN_VEC_EXTRACT_ODD
>>>>   - IFN_VEC_REVERSE
>>>
>>> It's a step backwards from a unified representation of permutes in GIMPLE.
>>> I think it would be better to have the internal functions generate the
>>> well-known permute mask instead.  Thus you'd have
>>>
>>> mask = IFN_VEC_INTERLEAVE_HI_MASK ();
>>> vec = VEC_PERM_EXPR <vec1, vec2, mask>;
>>>
>>> extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE.
>>> interleave could use a double-size element mode to use VEC_SERIES_EXPR with
>>> 0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector
>>> for example.
>>> And then view-convert to the original size element mode to the at the mask.
>>
>> I don't think that would be better in practice, for a few reasons:
>>
>> (1) The maximum SVE vector lengthis 256 bytes, so a general 2-operand mask
>>     for a permute on bytes would need a range of [0, 511].  That would be
>>     too big to hold in an element.  (FWIW, the general SVE permute
>>     instruction permutes a single vector.)
>
> That's an interesting thing to notice as limitation of VEC_PERM_EXPR in
> general -- it can't handle permutes of two QImode vectors with more than 128
> elements...  But it just means we need to lift the limitation that the vector
> containing the permutation has to have an element size that matches the
> to-be permuted vector element size.

We could do that, but would you expect to handle wider elements for
general selectors or just constant ones?  If it's just constants then
it creates another special case.  If it's variables too then it's
likely to be expensive.

For SVE the current definition is OK.

>> (2) The view-convert trick doesn't work for 64-bit elements, since there's
>>     no 128-bit vec_series instruction.  (And if there were 128-bit vector
>>     operations, we'd have the same problem for 256 bits, etc.)
>>
>>     I suspect the best way of generating an interleave mask for 64 bits
>>     would be to interleave two vec_series.  That suggests that the
>>     primitive operation really is the interleave rather than the mask.
>>
>> (3) On a related note, I think one of the attractions of a single
>>     unified permute representation was that it left the implementation
>>     up to the target.  Using vec_series-based tricks for things like
>>     interleaves would be doing the opposite: it would be baking in a
>>     particular implementation and leaving the target to do more work
>>     if it wanted a different implementation.
>
> True...
>
>> (4) The main benefit of directly-mapped internal functions is that they
>>     correspond to target optabs.  This makes them very light-weight to
>>     support (often not more than a line in internal-fn.def).  In contrast,
>>     if we had internal functions for the masks, we'd need custom expand
>>     code to code-generate the mask.  And for INTERLEAVE_LO masks in
>>     particular, the sequence we generate is likely to be expensive.
>>     It would also be code that we'd never want to see used in practice;
>>     it would just be there "in case".
>
> The disadvantage is you lose the ability to fold the interleave operation
> with a following permute.

True, but these internal functions are only created by the vectoriser
for specific purposes.  I wouldn't expect there to be that many cases
where further folds are possible, and it's not clear that they'd be
a win for SVE.

>> (5) It doesn't really seem to add any generality.  Any code that wants
>>     to know what the permute is doing and potentially rewrite it will
>>     need to understand the mask internal function as well.  Or to look
>>     through the view-convert sequence and recognise it as actually
>>     describing an interleave.
>>
>> (6) We'd still want the mask internal function and permute to be
>>     treated as effectively a single operation, since they're
>>     significantly more efficient as a unit than as separate
>>     instructions.  It seems like we'd be risking a repeat of the
>>     (VEC_)COND_EXPR situation to make that happen.  I think in the past
>>     you've been uneasy about the comparison that can be embedded in
>>     operand 0 of a COND_EXPR, but AIUI that was important on some
>>     targets to avoid the comparison becoming completely dissociated from
>>     the ?: and being rewritten into a form that the target couldn't
>>     handle easily.  I'd be afraid of the same thing happening here.
>>
>> But reading the original message back, I didn't make it clear that we only
>> use the new internal functions if the associated optab is implemented.
>> Existing targets aren't affected and continue to use permutes for
>> everything.
>
> Sure, I understood that.  But you end up having a way to query the
> target for support of an interleave and I fear we'd get back to the point
> where targets implement that and in turn of this lose the advantage
> the generic VEC_PERM_EXPR handling has.  So maybe restrict it
> to vectors with variable sized modes then...

The documentation in the patch that adds the optabs points people
at the normal vec_const_perm approach, which should hopefully reduce
the risk of that.

I don't mind restricting it to variable-length vectors if we have to,
but using the internal functions if the optabs are defined seems more
consistent with everything else.  I agree targets could define the
optabs in cases where it would be better not to, but that's really
a target decision.

>> [Hmm... sorry for the long-winded answer]
>>
>>> I really wonder how you handle arbitrary permutes generated by SLP
>>> loop vectorization ;)
>>> (well, I guess "not supported" for the moment).
>>
>> Right.  This is high up the list of "nice to haves", but I was expecting
>> the result would still involve internal functions.
>
> I suppose SVE has something like repeat { 0, 1, 5, 6 } to a larger vector
> and in turn add CST to each element?  So sth like the series thing but
> replicating a vector rather than a single scalar?

Not in a single instruction, but it's easy to emulate.

>>> In general, did you do any compile-time / memory-use benchmarking of
>>> the middle-end changes for a) targets not using variable-size modes,
>>> b) a target having them, with/without the patches?
>>
>> I'll get back to you on this one -- wanted to respond to the other
>> points before I had results.
>>
>>> How is debugging experience (of GCC itself) for targets without
>>> variable-size modes when dealing with RTL?
>>
>> I've not found it to be much different, although I suppose I prefer
>> printf debugging and so probably don't use gdb often enough to give a
>> meaningful answer.  Using the machine mode enum wrappers means that the
>> values of mode parameters are printed as "..." in backtraces, but I think
>> some python can sort that out.
>>
>>> A question on SVE itself -- is the vector size "fixed" in hardware or
>>> can it change say,
>>> per process? [just thinking of SMT and partitioning of the vector resources]
>>
>> It can change per process.
>>
>>> Given that for HPC everybody recompiles their code for a specific
>>> machine I'd have expected a -mvector-size=N switch to be a more
>>> pragmatic approach for GCC 7 and also one that (if the size is really
>>> "fixed" in HW) might result in better code generation (at least
>>> initially).
>>
>> The port does have an -msve-vector-bits=N switch if you really want
>> to force a specific length, but since the architecture has been
>> designed for VL-agnostic code, knowing the vector length makes no
>> practical difference on most workloads I've tried.  I've even seen
>> cases where fixing the length makes things worse: the variable-length
>> IR maps nicely to the architecture, whereas using hard-coded numbers
>> can tempt the compiler to mess things around a bit :-)
>
> Well, hard-coded numbers might be easier to support in GCC 7 ;)
>
> But the first question boiled down to, if code is compiled with
> -msve-vector-bits=N,
> can we annotate the executable with that information and make the kernel
> (or whoever is responsible) set the SVE vector length for the process
> accordingly
> (or refuse to load the binary)?

We’re looking at the possibility of recording a VL in object files that
loaders could use.  However, it would be perfectly valid to insert
conditionally-executed vector-length-specific code into an otherwise
vector-length-agnostic executable or shared library (perhaps using
ifuncs to select the VLS code on suitable hosts).  So even if we did
have an ELF mechanism for recording a VL, it wouldn't necessarily be
appropriate for -msve-vector-bits= to set that VL automatically.

>> Also, -msve-vector-bits=128 (the smallest supported size) actually
>> generates normal vector-length agnostic code.  Genuinely fixing the
>> length to 128 bits would make the SVE .md patterns and address
>> legitimisation decisions clash with the current AdvSIMD ones: we'd need
>> a lot of GCC changes to make 128-bit-specific code be on a par with the
>> current vector-length-agnostic code.
>
> Ok, I would have guessed an initial implementation would be "trivial", but
> if the modes then indeed clash the only solution would be to make
> -msve-vector-bits=128 disable AvdSIMD (or the other way around).

We definitely don't want to do that :-)  SVE builds on top of AdvSIMD
and code should be able to use both simultaneously.  This includes
ACLE functions, etc.

Thanks,
Richard

>> More fundamentally though: as you asked above, the vector length is
>> a per-process choice rather than a machine-wide choice.  We'd also
>> want GCC's SVE output to run on any SVE implementation by default,
>> especially in cases where GCC is used as a distribution compiler.
>
> Understood.
>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard

Re: [SVE] Support for variable-sized machine modes

Reply via email to