Re: [SVE] Support for variable-sized machine modes

Richard Biener Fri, 25 Nov 2016 01:06:07 -0800

On Thu, Nov 24, 2016 at 6:06 PM, Richard Sandiford
<richard.sandif...@arm.com> wrote:
> Richard Biener <richard.guent...@gmail.com> writes:
>> On Thu, Nov 17, 2016 at 11:00 PM, Richard Sandiford
>> <richard.sandif...@arm.com> wrote:
>>> Thanks for the comments.
>>>
>>> Richard Biener <richard.guent...@gmail.com> writes:
>>>> On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford
>>>>> Constructing variable-length vectors
>>>>> ====================================
>>>>>
>>>>> Currently both tree and rtl vector constants require the number of
>>>>> elements to be known at compile time and allow the elements to be
>>>>> arbitrarily different from one another.  SVE vector constants instead
>>>>> have a variable number of elements and require the constant to have
>>>>> some inherent structure, so that the values remain predictable however
>>>>> long the vector is.  In practice there are two useful types of constant:
>>>>>
>>>>> (a) a duplicate of a single value to all elements.
>>>>>
>>>>> (b) a linear series in which element E has the value BASE + E * STEP,
>>>>>     for some given BASE and STEP.
>>>>>
>>>>> For integers, (a) could simply be seen as a special form of (b) in
>>>>> which the step is zero.  However, we've deliberately not defined (b)
>>>>> for floating-point modes, in order to avoid specifying whether element
>>>>> E should really be calculcated as BASE + E * STEP or as BASE with STEP
>>>>> added E times (which would round differently).  So treating (a) as a
>>>>> separate kind of constant from (b) is useful for floating-point types.
>>>>>
>>>>> We need to support the same operations for non-constant vectors as well
>>>>> as constant ones.  Both operations have direct analogues in SVE.
>>>>>
>>>>> rtl already supports (a) for variables via vec_duplicate.  For constants
>>>>> we simply wrapped such vec_duplicates in a (const ...), so for example:
>>>>>
>>>>>   (const:VnnHI (vec_duplicate:VnnHI (const_int 10)))
>>>>>
>>>>> represents a vector constant in which each element is the 16-bit value 10.
>>>>>
>>>>> For (b) we created a new vec_series rtl code that takes the base and step
>>>>> as operands.  A vec_series is constant if it has constant operands, in 
>>>>> which
>>>>> case it too can be wrapped in a (const ...).  For example:
>>>>>
>>>>>   (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3)))
>>>>>
>>>>> represents the constant { 1, 4, 7, 10, ... }.
>>>>>
>>>>> We only use constant vec_duplicate and vec_series when the number of
>>>>> elements is variable.  Vectors with a constant number of elements
>>>>> continue to use const_vector.  It might be worth considering using
>>>>> vec_duplicate across the board in future though, since it's significantly
>>>>> more compact when the number of elements is large.
>>>>>
>>>>> In both vec_duplicate and vec_series constants, the value of the element
>>>>> can be any constant that is valid for the element's mode; it doesn't have
>>>>> to be a const_int.
>>>>>
>>>>> The patches take a similar approach for trees.  A new VEC_DUPLICATE_EXPR
>>>>> returns a vector in which every element is equal to operand 0, while a new
>>>>> VEC_SERIES_EXPR creates a linear series, taking the same two operands as 
>>>>> the
>>>>> rtl code.  The trees are TREE_CONSTANT if the operands are TREE_CONSTANT.
>>>>>
>>>>> The new trees are valid gimple values iff they are TREE_CONSTANT.
>>>>> This means that the constant forms can be used in a very similar way
>>>>> to VECTOR_CST, rather than always requiring a separate gimple assignment.
>>>>
>>>> Hmm.  They are hopefully (at least VEC_DUPLICATE_EXPR) not 
>>>> GIMPLE_SINGLE_RHS.
>>>> But it means they'd appear (when TREE_CONSTANT) as gimple operand in
>>>> GENERIC form.
>>>
>>> You guessed correctly: they're GIMPLE_SINGLE_RHS :-)  That seemed to be
>>> our current way of handling this kind of expression.  Do you not like it
>>> because of the overhead of the extra tree node in plain:
>>>
>>>    reg = VEC_DUPLICATE_EXPR <X>
>>>
>>> assignments?
>>
>> Not the overhead but it's a step backward of getting rid of GENERIC
>> _expressions_
>> in GIMPLE.  They require special handling in most of the middle-end 
>> (genmatch,
>> value-numbering, PRE to just name a few).
>>
>>>  The problem is that if we treat them as unary, the
>>> VEC_DUPLICATE_EXPR node appears and disappears depending on whether
>>> the assignment has an operator or not, which makes them significantly
>>> different from VECTOR_CST.  The idea was to make them as similar as
>>> possible, so that most code wouldn't care that a different tree code
>>> is being used.
>>>
>>> I think in practice most duplicates are used as operands rather than
>>> as rhses in their own right.
>>
>> Well, then maybe restrict them to this.  Or add VEC_DUPLICATE_EXPR
>> and VEC_DUPLICATE_CONST (like we fold a RHS CONSTRUCTOR
>> to a VECTOR_CST when all elements become constant).
>
> OK.  I did wonder about that, but thought it would be bad to have two
> codes that do essentially the same thing.  It'd be a bit like the
> NOP_EXPR/CONVERT_EXPR thing, which I thought we were also trying
> to move away from.
>
> Do we have a plan for how the other GIMPLE_SINGLE_RHS codes are
> going to be migrated?


The ones I like to see migrated are REAL/IMAGPART_EXPR,
VIEW_CONVERT_EXPR and BIT_FIELD_REF if they operate on a register.
I don't see us migrating away from memory reference trees yet (that would
be a lot of work with maybe questionable return).  Then of course there
is the embedded GENERIC expression inside [VEC_]COND_EXPR for the
comparison operand.

For the tcc_reference codes mentioned above the cleanest way would be
to introduce new tree codes for the register variant...

>>>>> Variable-length permutes
>>>>> ========================
>>>>>
>>>>> SVE has a similar set of permute instructions to Advanced SIMD: it has
>>>>> a TBL instruction for general permutes and specific instructions like
>>>>> TRN1 for certain common operations.  Although it would be possible to
>>>>> construct variable-length masks for all the special-purpose permutes,
>>>>> the expression to construct things like "interleave high" would be
>>>>> relatively complex.  It seemed better to add optabs and internal
>>>>> functions for certain kinds of well-known operation, specifically:
>>>>>
>>>>>   - IFN_VEC_INTERLEAVE_HI
>>>>>   - IFN_VEC_INTERLEAVE_LO
>>>>>   - IFN_VEC_EXTRACT_EVEN
>>>>>   - IFN_VEC_EXTRACT_ODD
>>>>>   - IFN_VEC_REVERSE
>>>>
>>>> It's a step backwards from a unified representation of permutes in GIMPLE.
>>>> I think it would be better to have the internal functions generate the
>>>> well-known permute mask instead.  Thus you'd have
>>>>
>>>> mask = IFN_VEC_INTERLEAVE_HI_MASK ();
>>>> vec = VEC_PERM_EXPR <vec1, vec2, mask>;
>>>>
>>>> extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE.
>>>> interleave could use a double-size element mode to use VEC_SERIES_EXPR with
>>>> 0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector
>>>> for example.
>>>> And then view-convert to the original size element mode to the at the mask.
>>>
>>> I don't think that would be better in practice, for a few reasons:
>>>
>>> (1) The maximum SVE vector lengthis 256 bytes, so a general 2-operand mask
>>>     for a permute on bytes would need a range of [0, 511].  That would be
>>>     too big to hold in an element.  (FWIW, the general SVE permute
>>>     instruction permutes a single vector.)
>>
>> That's an interesting thing to notice as limitation of VEC_PERM_EXPR in
>> general -- it can't handle permutes of two QImode vectors with more than 128
>> elements...  But it just means we need to lift the limitation that the vector
>> containing the permutation has to have an element size that matches the
>> to-be permuted vector element size.
>
> We could do that, but would you expect to handle wider elements for
> general selectors or just constant ones?  If it's just constants then
> it creates another special case.  If it's variables too then it's
> likely to be expensive.

I guess it really depends on whether targets support permuting
two V256QImode vectors and how they are going to encode the
permute vector.

> For SVE the current definition is OK.

Which means SVE doesn't support general permutation of two V256QImode
vectors?

>>> (2) The view-convert trick doesn't work for 64-bit elements, since there's
>>>     no 128-bit vec_series instruction.  (And if there were 128-bit vector
>>>     operations, we'd have the same problem for 256 bits, etc.)
>>>
>>>     I suspect the best way of generating an interleave mask for 64 bits
>>>     would be to interleave two vec_series.  That suggests that the
>>>     primitive operation really is the interleave rather than the mask.
>>>
>>> (3) On a related note, I think one of the attractions of a single
>>>     unified permute representation was that it left the implementation
>>>     up to the target.  Using vec_series-based tricks for things like
>>>     interleaves would be doing the opposite: it would be baking in a
>>>     particular implementation and leaving the target to do more work
>>>     if it wanted a different implementation.
>>
>> True...
>>
>>> (4) The main benefit of directly-mapped internal functions is that they
>>>     correspond to target optabs.  This makes them very light-weight to
>>>     support (often not more than a line in internal-fn.def).  In contrast,
>>>     if we had internal functions for the masks, we'd need custom expand
>>>     code to code-generate the mask.  And for INTERLEAVE_LO masks in
>>>     particular, the sequence we generate is likely to be expensive.
>>>     It would also be code that we'd never want to see used in practice;
>>>     it would just be there "in case".
>>
>> The disadvantage is you lose the ability to fold the interleave operation
>> with a following permute.
>
> True, but these internal functions are only created by the vectoriser
> for specific purposes.  I wouldn't expect there to be that many cases
> where further folds are possible, and it's not clear that they'd be
> a win for SVE.

So there are no intrinsic headers for SVE and thus users are not expected
to write SVE vectorized code themselves?  Or are expected to generate
optimally optimized code?

>>> (5) It doesn't really seem to add any generality.  Any code that wants
>>>     to know what the permute is doing and potentially rewrite it will
>>>     need to understand the mask internal function as well.  Or to look
>>>     through the view-convert sequence and recognise it as actually
>>>     describing an interleave.
>>>
>>> (6) We'd still want the mask internal function and permute to be
>>>     treated as effectively a single operation, since they're
>>>     significantly more efficient as a unit than as separate
>>>     instructions.  It seems like we'd be risking a repeat of the
>>>     (VEC_)COND_EXPR situation to make that happen.  I think in the past
>>>     you've been uneasy about the comparison that can be embedded in
>>>     operand 0 of a COND_EXPR, but AIUI that was important on some
>>>     targets to avoid the comparison becoming completely dissociated from
>>>     the ?: and being rewritten into a form that the target couldn't
>>>     handle easily.  I'd be afraid of the same thing happening here.
>>>
>>> But reading the original message back, I didn't make it clear that we only
>>> use the new internal functions if the associated optab is implemented.
>>> Existing targets aren't affected and continue to use permutes for
>>> everything.
>>
>> Sure, I understood that.  But you end up having a way to query the
>> target for support of an interleave and I fear we'd get back to the point
>> where targets implement that and in turn of this lose the advantage
>> the generic VEC_PERM_EXPR handling has.  So maybe restrict it
>> to vectors with variable sized modes then...
>
> The documentation in the patch that adds the optabs points people
> at the normal vec_const_perm approach, which should hopefully reduce
> the risk of that.
>
> I don't mind restricting it to variable-length vectors if we have to,
> but using the internal functions if the optabs are defined seems more
> consistent with everything else.  I agree targets could define the
> optabs in cases where it would be better not to, but that's really
> a target decision.
>
>>> [Hmm... sorry for the long-winded answer]
>>>
>>>> I really wonder how you handle arbitrary permutes generated by SLP
>>>> loop vectorization ;)
>>>> (well, I guess "not supported" for the moment).
>>>
>>> Right.  This is high up the list of "nice to haves", but I was expecting
>>> the result would still involve internal functions.
>>
>> I suppose SVE has something like repeat { 0, 1, 5, 6 } to a larger vector
>> and in turn add CST to each element?  So sth like the series thing but
>> replicating a vector rather than a single scalar?
>
> Not in a single instruction, but it's easy to emulate.

Ok.

>>>> In general, did you do any compile-time / memory-use benchmarking of
>>>> the middle-end changes for a) targets not using variable-size modes,
>>>> b) a target having them, with/without the patches?
>>>
>>> I'll get back to you on this one -- wanted to respond to the other
>>> points before I had results.
>>>
>>>> How is debugging experience (of GCC itself) for targets without
>>>> variable-size modes when dealing with RTL?
>>>
>>> I've not found it to be much different, although I suppose I prefer
>>> printf debugging and so probably don't use gdb often enough to give a
>>> meaningful answer.  Using the machine mode enum wrappers means that the
>>> values of mode parameters are printed as "..." in backtraces, but I think
>>> some python can sort that out.
>>>
>>>> A question on SVE itself -- is the vector size "fixed" in hardware or
>>>> can it change say,
>>>> per process? [just thinking of SMT and partitioning of the vector 
>>>> resources]
>>>
>>> It can change per process.
>>>
>>>> Given that for HPC everybody recompiles their code for a specific
>>>> machine I'd have expected a -mvector-size=N switch to be a more
>>>> pragmatic approach for GCC 7 and also one that (if the size is really
>>>> "fixed" in HW) might result in better code generation (at least
>>>> initially).
>>>
>>> The port does have an -msve-vector-bits=N switch if you really want
>>> to force a specific length, but since the architecture has been
>>> designed for VL-agnostic code, knowing the vector length makes no
>>> practical difference on most workloads I've tried.  I've even seen
>>> cases where fixing the length makes things worse: the variable-length
>>> IR maps nicely to the architecture, whereas using hard-coded numbers
>>> can tempt the compiler to mess things around a bit :-)
>>
>> Well, hard-coded numbers might be easier to support in GCC 7 ;)
>>
>> But the first question boiled down to, if code is compiled with
>> -msve-vector-bits=N,
>> can we annotate the executable with that information and make the kernel
>> (or whoever is responsible) set the SVE vector length for the process
>> accordingly
>> (or refuse to load the binary)?
>
> We’re looking at the possibility of recording a VL in object files that
> loaders could use.  However, it would be perfectly valid to insert
> conditionally-executed vector-length-specific code into an otherwise
> vector-length-agnostic executable or shared library (perhaps using
> ifuncs to select the VLS code on suitable hosts).  So even if we did
> have an ELF mechanism for recording a VL, it wouldn't necessarily be
> appropriate for -msve-vector-bits= to set that VL automatically.

I see.  So the -mvse-vector-bits=N would need to emit VL agnostic
code anyway or do loop versioning with a check against the actual VL.

>>> Also, -msve-vector-bits=128 (the smallest supported size) actually
>>> generates normal vector-length agnostic code.  Genuinely fixing the
>>> length to 128 bits would make the SVE .md patterns and address
>>> legitimisation decisions clash with the current AdvSIMD ones: we'd need
>>> a lot of GCC changes to make 128-bit-specific code be on a par with the
>>> current vector-length-agnostic code.
>>
>> Ok, I would have guessed an initial implementation would be "trivial", but
>> if the modes then indeed clash the only solution would be to make
>> -msve-vector-bits=128 disable AvdSIMD (or the other way around).
>
> We definitely don't want to do that :-)  SVE builds on top of AdvSIMD
> and code should be able to use both simultaneously.  This includes
> ACLE functions, etc.

Ok.

Thanks,
Richard.

> Thanks,
> Richard
>
>>> More fundamentally though: as you asked above, the vector length is
>>> a per-process choice rather than a machine-wide choice.  We'd also
>>> want GCC's SVE output to run on any SVE implementation by default,
>>> especially in cases where GCC is used as a distribution compiler.
>>
>> Understood.
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Richard

Re: [SVE] Support for variable-sized machine modes

Reply via email to