Thanks for the comments. Richard Biener <richard.guent...@gmail.com> writes: > On Fri, Nov 11, 2016 at 6:50 PM, Richard Sandiford >> Constructing variable-length vectors >> ==================================== >> >> Currently both tree and rtl vector constants require the number of >> elements to be known at compile time and allow the elements to be >> arbitrarily different from one another. SVE vector constants instead >> have a variable number of elements and require the constant to have >> some inherent structure, so that the values remain predictable however >> long the vector is. In practice there are two useful types of constant: >> >> (a) a duplicate of a single value to all elements. >> >> (b) a linear series in which element E has the value BASE + E * STEP, >> for some given BASE and STEP. >> >> For integers, (a) could simply be seen as a special form of (b) in >> which the step is zero. However, we've deliberately not defined (b) >> for floating-point modes, in order to avoid specifying whether element >> E should really be calculcated as BASE + E * STEP or as BASE with STEP >> added E times (which would round differently). So treating (a) as a >> separate kind of constant from (b) is useful for floating-point types. >> >> We need to support the same operations for non-constant vectors as well >> as constant ones. Both operations have direct analogues in SVE. >> >> rtl already supports (a) for variables via vec_duplicate. For constants >> we simply wrapped such vec_duplicates in a (const ...), so for example: >> >> (const:VnnHI (vec_duplicate:VnnHI (const_int 10))) >> >> represents a vector constant in which each element is the 16-bit value 10. >> >> For (b) we created a new vec_series rtl code that takes the base and step >> as operands. A vec_series is constant if it has constant operands, in which >> case it too can be wrapped in a (const ...). For example: >> >> (const:VnnSI (vec_series:VnnSI (const_int 1) (const_int 3))) >> >> represents the constant { 1, 4, 7, 10, ... }. >> >> We only use constant vec_duplicate and vec_series when the number of >> elements is variable. Vectors with a constant number of elements >> continue to use const_vector. It might be worth considering using >> vec_duplicate across the board in future though, since it's significantly >> more compact when the number of elements is large. >> >> In both vec_duplicate and vec_series constants, the value of the element >> can be any constant that is valid for the element's mode; it doesn't have >> to be a const_int. >> >> The patches take a similar approach for trees. A new VEC_DUPLICATE_EXPR >> returns a vector in which every element is equal to operand 0, while a new >> VEC_SERIES_EXPR creates a linear series, taking the same two operands as the >> rtl code. The trees are TREE_CONSTANT if the operands are TREE_CONSTANT. >> >> The new trees are valid gimple values iff they are TREE_CONSTANT. >> This means that the constant forms can be used in a very similar way >> to VECTOR_CST, rather than always requiring a separate gimple assignment. > > Hmm. They are hopefully (at least VEC_DUPLICATE_EXPR) not GIMPLE_SINGLE_RHS. > But it means they'd appear (when TREE_CONSTANT) as gimple operand in > GENERIC form.
You guessed correctly: they're GIMPLE_SINGLE_RHS :-) That seemed to be our current way of handling this kind of expression. Do you not like it because of the overhead of the extra tree node in plain: reg = VEC_DUPLICATE_EXPR <X> assignments? The problem is that if we treat them as unary, the VEC_DUPLICATE_EXPR node appears and disappears depending on whether the assignment has an operator or not, which makes them significantly different from VECTOR_CST. The idea was to make them as similar as possible, so that most code wouldn't care that a different tree code is being used. I think in practice most duplicates are used as operands rather than as rhses in their own right. >> Variable-length permutes >> ======================== >> >> SVE has a similar set of permute instructions to Advanced SIMD: it has >> a TBL instruction for general permutes and specific instructions like >> TRN1 for certain common operations. Although it would be possible to >> construct variable-length masks for all the special-purpose permutes, >> the expression to construct things like "interleave high" would be >> relatively complex. It seemed better to add optabs and internal >> functions for certain kinds of well-known operation, specifically: >> >> - IFN_VEC_INTERLEAVE_HI >> - IFN_VEC_INTERLEAVE_LO >> - IFN_VEC_EXTRACT_EVEN >> - IFN_VEC_EXTRACT_ODD >> - IFN_VEC_REVERSE > > It's a step backwards from a unified representation of permutes in GIMPLE. > I think it would be better to have the internal functions generate the > well-known permute mask instead. Thus you'd have > > mask = IFN_VEC_INTERLEAVE_HI_MASK (); > vec = VEC_PERM_EXPR <vec1, vec2, mask>; > > extract_even/odd should be doable with VEC_SERIES_EXPR, so is VEC_REVERSE. > interleave could use a double-size element mode to use VEC_SERIES_EXPR with > 0004 + n * 0101 to get 0004, 0105, 0206, 0307 for a 4 element vector > for example. > And then view-convert to the original size element mode to the at the mask. I don't think that would be better in practice, for a few reasons: (1) The maximum SVE vector lengthis 256 bytes, so a general 2-operand mask for a permute on bytes would need a range of [0, 511]. That would be too big to hold in an element. (FWIW, the general SVE permute instruction permutes a single vector.) (2) The view-convert trick doesn't work for 64-bit elements, since there's no 128-bit vec_series instruction. (And if there were 128-bit vector operations, we'd have the same problem for 256 bits, etc.) I suspect the best way of generating an interleave mask for 64 bits would be to interleave two vec_series. That suggests that the primitive operation really is the interleave rather than the mask. (3) On a related note, I think one of the attractions of a single unified permute representation was that it left the implementation up to the target. Using vec_series-based tricks for things like interleaves would be doing the opposite: it would be baking in a particular implementation and leaving the target to do more work if it wanted a different implementation. (4) The main benefit of directly-mapped internal functions is that they correspond to target optabs. This makes them very light-weight to support (often not more than a line in internal-fn.def). In contrast, if we had internal functions for the masks, we'd need custom expand code to code-generate the mask. And for INTERLEAVE_LO masks in particular, the sequence we generate is likely to be expensive. It would also be code that we'd never want to see used in practice; it would just be there "in case". (5) It doesn't really seem to add any generality. Any code that wants to know what the permute is doing and potentially rewrite it will need to understand the mask internal function as well. Or to look through the view-convert sequence and recognise it as actually describing an interleave. (6) We'd still want the mask internal function and permute to be treated as effectively a single operation, since they're significantly more efficient as a unit than as separate instructions. It seems like we'd be risking a repeat of the (VEC_)COND_EXPR situation to make that happen. I think in the past you've been uneasy about the comparison that can be embedded in operand 0 of a COND_EXPR, but AIUI that was important on some targets to avoid the comparison becoming completely dissociated from the ?: and being rewritten into a form that the target couldn't handle easily. I'd be afraid of the same thing happening here. But reading the original message back, I didn't make it clear that we only use the new internal functions if the associated optab is implemented. Existing targets aren't affected and continue to use permutes for everything. [Hmm... sorry for the long-winded answer] > I really wonder how you handle arbitrary permutes generated by SLP > loop vectorization ;) > (well, I guess "not supported" for the moment). Right. This is high up the list of "nice to haves", but I was expecting the result would still involve internal functions. > In general, did you do any compile-time / memory-use benchmarking of > the middle-end changes for a) targets not using variable-size modes, > b) a target having them, with/without the patches? I'll get back to you on this one -- wanted to respond to the other points before I had results. > How is debugging experience (of GCC itself) for targets without > variable-size modes when dealing with RTL? I've not found it to be much different, although I suppose I prefer printf debugging and so probably don't use gdb often enough to give a meaningful answer. Using the machine mode enum wrappers means that the values of mode parameters are printed as "..." in backtraces, but I think some python can sort that out. > A question on SVE itself -- is the vector size "fixed" in hardware or > can it change say, > per process? [just thinking of SMT and partitioning of the vector resources] It can change per process. > Given that for HPC everybody recompiles their code for a specific > machine I'd have expected a -mvector-size=N switch to be a more > pragmatic approach for GCC 7 and also one that (if the size is really > "fixed" in HW) might result in better code generation (at least > initially). The port does have an -msve-vector-bits=N switch if you really want to force a specific length, but since the architecture has been designed for VL-agnostic code, knowing the vector length makes no practical difference on most workloads I've tried. I've even seen cases where fixing the length makes things worse: the variable-length IR maps nicely to the architecture, whereas using hard-coded numbers can tempt the compiler to mess things around a bit :-) Also, -msve-vector-bits=128 (the smallest supported size) actually generates normal vector-length agnostic code. Genuinely fixing the length to 128 bits would make the SVE .md patterns and address legitimisation decisions clash with the current AdvSIMD ones: we'd need a lot of GCC changes to make 128-bit-specific code be on a par with the current vector-length-agnostic code. More fundamentally though: as you asked above, the vector length is a per-process choice rather than a machine-wide choice. We'd also want GCC's SVE output to run on any SVE implementation by default, especially in cases where GCC is used as a distribution compiler. Thanks, Richard