On Tue, Jul 9, 2024 at 11:45 AM Tejas Belagod <tejas.bela...@arm.com> wrote: > > On 7/8/24 4:45 PM, Richard Biener wrote: > > On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod <tejas.bela...@arm.com> wrote: > >> > >> Hi, > >> > >> Sorry to have dropped the ball on > >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, but > >> here I've tried to pick it up again and write up a strawman proposal for > >> elevating __attribute__((vector_mask)) to the FE from GIMPLE. > >> > >> > >> Thanks, > >> Tejas. > >> > >> Motivation > >> ---------- > >> > >> The idea of packed boolean vectors came about when we wanted to support > >> C/C++ operators on SVE ACLE types. The current vector boolean type that > >> ACLE specifies does not adequately disambiguate vector lane sizes which > >> they were derived off of. Consider this simple, albeit unrealistic, > >> example: > >> > >> bool foo (svint32_t a, svint32_t b) > >> { > >> svbool_t p = a > b; > >> > >> // Here p[2] is not the same as a[2] > b[2]. > >> return p[2]; > >> } > >> > >> In the above example, because svbool_t has a fixed 1-lane-per-byte, p[i] > >> does not return the bool value corresponding to a[i] > b[i]. This > >> necessitates a 'typed' vector boolean value that unambiguously > >> represents results of operations > >> of the same type. > >> > >> __attribute__((vector_mask)) > >> ----------------------------- > >> > >> Note: If interested in historical discussions refer to: > >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html > >> > >> We define this new attribute which when applied to a base data vector > >> produces a new boolean vector type that represents a boolean type that > >> is produced as a result of operations on the corresponding base vector > >> type. The following is the syntax. > >> > >> typedef int v8si __attribute__((vector_size (8 * sizeof (int))); > >> typedef v8si v8sib __attribute__((vector_mask)); > >> > >> Here the 'base' data vector type is v8si or a vector of 8 integers. > >> > >> Rules > >> > >> • The layout/size of the boolean vector type is implementation-defined > >> for its base data vector type. > >> > >> • Two boolean vector types who's base data vector types have same number > >> of elements and lane-width have the same layout and size. > >> > >> • Consequently, two boolean vectors who's base data vector types have > >> different number of elements or different lane-size have different layouts. > >> > >> This aligns with gnu vector extensions that generate integer vectors as > >> a result of comparisons - "The result of the comparison is a vector of > >> the same width and number of elements as the comparison operands with a > >> signed integral element type." according to > >> https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html. > > > > Without having the time to re-review this all in detail I think the GNU > > vector extension does not expose the result of the comparison as the > > machine would produce it but instead a comparison "decays" to > > a conditional: > > > > typedef int v4si __attribute__((vector_size(16))); > > > > v4si a; > > v4si b; > > > > void foo() > > { > > auto r = a < b; > > } > > > > produces, with C23: > > > > vector(4) int r = VEC_COND_EXPR < a < b , { -1, -1, -1, -1 } , { 0, > > 0, 0, 0 } > ; > > > > In fact on x86_64 with AVX and AVX512 you have two different "machine > > produced" mask types and the above could either produce a AVX mask with > > 32bit elements or a AVX512 mask with 1bit elements. > > > > Not exposing "native" mask types requires the compiler optimizing subsequent > > uses and makes generic vectors difficult to combine with for example AVX512 > > intrinsics (where masks are just 'int'). Across an ABI boundary it's also > > even more difficult to optimize mask transitions. > > > > But it at least allows portable code and it does not suffer from users > > trying to > > expose machine representations of masks as input to generic vector code > > with all the problems of constant folding not only requiring self-consistent > > code within the compiler but compatibility with user produced constant > > masks. > > > > That said, I somewhat question the need to expose the target mask layout > > to users for GCCs generic vector extension. > > > > Thanks for your feedback. > > IIUC, I can imagine how having a GNU vector extension exposing the > target vector mask layout can pose a challenge - maybe making it a > generic GNU vector extension was too ambitious. I wonder if there's > value in pursuing these alternate paths? > > 1. Can implementing this extension in a 'generic' way i.e. possibly not > implement it with a target mask, but just a generic int vector, still > maintain the consistency of GNU predicate vectors within the compiler? I > know it may not seem very different from how boolean vectors are > currently implemented (as in your above example), but, having the > __attribute__((vector_mask)) as a 'property' of the object makes it > useful to optimize its uses to target predicates in subsequent stages of > the compiler. > > 2. Restricting __attribute__((vector_mask)) to apply only to target > intrinsic types? Eg. > > On SVE something like: > typedef svint16_t svpred16_t __attribute__((vector_mask)); // OK. > > On AVX, something like: > typedef __m256i __mask32 __attribute__((vector_mask)); // OK - though > this would require more fine-grained defn of lane-size to mask-bits mapping.
I think the target should be able to register builtin types already which intrinsics could use. There is already the vector_mask attribute but only for GIMPLE and it has the same limitation of querying the target for the actual mode being used - for AVX vs AVX512 one might be able to combine this with a mode attribute. Not sure if on arm you can parse __attribute__((mode("Vx4BI4"))) or how the modes are called. But when you are talking about intrinsics I'd really suggest to leave the type creation to the target rather than trying to do a typedef in a header? Richard. > Would not be allowed on GNU Vector Extensions: > typedef v4si v4sib __attribute__((vector_mask)); // Error - vector_mask > can't be a generic GNU vector extension! > > > Thanks, > Tejas. > > > >> Producers and Consumers of PBV > >> ------------------------------ > >> > >> With GNU vector extensions, comparisons produce boolean vectors; > >> conditional and bitwise operators consume them. Comparison producers > >> generate signed integer vectors of the same lane-width as the operands > >> of the comparison operator. This means conditionals and bitwise > >> operators cannot be applied to mixed vectors that are a result of > >> different width operands. Eg. > >> > >> v8hi foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f) > >> { > >> return a > b || c > d; // error! > >> return a > b || e < f; // OK - no explicit conversion needed. > >> return a > b || __builtin_convertvector (c > d, v8si); // OK. > >> return a | b && c | d; // error! > >> return a | b && __builtin_convertvector (c | d, v8si); // OK. > >> } > >> > >> __builtin_convertvector () needs to be applied to convert vectors to the > >> type one wants to do the comparison in. IoW, the integer vectors that > >> represent boolean vectors are 'strictly-typed'. If we extend these rules > >> to vector_mask, this will look like: > >> > >> typedef v8sib v8si __attribute__((vector_mask)); > >> typedef v8hib v8hi __attribute__((vector_mask)); > >> typedef v8sfb v8sf __attribute__((vector_mask)); > >> > >> v8sib foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f) > >> { > >> v8sib psi = a > b; > >> v8hib phi = c > d; > >> v8sfb psf = e < f; > >> > >> return psi || phi; // error! > >> return psi || psf; // OK - no explicit conversion needed. > >> return psi || __builtin_convertvector (phi, v8sib); // OK. > >> return psi | phi; // error! > >> return psi | __builtin_convertvector (phi, v8sib); // OK. > >> return psi | psf; // OK - no explicit conversion needed. > >> } > >> > >> Now according to the rules explained above, v8sib and v8hib will have > >> different layouts (which is why they can't be used directly without > >> conversion if used as operands of operations). OTOH, the same rules > >> dictate that the layout of, say v8sib and v8sfb, where v8sfb is the > >> float base data vector equivalent of v8sib which when applied ensure > >> that v8sib and v8sfb have the same layout and hence can be used as > >> operands of operators without explicit conversion. This aligns with the > >> GNU vector extensions rules where comparison of 2 v8sf vectors results > >> in a v8si of the same lane-width and number of elements as that would > >> result in comparison of 2 v8si vectors. > >> > >> Application of vector_mask to sizeless types > >> -------------------------------------------- > >> > >> __attribute__((vector_mask)) has the advantage that it can be applied to > >> sizeless types seamlessly. When __attribute__((vector_mask)) is applied > >> to a data vector that is a sizeless type, the resulting vector mask also > >> becomes a sizeless type. > >> Eg. > >> > >> typedef svpred16_t svint16_t __attribute__((vector_mask)); > >> > >> This is equivalent of > >> > >> typedef vNhib vNhi __attribute__((vector_mask)); > >> > >> where N could be 8, 16, 32 etc. > >> > >> The resulting type is a scalable boolean vector type, i.e svint8_t. The > >> resulting boolean vector type has the same behavior as the scalar type > >> svint8_t. While svint8_t can represent a scalable bool vector, we need a > >> scalable scalar type to represent the bit-mask variant of the opaque > >> type that represents the bool vector. I haven't thought this through, > >> but I suspect it will be implemented as a 'typed' variant of svbool_t. > >> > >> ABI > >> --- > >> > >> Given the new opaque type, it needs rules that define PCS, storage > >> layout in aggregates and alignment. > >> > >> PCS > >> --- > >> > >> GNU vector extension type parameters are always passed on the stack. > >> Similarly vector_mask applied to GNU base data vector type parameters > >> will also be passed on the stack. The format to pass on the stack will > >> always be a canonical format - an opaque type where the internal > >> representation can be implementation-defined. > >> > >> The canonical form of the argument could be a boolean vector. This > >> boolean vector will be passed on the stack just like other GNU vectors. > >> vector bool is convenient for a callee to synthesize into a predicate > >> (irrespective of the target i.e. NEON, SVE, AVX) using target instructions. > >> > >> If the base data vector is an ACLE type, if the canonical bool vector we > >> choose is svint8_t or a typed svbool_t we could apply the same rules as > >> ABI for the said type. > >> > >> Alignment > >> --------- > >> > >> For boolean vector in memory, their alignment will be the natural > >> alignment as defined by the AAPCS64 i.e. 8 and 16 bytes for Short > >> Vectors and 16 bytes for scalable vectors. > >> > >> Aggregates > >> ---------- > >> > >> For fixed size vectors, the type resulting from applying > >> __attribute__((vector_mask)) is a vector of booleans IoW a vNqi. > >> Therefore the same rules apply as would apply to a GNU vector with 8-bit > >> elements of the same size in an aggregate. For scalable GNU boolean > >> vectors in aggregates, it acts as a Pure scalable type svint8_t and the > >> ABI rules from Section 5.10 of AAPCS64 apply. > >> > >> Operation Semantics > >> ------------------- > >> > >> What should be the data structure of the vector mask type? This seems to > >> be the main consideration. As suggested by Richard in > >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html the idea > >> is to have an opaque type to have control over operations and > >> observability. This means that the internal representation can be a bit > >> mask, but based on the operator being applied to it, the mask can > >> 'decay' to another operator-friendly data structure. > >> > >> vector_mask has 2 forms that is chosen based on the context. It lives as > >> a mask and a vector bool. Here we describe its behaviour in various > >> contexts. > >> > >> Arithmetic ops > >> -------------- > >> > >> These don't apply as the values are essentially binary. > >> > >> Bitwise ops - &, ^, |, ~, >>, << > >> --------------------------------- > >> > >> Here vector_mask acts as a scalar bitmask. Applying bitwise ops is like > >> another scalar operation. > >> > >> If p1 and p2 are vector_mask types of type: > >> > >> typedef v8sib v8si __attribute__((vector_mask)); > >> > >> Bitwise &, | and ^ > >> ------------------ > >> > >> p1 & p2 > >> > >> Here p1 and p2 act as integer type bitmasks where each bit represents a > >> vector lane of the data vector type. LSBit representing the lowest > >> numbered lane and MSBit representing the highest numbered lane. > >> > >> p1 & <scalar immediate> > >> > >> Here the immediate scalar is implicitly cast to a vector_mask type and > >> the binary op is applied accordingly. > >> > >> Bitwise ~: > >> > >> ~p1 > >> > >> Treats p1 as a bitmask and inverts all the bits of the bitmask. > >> > >> Bitwise >>, << : > >> > >> p1 >> <scalar immediate> > >> p1 >> <scalar int32 variable> > >> > >> Treats p1 as a bitmask. The shifter operand has to be a signed int32 > >> immediate. If the immediate is negative, the direction of the shift is > >> inverted. Behaviour for any value outside the range of 0..nelems-1 is > >> undefined. > >> > >> p1 >> p2 or p1 << p2 > >> > >> is not allowed. > >> > >> Logical ops - ==, !=, >, <, >=, <= > >> ---------------------------------- > >> > >> The following ops treat vector_mask as bitmask: > >> p1 == p2 > >> p1 != p2 > >> p1 == <scalar immediate> > >> p1 != <scalar immediate> > >> > >> The result of these operations is a bool. Note that the scalar > >> immediates will be implicitly converted to the LHS type of p1. Eg. if p1 > >> is v8sib, > >> > >> p1 == 0x3 > >> > >> will mean that 0x3 will represent lower numbered 2 lanes of v8sib are > >> true and the rest are false. > >> > >> >, <, >=, <= do not apply to the vector_mask. > >> > >> Ternary operator ?: > >> ------------------- > >> > >> p1 <logicalop> p2 ? s1 : s2; > >> > >> is allowed and p1 and p2 are treated as bitmasks. > >> > >> Conditional operators ||, && ! > >> ------------------------------ > >> > >> Here vector_mask is used as a bitmask scalar. So > >> > >> p1 != 0 || p2 == 0 > >> > >> treats p1 and p2 as scalar bitmasks. Similarly for && and !. > >> > >> Assignment ops =, <<=, >>= > >> -------------------------- > >> > >> The assignment operator is straightforward - it does a copy of the RHS > >> into a p1. Eg. > >> > >> p1 = p2 > >> > >> Copies the value of p2 into p1. If the types are different, there is no > >> implicit conversion from one to the other (except in cases mentioned > >> below). One will have to explicitly convert using > >> __builtin_convertvector (). So if p1 and p2 are different and if one > >> wants to copy p2 to p1, one has to write > >> > >> p1 = __builtin_convertvector (p2, typeof (p1)); > >> > >> __builtin_convertvector is implementation-defined. It is essential to > >> note p1 and p2 must have the same number of lanes irrespective of the > >> lane-size. Also, explicit conversion is not required if the lane-sizes > >> are the same for p1 and p2 along with the same number of elements. So > >> for eg. if p1 is v8sib and p2 is v8sfb, there is no explicit conversion > >> required. Same for v8sib and v8uib. > >> > >> <<= and >>= have similar operations. > >> > >> Increment Ops ++, -- > >> --------------------- > >> > >> NA > >> > >> Address-of & > >> ------------ > >> > >> Taking address of a vector_mask returns (vector bool *). > >> > >> sizeof () > >> -------- > >> > >> sizeof (vector_mask) = sizeof (vector bool) > >> > >> alignof () > >> ---------- > >> > >> See Alignment section above > >> > >> Typecast and implicit conversions > >> --------------------------------- > >> > >> typecast from one vector_mask type to another vector_mask type is only > >> possible using __builtin_convertvector () if, as explained above, the > >> lane-size are different. It is not possible to convert between vectors > >> of different nelems either way. > >> > >> Implicit conversions between two same-nelem vector_masks are possible > >> only if the lane-sizes are same. > >> > >> Literals and Initialization > >> --------------------------- > >> > >> There are two ways to initialize vector_mask objects - bitmask form and > >> constant array form. Eg. typedef v4si v4si __attribute__((vector_mask)); > >> > >> void foo () > >> { > >> v4sib p1 = 0xf; > >> > >> /* Do something. */ > >> > >> p1 = {1, 1, 1, 0}; > >> > >> ... > >> } > >> > >> The behaviour is undefined values other than 1 or 0 are used in the > >> constant array initializer. > >> > >> C++: > >> --- > >> > >> static_cast<target_type> (<source_expression>) > >> > >> LLVM allows static_cast<> where both vector sizes are same, but the > >> semantics are equal to reinterpret_cast<>. GNU does not allow > >> static_cast<> irrespective of source and target shapes. > >> > >> To be consistent, leave it unsupported for vector_mask too. > >> > >> dynamic_cast <target_type> (<source_expr>) > >> NA > >> > >> reinterpret_cast<> > >> Semantics are same as Clang's static_cast<> i.e. reinterpret the types > >> if both source and target type vectors are same size. > >> > >> const_cast<> > >> > >> Applies constness to a vector mask type pointer. > >> > >> #include <inttypes.h> > >> > >> typedef int32_t v16si __attribute__((__vector_size__(64))); > >> typedef v16si v16sib __attribute__((vector_mask)); > >> > >> __attribute__((noinline)) > >> const v16sib * foo (v16sib * a) > >> { > >> return const_cast<v16sib *> (a); > >> } > >> > >> new & delete > >> > >> For new, vector_mask types will return a pointer to vector_mask type and > >> allocate sizeof (vector bool) depending on the size of the vector bool > >> array in > >> bytes. For eg. typedef v16sib v16si __attribute__((vector_mask)); > >> > >> v16sib * foo() > >> { > >> return new v16sib; > >> } > >> > >> foo returns sizeof (vector bool (16))) i.e. 16 bytes. > >> > >> __attribute__((vector_mask)) 's conflation with GIMPLE > >> ------------------------------------------------------ > >> > >> __attribute__((vector_mask)) is a feature that has been elevated from > >> GIMPLE to the FE. In GIMPLE, the semantics are loosely-typed and > >> target-dependent i.e. different-shared vector mask types are allowed to > >> work with binary ops depending on which target we're compiling for. Eg. > >> > >> typedef v8sib v8si __attribute__((vector_mask)); > >> typedef v8hib v8hi __attribute__((vector_mask)); > >> > >> __GIMPLE v8sib foo (v8si a, v8si b, v8hi c, v8hi d) > >> { > >> v8sib psi = a > b; > >> v8hib phi = c > d; > >> > >> return psi | phi; // OK on amdgcn, but errors on aarch64! > >> } > >> > >> This dichotomy is acceptable as long as GIMPLE semantics don't change > >> and because the FE semantics are proposed to be more restrictive, its > >> becomes a subset of the functionality of GIMPLE semantics. This is the > >> current starting point, but going forward if there are scenarios where > >> we have to diverge from GIMPLE semantics, we have to discuss that on a > >> case-by-case basis. > >> > >> >