On Tue, Jul 9, 2024 at 11:45 AM Tejas Belagod <tejas.bela...@arm.com> wrote:
>
> On 7/8/24 4:45 PM, Richard Biener wrote:
> > On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod <tejas.bela...@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> Sorry to have dropped the ball on
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, but
> >> here I've tried to pick it up again and write up a strawman proposal for
> >> elevating __attribute__((vector_mask)) to the FE from GIMPLE.
> >>
> >>
> >> Thanks,
> >> Tejas.
> >>
> >> Motivation
> >> ----------
> >>
> >> The idea of packed boolean vectors came about when we wanted to support
> >> C/C++ operators on SVE ACLE types. The current vector boolean type that
> >> ACLE specifies does not adequately disambiguate vector lane sizes which
> >> they were derived off of. Consider this simple, albeit unrealistic, 
> >> example:
> >>
> >>     bool foo (svint32_t a, svint32_t b)
> >>     {
> >>       svbool_t p = a > b;
> >>
> >>       // Here p[2] is not the same as a[2] > b[2].
> >>       return p[2];
> >>     }
> >>
> >> In the above example, because svbool_t has a fixed 1-lane-per-byte, p[i]
> >> does not return the bool value corresponding to a[i] > b[i]. This
> >> necessitates a 'typed' vector boolean value that unambiguously
> >> represents results of operations
> >> of the same type.
> >>
> >> __attribute__((vector_mask))
> >> -----------------------------
> >>
> >> Note: If interested in historical discussions refer to:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html
> >>
> >> We define this new attribute which when applied to a base data vector
> >> produces a new boolean vector type that represents a boolean type that
> >> is produced as a result of operations on the corresponding base vector
> >> type. The following is the syntax.
> >>
> >>     typedef int v8si __attribute__((vector_size (8 * sizeof (int)));
> >>     typedef v8si v8sib __attribute__((vector_mask));
> >>
> >> Here the 'base' data vector type is v8si or a vector of 8 integers.
> >>
> >> Rules
> >>
> >> • The layout/size of the boolean vector type is implementation-defined
> >> for its base data vector type.
> >>
> >> • Two boolean vector types who's base data vector types have same number
> >> of elements and lane-width have the same layout and size.
> >>
> >> • Consequently, two boolean vectors who's base data vector types have
> >> different number of elements or different lane-size have different layouts.
> >>
> >> This aligns with gnu vector extensions that generate integer vectors as
> >> a result of comparisons - "The result of the comparison is a vector of
> >> the same width and number of elements as the comparison operands with a
> >> signed integral element type." according to
> >>      https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html.
> >
> > Without having the time to re-review this all in detail I think the GNU
> > vector extension does not expose the result of the comparison as the
> > machine would produce it but instead a comparison "decays" to
> > a conditional:
> >
> > typedef int v4si __attribute__((vector_size(16)));
> >
> > v4si a;
> > v4si b;
> >
> > void foo()
> > {
> >    auto r = a < b;
> > }
> >
> > produces, with C23:
> >
> >    vector(4) int r =  VEC_COND_EXPR < a < b , { -1, -1, -1, -1 } , { 0,
> > 0, 0, 0 } > ;
> >
> > In fact on x86_64 with AVX and AVX512 you have two different "machine
> > produced" mask types and the above could either produce a AVX mask with
> > 32bit elements or a AVX512 mask with 1bit elements.
> >
> > Not exposing "native" mask types requires the compiler optimizing subsequent
> > uses and makes generic vectors difficult to combine with for example AVX512
> > intrinsics (where masks are just 'int').  Across an ABI boundary it's also
> > even more difficult to optimize mask transitions.
> >
> > But it at least allows portable code and it does not suffer from users 
> > trying to
> > expose machine representations of masks as input to generic vector code
> > with all the problems of constant folding not only requiring self-consistent
> > code within the compiler but compatibility with user produced constant 
> > masks.
> >
> > That said, I somewhat question the need to expose the target mask layout
> > to users for GCCs generic vector extension.
> >
>
> Thanks for your feedback.
>
> IIUC, I can imagine how having a GNU vector extension exposing the
> target vector mask layout can pose a challenge - maybe making it a
> generic GNU vector extension was too ambitious. I wonder if there's
> value in pursuing these alternate paths?
>
> 1. Can implementing this extension in a 'generic' way i.e. possibly not
> implement it with a target mask, but just a generic int vector, still
> maintain the consistency of GNU predicate vectors within the compiler? I
> know it may not seem very different from how boolean vectors are
> currently implemented (as in your above example), but, having the
> __attribute__((vector_mask)) as a 'property' of the object makes it
> useful to optimize its uses to target predicates in subsequent stages of
> the compiler.
>
> 2. Restricting __attribute__((vector_mask)) to apply only to target
> intrinsic types? Eg.
>
> On SVE something like:
> typedef svint16_t svpred16_t __attribute__((vector_mask)); // OK.
>
> On AVX, something like:
> typedef __m256i __mask32 __attribute__((vector_mask)); // OK - though
> this would require more fine-grained defn of lane-size to mask-bits mapping.

I think the target should be able to register builtin types already which
intrinsics could use.  There is already the vector_mask attribute but only
for GIMPLE and it has the same limitation of querying the target for the
actual mode being used - for AVX vs AVX512 one might be able to
combine this with a mode attribute.  Not sure if on arm you can parse
__attribute__((mode("Vx4BI4"))) or how the modes are called.

But when you are talking about intrinsics I'd really suggest to leave the
type creation to the target rather than trying to do a typedef in a header?

Richard.

> Would not be allowed on GNU Vector Extensions:
> typedef v4si v4sib __attribute__((vector_mask)); // Error - vector_mask
> can't be a generic GNU vector extension!
>
>
> Thanks,
> Tejas.
>
>
> >> Producers and Consumers of PBV
> >> ------------------------------
> >>
> >> With GNU vector extensions, comparisons produce boolean vectors;
> >> conditional and bitwise operators consume them. Comparison producers
> >> generate signed integer vectors of the same lane-width as the operands
> >> of the comparison operator. This means conditionals and bitwise
> >> operators cannot be applied to mixed vectors that are a result of
> >> different width operands. Eg.
> >>
> >>     v8hi foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
> >>     {
> >>       return a > b || c > d; // error!
> >>       return a > b || e < f; // OK - no explicit conversion needed.
> >>       return a > b || __builtin_convertvector (c > d, v8si); // OK.
> >>       return a | b && c | d; // error!
> >>       return a | b && __builtin_convertvector (c | d, v8si); // OK.
> >>     }
> >>
> >> __builtin_convertvector () needs to be applied to convert vectors to the
> >> type one wants to do the comparison in. IoW, the integer vectors that
> >> represent boolean vectors are 'strictly-typed'. If we extend these rules
> >> to vector_mask, this will look like:
> >>
> >>     typedef v8sib v8si __attribute__((vector_mask));
> >>     typedef v8hib v8hi __attribute__((vector_mask));
> >>     typedef v8sfb v8sf __attribute__((vector_mask));
> >>
> >>     v8sib foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
> >>     {
> >>       v8sib psi = a > b;
> >>       v8hib phi = c > d;
> >>       v8sfb psf = e < f;
> >>
> >>       return psi || phi; // error!
> >>       return psi || psf; // OK - no explicit conversion needed.
> >>       return psi || __builtin_convertvector (phi, v8sib); // OK.
> >>       return psi | phi; // error!
> >>       return psi | __builtin_convertvector (phi, v8sib); // OK.
> >>       return psi | psf; // OK - no explicit conversion needed.
> >>     }
> >>
> >> Now according to the rules explained above, v8sib and v8hib will have
> >> different layouts (which is why they can't be used directly without
> >> conversion if used as operands of operations). OTOH, the same rules
> >> dictate that the layout of, say v8sib and v8sfb, where v8sfb is the
> >> float base data vector equivalent of v8sib which when applied ensure
> >> that v8sib and v8sfb have the same layout and hence can be used as
> >> operands of operators without explicit conversion. This aligns with the
> >> GNU vector extensions rules where comparison of 2 v8sf vectors results
> >> in a v8si of the same lane-width and number of elements as that would
> >> result in comparison of 2 v8si vectors.
> >>
> >> Application of vector_mask to sizeless types
> >> --------------------------------------------
> >>
> >> __attribute__((vector_mask)) has the advantage that it can be applied to
> >> sizeless types seamlessly.  When __attribute__((vector_mask)) is applied
> >> to a data vector that is a sizeless type, the resulting vector mask also
> >> becomes a sizeless type.
> >> Eg.
> >>
> >>     typedef svpred16_t svint16_t __attribute__((vector_mask));
> >>
> >> This is equivalent of
> >>
> >>     typedef vNhib vNhi __attribute__((vector_mask));
> >>
> >> where N could be 8, 16, 32 etc.
> >>
> >> The resulting type is a scalable boolean vector type, i.e svint8_t. The
> >> resulting boolean vector type has the same behavior as the scalar type
> >> svint8_t. While svint8_t can represent a scalable bool vector, we need a
> >> scalable scalar type to represent the bit-mask variant of the opaque
> >> type that represents the bool vector. I haven't thought this through,
> >> but I suspect it will be implemented as a 'typed' variant of svbool_t.
> >>
> >> ABI
> >> ---
> >>
> >> Given the new opaque type, it needs rules that define PCS, storage
> >> layout in aggregates and alignment.
> >>
> >> PCS
> >> ---
> >>
> >> GNU vector extension type parameters are always passed on the stack.
> >> Similarly vector_mask applied to GNU base data vector type parameters
> >> will also be passed on  the stack. The format to pass on the stack will
> >> always be a canonical format - an opaque type where the internal
> >> representation can be implementation-defined.
> >>
> >> The canonical form of the argument could be a boolean vector. This
> >> boolean vector will be passed on the stack just like other GNU vectors.
> >> vector bool is convenient for a callee to synthesize into a predicate
> >> (irrespective of the target i.e. NEON, SVE, AVX) using target instructions.
> >>
> >> If the base data vector is an ACLE type, if the canonical bool vector we
> >> choose is svint8_t or a typed svbool_t we could apply the same rules as
> >> ABI for the said type.
> >>
> >> Alignment
> >> ---------
> >>
> >> For boolean vector in memory, their alignment will be the natural
> >> alignment as defined by the AAPCS64 i.e. 8 and 16 bytes for Short
> >> Vectors and 16 bytes for scalable vectors.
> >>
> >> Aggregates
> >> ----------
> >>
> >> For fixed size vectors, the type resulting from applying
> >> __attribute__((vector_mask)) is a vector of booleans IoW a vNqi.
> >> Therefore the same rules apply as would apply to a GNU vector with 8-bit
> >> elements of the same size in an aggregate.  For scalable GNU boolean
> >> vectors in aggregates, it acts as a Pure scalable type svint8_t and the
> >> ABI rules from Section 5.10 of AAPCS64 apply.
> >>
> >> Operation Semantics
> >> -------------------
> >>
> >> What should be the data structure of the vector mask type? This seems to
> >> be the main consideration. As suggested by Richard in
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html the idea
> >> is to have an opaque type to have control over operations and
> >> observability. This means that the internal representation can be a bit
> >> mask, but based on the operator being applied to it, the mask can
> >> 'decay' to another operator-friendly data structure.
> >>
> >> vector_mask has 2 forms that is chosen based on the context. It lives as
> >> a mask and a vector bool. Here we describe its behaviour in various
> >> contexts.
> >>
> >> Arithmetic ops
> >> --------------
> >>
> >> These don't apply as the values are essentially binary.
> >>
> >> Bitwise ops -  &, ^, |, ~, >>, <<
> >> ---------------------------------
> >>
> >> Here vector_mask acts as a scalar bitmask. Applying bitwise ops is like
> >> another scalar operation.
> >>
> >> If p1 and p2 are vector_mask types of type:
> >>
> >>          typedef v8sib v8si __attribute__((vector_mask));
> >>
> >> Bitwise &, | and ^
> >> ------------------
> >>
> >>     p1 & p2
> >>
> >> Here p1 and p2 act as integer type bitmasks where each bit represents a
> >> vector lane of the data vector type. LSBit representing the lowest
> >> numbered lane and MSBit representing the highest numbered lane.
> >>
> >>     p1 & <scalar immediate>
> >>
> >> Here the immediate scalar is implicitly cast to a vector_mask type and
> >> the binary op is applied accordingly.
> >>
> >> Bitwise ~:
> >>
> >>     ~p1
> >>
> >> Treats p1 as a bitmask and inverts all the bits of the bitmask.
> >>
> >> Bitwise >>, << :
> >>
> >>     p1 >> <scalar immediate>
> >>     p1 >> <scalar int32 variable>
> >>
> >> Treats p1 as a bitmask. The shifter operand has to be a signed int32
> >> immediate. If the immediate is negative, the direction of the shift is
> >> inverted. Behaviour for any value outside the range of 0..nelems-1 is
> >> undefined.
> >>
> >>     p1 >> p2 or p1 << p2
> >>
> >> is not allowed.
> >>
> >> Logical ops - ==, !=, >, <, >=, <=
> >> ----------------------------------
> >>
> >> The following ops treat vector_mask as bitmask:
> >>     p1 == p2
> >>     p1 != p2
> >>     p1 == <scalar immediate>
> >>     p1 != <scalar immediate>
> >>
> >> The result of these operations is a bool. Note that the scalar
> >> immediates will be implicitly converted to the LHS type of p1. Eg. if p1
> >> is v8sib,
> >>
> >>     p1 == 0x3
> >>
> >> will mean that 0x3 will represent lower numbered 2 lanes of v8sib are
> >> true and the rest are false.
> >>
> >>   >, <, >=, <= do not apply to the vector_mask.
> >>
> >> Ternary operator ?:
> >> -------------------
> >>
> >>     p1 <logicalop> p2 ? s1 : s2;
> >>
> >> is allowed and p1 and p2 are treated as bitmasks.
> >>
> >> Conditional operators ||, && !
> >> ------------------------------
> >>
> >> Here vector_mask is used as a bitmask scalar. So
> >>
> >>     p1 != 0 || p2 == 0
> >>
> >> treats p1 and p2 as scalar bitmasks. Similarly for && and !.
> >>
> >> Assignment ops =, <<=, >>=
> >> --------------------------
> >>
> >> The assignment operator is straightforward - it does a copy of the RHS
> >> into a p1. Eg.
> >>
> >>     p1 = p2
> >>
> >> Copies the value of p2 into p1. If the types are different, there is no
> >> implicit conversion from one to the other (except in cases mentioned
> >> below). One will have to explicitly convert using
> >> __builtin_convertvector (). So if p1 and p2 are different and if one
> >> wants to copy p2 to p1, one has to write
> >>
> >>     p1 = __builtin_convertvector (p2, typeof (p1));
> >>
> >> __builtin_convertvector is implementation-defined. It is essential to
> >> note p1 and p2 must have the same number of lanes irrespective of the
> >> lane-size. Also, explicit conversion is not required if the lane-sizes
> >> are the same for p1 and p2 along with the same number of elements. So
> >> for eg. if p1 is v8sib and p2 is v8sfb, there is no explicit conversion
> >> required. Same for v8sib and v8uib.
> >>
> >> <<= and >>= have similar operations.
> >>
> >> Increment Ops ++, --
> >> ---------------------
> >>
> >> NA
> >>
> >> Address-of &
> >> ------------
> >>
> >> Taking address of a vector_mask returns (vector bool *).
> >>
> >> sizeof ()
> >> --------
> >>
> >> sizeof (vector_mask) = sizeof (vector bool)
> >>
> >> alignof ()
> >> ----------
> >>
> >> See Alignment section above
> >>
> >> Typecast and implicit conversions
> >> ---------------------------------
> >>
> >> typecast from one vector_mask type to another vector_mask type is only
> >> possible using __builtin_convertvector () if, as explained above, the
> >> lane-size are different. It is not possible to convert between vectors
> >> of different nelems either way.
> >>
> >> Implicit conversions between two same-nelem vector_masks are possible
> >> only if the lane-sizes are same.
> >>
> >> Literals and Initialization
> >> ---------------------------
> >>
> >> There are two ways to initialize vector_mask objects - bitmask form and
> >> constant array form. Eg.  typedef v4si v4si __attribute__((vector_mask));
> >>
> >>     void foo ()
> >>     {
> >>       v4sib p1 = 0xf;
> >>
> >>       /* Do something. */
> >>
> >>       p1 = {1, 1, 1, 0};
> >>
> >>       ...
> >>     }
> >>
> >> The behaviour is undefined values other than 1 or 0 are used in the
> >> constant array initializer.
> >>
> >> C++:
> >> ---
> >>
> >> static_cast<target_type> (<source_expression>)
> >>
> >> LLVM allows static_cast<> where both vector sizes are same, but the
> >> semantics are equal to reinterpret_cast<>. GNU does not allow
> >> static_cast<> irrespective of source and target shapes.
> >>
> >> To be consistent, leave it unsupported for vector_mask too.
> >>
> >> dynamic_cast <target_type> (<source_expr>)
> >> NA
> >>
> >> reinterpret_cast<>
> >> Semantics are same as Clang's static_cast<> i.e. reinterpret the types
> >> if both source and target type vectors are same size.
> >>
> >> const_cast<>
> >>
> >> Applies constness to a vector mask type pointer.
> >>
> >>     #include <inttypes.h>
> >>
> >>     typedef int32_t v16si __attribute__((__vector_size__(64)));
> >>     typedef v16si v16sib __attribute__((vector_mask));
> >>
> >>     __attribute__((noinline))
> >>     const v16sib * foo (v16sib * a)
> >>     {
> >>       return const_cast<v16sib *> (a);
> >>     }
> >>
> >> new & delete
> >>
> >> For new, vector_mask types will return a pointer to vector_mask type and
> >> allocate sizeof (vector bool) depending on the size of the vector bool
> >> array in
> >> bytes. For eg.  typedef v16sib v16si __attribute__((vector_mask));
> >>
> >>     v16sib * foo()
> >>     {
> >>       return new v16sib;
> >>     }
> >>
> >> foo returns sizeof (vector bool (16))) i.e. 16 bytes.
> >>
> >> __attribute__((vector_mask)) 's conflation with GIMPLE
> >> ------------------------------------------------------
> >>
> >> __attribute__((vector_mask)) is a feature that has been elevated from
> >> GIMPLE to the FE. In GIMPLE, the semantics are loosely-typed and
> >> target-dependent i.e. different-shared vector mask types are allowed to
> >> work with binary ops depending on which target we're compiling for. Eg.
> >>
> >>     typedef v8sib v8si __attribute__((vector_mask));
> >>     typedef v8hib v8hi __attribute__((vector_mask));
> >>
> >>     __GIMPLE v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
> >>     {
> >>       v8sib psi = a > b;
> >>       v8hib phi = c > d;
> >>
> >>       return psi | phi; // OK on amdgcn, but errors on aarch64!
> >>     }
> >>
> >> This dichotomy is acceptable as long as GIMPLE semantics don't change
> >> and because the FE semantics are proposed to be more restrictive, its
> >> becomes a subset of the functionality of GIMPLE semantics. This is the
> >> current starting point, but going forward if there are scenarios where
> >> we have to diverge from GIMPLE semantics, we have to discuss that on a
> >> case-by-case basis.
> >>
> >>
>

Reply via email to