Matthias Kretz <m.kr...@gsi.de> writes:
> Hi Richard,
>
> sorry for not answering sooner. I took action on your mail but failed to also 
> give feedback. Now in light of your veto of Srinivas patch I wanted to use 
> the 
> opportunity to pick this up again.
>
> On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote:
>> However, we also support different vector lengths for streaming SVE
>> (running in "streaming" mode on SME) and non-streaming SVE (running
>> in "non-streaming" mode on the core).  Having two different lengths is
>> expected to be the common case, rather than a theoretical curiosity.
>
> I read up on this after you mentioned this for the first time. As a WG21 
> member I find the approach troublesome - but that's a bit off-topic for this 
> thread.
>
> The big issue here is that, IIUC, a user (and the simd library) cannot do the 
> right thing at the moment. There simply isn't enough context information 
> available when parsing the <experimental/simd> header. I.e. on definition of 
> the class template there's no facility to take target_clones or SME 
> "streaming" mode into account. Consequently, if we want the library to be fit 
> for SME, then we need more language extension(s) to make it work.

Yeah.  I think the same applies to plain SVE.  It seems reasonable to
have functions whose implementation is specialised for a specific SVE
length, with that function being selected at runtime where appropriate.
Those functions needn't (in principle) be in separate TUs.  The “best”
definition of native<float> then becomes a per-function property rather
than a per-TU property.

As you note later, I think the same thing would apply to x86_64.

> I guess I'm looking for a way to declare types that are different depending 
> on 
> whether they are used in streaming mode or non-streaming mode (making them 
> ill-formed to use in functions marked arm_streaming_compatible).
>
> From reading through https://arm-software.github.io/acle/main/
> acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of 
> member functions or ctor/dtor, static and non-static data members, etc.
>
> The big issue I see here is that currently all of std::* is declared without 
> a 
> arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything 
> from the standard library in streaming mode. Since that also applies to 
> std::experimental::simd, we're not creating a new footgun, only missing out 
> on 
> potential users?

Kind-of.  However, we can inline a non-streaming function into a streaming
function if that doesn't change defined behaviour.  And that's important
in practice for C++, since most trivial inline functions will not be
marked streaming-compatible despite being so in practice.

It's UB to pass and return SVE vectors across streaming/non-streaming
boundaries unless the two VLs are equal.  It's therefore valid to inline
such functions into streaming functions *unless* the callee uses
non-streaming-only instructions such as gather loads.

Because of that, someone trying to use std::experimenal::simd in SME
functions is likely to succeed, at least in simple cases.

> Some more thoughts on target_clones/streaming SVE language extension 
> evolution:
>
>   void nonstreaming_fn(void) {
>     constexpr int width = __arm_sve_bits(); // e.g. 512
>     constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the
>       // vector_size attribute works with bytes, not bits)
>   }
>
>   __attribute__((arm_locally_streaming))
>   void streaming_fn(void) {
>     constexpr int width = __arm_sve_bits(); // e.g. 128
>     constexpr int width2 = __builtin_vector_size(); // e.g. 16
>   }
>
>   __attribute__((target_clones("sse4.2,avx2")))
>   void streaming_fn(void) {
>     constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone
>       // and 32 in the avx2 clone
>   }
>
> ... as a starting point for exploration. Given this, I'd still have to resort 
> to a macro to define a "native" simd type:
>
> #define NATIVE_SIMD(T) std::experimental::simd<T, _SveAbi<__arm_sve_bits() / 
> CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>
>
> Getting rid of the macro seems to be even harder.

Yeah.  The constexprs in the AArch64 functions would only be compile-time
constants in to-be-defined circumstances, using some mechanism to specify
the streaming and non-streaming vector lengths at compile time.  But that's
a premise of the whole discussion, just noting it for the record in case
anyone reading this later jumps in at this point.

> A declaration of an alias like
>
> template <typename T>
> using SveSimd = std::experimental::simd<T, _SveAbi<__arm_sve_bits() / 
> CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>;
>
> would have to delay "invoking" __arm_sve_bits() until it knows its context:
>
>   void nonstreaming_fn(void) {
>     static_assert(sizeof(SveSimd<float>) == 64);
>   }
>
>   __attribute__((arm_locally_streaming))
>   void streaming_fn(void) {
>     static_assert(sizeof(SveSimd<float>) == 16);
>     nonstreaming_fn(); // fine
>   }
>
> This gets even worse for target_clones, where
>
>   void f() {
>     sizeof(std::simd<float>) == ?
>   }
>
>   __attribute__((target_clones("sse4.2,avx2")))
>   void g() {
>     f();
>   }
>
> the compiler *must* virally apply target_clones to all functions it calls. 
> And 
> member functions must either also get cloned as functions, or the whole type 
> must be cloned (as in the std::simd case, where the sizeof needs to change). 😳

Yeah, tricky :)

It's also not just about vector widths.  The target-clones case also has
the problem that you cannot detect at include time which features are
available.  E.g. “do I have SVE2-specific instructions?” becomes a
contextual question rather than a global question.

Fortunately, this should just be a missed optimisation.  But it would be
nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only
instructions, even if SVE2 wasn't enabled at include time.

Thanks for the other (snipped) clarifications.  They were really helpful,
but I didn't have anything to add.

Richard

Reply via email to