Matthias Kretz <m.kr...@gsi.de> writes: > Hi Richard, > > sorry for not answering sooner. I took action on your mail but failed to also > give feedback. Now in light of your veto of Srinivas patch I wanted to use > the > opportunity to pick this up again. > > On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote: >> However, we also support different vector lengths for streaming SVE >> (running in "streaming" mode on SME) and non-streaming SVE (running >> in "non-streaming" mode on the core). Having two different lengths is >> expected to be the common case, rather than a theoretical curiosity. > > I read up on this after you mentioned this for the first time. As a WG21 > member I find the approach troublesome - but that's a bit off-topic for this > thread. > > The big issue here is that, IIUC, a user (and the simd library) cannot do the > right thing at the moment. There simply isn't enough context information > available when parsing the <experimental/simd> header. I.e. on definition of > the class template there's no facility to take target_clones or SME > "streaming" mode into account. Consequently, if we want the library to be fit > for SME, then we need more language extension(s) to make it work.
Yeah. I think the same applies to plain SVE. It seems reasonable to have functions whose implementation is specialised for a specific SVE length, with that function being selected at runtime where appropriate. Those functions needn't (in principle) be in separate TUs. The “best” definition of native<float> then becomes a per-function property rather than a per-TU property. As you note later, I think the same thing would apply to x86_64. > I guess I'm looking for a way to declare types that are different depending > on > whether they are used in streaming mode or non-streaming mode (making them > ill-formed to use in functions marked arm_streaming_compatible). > > From reading through https://arm-software.github.io/acle/main/ > acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of > member functions or ctor/dtor, static and non-static data members, etc. > > The big issue I see here is that currently all of std::* is declared without > a > arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything > from the standard library in streaming mode. Since that also applies to > std::experimental::simd, we're not creating a new footgun, only missing out > on > potential users? Kind-of. However, we can inline a non-streaming function into a streaming function if that doesn't change defined behaviour. And that's important in practice for C++, since most trivial inline functions will not be marked streaming-compatible despite being so in practice. It's UB to pass and return SVE vectors across streaming/non-streaming boundaries unless the two VLs are equal. It's therefore valid to inline such functions into streaming functions *unless* the callee uses non-streaming-only instructions such as gather loads. Because of that, someone trying to use std::experimenal::simd in SME functions is likely to succeed, at least in simple cases. > Some more thoughts on target_clones/streaming SVE language extension > evolution: > > void nonstreaming_fn(void) { > constexpr int width = __arm_sve_bits(); // e.g. 512 > constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the > // vector_size attribute works with bytes, not bits) > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > constexpr int width = __arm_sve_bits(); // e.g. 128 > constexpr int width2 = __builtin_vector_size(); // e.g. 16 > } > > __attribute__((target_clones("sse4.2,avx2"))) > void streaming_fn(void) { > constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone > // and 32 in the avx2 clone > } > > ... as a starting point for exploration. Given this, I'd still have to resort > to a macro to define a "native" simd type: > > #define NATIVE_SIMD(T) std::experimental::simd<T, _SveAbi<__arm_sve_bits() / > CHAR_BITS, __arm_sve_bits() / CHAR_BITS>> > > Getting rid of the macro seems to be even harder. Yeah. The constexprs in the AArch64 functions would only be compile-time constants in to-be-defined circumstances, using some mechanism to specify the streaming and non-streaming vector lengths at compile time. But that's a premise of the whole discussion, just noting it for the record in case anyone reading this later jumps in at this point. > A declaration of an alias like > > template <typename T> > using SveSimd = std::experimental::simd<T, _SveAbi<__arm_sve_bits() / > CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>; > > would have to delay "invoking" __arm_sve_bits() until it knows its context: > > void nonstreaming_fn(void) { > static_assert(sizeof(SveSimd<float>) == 64); > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > static_assert(sizeof(SveSimd<float>) == 16); > nonstreaming_fn(); // fine > } > > This gets even worse for target_clones, where > > void f() { > sizeof(std::simd<float>) == ? > } > > __attribute__((target_clones("sse4.2,avx2"))) > void g() { > f(); > } > > the compiler *must* virally apply target_clones to all functions it calls. > And > member functions must either also get cloned as functions, or the whole type > must be cloned (as in the std::simd case, where the sizeof needs to change). 😳 Yeah, tricky :) It's also not just about vector widths. The target-clones case also has the problem that you cannot detect at include time which features are available. E.g. “do I have SVE2-specific instructions?” becomes a contextual question rather than a global question. Fortunately, this should just be a missed optimisation. But it would be nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only instructions, even if SVE2 wasn't enabled at include time. Thanks for the other (snipped) clarifications. They were really helpful, but I didn't have anything to add. Richard