Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

Matthias Kretz Thu, 28 Mar 2024 07:53:30 -0700

On Mittwoch, 27. März 2024 14:34:52 CET Richard Sandiford wrote:
> Matthias Kretz <m.kr...@gsi.de> writes:
> > The big issue here is that, IIUC, a user (and the simd library) cannot do
> > the right thing at the moment. There simply isn't enough context
> > information available when parsing the <experimental/simd> header. I.e.
> > on definition of the class template there's no facility to take
> > target_clones or SME "streaming" mode into account. Consequently, if we
> > want the library to be fit for SME, then we need more language
> > extension(s) to make it work.
> 
> Yeah.  I think the same applies to plain SVE.


With "plain SVE" you mean the *scalable* part of it, right? BTW, I've 
experimented with implementing simd<T> basically as

template <typename T, int N>
class simd
{
  alignas(bit_ceil(sizeof(T) * N)) T data[N];

See here: https://compiler-explorer.com/z/WW6KqanTW

Maybe the compiler can get better at optimizing this approach. But for now 
it's not a solution for a *scalable* variant, because every code is going to 
be load/store bound from the get go.

@Srinivas: See the guard variables for __index0123? They need to go. I believe 
you can and should declare them `constexpr`.

>  It seems reasonable to
> have functions whose implementation is specialised for a specific SVE
> length, with that function being selected at runtime where appropriate.
> Those functions needn't (in principle) be in separate TUs.  The “best”
> definition of native<float> then becomes a per-function property rather
> than a per-TU property.

Hmm, I never considered this; but can one actually write fixed-length SVE code 
if -msve-vector-bits is not given? Then it's certainly possible to write a 
single TU with a runtime dispatch for all different SVE-widths. (This is less 
interesting on x86 where we need to dispatch on ISA extensions *and* vector 
width. It's much simpler (and safer) to compile a TU multiple times, 
restricted to a certain set of ISA extensions and then dispatch to the right 
translation at from some general code section.)

> As you note later, I think the same thing would apply to x86_64.

Yes. I don't think "same" is the case (yet) but it's very similar. Once ARM is 
at SVE9 😉 and binaries need to support HW from SVE2 up to SVE9 it gets closer 
to "same".

> > The big issue I see here is that currently all of std::* is declared
> > without a arm_streaming or arm_streaming_compatible. Thus, IIUC, you
> > can't use anything from the standard library in streaming mode. Since
> > that also applies to std::experimental::simd, we're not creating a new
> > footgun, only missing out on potential users?
> 
> Kind-of.  However, we can inline a non-streaming function into a streaming
> function if that doesn't change defined behaviour.  And that's important
> in practice for C++, since most trivial inline functions will not be
> marked streaming-compatible despite being so in practice.

Ah good to know that it takes a pragmatic approach here. But I imagine this 
could become a source of confusion to users.

> > [...]
> > the compiler *must* virally apply target_clones to all functions it calls.
> > And member functions must either also get cloned as functions, or the
> > whole type must be cloned (as in the std::simd case, where the sizeof
> > needs to change). 😳
> Yeah, tricky :)
> 
> It's also not just about vector widths.  The target-clones case also has
> the problem that you cannot detect at include time which features are
> available.  E.g. “do I have SVE2-specific instructions?” becomes a
> contextual question rather than a global question.
> 
> Fortunately, this should just be a missed optimisation.  But it would be
> nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only
> instructions, even if SVE2 wasn't enabled at include time.

Exactly. Even if we solve the scalable vector-length question, the 
target_clones question stays relevant.

So far my best answer, for x86 at least, is to compile the SIMD code multiple 
times into different shared libraries. And then let the dynamic linker pick 
the right library variant depending on the CPU. I'd be happy to have something 
simpler and working right out of the box.

Best,
  Matthias

-- 
──────────────────────────────────────────────────────────────────────────
 Dr. Matthias Kretz                           https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
 std::simd
──────────────────────────────────────────────────────────────────────────

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

Reply via email to