On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote: > Thanks for the patch and sorry for the slow review.
Sorry for my slow reaction. I needed a long vacation. For now I'll focus on the design question wrt. multi-arch compilation. > I can only comment on the usage of SVE, rather than on the scaffolding > around it. Hopefully Jonathan or others can comment on the rest. That's very useful! > The main thing that worries me is: > > #if _GLIBCXX_SIMD_HAVE_SVE > constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8; > #else > constexpr inline int __sve_vectorized_size_bytes = 0; > #endif > > Although -msve-vector-bits is currently a per-TU setting, that isn't > necessarily going to be the case in future. This is a topic that I care about a lot... as simd user, implementer, and WG21 proposal author. Are you thinking of a plan to implement the target_clones function attribute for different SVE lengths? Or does it go further than that? PR83875 is raising the same issue and solution ideas for x86. If I understand your concern correctly, then the issue you're raising exists in the same form for x86. If anyone is interested in working on a "translation phase 7 replacement" for compiler flags macros I'd be happy to give some input of what I believe is necessary to make target_clones work with std(x)::simd. This seems to be about constant expressions that return compiler-internal state - probably similar to how static reflection needs to work. For a sketch of a direction: what I'm already doing in std::experimental::simd, is to tag all non-always_inline function names with a bitflag, representing a relevant subset of -f and -m flags. That way, I'm guarding against surprises on linking TUs compiled with different flags. > Ideally it would be > possible to define different implementations of a function with > different (fixed) vector lengths within the same TU. The value at > the point that the header file is included is then somewhat arbitrary. > > So rather than have: > > using __neon128 = _Neon<16>; > > using __neon64 = _Neon<8>; > > > > +using __sve = _Sve<>; > > would it be possible instead to have: > > using __sve128 = _Sve<128>; > using __sve256 = _Sve<256>; > ...etc... > > ? Code specialised for 128-bit vectors could then use __sve128 and > code specialised for 256-bit vectors could use __sve256. Hmm, as things stand we'd need two numbers, IIUC: _Sve<NumberOfUsedBytes, SizeofRegister> On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm registers (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it doesn't ;) ). > Perhaps that's not possible as things stand, but it'd be interesting > to know the exact failure mode if so. Either way, it would be good to > remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, > and instead rely on information derived from template parameters. The TS spec requires std::experimental::native_simd<int> to basically give you the largest, most efficient, full SIMD register. (And it can't be a sizeless type because they don't exist in C++). So how would you do that without looking at __ARM_FEATURE_SVE_BITS in the simd implementation? > It should be possible to use SVE to optimise some of the __neon* > implementations, which has the advantage of working even for VLA SVE. > That's obviously a separate patch, though. Just saying for the record. I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must assume that other SVE implementations (especially those with __ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant latency when going from a NEON register to an SVE register and back (which each requires a store-load, IIUC). So are you thinking of implementing everything via SVE? That would break ABI, no? - Matthias -- ────────────────────────────────────────────────────────────────────────── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd ──────────────────────────────────────────────────────────────────────────