On 28/10/24 17:15 +0000, mmalcom...@nvidia.com wrote:
From: Matthew Malcomson <mmalcom...@nvidia.com>
I noticed that the libstdc++ patch is essentially separate and figured I
could send it upstream earlier to give reviewers more time to look at
it.
I am still working on adding the ability to use floating point types in
the __atomic_fetch_add builtins
Review of current state and motivation (for anyone reading this that has
not already seen the previous patches):
- Some hardware has support for floating point atomic fetch_add (and
similar).
- There are existing compilers targetting this hardware that use
libstdc++ -- e.g. NVC++.
- Since the libstdc++ atomic<float>::fetch_add and similar is written
directly as a CAS loop these compilers can not emit optimal code when
seeing such constructs.
- I hope to use __atomic_fetch_add builtins on floating point types
directly in libstdc++ so these compilers can emit better code.
- Clang already handles some floating point types in the
__atomic_fetch_add family of builtins.
- In order to only use this when available, I originally thought I could
check against the resolved versions of the builtin in a manner
something like `__has_builtin(__atomic_fetch_add_<fp-suffix>)`.
I then realised that clang does not expose resolved versions of these
atomic builtins to the user.
From the clang discourse it was suggested we instead use SFINAE (which
clang already supports).
- I have posted a patch for allowing the use of SFINAE on builtins (not
yet reviewed).
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html
*This* patch builds and regtests on top of that patch. It does not
change what happens for GCC, while it uses the builtin for codegen with
clang.
- I have previously sent a patchset upstream adding the ability to use
__atomic_fetch_add and similar on floating point types.
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663355.html
I hope to send a full patchset up soon including the suggestions given
there.
With that patchset included (plus the automatic linking of libatomic
as Joseph pointed out in the email below
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665408.html )
then current GCC should start to use the builtin branch added in this
patch.
So *currently*, this patch allows external compilers (NVC++ in
particular) to generate better code, and similarly lets clang understand
the operation better since it maps to a known builtin.
I hope that by GCC 15 this patch would also allow GCC to understand the
operation better via mapping to a known builtin.
----------------- 8< ----------- >8 ----------------
Points to question here are:
1) Is this the best approach for using SFINAE to check if this builtin
has a particular overload?
Don't know of a better approach, but not an expert in C++ templating.
Concepts!
We still need the CAS loop fallback for any compiler that doesn't
implement this builtin. Once all compilers we care about implement this
we can remove this special handling and merge the floating point and
integral operations into the same template.
Testing done:
N.b. all testing done on top of the patch introducing SFINAE on builtins
here, all testing done on AArch64:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html
1) No change seene in bootstrap & regression test on AArch64
2) Manually ran the atomic_float/1.cc testcase with clang++ 19 and
things passed. With clang++ 18 there was a failure independent to
this change where the use of `is_lock_free` in the testcase was not
optimised away so we got a linker error. After editing the testcase
it also passes with clang++ 18.
3) Manually checked that when compiling with clang we follow the branch
that uses the builtin for `float` (because clang has already
implemented these builtins for `float`).
- Done by adding `+1` to the result of that branch and checking that
we abort when running the result.
4) Manually checked that when compiling with GCC we follow the branch
that does not use the builtin for `float`.
- Done this by adding the same temporary bug to the header in the
builtin branch, and re-running tests to see that we still pass with
GCC.
Ok for trunk?
libstdc++-v3/ChangeLog:
* include/bits/atomic_base.h
(__atomic_impl::__is_atomic_fetch_add_available): Define new
struct using SFINAE to check whether __atomic_fetch_add is
implemented on floating point type.
(std::__atomic_impl::__fetch_add_flt): `if constexpr` branch
on the above SFINAE test to use __atomic_fetch_add when
available.
(__atomic_impl::__is_atomic_add_fetch_available,
std::__atomic_impl::__add_fetch_flt): Likewise.
(__atomic_impl::__is_atomic_fetch_sub_available,
std::__atomic_impl::__fetch_sub_flt): Likewise.
(__atomic_impl::__is_atomic_sub_fetch_available,
std::__atomic_impl::__sub_fetch_flt): Likewise.
Signed-off-by: Matthew Malcomson <mmalcom...@nvidia.com>
---
libstdc++-v3/include/bits/atomic_base.h | 116 ++++++++++++++++++------
1 file changed, 90 insertions(+), 26 deletions(-)
diff --git a/libstdc++-v3/include/bits/atomic_base.h b/libstdc++-v3/
include/bits/atomic_base.h
index 72cc4bae6cf..d7671b0e9d2 100644
--- a/libstdc++-v3/include/bits/atomic_base.h
+++ b/libstdc++-v3/include/bits/atomic_base.h
@@ -1209,54 +1209,118 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
__xor_fetch(_Tp* __ptr, _Val<_Tp> __i) noexcept
{ return __atomic_xor_fetch(__ptr, __i, __ATOMIC_SEQ_CST); }
+ template <typename _Tp, typename = void>
+ struct __is_atomic_fetch_add_available : false_type
+ {
+ };
+ template <typename T>
This T isn't uglified as _Tp, but it doesn't matter because ...
+ struct __is_atomic_fetch_add_available<
+ T, std::void_t<decltype (__atomic_fetch_add (
+ std::declval<T *> (), std::declval<T> (), int ()))>> :
true_type
+ {
+ };
+
Atomics for floating-point types are only in C++20, so we can use
Concepts for these constraints.
template<typename _Tp>
concept __atomic_fetch_addable
= requires (_Tp __t) { __atomic_fetch_add(&__t, __t, 1); };
I assume this will work fine if the void_t version worked.
template<typename _Tp>
_Tp
__fetch_add_flt(_Tp* __ptr, _Val<_Tp> __i, memory_order __m)
noexcept
{
- _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
- _Val<_Tp> __newval = __oldval + __i;
- while (!compare_exchange_weak(__ptr, __oldval, __newval, __m,
- memory_order_relaxed))
- __newval = __oldval + __i;
- return __oldval;
+ if constexpr (__is_atomic_fetch_add_available<_Tp>::value)
Then this would be just:
if constexpr (__atomic_fetch_addable<_Tp>)
i.e. no ::value
+ return __atomic_fetch_add (__ptr, __i, int (__m));