On 28/10/24 17:15 +0000, mmalcom...@nvidia.com wrote:
From: Matthew Malcomson <mmalcom...@nvidia.com>

I noticed that the libstdc++ patch is essentially separate and figured I
could send it upstream earlier to give reviewers more time to look at
it.
I am still working on adding the ability to use floating point types in
the __atomic_fetch_add builtins

Review of current state and motivation (for anyone reading this that has
not already seen the previous patches):
- Some hardware has support for floating point atomic fetch_add (and
 similar).
- There are existing compilers targetting this hardware that use
 libstdc++ -- e.g. NVC++.
- Since the libstdc++ atomic<float>::fetch_add and similar is written
 directly as a CAS loop these compilers can not emit optimal code when
 seeing such constructs.
- I hope to use __atomic_fetch_add builtins on floating point types
 directly in libstdc++ so these compilers can emit better code.
- Clang already handles some floating point types in the
 __atomic_fetch_add family of builtins.
- In order to only use this when available, I originally thought I could
 check against the resolved versions of the builtin in a manner
 something like `__has_builtin(__atomic_fetch_add_<fp-suffix>)`.
 I then realised that clang does not expose resolved versions of these
 atomic builtins to the user.
 From the clang discourse it was suggested we instead use SFINAE (which
 clang already supports).
- I have posted a patch for allowing the use of SFINAE on builtins (not
 yet reviewed).
 https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html
 *This* patch builds and regtests on top of that patch.  It does not
 change what happens for GCC, while it uses the builtin for codegen with
 clang.
- I have previously sent a patchset upstream adding the ability to use
 __atomic_fetch_add and similar on floating point types.
 https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663355.html
 I hope to send a full patchset up soon including the suggestions given
 there.
 With that patchset included (plus the automatic linking of libatomic
 as Joseph pointed out in the email below
 https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665408.html )
 then current GCC should start to use the builtin branch added in this
 patch.

So *currently*, this patch allows external compilers (NVC++ in
particular) to generate better code, and similarly lets clang understand
the operation better since it maps to a known builtin.

I hope that by GCC 15 this patch would also allow GCC to understand the
operation better via mapping to a known builtin.

----------------- 8< ----------- >8 ----------------

Points to question here are:
1) Is this the best approach for using SFINAE to check if this builtin
  has a particular overload?
  Don't know of a better approach, but not an expert in C++ templating.

Concepts!

We still need the CAS loop fallback for any compiler that doesn't
implement this builtin.  Once all compilers we care about implement this
we can remove this special handling and merge the floating point and
integral operations into the same template.

Testing done:
N.b. all testing done on top of the patch introducing SFINAE on builtins
here, all testing done on AArch64:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html

1) No change seene in bootstrap & regression test on AArch64
2) Manually ran the atomic_float/1.cc testcase with clang++ 19 and
  things passed.  With clang++ 18 there was a failure independent to
  this change where the use of `is_lock_free` in the testcase was not
  optimised away so we got a linker error.  After editing the testcase
  it also passes with clang++ 18.
3) Manually checked that when compiling with clang we follow the branch
  that uses the builtin for `float` (because clang has already
  implemented these builtins for `float`).
  - Done by adding `+1` to the result of that branch and checking that
    we abort when running the result.
4) Manually checked that when compiling with GCC we follow the branch
  that does not use the builtin for `float`.
  - Done this by adding the same temporary bug to the header in the
    builtin branch, and re-running tests to see that we still pass with
    GCC.

Ok for trunk?

libstdc++-v3/ChangeLog:

        * include/bits/atomic_base.h
         (__atomic_impl::__is_atomic_fetch_add_available): Define new
         struct using SFINAE to check whether __atomic_fetch_add is
         implemented on floating point type.
         (std::__atomic_impl::__fetch_add_flt): `if constexpr` branch
         on the above SFINAE test to use __atomic_fetch_add when
         available.
         (__atomic_impl::__is_atomic_add_fetch_available,
         std::__atomic_impl::__add_fetch_flt): Likewise.
         (__atomic_impl::__is_atomic_fetch_sub_available,
         std::__atomic_impl::__fetch_sub_flt): Likewise.
         (__atomic_impl::__is_atomic_sub_fetch_available,
         std::__atomic_impl::__sub_fetch_flt): Likewise.

Signed-off-by: Matthew Malcomson <mmalcom...@nvidia.com>
---
libstdc++-v3/include/bits/atomic_base.h | 116 ++++++++++++++++++------
1 file changed, 90 insertions(+), 26 deletions(-)

diff --git a/libstdc++-v3/include/bits/atomic_base.h 
b/libstdc++-v3/include/bits/atomic_base.h
index 72cc4bae6cf..d7671b0e9d2 100644
--- a/libstdc++-v3/include/bits/atomic_base.h
+++ b/libstdc++-v3/include/bits/atomic_base.h
@@ -1209,54 +1209,118 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
      __xor_fetch(_Tp* __ptr, _Val<_Tp> __i) noexcept
      { return __atomic_xor_fetch(__ptr, __i, __ATOMIC_SEQ_CST); }

+      template <typename _Tp, typename = void>
+      struct __is_atomic_fetch_add_available : false_type
+      {
+      };
+      template <typename T>

This T isn't uglified as _Tp, but it doesn't matter because ...

+      struct __is_atomic_fetch_add_available<
+       T, std::void_t<decltype (__atomic_fetch_add (
+            std::declval<T *> (), std::declval<T> (), int ()))>> : true_type
+      {
+      };
+

Atomics for floating-point types are only in C++20, so we can use
Concepts for these constraints.

template<typename _Tp>
  concept __atomic_fetch_addable
    = requires (_Tp __t) { __atomic_fetch_add(&__t, __t, 1); };

I assume this will work fine if the void_t version worked.

    template<typename _Tp>
      _Tp
      __fetch_add_flt(_Tp* __ptr, _Val<_Tp> __i, memory_order __m) noexcept
      {
-       _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
-       _Val<_Tp> __newval = __oldval + __i;
-       while (!compare_exchange_weak(__ptr, __oldval, __newval, __m,
-                                     memory_order_relaxed))
-         __newval = __oldval + __i;
-       return __oldval;
+       if constexpr (__is_atomic_fetch_add_available<_Tp>::value)

Then this would be just:

        if constexpr (__atomic_fetch_addable<_Tp>)

i.e. no ::value

+         return __atomic_fetch_add (__ptr, __i, int (__m));
+       else
+         {
+           _Val<_Tp> __oldval = load (__ptr, memory_order_relaxed);
+           _Val<_Tp> __newval = __oldval + __i;
+           while (!compare_exchange_weak (__ptr, __oldval, __newval, __m,
+                                          memory_order_relaxed))
+             __newval = __oldval + __i;
+           return __oldval;
+         }
      }

+      template <typename _Tp, typename = void>
+      struct __is_atomic_fetch_sub_available : false_type
+      {
+      };
+      template <typename T>
+      struct __is_atomic_fetch_sub_available<
+       T, std::void_t<decltype (__atomic_fetch_sub (
+            std::declval<T *> (), std::declval<T> (), int ()))>> : true_type
+      {
+      };
+
    template<typename _Tp>
      _Tp
      __fetch_sub_flt(_Tp* __ptr, _Val<_Tp> __i, memory_order __m) noexcept
      {
-       _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
-       _Val<_Tp> __newval = __oldval - __i;
-       while (!compare_exchange_weak(__ptr, __oldval, __newval, __m,
-                                     memory_order_relaxed))
-         __newval = __oldval - __i;
-       return __oldval;
+       if constexpr (__is_atomic_fetch_sub_available<_Tp>::value)
+         return __atomic_fetch_sub (__ptr, __i, int (__m));
+       else
+         {
+           _Val<_Tp> __oldval = load (__ptr, memory_order_relaxed);
+           _Val<_Tp> __newval = __oldval - __i;
+           while (!compare_exchange_weak (__ptr, __oldval, __newval, __m,
+                                          memory_order_relaxed))
+             __newval = __oldval - __i;
+           return __oldval;
+         }
      }

+      template <typename _Tp, typename = void>
+      struct __is_atomic_add_fetch_available : false_type
+      {
+      };
+      template <typename T>
+      struct __is_atomic_add_fetch_available<
+       T, std::void_t<decltype (__atomic_add_fetch (
+            std::declval<T *> (), std::declval<T> (), int ()))>> : true_type
+      {
+      };
+
    template<typename _Tp>
      _Tp
      __add_fetch_flt(_Tp* __ptr, _Val<_Tp> __i) noexcept
      {
-       _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
-       _Val<_Tp> __newval = __oldval + __i;
-       while (!compare_exchange_weak(__ptr, __oldval, __newval,
-                                     memory_order_seq_cst,
-                                     memory_order_relaxed))
-         __newval = __oldval + __i;
-       return __newval;
+       if constexpr (__is_atomic_add_fetch_available<_Tp>::value)
+         return __atomic_add_fetch (__ptr, __i, __ATOMIC_SEQ_CST);
+       else
+         {
+           _Val<_Tp> __oldval = load (__ptr, memory_order_relaxed);
+           _Val<_Tp> __newval = __oldval + __i;
+           while (!compare_exchange_weak (__ptr, __oldval, __newval,
+                                          memory_order_seq_cst,
+                                          memory_order_relaxed))
+             __newval = __oldval + __i;
+           return __newval;
+         }
      }

+      template <typename _Tp, typename = void>
+      struct __is_atomic_sub_fetch_available : false_type
+      {
+      };
+      template <typename T>
+      struct __is_atomic_sub_fetch_available<
+       T, std::void_t<decltype (__atomic_sub_fetch (
+            std::declval<T *> (), std::declval<T> (), int ()))>> : true_type
+      {
+      };
+
    template<typename _Tp>
      _Tp
      __sub_fetch_flt(_Tp* __ptr, _Val<_Tp> __i) noexcept
      {
-       _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
-       _Val<_Tp> __newval = __oldval - __i;
-       while (!compare_exchange_weak(__ptr, __oldval, __newval,
-                                     memory_order_seq_cst,
-                                     memory_order_relaxed))
-         __newval = __oldval - __i;
-       return __newval;
+       if constexpr (__is_atomic_sub_fetch_available<_Tp>::value)
+         return __atomic_sub_fetch (__ptr, __i, __ATOMIC_SEQ_CST);
+       else
+         {
+           _Val<_Tp> __oldval = load (__ptr, memory_order_relaxed);
+           _Val<_Tp> __newval = __oldval - __i;
+           while (!compare_exchange_weak (__ptr, __oldval, __newval,
+                                          memory_order_seq_cst,
+                                          memory_order_relaxed))
+             __newval = __oldval - __i;
+           return __newval;
+         }
      }
  } // namespace __atomic_impl

--
2.43.0


Reply via email to