Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

Jonathan Wakely via Gcc-patches Wed, 04 Aug 2021 08:32:59 -0700

On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>
> On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> >
> > This is the right patch. The previous one is missing noexcept. Sorry.
> >
> >
> > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.mich...@gmail.com> 
> > wrote:
> >>
> >> Please find attached an updated patch after incorporating Jonathan's 
> >> suggestions.
> >>
> >> Changes from the last patch include:
> >> - Add a TSAN macro to bits/c++config.
> >> - Use separate constexpr bool-s for the conditions for lock-freedom, 
> >> double-width and alignment.
> >> - Move the code in the optimized path to a separate function 
> >> _M_release_double_width_cas.
>
> Thanks for the updated patch. At a quick glance it looks great. I'll
> apply it locally and test it tomorrow.



It occurs to me now that _M_release_double_width_cas is the wrong
name, because we're not doing a DWCAS, just a double-width load. What
do you think about renaming it to _M_release_combined instead? Since
it does a combined load of the two counts.

I'm no longer confident in the alignof suggestion I gave you.

+        constexpr bool __double_word
+          = sizeof(long long) == 2 * sizeof(_Atomic_word);
+        // The ref-count members follow the vptr, so are aligned to
+        // alignof(void*).
+        constexpr bool __aligned = alignof(long long) <= alignof(void*);

For IA32 (i.e. 32-bit x86) this constant will be true, because
alignof(long long) and alignof(void*) are both 4, even though
sizeof(long long) is 8. So in theory the _M_use_count and
_M_weak_count members could be in different cache lines, and the
atomic load will not be atomic. I think we want to use __alignof(long
long) here to get the "natural" alignment, not the smaller 4B
alignment allowed by SysV ABI. That means that we won't do this
optimization on IA32, but I think that's acceptable.

Alternatively, we could keep using alignof, but with an additional
run-time check something like (uintptr_t)&_M_use_count / 64 ==
(uintptr_t)&_M_weak_count / 64 to check they're on the same cache
line. I think for now we should just use __alignof and live without
the optimization on IA32.

Secondly:

+      void
+      __attribute__ ((noinline))

This needs to be __noinline__ because noinline is not a reserved word,
so users can do:
#define noinline 1
#include <memory>

Was it performance profiling, or code-size measurements (or something
else?) that showed this should be non-inline?
I'd like to add a comment explaining why we want it to be noinline.

In that function ...

+      _M_release_last_use() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+        _M_dispose();
+        if (_Mutex_base<_Lp>::_S_need_barriers)
+          {
+            __atomic_thread_fence (__ATOMIC_ACQ_REL);
+          }

I think we can remove this fence. The _S_need_barriers constant is
only true for the _S_mutex policy, and that just calls
_M_release_orig(), so never uses this function. I'll remove it and add
a comment noting that the lack of barrier is intentional.

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+                                                   -1) == 1)
+          {
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_destroy();
+          }
+      }

Alternatively, we could keep the fence in _M_release_last_use() and
refactor the other _M_release functions, so that we have explicit
specializations for each of:
_Sp_counted_base<_S_single>::_M_release() (already present)
_Sp_counted_base<_S_mutex>::_M_release()
_Sp_counted_base<_S_atomic>::_M_release()

The second and third would be new, as currently they both use the
definition in the primary template. The _S_mutex one would just
decrement _M_use_count then call _M_release_last_use() (which still
has the barrier needed for the _S_mutex policy). The _S_atomic one
would have your new optimization. See the attached patch showing what
I mean. I find this version a bit simpler to understand, as we just
have _M_release and _M_release_last_use, without
_M_release_double_width_cas and _M_release_orig. What do you think of
this version? Does it lose any important properties of your version
which I've failed to notice?

diff --git a/libstdc++-v3/include/bits/c++config 
b/libstdc++-v3/include/bits/c++config
index 32b8957f814..07465f7ecd5 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -143,6 +143,15 @@
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..b2397c8fddb 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -143,10 +143,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       virtual void*
       _M_get_deleter(const std::type_info&) noexcept = 0;
 
+      // Increment the use count (used when the count is greater than zero).
       void
       _M_add_ref_copy()
       { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
 
+      // Increment the use count if it is non-zero, throw otherwise.
       void
       _M_add_ref_lock()
       {
@@ -154,15 +156,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          __throw_bad_weak_ptr();
       }
 
+      // Increment the use count if it is non-zero.
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
+      // Decrement the use count.
       void
-      _M_release() noexcept
-      {
-        // Be race-detector-friendly.  For more info see bits/c++config.
-        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
-       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+      _M_release() noexcept;
+
+      // Called by _M_release() when the use count reaches zero.
+      __attribute__((__noinline__))
+      void
+      _M_release_last_use() noexcept
       {
        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
        _M_dispose();
@@ -184,12 +189,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
            _M_destroy();
          }
       }
-      }
 
+      // Increment the weak count.
       void
       _M_weak_add_ref() noexcept
       { __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
 
+      // Decrement the weak count.
       void
       _M_weak_release() noexcept
       {
@@ -286,6 +292,61 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      // Be race-detector-friendly.  For more info see bits/c++config.
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+       {
+         _M_release_last_use();
+       }
+    }
+
+  template<>
+    inline void
+    _Sp_counted_base<_S_atomic>::_M_release() noexcept
+    {
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+#if ! _GLIBCXX_TSAN
+      constexpr bool __lock_free
+       = __atomic_always_lock_free(sizeof(long long), 0)
+       && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+      constexpr bool __double_word
+       = sizeof(long long) == 2 * sizeof(_Atomic_word);
+      // The ref-count members follow the vptr, so are aligned to
+      // alignof(void*).
+      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
+      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
+       {
+         constexpr long long __unique_ref
+           = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
+         auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
+
+         _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+         if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
+           {
+             // Both counts are 1, so there are no weak references and
+             // we are releasing the last strong reference. No other
+             // threads can observe the effects of this _M_release()
+             // call (e.g. calling use_count()) without a data race.
+             *(long long*)(&_M_use_count) = 0;
+             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+             _M_dispose();
+             _M_destroy();
+             __builtin_puts("double double bye bye");
+             return;
+           }
+       }
+#endif
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+       {
+         _M_release_last_use();
+       }
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept

Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1

Reply via email to