On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: > > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > > > This is the right patch. The previous one is missing noexcept. Sorry. > > > > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.mich...@gmail.com> > > wrote: > >> > >> Please find attached an updated patch after incorporating Jonathan's > >> suggestions. > >> > >> Changes from the last patch include: > >> - Add a TSAN macro to bits/c++config. > >> - Use separate constexpr bool-s for the conditions for lock-freedom, > >> double-width and alignment. > >> - Move the code in the optimized path to a separate function > >> _M_release_double_width_cas. > > Thanks for the updated patch. At a quick glance it looks great. I'll > apply it locally and test it tomorrow.
It occurs to me now that _M_release_double_width_cas is the wrong name, because we're not doing a DWCAS, just a double-width load. What do you think about renaming it to _M_release_combined instead? Since it does a combined load of the two counts. I'm no longer confident in the alignof suggestion I gave you. + constexpr bool __double_word + = sizeof(long long) == 2 * sizeof(_Atomic_word); + // The ref-count members follow the vptr, so are aligned to + // alignof(void*). + constexpr bool __aligned = alignof(long long) <= alignof(void*); For IA32 (i.e. 32-bit x86) this constant will be true, because alignof(long long) and alignof(void*) are both 4, even though sizeof(long long) is 8. So in theory the _M_use_count and _M_weak_count members could be in different cache lines, and the atomic load will not be atomic. I think we want to use __alignof(long long) here to get the "natural" alignment, not the smaller 4B alignment allowed by SysV ABI. That means that we won't do this optimization on IA32, but I think that's acceptable. Alternatively, we could keep using alignof, but with an additional run-time check something like (uintptr_t)&_M_use_count / 64 == (uintptr_t)&_M_weak_count / 64 to check they're on the same cache line. I think for now we should just use __alignof and live without the optimization on IA32. Secondly: + void + __attribute__ ((noinline)) This needs to be __noinline__ because noinline is not a reserved word, so users can do: #define noinline 1 #include <memory> Was it performance profiling, or code-size measurements (or something else?) that showed this should be non-inline? I'd like to add a comment explaining why we want it to be noinline. In that function ... + _M_release_last_use() noexcept + { + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); + _M_dispose(); + if (_Mutex_base<_Lp>::_S_need_barriers) + { + __atomic_thread_fence (__ATOMIC_ACQ_REL); + } I think we can remove this fence. The _S_need_barriers constant is only true for the _S_mutex policy, and that just calls _M_release_orig(), so never uses this function. I'll remove it and add a comment noting that the lack of barrier is intentional. + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, + -1) == 1) + { + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); + _M_destroy(); + } + } Alternatively, we could keep the fence in _M_release_last_use() and refactor the other _M_release functions, so that we have explicit specializations for each of: _Sp_counted_base<_S_single>::_M_release() (already present) _Sp_counted_base<_S_mutex>::_M_release() _Sp_counted_base<_S_atomic>::_M_release() The second and third would be new, as currently they both use the definition in the primary template. The _S_mutex one would just decrement _M_use_count then call _M_release_last_use() (which still has the barrier needed for the _S_mutex policy). The _S_atomic one would have your new optimization. See the attached patch showing what I mean. I find this version a bit simpler to understand, as we just have _M_release and _M_release_last_use, without _M_release_double_width_cas and _M_release_orig. What do you think of this version? Does it lose any important properties of your version which I've failed to notice?
diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config index 32b8957f814..07465f7ecd5 100644 --- a/libstdc++-v3/include/bits/c++config +++ b/libstdc++-v3/include/bits/c++config @@ -143,6 +143,15 @@ # define _GLIBCXX_NODISCARD #endif +// Macro for TSAN. +#if __SANITIZE_THREAD__ +# define _GLIBCXX_TSAN 1 +#elif defined __has_feature +# if __has_feature(thread_sanitizer) +# define _GLIBCXX_TSAN 1 +# endif +#endif + #if __cplusplus diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h index 5be935d174d..b2397c8fddb 100644 --- a/libstdc++-v3/include/bits/shared_ptr_base.h +++ b/libstdc++-v3/include/bits/shared_ptr_base.h @@ -143,10 +143,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION virtual void* _M_get_deleter(const std::type_info&) noexcept = 0; + // Increment the use count (used when the count is greater than zero). void _M_add_ref_copy() { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); } + // Increment the use count if it is non-zero, throw otherwise. void _M_add_ref_lock() { @@ -154,15 +156,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __throw_bad_weak_ptr(); } + // Increment the use count if it is non-zero. bool _M_add_ref_lock_nothrow() noexcept; + // Decrement the use count. void - _M_release() noexcept - { - // Be race-detector-friendly. For more info see bits/c++config. - _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); - if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) + _M_release() noexcept; + + // Called by _M_release() when the use count reaches zero. + __attribute__((__noinline__)) + void + _M_release_last_use() noexcept { _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); _M_dispose(); @@ -184,12 +189,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_destroy(); } } - } + // Increment the weak count. void _M_weak_add_ref() noexcept { __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); } + // Decrement the weak count. void _M_weak_release() noexcept { @@ -286,6 +292,61 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } } + template<> + inline void + _Sp_counted_base<_S_mutex>::_M_release() noexcept + { + // Be race-detector-friendly. For more info see bits/c++config. + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) + { + _M_release_last_use(); + } + } + + template<> + inline void + _Sp_counted_base<_S_atomic>::_M_release() noexcept + { + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); +#if ! _GLIBCXX_TSAN + constexpr bool __lock_free + = __atomic_always_lock_free(sizeof(long long), 0) + && __atomic_always_lock_free(sizeof(_Atomic_word), 0); + constexpr bool __double_word + = sizeof(long long) == 2 * sizeof(_Atomic_word); + // The ref-count members follow the vptr, so are aligned to + // alignof(void*). + constexpr bool __aligned = __alignof(long long) <= alignof(void*); + if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) + { + constexpr long long __unique_ref + = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); + auto __both_counts = reinterpret_cast<long long*>(&_M_use_count); + + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); + if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref) + { + // Both counts are 1, so there are no weak references and + // we are releasing the last strong reference. No other + // threads can observe the effects of this _M_release() + // call (e.g. calling use_count()) without a data race. + *(long long*)(&_M_use_count) = 0; + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); + _M_dispose(); + _M_destroy(); + __builtin_puts("double double bye bye"); + return; + } + } +#endif + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) + { + _M_release_last_use(); + } + } + template<> inline void _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept