Re: [PATCH] libstdc++: Future-proof C++20 atomic wait/notify

Jonathan Wakely Mon, 17 Nov 2025 05:39:07 -0800

On Mon, 17 Nov 2025 at 14:09 +0100, Tomasz Kaminski wrote:

On Mon, Nov 17, 2025 at 1:50 PM Jonathan Wakely <[email protected]> wrote:

On Mon, 17 Nov 2025 at 10:10 +0100, Tomasz Kaminski wrote:
>On Sun, Nov 16, 2025 at 1:56 AM Jonathan Wakely <[email protected]>
wrote:
>
>> This will allow us to extend atomic waiting functions to support a
>> possible future 64-bit version of futex, as well as supporting
>> futex-like wait/wake primitives on other targets (e.g. macOS has
>> os_sync_wait_on_address and FreeBSD has _umtx_op).
>>
>> Before this change, the decision of whether to do a proxy wait or to
>> wait on the atomic variable itself was made in the header at
>> compile-time, which makes it an ABI property that would not have been
>> possible to change later.  That would have meant that
>> std::atomic<uint64_t> would always have to do a proxy wait even if Linux
>> gains support for 64-bit futex2(2) calls at some point in the future.
>> The disadvantage of proxy waits is that several distinct atomic objects
>> can share the same proxy state, leading to contention between threads
>> even when they are not waiting on the same atomic object, similar to
>> false sharing. It also result in spurious wake-ups because doing a
>> notify on an atomic object that uses a proxy wait will wake all waiters
>> sharing the proxy.
>>
>> For types that are known to definitely not need a proxy wait (e.g. int
>> on Linux) the header can still choose a more efficient path at
>> compile-time. But for other types, the decision of whether to do a proxy
>> wait is deferred to runtime, inside the library internals. This will
>> make it possible for future versions of libstdc++.so to extend the set
>> of types which don't need to use proxy waits, without ABI changes.
>>
>> The way the change works is to stop using the __proxy_wait flag that was
>> set by the inline code in the headers. Instead the __wait_args struct
>> has an extra pointer member which the library internals populate with
>> either the address of the atomic object or the _M_ver counter in the
>> proxy state. There is also a new _M_obj_size member which stores the
>> size of the atomic object, so that the library can decide whether a
>> proxy is needed. So for example if linux gains 64-bit futex support then
>> the library can decide not to use a proxy when _M_obj_size == 8.
>> Finally, the _M_old member of the __wait_args struct is changed to
>> uint64_t so that it has room to store 64-bit values, not just whatever
>> size the __platform_wait_t type is (which is a 32-bit int on Linux).
>> Similarly, the _M_val member of __wait_result_type changes to uint64_t
>> too.
>>
>> libstdc++-v3/ChangeLog:
>>
>>         * config/abi/pre/gnu.ver: Adjust exports.
>>         * include/bits/atomic_timed_wait.h
>> (_GLIBCXX_HAVE_PLATFORM_TIMED_WAIT):
>>         Do not define this macro.
>>         (__atomic_wait_address_until_v, __atomic_wait_address_for_v):
>>         Guard assertions with #ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
>>         * include/bits/atomic_wait.h (__platform_wait_uses_type):
>>         Different separately for platforms with and without platform
>>         wait.
>>         (_GLIBCXX_HAVE_PLATFORM_WAIT): Do not define this macro.
>>         (_GLIBCXX_UNKNOWN_PLATFORM_WAIT): Define new macro.
>>         (__wait_value_type): New typedef.
>>         (__wait_result_type): Change _M_val to __wait_value_type.
>>         (__wait_args_base::_M_old): Change to __wait_args_base.
>>         (__wait_args_base::_M_obg, __wait_args_base::_M_obj_size): New
>>         data members.
>>         (__wait_args::__wait_args): Set _M_obj and _M_obj_size on
>>         construction.
>>         (__wait_args::_M_setup_wait): Change void* parameter to deduced
>>         type. Use _S_bit_cast instead of __builtin_bit_cast.
>>         (__wait_args::_M_load_proxy_wait_val): Remove function, replace
>>         with ...
>>         (__wait_args::_M_setup_wait_impl): New function.
>>         (__wait_args::_S_bit_cast): Wrapper for __builtin_bit_cast which
>>         also supports conversion from 32-bit values.
>>         (__wait_args::_S_flags_for): Do not set __proxy_wait flag.
>>         (__atomic_wait_address_v): Guard assertions with #ifdef
>>         _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
>>         * src/c++20/atomic.cc (_GLIBCXX_HAVE_PLATFORM_WAIT): Define here
>>         instead of in header. Check _GLIBCXX_HAVE_PLATFORM_WAIT instead
>>         of _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT.
>>         (__spin_impl): Adjust for 64-bit __wait_args_base::_M_old.
>>         (use_proxy_wait): New function.
>>         (__wait_args::_M_load_proxy_wait_val): Replace with ...
>>         (__wait_args::_M_setup_wait_impl): New function. Call
>>         use_proxy_wait to decide at runtime whether to wait on the
>>         pointer directly instead of using a proxy. If a proxy is needed,
>>         set _M_obj to point to its _M_ver member. Adjust for change to
>>         type of _M_old.
>>         (__wait_impl): Wait on _M_obj unconditionally.
>>         (__notify_impl): Call use_proxy_wait to decide whether to notify
>>         on the address parameter or a proxy
>>         (__spin_until_impl): Adjust for change to type of _M_val.
>>         (__wait_until_impl): Wait on _M_obj unconditionally.
>> ---
>>
>> Tested x86_64-linux, powerpc64le-linux, sparc-solaris.
>>
>A lot of comments below.
>
>>
>> I think this is an imporant change which I unfortunately didn't think of
>> until recently.
>>
>> This changes the exports from the shared library, but we're still in
>> stage 1 so I think that should be allowed (albeit unfortunate). Nobody
>> should be expecting GCC 16 to be stable yet.
>>
>> The __proxy_wait enumerator is now unused and could be removed. The
>> __abi_version enumerator could also be bumped to indicate the
>> incompatibility with earlier snapshots of GCC 16, but I don't think that
>> is needed. We could in theory keep the old symbol export
>> (__wait_args::_M_load_proxy_wait) and make it trap/abort if called, but
>> I'd prefer to just remove it and cause dynamic linker errors instead.
>>
>> There's a TODO in the header about which types should be allowed to take
>> the optimized paths (see the __waitable concept). For types where that's
>> true, if the size matches a futex then we'll use a futex, even if it's
>> actually an enum or floating-point type (or pointer on 32-bit targets).
>> I'm not sure if that's safe.
>>
>>
>>  libstdc++-v3/config/abi/pre/gnu.ver           |   3 +-
>>  libstdc++-v3/include/bits/atomic_timed_wait.h |  12 +-
>>  libstdc++-v3/include/bits/atomic_wait.h       | 109 +++++++++-----
>>  libstdc++-v3/src/c++20/atomic.cc              | 140 +++++++++++-------
>>  4 files changed, 166 insertions(+), 98 deletions(-)
>>
>> diff --git a/libstdc++-v3/config/abi/pre/gnu.ver
>> b/libstdc++-v3/config/abi/pre/gnu.ver
>> index 2e48241d51f9..3c2bd4921730 100644
>> --- a/libstdc++-v3/config/abi/pre/gnu.ver
>> +++ b/libstdc++-v3/config/abi/pre/gnu.ver
>> @@ -2553,7 +2553,8 @@ GLIBCXX_3.4.35 {
>>      _ZNSt8__detail11__wait_implEPKvRNS_16__wait_args_baseE;
>>      _ZNSt8__detail13__notify_implEPKvbRKNS_16__wait_args_baseE;
>>
>>
_ZNSt8__detail17__wait_until_implEPKvRNS_16__wait_args_baseERKNSt6chrono8durationI[lx]St5ratioIL[lx]1EL[lx]1000000000EEEE;
>> -    _ZNSt8__detail11__wait_args22_M_load_proxy_wait_valEPKv;
>> +    _ZNSt8__detail11__wait_args18_M_setup_wait_implEPKv;
>> +    _ZNSt8__detail11__wait_args20_M_setup_notify_implEPKv;
>>
>>      # std::chrono::gps_clock::now, tai_clock::now
>>      _ZNSt6chrono9gps_clock3nowEv;
>> diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h
>> b/libstdc++-v3/include/bits/atomic_timed_wait.h
>> index 30f7ff616840..918a267d10eb 100644
>> --- a/libstdc++-v3/include/bits/atomic_timed_wait.h
>> +++ b/libstdc++-v3/include/bits/atomic_timed_wait.h
>> @@ -75,14 +75,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>           return chrono::ceil<__w_dur>(__atime);
>>        }
>>
>> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
>> -#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
>> -#else
>> -// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement
>> __platform_wait_until
>> -// if there is a more efficient primitive supported by the platform
>> -// (e.g. __ulock_wait) which is better than pthread_cond_clockwait.
>> -#endif // ! HAVE_LINUX_FUTEX
>> -
>>      __wait_result_type
>>      __wait_until_impl(const void* __addr, __wait_args_base& __args,
>>                       const __wait_clock_t::duration& __atime);
>> @@ -156,7 +148,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>                                   const chrono::time_point<_Clock,
_Dur>&
>> __atime,
>>                                   bool __bare_wait = false) noexcept
>>      {
>> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
>> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
>>        __glibcxx_assert(false); // This function can't be used for proxy
>> wait.
>>  #endif
>>        __detail::__wait_args __args{ __addr, __old, __order,
__bare_wait };
>> @@ -208,7 +200,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>                                 const chrono::duration<_Rep, _Period>&
>> __rtime,
>>                                 bool __bare_wait = false) noexcept
>>      {
>> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
>> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
>>
>This name really reads strange, and sounds like something with "TODO".
>I think  _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT was just OK name, even
>if it was not used directly.

The rationale for the new name is that when the header gets
preprocessed we don't yet know if the libstdc++ shared lib that will
be used at runtime has a platform wait function available. A
translation unit might get compiled with GCC 16 and so there is no
known platform wait for macOS, but then at runtime it might link to
the libstdc++.dylib from GCC 17 which uses ulock_wait.

And the reason for getting rid of HAVE_PLATFORM_TIMED_WAIT is twofold:
- I don't think it makes sense to have separate HAVE_PLATFORM_WAIT and
   HAVE_PLATFORM_TIMED_WAIT macros. I doubt any target is going to
   support a futex-like operation that doesn't support a timeout (and
   if there is such an operation, we should just not use it at all).
- The macro saying whether we have a platform wait operation is now
   only defined inside the library, not in the header. It's not
   something that the header needs to know (or can know).

But I will try to improve the name of UNKNOWN_PLATFORM_WAIT.

>>        __glibcxx_assert(false); // This function can't be used for proxy
>> wait.
>>  #endif
>>        __detail::__wait_args __args{ __addr, __old, __order,
__bare_wait };
>> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
>> b/libstdc++-v3/include/bits/atomic_wait.h
>> index 95151479c120..49369419d6a6 100644
>> --- a/libstdc++-v3/include/bits/atomic_wait.h
>> +++ b/libstdc++-v3/include/bits/atomic_wait.h
>> @@ -45,35 +45,34 @@
>>  namespace std _GLIBCXX_VISIBILITY(default)
>>  {
>>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> +#if defined _GLIBCXX_HAVE_LINUX_FUTEX
>>    namespace __detail
>>    {
>> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
>> -#define _GLIBCXX_HAVE_PLATFORM_WAIT 1
>>      using __platform_wait_t = int;
>>      inline constexpr size_t __platform_wait_alignment = 4;
>> +  }
>> +  template<typename _Tp>
>> +    inline constexpr bool __platform_wait_uses_type
>> +      = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp)
>=
>> 4;
>>  #else
>> +# define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1
>>  // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait()
>>  // and __platform_notify() if there is a more efficient primitive
>> supported
>>  // by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better
>> than
>>  // a mutex/condvar based wait.
>> +  namespace __detail
>> +  {
>>  # if ATOMIC_LONG_LOCK_FREE == 2
>>      using __platform_wait_t = unsigned long;
>>  # else
>>      using __platform_wait_t = unsigned int;
>>  # endif
>>      inline constexpr size_t __platform_wait_alignment
>> -      = __alignof__(__platform_wait_t);
>> -#endif
>> +      = sizeof(__platform_wait_t) < __alignof__(__platform_wait_t)
>> +         ? __alignof__(__platform_wait_t) : sizeof(__platform_wait_t);
>>    } // namespace __detail
>> -
>> -  template<typename _Tp>
>> -    inline constexpr bool __platform_wait_uses_type
>> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
>> -      = is_scalar_v<_Tp>
>> -       && ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
>> -       && (alignof(_Tp) >= __detail::__platform_wait_alignment));
>> -#else
>> -      = false;
>> +  template<typename>
>> +    inline constexpr bool __platform_wait_uses_type = false;
>>  #endif
>>
>>    namespace __detail
>> @@ -105,10 +104,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>         return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0;
>>        }
>>
>> -    // lightweight std::optional<__platform_wait_t>
>> +    // TODO: this needs to be false for types with padding, e.g.
__int20.
>>
>I do not understand why this needs to be required. This funciton is used
>only via atomic
>or atomic_ref. For atomic, we can guarantee that the type has padding
bytes
>cleared..

When we pass the "old" value to the kernel, I don't know if we can
guarantee that cleared padding bits remain cleared. We can be sure
that the value inside the std::atomic has padding bits cleared to
zero, but the std::atomic<T>::wait(T) function takes its argument by
value, and also passes that to the kernel by value.

I think it's safer to just do proxy waits for types with padding.

>> +    // TODO: should this be true only for integral, enum, and pointer
>> types?
>>
>What I think is missing here is alignment. I assume that any platform wait
>may use
>bits that are clear due the alignment of platform wait type for some
>internal state.

I don't understand what you mean here. The platform wait type is just
an integer, so all bits are part of the value representation (if we
ignore weird types like __int20 on msp430). A pointer to that type
might have "unused" low bits due to alignment, but we form that
pointer by taking the address of the atomic object, and we pass it to
the library, and we pass it to the OS wake/wait operation. I don't
think there's any opportunity for anything to ever set (or read) bits
for internal state.

>Or we are going to check is_sufficiently_aligment in cc file, and use
>different kind of
>wait depending on the object?

Yes, we can check the alignment of the __addr pointer when deciding
whether to use the platform wait operation. But the objects that
__addr points to are always the value inside a std::atomic or the
object that std::atomic_ref::_M_ptr points to, and we already know
that those objects are suitably aligned for atomic ops. It's possible
that some platform wait has stricter alignment requirements (e.g. you
can do atomic add/subtract if it's 4-byte aligned by you can't
wait/notify unless it's 16-byte aligned) but I don't know of any cases
where that's true (for macOS the object only needs to be aligned to
its size, which is already guaranteed by std::atomic and required by
std::atomic_ref, for FreeBSD it must be aligned to "ABI-mandated
alignment" which isn't very clear to me, but I don't think it means
"aligned more than the type usually requires").

If we want to use some new wait/wake primitive in future and it
requires stricter alignment, we can just check the alignment of __addr
in the new use_proxy_wait function in src/c++20/atomic.cc. That
function doesn't currently take the original __addr pointer, because
it doesn't currently need it, but it's not a public function so we can
just change it any time we need to. I can change it to take the
__addr argument now, marked [[maybe_unused]], if that would make it
clearer that the decision can (in theory) depend on the address.

But I don't think the __waitable concept should consider alignment.

The objects we do atomic wait/notify on are already aligned suitably
for atomic ops, and the address of any individual object can be used
to decide whether it's suitably aligned for wait/notify, it doesn't
need to be a static property of the type.

The point of this __waitable check is "does this type look like
something that we should consider using as a futex", which for linux
today means "can it safely be reinterpret_cast to int& for the
kernel's purposes" (where I'm relying on the fact that the kernel can
happily ignore type-based alias analysis and just cares about the
bits!)

We should probably not treat struct { char c; /*padding;*/ short i; }
as a futex, but we can treat an enum as a futex, because as far as the
kernel's futex syscall is concerned, it's just some bits and it either
equals some value or it doesn't.

>But I think, we can later safely extend or change what is waitable (except
>extending it past 8 bytes),
>as if we start putting _M_obj_size to non zero, impl may use platform
wait.
>
>So, I will go with safe option is integral, pointer or enum, This would

That means always using a proxy wait for float and double, but I think
that's OK.

>also give
>us no padding guarantee, I assume?

__int20 is integral and has padding bits. But it's a very special case
and only present on one or two non-mainstream targets.

>> +    template<typename _Tp>
>> +      concept __waitable
>> +       = is_scalar_v<_Tp> && (sizeof(_Tp) <= sizeof(__UINT64_TYPE__));
>> +
>> +    // Storage for up to 64 bits of value, should be considered opaque
>> bits.
>> +    using __wait_value_type = __UINT64_TYPE__;
>> +
>> +    // lightweight std::optional<__wait_value_type>
>>      struct __wait_result_type
>>      {
>> -      __platform_wait_t _M_val;
>> +      __wait_value_type _M_val;
>>        unsigned char _M_has_val : 1; // _M_val value was loaded before
>> return.
>>        unsigned char _M_timeout : 1; // Waiting function ended with
>> timeout.
>>        unsigned char _M_unused : 6;  // padding
>> @@ -143,8 +151,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>      {
>>        __wait_flags _M_flags;
>>        int _M_order = __ATOMIC_ACQUIRE;
>> -      __platform_wait_t _M_old = 0;
>> +      __wait_value_type _M_old{};
>>        void* _M_wait_state = nullptr;
>> +      const void* _M_obj = nullptr;  // The address of the object to
wait
>> on.
>> +      unsigned char _M_obj_size = 0; // The size of that object.
>>
>>        // Test whether _M_flags & __flags is non-zero.
>>        bool
>> @@ -162,36 +172,48 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>         explicit
>>         __wait_args(const _Tp* __addr, bool __bare_wait = false)
noexcept
>>         : __wait_args_base{ _S_flags_for(__addr, __bare_wait) }
>> -       { }
>> +       {
>> +         _M_obj = __addr; // Might be replaced by _M_setup_wait
>> +         if constexpr (__waitable<_Tp>)
>> +           // __wait_impl might be able to wait directly on __addr
>> +           // instead of using a proxy, depending on its size.
>> +           _M_obj_size = sizeof(_Tp);
>> +       }
>>
>>        __wait_args(const __platform_wait_t* __addr, __platform_wait_t
>> __old,
>>                   int __order, bool __bare_wait = false) noexcept
>> -      : __wait_args_base{ _S_flags_for(__addr, __bare_wait), __order,
>> __old }
>> -      { }
>> +      : __wait_args(__addr, __bare_wait)
>> +      {
>> +       _M_order = __order;
>> +       _M_old = __old;
>> +      }
>>
>>        __wait_args(const __wait_args&) noexcept = default;
>>        __wait_args& operator=(const __wait_args&) noexcept = default;
>>
>> -      template<typename _ValFn,
>> -              typename _Tp =
decay_t<decltype(std::declval<_ValFn&>()())>>
>> +      template<typename _Tp, typename _ValFn>
>>         _Tp
>> -       _M_setup_wait(const void* __addr, _ValFn __vfn,
>> +       _M_setup_wait(const _Tp* __addr, _ValFn __vfn,
>>                       __wait_result_type __res = {})
>>         {
>> +         static_assert(is_same_v<_Tp, decay_t<decltype(__vfn())>>);
>> +
>>           if constexpr (__platform_wait_uses_type<_Tp>)
>>             {
>> -             // If the wait is not proxied, the value we check when
>> waiting
>> -             // is the value of the atomic variable itself.
>> +             // If we know for certain that this type can be waited on
>> +             // efficiently using something like a futex syscall,
>> +             // then we can avoid the overhead of _M_setup_wait_impl
>> +             // and just load the value and store it into _M_old.
>>
>> -             if (__res._M_has_val) // The previous wait loaded a recent
>> value.
>> +             if (__res._M_has_val) // A previous wait loaded a recent
>> value.
>>                 {
>>                   _M_old = __res._M_val;
>> -                 return __builtin_bit_cast(_Tp, __res._M_val);
>> +                 return _S_bit_cast<_Tp>(_M_old);
>>
>I am not sure if I understand which branch of _S_bit_cast this would use,
>we have neither (sizeof(To) == sizeof(From) i.e. 4 vs 8) and
>neither sizeof(_From) == sizeof(__UINT32_TYPE__).

Oops, yes, that's a bug in _S_bit_cast. It was supposed to also handle
the case where we only want some of the bits in __from. Apparently I
lost that branch of _S_bit_cast in some revision of the patch.

>I would much more prefer if this would be dome as:
>return __builtin_bit_cast(_Tp, static_cast<__UINT32_TYPE__>(_M_old));

That assumes that only 32 bits of the values are used, but the header
doesn't know that.

>
>>                 }
>>               else // Load the value from __vfn
>>                 {
>> -                 _Tp __val = __vfn();
>> -                 _M_old = __builtin_bit_cast(__platform_wait_t, __val);
>> +                 auto __val = __vfn();
>> +                 _M_old = _S_bit_cast<__wait_value_type>(__val);
>>
>And here:
>      _M_ old = __builtin_bit_cast(__UINT32_TYPE__, __val);
>However, instead of __UINT32_TYPE__, we should use
>make_unsinged<__platform_wait_t>.

No, again, this assumes that only sizeof(__platform_wait_t) bytes are
useful. That is the design mistake in the current code. It assumes
that we will never do non-proxy wait on something larger than today's
__platform_wait_t. For a future version of linux, we will not change
__platform_wait_t (it will still be int) but we might start doing
non-proxy wait on 64-bit values. In that case, all bits of _M_val and
_M_old would be used, and casting to uint32_t in the header would
truncate the values.

Yes, but allo of this code is inside __platform_wait_uses_type<_Tp>,
constexpr branch, where we skip overhead of _M_setup_wait impl.
The __platform_wait_uses_type can be only true for types equal to
__platform_wait_t, and if we do not change that value, we will not change
what types go into this branch, so it will always be 4 bytes.


For linux, yes, but not in general.

-- a/libstdc++-v3/include/bits/atomic_wait.h
+++ b/libstdc++-v3/include/bits/atomic_wait.h
@@ -45,7 +45,8 @@
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
-#if defined _GLIBCXX_HAVE_LINUX_FUTEX
+#if defined _GLIBCXX_HAVE_LINUX_FUTEX || defined __DragonFly__ \
+  || defined __OpenBSD__
   namespace __detail
   {
     using __platform_wait_t = int;
@@ -54,6 +55,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template<typename _Tp>
     inline constexpr bool __platform_wait_uses_type
       = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp) >= 4;
+#elif defined __APPLE__ || defined __FreeBSD__
+  namespace __detail
+  {
+    using __platform_wait_t = __UINT64_TYPE__;
+    inline constexpr size_t __platform_wait_alignment = 8;
+  }
+  template<typename _Tp>
+    inline constexpr bool __platform_wait_uses_type
+      = is_scalar_v<_Tp>
+         && ((sizeof(_Tp) == 4 && alignof(_Tp) >= 4)
+               || (sizeof(_Tp) == 8 && alignof(_Tp) >= 8));
 #else
 # define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1
 // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait()


This change for macOS and FreeBSD (on my local machine, not in the
patch I sent) would mean that __platform_wait_uses_type is true for
both 32-bit *and* 64-bit integers. Because Apple's ulock_wait and
FreeBSD's _umtx_op both support 32-bit as well as 64-bit. They take a
64-bit integer, but depending on the flags you pass they only look at
half of it. That's why I need to change _M_old and _M_val to uint64_t.

I think that's the context you were missing for why _S_bit_cast even
exists. It isn't needed today (I could just use __builtin_bit_cast)
but it's part of the future-proofing to allow this code to work for
other types in future.

Admittedly, the changes I'm proposing today could just use
__builtin_bit_cast and _S_bit_cast could be introduced later when
extending support to macOS and/or FreeBSD. But I was trying to ensure
that the new design does actually work for those future cases.

So for Linux, we will only ever take the __platform_wait_uses_type
branch for 32-bit types, and for 64-bit types we will always call
_M_setup_wait_impl (which will always decide to do a proxy wait today,
but for a future version of linux with 64-bit futex2 could decide to
do a non-proxy wait). But for other targets, we could skip
_M_setup_wait_impl for more types. And in that case, _S_bit_cast needs
to support converting between different sizes.

It might seem unsafe to change __platform_wait_uses_type for macOS and
FreeBSD later, but it's OK if we change some types to take the inline
"not a proxy wait" path as long as the use_proxy_wait function inside
libstdc++.so is changed at the same time. Translation units compiled
before the fast path was added would still call _M_setup_wait_impl but
they would call the new definition of it, which uses a new
use_proxy_wait which decides to do a non-proxy wait for those types.
New code which takes the inline non-proxy wait path would agree and
also use a non-proxy wait.

So it's safe to change __platform_wait_uses_type<T> from false to
true, as long as we also change use_proxy_wait to match.  However,
once we make __platform_wait_uses_type<T> true for a given type, we
need to ensure that we *always* do a non-proxy wait for that type,
because translation units compiled with
__platform_wait_uses_type<T>==true will continue to do a non-proxy
wait and so TUs that call _M_setup_wait_impl need to get the same
answer from use_proxy_wait.

(Aside: maybe it would be nice if __platform_wait_uses_type<T> had a
name that matches the semantics of use_proxy_wait and makes the
coupling more obvious, something like __maybe_use_proxy_wait<T> where
true would mean "do a runtime check using use_proxy_wait", and false
would mean "do the inline fast path for non-proxy wait")

Re: [PATCH] libstdc++: Future-proof C++20 atomic wait/notify

Reply via email to