Re: [committed] libstdc++: Use __libc_single_threaded to optimise atomics [PR 96817]

Jonathan Wakely via Gcc-patches Wed, 30 Sep 2020 08:03:46 -0700

On 29/09/20 13:51 +0200, Christophe Lyon via Libstdc++ wrote:

On Sat, 26 Sep 2020 at 21:42, Jonathan Wakely via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:


Glibc 2.32 adds a global variable that says whether the process is
single-threaded. We can use this to decide whether to elide atomic
operations, as a more precise and reliable indicator than
__gthread_active_p.

This means that guard variables for statics and reference counting in
shared_ptr can use less expensive, non-atomic ops even in processes that
are linked to libpthread, as long as no threads have been created yet.
It also means that we switch to using atomics if libpthread gets loaded
later via dlopen (this still isn't supported in general, for other
reasons).

We can't use __libc_single_threaded to replace __gthread_active_p
everywhere. If we replaced the uses of __gthread_active_p in std::mutex
then we would elide the pthread_mutex_lock in the code below, but not
the pthread_mutex_unlock:

  std::mutex m;
  m.lock();            // pthread_mutex_lock
  std::thread t([]{}); // __libc_single_threaded = false
  t.join();
  m.unlock();          // pthread_mutex_unlock

We need the lock and unlock to use the same "is threading enabled"
predicate, and similarly for init/destroy pairs for mutexes and
condition variables, so that we don't try to release resources that were
never acquired.

There are other places that could use __libc_single_threaded, such as
_Sp_locker in src/c++11/shared_ptr.cc and locale init functions, but
they can be changed later.

libstdc++-v3/ChangeLog:

        PR libstdc++/96817
        * include/ext/atomicity.h (__gnu_cxx::__is_single_threaded()):
        New function wrapping __libc_single_threaded if available.
        (__exchange_and_add_dispatch, __atomic_add_dispatch): Use it.
        * libsupc++/guard.cc (__cxa_guard_acquire, __cxa_guard_abort)
        (__cxa_guard_release): Likewise.
        * testsuite/18_support/96817.cc: New test.

Tested powerpc64le-linux, with glibc 2.31 and 2.32. Committed to trunk.


Hi,

This patch introduced regressions on armeb-linux-gnueabhf:
--target armeb-none-linux-gnueabihf --with-cpu cortex-a9
   g++.dg/compat/init/init-ref2 cp_compat_x_tst.o-cp_compat_y_tst.o execute
   g++.dg/cpp2a/decomp1.C  -std=gnu++14 execution test
   g++.dg/cpp2a/decomp1.C  -std=gnu++17 execution test
   g++.dg/cpp2a/decomp1.C  -std=gnu++2a execution test
   g++.dg/init/init-ref2.C  -std=c++14 execution test
   g++.dg/init/init-ref2.C  -std=c++17 execution test
   g++.dg/init/init-ref2.C  -std=c++2a execution test
   g++.dg/init/init-ref2.C  -std=c++98 execution test
   g++.dg/init/ref15.C  -std=c++14 execution test
   g++.dg/init/ref15.C  -std=c++17 execution test
   g++.dg/init/ref15.C  -std=c++2a execution test
   g++.dg/init/ref15.C  -std=c++98 execution test
   g++.old-deja/g++.jason/pmf7.C  -std=c++98 execution test
   g++.old-deja/g++.mike/leak1.C  -std=c++14 execution test
   g++.old-deja/g++.mike/leak1.C  -std=c++17 execution test
   g++.old-deja/g++.mike/leak1.C  -std=c++2a execution test
   g++.old-deja/g++.mike/leak1.C  -std=c++98 execution test
   g++.old-deja/g++.other/init19.C  -std=c++14 execution test
   g++.old-deja/g++.other/init19.C  -std=c++17 execution test
   g++.old-deja/g++.other/init19.C  -std=c++2a execution test
   g++.old-deja/g++.other/init19.C  -std=c++98 execution test

and probably some (280) in libstdc++ tests: (I didn't bisect those):
   19_diagnostics/error_category/generic_category.cc execution test
   19_diagnostics/error_category/system_category.cc execution test
   20_util/scoped_allocator/1.cc execution test
   20_util/scoped_allocator/2.cc execution test
   20_util/scoped_allocator/construct_pair_c++2a.cc execution test
   20_util/to_address/debug.cc execution test
   20_util/variant/run.cc execution test


I think this is a latent bug in the static initialization code for
EABI that affects big endian. In libstdc++-v3/libsupc++/guard.cc we
have:

# ifndef _GLIBCXX_GUARD_TEST_AND_ACQUIRE

// Test the guard variable with a memory load with
// acquire semantics.

inline bool
__test_and_acquire (__cxxabiv1::__guard *g)
{
  unsigned char __c;
  unsigned char *__p = reinterpret_cast<unsigned char *>(g);
  __atomic_load (__p, &__c,  __ATOMIC_ACQUIRE);
  (void) __p;
  return _GLIBCXX_GUARD_TEST(&__c);
}
#  define _GLIBCXX_GUARD_TEST_AND_ACQUIRE(G) __test_and_acquire (G)
# endif

That inspects the first byte of the guard variable. But for EABI the
"is initialized" bit is the least significant bit of the guard
variable. For little endian that's fine, the least significant bit is
in the first byte. But for big endian, it's not in the first byte, so
we are looking in the wrong place. This means that the initial check
in __cxa_guard_acquire is wrong:

  extern "C"
  int __cxa_guard_acquire (__guard *g)
  {
#ifdef __GTHREADS
    // If the target can reorder loads, we need to insert a read memory
    // barrier so that accesses to the guarded variable happen after the
    // guard test.
    if (_GLIBCXX_GUARD_TEST_AND_ACQUIRE (g))
      return 0;

This will always be false for big endian EABI, which means we run the
rest of the function even when the variable is initialized. Previously
that still gave the right answer, but inefficiently. After my change
it gives the wrong answer, because the new code assumes that the check
right at the start of __cxa_guard_acquire ACTUALLY WORKS. Silly me.

I think the attached should fix it. This should be backported to fix
inefficient code for static init on big endian EABI targets.

commit 937d75fdfa9a7830e93d811e37d545897955d66b
Author: Jonathan Wakely <jwak...@redhat.com>
Date:   Wed Sep 30 15:48:56 2020

    libstdc++: Fix test_and_acquire / set_and_release for EABI guard variables
    
    The default definitions of _GLIBCXX_GUARD_TEST_AND_ACQUIRE and
    _GLIBCXX_GUARD_SET_AND_RELEASE in libsupc++/guard.cc only work for the
    generic (IA64) ABI, because they test/set the first byte of the guard
    variable. For EABI we need to use the least significant bit, which means
    using the first byte is wrong for big endian targets.
    
    This has always been wrong, but previously it only caused poor
    performance. The _GLIBCXX_GUARD_TEST_AND_ACQUIRE at the very start of
    __cxa_guard_acquire would always return false even if the initialization
    was actually complete. Before my r11-3484 change the atomic compare
    exchange would have loaded the correct value, and then returned 0 as
    expected when the initialization is complete. After my change, in the
    single-threaded case there is no redundant check for init being
    complete, because I foolishly assumed that the check at the start of the
    function actually worked.
    
    The default definition of _GLIBCXX_GUARD_SET_AND_RELEASE is also wrong
    for big endian EABI, but appears to work because it sets the wrong bit
    but then the buggy TEST_AND_ACQUIRE tests that wrong bit as well. Also,
    the buggy SET_AND_RELEASE macro is only used for targets with threads
    enabled but no futex syscalls.
    
    This should fix the regressions introduced by my patch, by defining
    custom versions of the TEST_AND_ACQUIRE and SET_AND_RELEASE macros that
    are correct for EABI.
    
    libstdc++-v3/ChangeLog:
    
            * config/cpu/arm/cxxabi_tweaks.h (_GLIBCXX_GUARD_TEST_AND_ACQUIRE):
            (_GLIBCXX_GUARD_SET_AND_RELEASE): Define for EABI.

diff --git a/libstdc++-v3/config/cpu/arm/cxxabi_tweaks.h b/libstdc++-v3/config/cpu/arm/cxxabi_tweaks.h
index 4eb43c8373c..4fb34869f8a 100644
--- a/libstdc++-v3/config/cpu/arm/cxxabi_tweaks.h
+++ b/libstdc++-v3/config/cpu/arm/cxxabi_tweaks.h
@@ -39,7 +39,7 @@ namespace __cxxabiv1
 
 #ifdef __ARM_EABI__
   // The ARM EABI uses the least significant bit of a 32-bit
-  // guard variable.  */
+  // guard variable.
 #define _GLIBCXX_GUARD_TEST(x) ((*(x) & 1) != 0)
 #define _GLIBCXX_GUARD_SET(x) *(x) = 1
 #define _GLIBCXX_GUARD_BIT 1
@@ -47,6 +47,11 @@ namespace __cxxabiv1
 #define _GLIBCXX_GUARD_WAITING_BIT __guard_test_bit (2, 1)
   typedef int __guard;
 
+#define _GLIBCXX_GUARD_TEST_AND_ACQUIRE(x) \
+  _GLIBCXX_GUARD_TEST(__atomic_load_n(x, __ATOMIC_ACQUIRE))
+#define _GLIBCXX_GUARD_SET_AND_RELEASE(x) \
+  __atomic_store_n(x, 1, __ATOMIC_RELEASE)
+
   // We also want the element size in array cookies.
 #define _GLIBCXX_ELTSIZE_IN_COOKIE 1

Re: [committed] libstdc++: Use __libc_single_threaded to optimise atomics [PR 96817]

Reply via email to