https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115955

            Bug ID: 115955
           Summary: atomic<T>::wait _S_for uses a poor hash function
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub.lopuszanski at oracle dot com
  Target Milestone: ---

Summary:
========

The way I understand the code in 
https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/atomic_wait.h#L251-L258
and the article at
https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc#
is that for std::atomic<uint64_t>::wait the implementation uses 16 elements and
maps the address of the actual atomic to the element by a very simple hash
function:
```
      static __waiter_pool_base&
      _S_for(const void* __addr) noexcept
      {
        constexpr uintptr_t __ct = 16;
        static __waiter_pool_base __w[__ct];
        auto __key = (uintptr_t(__addr) >> 2) % __ct;
        return __w[__key];
      }
```
This seems to be a very poor choice in my experience, as it only depends on (4
of) the lowest 6 bits, because data structures used in multi-threaded apps, and
in particular, data structures which use atomics, often use cache-line
alignment to avoid "false sharing", and because a cache line usually has 64
bytes, thus the __addr%64 is usually the same value for all instances of a
given class. Also, quite often: 0.
This is a performance problem, as it then leads to spurious wake ups.

Background:
===========

I am working on InnoDB engine for MySQL, which uses a very old `struct
os_event` which among other things lets people call epoch=reset(), set(),
wait_low(epoch). This structure is used a lot in other old classes such as
ib_mutex_t or rw_lock_t to notify threads waiting for a latch that it got
released. As each page in Buffer Pool has several such latches, and BP size can
be in TBs, one can imagine we have a lot of os_event-s. They are implemented as
a combination of boolean flag, uint64_t counter, a native mutex, and a native
cond var, leading to a huge sizeof(os_event) (upwards of 96 bytes on Linux) and
a lot of non-trivial code to maintain. In 2020 [Facebook
proposed](https://bugs.mysql.com/bug.php?id=102045) to use an idea similar to
[WebKit's Parking Lot](https://db.in.tum.de/~boettcher/locking.pdf), which is
to have a pool of such structs, and somehow reuse/share them. This would make
the logic even more complicated, but would save a lot of memory (their claim:
1GB).

Recently MySQL's codebase moved to C++20, and I saw an opportunity to replace
all of this custom logic with std::atomic<uint64_t> as there is a natural
mapping:
```
epoch=reset() --> counter=load(), 
set() --> fetch_add(1)+notify_all(), 
wait_low(counter) --> wait(epoch).
```
This is not exactly the same thing, but close enough semantically, so I've
implemented the change and run sysbench OLTP-RW test suite, and saw...disaster:
```
$num_of_runs $median $median_abs_diff ($min < $avg < $max) "$version"
pareto 64
6 27800 +- 113 ( 27423 < 27739 < 27943 ) "use std::atomic instead of os_event_t
+ 32-bit"
6 27676 +- 255 ( 27401 < 27696 < 28019 ) "use std::atomic instead of
os_event_t"
6 27527 +- 163 ( 27171 < 27545 < 27806 ) "baseline"

pareto 1024
9 32234 +- 113 ( 31930 < 32244 < 32555 ) "use std::atomic instead of os_event_t
+ 32-bit"
9 32080 +- 134 ( 31946 < 32137 < 32460 ) "baseline"
9 21791 +- 677 ( 19965 < 21408 < 22486 ) "use std::atomic instead of
os_event_t"

uniform 64
9 27326 +- 170 ( 26799 < 27287 < 27666 ) "use std::atomic instead of os_event_t
+ 32-bit"
9 27264 +- 98 ( 26677 < 27162 < 27453 ) "use std::atomic instead of os_event_t"
9 27246 +- 195 ( 26758 < 27206 < 27561 ) "baseline"

uniform 1024
9 35766 +- 125 ( 35246 < 35671 < 36173 ) "use std::atomic instead of os_event_t
+ 32-bit"
9 35623 +- 146 ( 35446 < 35669 < 36150 ) "baseline"
9 23631 +- 313 ( 22835 < 23603 < 24399 ) "use std::atomic instead of
os_event_t"
```
This is a machine with 96 CPUs, and we can see that for 1024 threads the impact
of replacing the default os_event_t with atomic<uint64_t> is changing the
number of transactions per second from 35.6K to 23.6K.

A flame graphs' diff showed a lot of time spent in `syscall` during latch
acquistions, i.e. when calling "wait".
I guess this is not so much about the duration, but about frequency of these
calls, as there's a retry `while` loop in atomic_wait.h which calls futex
syscall again whenever a spurious wake up occurs.
At any rate, the problem is somewhere in the handling of 64-bit atomics,
because changing the counter to std::atomic<uint32_t> made the problem go away
as you can see.
I'd rather use uint64_t to make ABA almost impossible.

Proposed fix:
=============
How about:
a) using %17 instead of %16? 
b) looking at the next 4 bits?
c) pre-multiply the __addr by a random odd constant and take the top 4 bits?
d) use a real hash function, such as _mm_crc32_u64?
e) something more galaxy brainded like using a lock-free hashmap from actively
used addr values to a pool of the size equal to number of threads?

Personally, I'd give (a) a try - division by a contexpr value shouldn't hurt
much, and we are about to do a costly syscall anyway.
  • [Bug libstdc++/115955] Ne... jakub.lopuszanski at oracle dot com via Gcc-bugs

Reply via email to