https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115955
Bug ID: 115955 Summary: atomic<T>::wait _S_for uses a poor hash function Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: jakub.lopuszanski at oracle dot com Target Milestone: --- Summary: ======== The way I understand the code in https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/atomic_wait.h#L251-L258 and the article at https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc# is that for std::atomic<uint64_t>::wait the implementation uses 16 elements and maps the address of the actual atomic to the element by a very simple hash function: ``` static __waiter_pool_base& _S_for(const void* __addr) noexcept { constexpr uintptr_t __ct = 16; static __waiter_pool_base __w[__ct]; auto __key = (uintptr_t(__addr) >> 2) % __ct; return __w[__key]; } ``` This seems to be a very poor choice in my experience, as it only depends on (4 of) the lowest 6 bits, because data structures used in multi-threaded apps, and in particular, data structures which use atomics, often use cache-line alignment to avoid "false sharing", and because a cache line usually has 64 bytes, thus the __addr%64 is usually the same value for all instances of a given class. Also, quite often: 0. This is a performance problem, as it then leads to spurious wake ups. Background: =========== I am working on InnoDB engine for MySQL, which uses a very old `struct os_event` which among other things lets people call epoch=reset(), set(), wait_low(epoch). This structure is used a lot in other old classes such as ib_mutex_t or rw_lock_t to notify threads waiting for a latch that it got released. As each page in Buffer Pool has several such latches, and BP size can be in TBs, one can imagine we have a lot of os_event-s. They are implemented as a combination of boolean flag, uint64_t counter, a native mutex, and a native cond var, leading to a huge sizeof(os_event) (upwards of 96 bytes on Linux) and a lot of non-trivial code to maintain. In 2020 [Facebook proposed](https://bugs.mysql.com/bug.php?id=102045) to use an idea similar to [WebKit's Parking Lot](https://db.in.tum.de/~boettcher/locking.pdf), which is to have a pool of such structs, and somehow reuse/share them. This would make the logic even more complicated, but would save a lot of memory (their claim: 1GB). Recently MySQL's codebase moved to C++20, and I saw an opportunity to replace all of this custom logic with std::atomic<uint64_t> as there is a natural mapping: ``` epoch=reset() --> counter=load(), set() --> fetch_add(1)+notify_all(), wait_low(counter) --> wait(epoch). ``` This is not exactly the same thing, but close enough semantically, so I've implemented the change and run sysbench OLTP-RW test suite, and saw...disaster: ``` $num_of_runs $median $median_abs_diff ($min < $avg < $max) "$version" pareto 64 6 27800 +- 113 ( 27423 < 27739 < 27943 ) "use std::atomic instead of os_event_t + 32-bit" 6 27676 +- 255 ( 27401 < 27696 < 28019 ) "use std::atomic instead of os_event_t" 6 27527 +- 163 ( 27171 < 27545 < 27806 ) "baseline" pareto 1024 9 32234 +- 113 ( 31930 < 32244 < 32555 ) "use std::atomic instead of os_event_t + 32-bit" 9 32080 +- 134 ( 31946 < 32137 < 32460 ) "baseline" 9 21791 +- 677 ( 19965 < 21408 < 22486 ) "use std::atomic instead of os_event_t" uniform 64 9 27326 +- 170 ( 26799 < 27287 < 27666 ) "use std::atomic instead of os_event_t + 32-bit" 9 27264 +- 98 ( 26677 < 27162 < 27453 ) "use std::atomic instead of os_event_t" 9 27246 +- 195 ( 26758 < 27206 < 27561 ) "baseline" uniform 1024 9 35766 +- 125 ( 35246 < 35671 < 36173 ) "use std::atomic instead of os_event_t + 32-bit" 9 35623 +- 146 ( 35446 < 35669 < 36150 ) "baseline" 9 23631 +- 313 ( 22835 < 23603 < 24399 ) "use std::atomic instead of os_event_t" ``` This is a machine with 96 CPUs, and we can see that for 1024 threads the impact of replacing the default os_event_t with atomic<uint64_t> is changing the number of transactions per second from 35.6K to 23.6K. A flame graphs' diff showed a lot of time spent in `syscall` during latch acquistions, i.e. when calling "wait". I guess this is not so much about the duration, but about frequency of these calls, as there's a retry `while` loop in atomic_wait.h which calls futex syscall again whenever a spurious wake up occurs. At any rate, the problem is somewhere in the handling of 64-bit atomics, because changing the counter to std::atomic<uint32_t> made the problem go away as you can see. I'd rather use uint64_t to make ABA almost impossible. Proposed fix: ============= How about: a) using %17 instead of %16? b) looking at the next 4 bits? c) pre-multiply the __addr by a random odd constant and take the top 4 bits? d) use a real hash function, such as _mm_crc32_u64? e) something more galaxy brainded like using a lock-free hashmap from actively used addr values to a pool of the size equal to number of threads? Personally, I'd give (a) a try - division by a contexpr value shouldn't hurt much, and we are about to do a costly syscall anyway.