https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
--- Comment #42 from LIU Hao <lh_mouse at 126 dot com> --- (In reply to Yongwei Wu from comment #27) > Anyone can show a valid use case for a non-lock-free version of 128-bit > atomic_compare_exchange? > > I am trying to use it in a data structure intended to be lock-free. I am > surprised to find that the C++ std::atomic::compare_exchange_weak does not > result in lock-free code for a 128-bit struct intended for ABA-free CAS. As > a result, the GCC-generated code is MUCH slower than the mutex-based version > in my 8-thread contention test, defeating all its valid purposes. I am > talking about a 10x difference. And the Clang-generated code is more than > 200x faster in the same test. [I think this is off topic though.] I tested CMPXCHG16B with inline assembly on an i7-1165G7 (Dell XPS 13 9305) and it turned out to be much slower than CMPXCHG, even slower than a pair of calls to `pthread_mutex_lock()` and unlock. Similar results were observed on a desktop i7 11700 and a server Xeon Cascadelake. The performance degeneration might be caused by more μops, more locking work for the extra width of operands, and more cache synchronization, which makes some sense if we assume the CPU should be optimized mostly for 8-byte access. The conclusion is probably that 16-byte compare-and-swap isn't recommended.