https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
--- Comment #43 from admin_public at liblfds dot org --- > I tested CMPXCHG16B with inline assembly on an i7-1165G7 (Dell XPS 13 9305) > and it turned out to be much slower than CMPXCHG, even slower than a pair of > calls to `pthread_mutex_lock()` and unlock. Mutexes are faster when single threaded and there's no contention to the locking object. Compare-exchange (8 or 16) is much faster (orders of magnitude faster) as contention rises. Sometimes you need CAS 16, rather than CAS 8, due to the implementation requirements of lock free data structures.