On Wed, Jul 31, 2024 at 8:47 PM Heikki Linnakangas <hlinn...@iki.fi> wrote: > On 31/07/2024 08:52, Thomas Munro wrote: > The old __i386__ implementation of TAS() said: > > > * When this was last tested, we didn't have separate TAS() and > > TAS_SPIN() > > * macros. Nowadays it probably would be better to do a non-locking > > test > > * in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done > > the > > * testing to verify that. Without some empirical evidence, better to > > * leave it alone. > > It seems that you did what the comment suggested. That seems fine. For > sake of completeness, if someone has an i386 machine lying around, it > would be nice to verify that. Or an official CPU manufacturer's > implementation guide, or references to other implementations or something.
Hmm, the last "real" 32 bit CPU is from ~20 years ago. Now the only 32 bit x86 systems we should nominally care about are modern CPUs that can also run 32 bit instruction; is there a reason to think they'd behave differently at this level? Looking at the current Intel optimisation guide's discussion of spinlock implementation at page 2-34 of [1], it doesn't distinguish between 32 and 64, and it has that double-check thing. > > - * On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and > > - * S_UNLOCK() macros must further include hardware-level memory fence > > - * instructions to prevent similar re-ordering at the hardware level. > > - * TAS() and TAS_SPIN() must guarantee that loads and stores issued > > after > > - * the macro are not executed until the lock has been obtained. > > Conversely, > > - * S_UNLOCK() must guarantee that loads and stores issued before the > > macro > > - * have been executed before the lock is released. > > That old comment means that both SpinLockAcquire() and SpinLockRelease() > acted as full memory barriers, and looking at the implementations, that > was indeed so. With the new implementation, SpinLockAcquire() will have > "acquire semantics" and SpinLockRelease will have "release semantics". > That's very sensible, and I don't believe it will break anything, but > it's a change in semantics nevertheless. Yeah. It's interesting that our pg_atomic_clear_flag(f) is like standard atomic_flag_clear_explicit(f, memory_order_release), not like atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f, memory_order_seq_cst). Example spinlock code I've seen written in modern C or C++ therefore uses the _explicit variants, so it can get acquire/release, which is what people usually want from a lock-like thing. What's a good way to test the performance in PostgreSQL? In a naive loop that just test-and-sets and clears a flag a billion times in a loop and does nothing else, I see 20-40% performance increase depending on architecture when comparing _seq_cst with _acquire/_release. You're right that this semantic change deserves explicit highlighting, in comments somewhere... I wonder if we have anywhere that is counting on the stronger barrier... [1] https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html