Hi, As part of [1] I was staring at the assembly code generated for SpinLockAcquire(), fairly randomly using GetRecoveryState() as the example.
On master, in an optimized build this generates the following code (gcc 12 in this case, but it doesn't really matter): 0000000000004220 <GetRecoveryState>: 4220: 55 push %rbp 4221: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 4228 <GetRecoveryState+0x8> 4228: ba 01 00 00 00 mov $0x1,%edx 422d: 48 89 e5 mov %rsp,%rbp 4230: 48 05 c0 01 00 00 add $0x1c0,%rax 4236: f0 86 10 lock xchg %dl,(%rax) 4239: 84 d2 test %dl,%dl 423b: 75 23 jne 4260 <GetRecoveryState+0x40> 423d: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 4244 <GetRecoveryState+0x24> 4244: 8b 80 44 01 00 00 mov 0x144(%rax),%eax 424a: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 4251 <GetRecoveryState+0x31> 4251: c6 82 c0 01 00 00 00 movb $0x0,0x1c0(%rdx) 4258: 5d pop %rbp 4259: c3 ret 425a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 4260: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 4267 <GetRecoveryState+0x47> 4267: 48 8d 0d 00 00 00 00 lea 0x0(%rip),%rcx # 426e <GetRecoveryState+0x4e> 426e: 48 8d b8 c0 01 00 00 lea 0x1c0(%rax),%rdi 4275: ba c8 18 00 00 mov $0x18c8,%edx 427a: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi # 4281 <GetRecoveryState+0x61> 4281: ff 15 00 00 00 00 call *0x0(%rip) # 4287 <GetRecoveryState+0x67> 4287: eb b4 jmp 423d <GetRecoveryState+0x1d> The main thing I want to raise attention about is the following bit: add $0x1c0,%rax lock xchg %dl,(%rax) 0x1c0 is the offset of info_lck in XLogCtlData. So the code first computes the address of the lock in %rax and then does the xchg on that. That's pretty odd, because on x86 this could just be encoded as an offset to the address - as shown in the code for the unlock a bit later: 4251: c6 82 c0 01 00 00 00 movb $0x0,0x1c0(%rdx) After being confused for a while, the explanation is fairly simple: We use volatile and dereference the address: static __inline__ int tas(volatile slock_t *lock) { slock_t _res = 1; __asm__ __volatile__( " lock \n" " xchgb %0,%1 \n" : "+q"(_res), "+m"(*lock) : /* no inputs */ : "memory", "cc"); return (int) _res; } (note the (*lock) and the volatile in the signature). I think it'd be just as defensible to not emit a separate load here, despite the volatile, and indeed clang doesn't emit a separate load. But it also does seem defensible to take translate the code very literally, as gcc does. If I remove the volatile from the signature or cast it away, gcc indeed generates the offset version: 4230: f0 86 82 c0 01 00 00 lock xchg %al,0x1c0(%rdx) A second, even smaller, issue with the code is that we use "lock xchgb" despite xchg having implied lock approximately forever ([2]). That makes the code slightly wider than necessary (the lock prefix is one byte). I doubt there's a lot of situations where these end up having a meaningful performance impact, but it still seems suboptimal. I may be seeing a *small* gain in a workload inserting lots of tiny records, but it's hard to be sure if it's above the noise floor. I'm wondering in how many places our fairly broad use of volatiles causes more substantially worse code being generated. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20240729165154.56zqyg34x2ywkpsh%40awork3.anarazel.de [2] https://www.felixcloutier.com/x86/xchg#description