On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote: > On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote: > > > Baseline, in best case scenario of least number of executed > > instructions. > > > > 3e0: endbr64 ; 4 bytes (always executed) > > 3e4: movb $0x0,(%rdi) ; 3 bytes (unlock, > > ; always executed) > > 3e7: decl %gs:__preempt_count ; 7 bytes (always executed) > > 3ee: je 3f5 ; 2 bytes (always executed) > > 3f0: jmp __x86_return_thunk ; 5 bytes (executed if above > > ; je is not taken) > > ; rest is not executed > > 3f5: call __SCT__preempt_schedule ; 5 bytes > > 3fa: jmp __x86_return_thunk ; 5 bytes > > > > Tracepoint (again same case of least number of executed instructions). > > > > bc0: endbr64 ; 4 bytes (always executed) > > bc4: xchg %ax,%ax ; 2 bytes (always executed, this > > is an > > ; only addition on the execution > > path). > > bc6: movb $0x0,(%rdi) ; 3 bytes (unlock, always > > executed) > > bc9: decl %gs:__preempt_count ; 7 bytes (always executed) > > bd0: je bde ; 2 bytes (always executed) > > bd2: jmp __x86_return_thunk ; 5 bytes (executed if above > > ; je is not taken) > > ; rest is not executed > > bd7: call queued_spin_release_traced ; 5 bytes > > bdc: jmp bc9 ; 2 bytes > > bde: call __SCT__preempt_schedule ; 5 bytes > > be3: jmp __x86_return_thunk ; 5 bytes > > > > So I've been playing with this a bit, and it is all really sad. > > Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the > 'best' solution would be changing that paravirt call with a > static_call(), that actually shrinks the code by 1 byte. > > And then this tracepoint nonsense can simply use a different unlock > function, just like paravirt. > > 0000 00000000000001d0 <_raw_spin_unlock>: > 0000 1d0: f3 0f 1e fa endbr64 > 0004 1d4: ff 15 00 00 00 00 call *0x0(%rip) # 1da > <_raw_spin_unlock+0xa> 1d6: R_X86_64_PC32 pv_ops_lock+0x4 > 000a 1da: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 1e1 > <_raw_spin_unlock+0x11> 1dd: R_X86_64_PC32 __preempt_count-0x4 > 0011 1e1: 74 06 je 1e9 <_raw_spin_unlock+0x19> > 0013 1e3: 2e e9 00 00 00 00 cs jmp 1e9 <_raw_spin_unlock+0x19> > 1e5: R_X86_64_PLT32 __x86_return_thunk-0x4 > 0019 1e9: e8 00 00 00 00 call 1ee <_raw_spin_unlock+0x1e> > 1ea: R_X86_64_PLT32 __SCT__preempt_schedule-0x4 > 001e 1ee: 2e e9 00 00 00 00 cs jmp 1f4 <_raw_spin_unlock+0x24> > 1f0: R_X86_64_PLT32 __x86_return_thunk-0x4 > > > 0000 00000000000001d0 <_raw_spin_unlock>: > 0000 1d0: f3 0f 1e fa endbr64 > 0004 1d4: e8 00 00 00 00 call 1d9 <_raw_spin_unlock+0x9> > 1d5: R_X86_64_PLT32 __SCT__queued_spin_unlock-0x4 > 0009 1d9: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 1e0 > <_raw_spin_unlock+0x10> 1dc: R_X86_64_PC32 __preempt_count-0x4 > 0010 1e0: 74 06 je 1e8 <_raw_spin_unlock+0x18> > 0012 1e2: 2e e9 00 00 00 00 cs jmp 1e8 <_raw_spin_unlock+0x18> > 1e4: R_X86_64_PLT32 __x86_return_thunk-0x4 > 0018 1e8: e8 00 00 00 00 call 1ed <_raw_spin_unlock+0x1d> > 1e9: R_X86_64_PLT32 __SCT__preempt_schedule-0x4 > 001d 1ed: 2e e9 00 00 00 00 cs jmp 1f3 <_raw_spin_unlock+0x23> > 1ef: R_X86_64_PLT32 __x86_return_thunk-0x4 > > > Something a little like so, which is completely untested, except to > build kernel/locking/spinlock.o (with clang-23).
Thanks a lot for taking a look, Peter. I like the static_call idea. It's truly zero cost on x86 (and, as you note, even a byte smaller). The one caveat is that it relies on HAVE_STATIC_CALL_INLINE to stay free. So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is available (x86), and a static branch fallback elsewhere, gated behind a default-off config so it imposes nothing on arches/kernels that don't opt in. I'm mostly interested in x86, but would like arm64 to work too, which would use the fallback. Concretely: 1. Split the sleepable-lock patches out and send them separately. They're independent of the static call work and look far less controversial. 2. Convert the paravirt spinlock unlock to a static_call, as the foundation for the unlock tracepoint. I'm happy to take a stab at it. Let me know if you'd rather do it yourself. 3. Build the unlock tracepoint on top: static_call where it's cheap, config-gated static_branch fallback where it isn't. Does this plan sound reasonable to you? > > Also, I think someone should go do some performance runs with > ARCH_INLINE_SPIN_* set for x86 just like for s390. That's a good point, I'll run benchmarks and report back with the results.
