On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <[email protected]> wrote: > > Andrii reported an issue with optimized uprobes [1] that can clobber > redzone area with call instruction storing return address on stack > where user code may keep temporary data without adjusting rsp. > > Fixing this by moving the optimized uprobes on top of 10-bytes nop > instruction, so we can squeeze another instruction to escape the > redzone area before doing the call, like: > > lea -0x80(%rsp), %rsp > call tramp > > Note the lea instruction is used to adjust the rsp register without > changing the flags. > > We use nop10 and following transformation to optimized instructions > above and back as suggested by Peterz [2]. > > Optimize path (int3_update_optimize): > > 1) Initial state after set_swbp() installed the uprobe: > cc 2e 0f 1f 84 00 00 00 00 00 > > From offset 0 this is INT3 followed by the tail of the original > 10-byte NOP. > > After a previous unoptimization bytes 5..9 may still contain the > old call instruction, which remains valid for threads already there. > > 2) Rewrite the LEA tail and call displacement: > cc [8d 64 24 80 e8 d0 d1 d2 d3] > > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not > executable entry points while byte 0 is trapped. > > 3) Publish the first LEA byte: > [48] 8d 64 24 80 e8 d0 d1 d2 d3 > > From offset 0 this is: > lea -0x80(%rsp), %rsp > call <uprobe-trampoline> > > Unoptimize path (int3_update_unoptimize): > > 1) Initial optimized state: > 48 8d 64 24 80 e8 d0 d1 d2 d3 > Same as 3) above. > > 2) Trap new entries before restoring the NOP bytes: > [cc] 8d 64 24 80 e8 d0 d1 d2 d3 > > From offset 0 this traps. A thread that had already executed the > LEA can still reach the intact CALL at offset 5. > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped > and byte 5 as CALL. > cc [2e 0f 1f 84] e8 d0 d1 d2 d3 > > From offset 0 this still traps. Offset 5 is still the CALL for any > thread that was already past the first LEA byte. > > 4) Publish the first byte of the original NOP: > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3 > > From offset 0 this is the restored 10-byte NOP; the CALL opcode and > displacement are now only NOP operands. Offset 5 still decodes as > CALL for a thread that was already there. > > Tthere is only a single target uprobe-trampoline for the given nop10 > instruction address, so the CALL instruction will not be changed across > unoptimization/optimization cycles. > Therefore, any task that is preempted at the CALL instruction is > guaranteed > to observe that CALL and not anything else. > > Note as explained in [2] we need to use following nop10: > PF1 PF2 ESC NOPL MOD SIB DISP32 > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw > 0x00000000(%rax,%rax,1) > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS > attribute in is_prefix_bad function. > > Also changing the uprobe syscall error when called out of uprobe > trampoline to -EPROTO, so we are able to detect the fixed kernel. > > The optimized uprobe performance stays the same: > > uprobe-nop : 3.129 ± 0.013M/s > uprobe-push : 3.045 ± 0.006M/s > uprobe-ret : 1.095 ± 0.004M/s > --> uprobe-nop10 : 7.170 ± 0.020M/s > uretprobe-nop : 2.143 ± 0.021M/s > uretprobe-push : 2.090 ± 0.000M/s > uretprobe-ret : 0.942 ± 0.000M/s > --> uretprobe-nop10: 3.381 ± 0.003M/s > usdt-nop : 3.245 ± 0.004M/s > --> usdt-nop10 : 7.256 ± 0.023M/s > > [1] https://lore.kernel.org/bpf/[email protected]/ > [2] > https://lore.kernel.org/bpf/[email protected]/#t > Reported-by: Andrii Nakryiko <[email protected]> > Closes: https://lore.kernel.org/bpf/[email protected]/ > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes") > Assisted-by: Codex:GPT-5.5 > Signed-off-by: Jiri Olsa <[email protected]> > --- > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++---------- > 1 file changed, 190 insertions(+), 65 deletions(-) >
[...] > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, > struct vm_area_struct *vma, > smp_text_poke_sync_each_cpu(); > > /* > - * Write first byte. > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 > trapped > + * and byte 5 as CALL: > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3 > + */ > + ctx.expect = EXPECT_SWBP_OPTIMIZED; > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1, > + LEA_INSN_SIZE - 1, verify_insn, > + true /* is_register */, false /* do_update_ref_ctr > */, tbh, it's quite subtle and non-obvious why is_register should be set to true first two times (and especially that is_register and do_update_ref_ctr are implicitly connected), not sure how to make it cleaner, but maybe leave a short comment explaining this twice register, once unregister sequence? > + &ctx); > + if (err) > + return err; > + > + smp_text_poke_sync_each_cpu(); [...]
