On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <[email protected]> wrote:
> >
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> >
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> >
> > lea -0x80(%rsp), %rsp
> > call tramp
> >
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> >
> > We use nop10 and following transformation to optimized instructions
> > above and back as suggested by Peterz [2].
> >
> > Optimize path (int3_update_optimize):
> >
> > 1) Initial state after set_swbp() installed the uprobe:
> > cc 2e 0f 1f 84 00 00 00 00 00
> >
> > From offset 0 this is INT3 followed by the tail of the original
> > 10-byte NOP.
> >
> > After a previous unoptimization bytes 5..9 may still contain the
> > old call instruction, which remains valid for threads already there.
> >
> > 2) Rewrite the LEA tail and call displacement:
> > cc [8d 64 24 80 e8 d0 d1 d2 d3]
> >
> > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
> > executable entry points while byte 0 is trapped.
> >
> > 3) Publish the first LEA byte:
> > [48] 8d 64 24 80 e8 d0 d1 d2 d3
> >
> > From offset 0 this is:
> > lea -0x80(%rsp), %rsp
> > call <uprobe-trampoline>
> >
> > Unoptimize path (int3_update_unoptimize):
> >
> > 1) Initial optimized state:
> > 48 8d 64 24 80 e8 d0 d1 d2 d3
> > Same as 3) above.
> >
> > 2) Trap new entries before restoring the NOP bytes:
> > [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> >
> > From offset 0 this traps. A thread that had already executed the
> > LEA can still reach the intact CALL at offset 5.
> >
> > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > and byte 5 as CALL.
> > cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> >
> > From offset 0 this still traps. Offset 5 is still the CALL for any
> > thread that was already past the first LEA byte.
> >
> > 4) Publish the first byte of the original NOP:
> > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> >
> > From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > displacement are now only NOP operands. Offset 5 still decodes as
> > CALL for a thread that was already there.
> >
> > Tthere is only a single target uprobe-trampoline for the given nop10
> > instruction address, so the CALL instruction will not be changed across
> > unoptimization/optimization cycles.
> > Therefore, any task that is preempted at the CALL instruction is
> > guaranteed
> > to observe that CALL and not anything else.
> >
> > Note as explained in [2] we need to use following nop10:
> > PF1 PF2 ESC NOPL MOD SIB DISP32
> > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs
> > nopw 0x00000000(%rax,%rax,1)
> >
> > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > attribute in is_prefix_bad function.
> >
> > Also changing the uprobe syscall error when called out of uprobe
> > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> >
> > The optimized uprobe performance stays the same:
> >
> > uprobe-nop : 3.129 ± 0.013M/s
> > uprobe-push : 3.045 ± 0.006M/s
> > uprobe-ret : 1.095 ± 0.004M/s
> > --> uprobe-nop10 : 7.170 ± 0.020M/s
> > uretprobe-nop : 2.143 ± 0.021M/s
> > uretprobe-push : 2.090 ± 0.000M/s
> > uretprobe-ret : 0.942 ± 0.000M/s
> > --> uretprobe-nop10: 3.381 ± 0.003M/s
> > usdt-nop : 3.245 ± 0.004M/s
> > --> usdt-nop10 : 7.256 ± 0.023M/s
> >
> > [1] https://lore.kernel.org/bpf/[email protected]/
> > [2]
> > https://lore.kernel.org/bpf/[email protected]/#t
> > Reported-by: Andrii Nakryiko <[email protected]>
> > Closes:
> > https://lore.kernel.org/bpf/[email protected]/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Assisted-by: Codex:GPT-5.5
> > Signed-off-by: Jiri Olsa <[email protected]>
> > ---
> > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > 1 file changed, 190 insertions(+), 65 deletions(-)
> >
>
> [...]
>
> > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe,
> > struct vm_area_struct *vma,
> > smp_text_poke_sync_each_cpu();
> >
> > /*
> > - * Write first byte.
> > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0
> > trapped
> > + * and byte 5 as CALL:
> > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > + */
> > + ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > + LEA_INSN_SIZE - 1, verify_insn,
> > + true /* is_register */, false /*
> > do_update_ref_ctr */,
>
> tbh, it's quite subtle and non-obvious why is_register should be set
> to true first two times (and especially that is_register and
> do_update_ref_ctr are implicitly connected), not sure how to make it
> cleaner, but maybe leave a short comment explaining this twice
> register, once unregister sequence?
ok, I came up with comment below
thanks,
jirka
---
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index de544516ea70..92449f34c005 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe
*auprobe, struct vm_area_st
int err;
/*
+ * Note the first two uprobe_write calls use is_register=true, because
they
+ * are intermediate patching states while the probe is still active.
+ *
+ * The last uprobe_write to nop10 instruction is called with
is_register=false
+ * and do_update_ref_ctr=true to trigger the refctr update.
+ *
* 1) Initial optimized state:
* 48 8d 64 24 80 e8 d0 d1 d2 d3
*