On Fri, Jul 25, 2025 at 07:13:18PM +0900, Masami Hiramatsu wrote: > On Sun, 20 Jul 2025 13:21:20 +0200 > Jiri Olsa <jo...@kernel.org> wrote: > > > Putting together all the previously added pieces to support optimized > > uprobes on top of 5-byte nop instruction. > > > > The current uprobe execution goes through following: > > > > - installs breakpoint instruction over original instruction > > - exception handler hit and calls related uprobe consumers > > - and either simulates original instruction or does out of line single > > step > > execution of it > > - returns to user space > > > > The optimized uprobe path does following: > > > > - checks the original instruction is 5-byte nop (plus other checks) > > - adds (or uses existing) user space trampoline with uprobe syscall > > - overwrites original instruction (5-byte nop) with call to user space > > trampoline > > - the user space trampoline executes uprobe syscall that calls related > > uprobe > > consumers > > - trampoline returns back to next instruction > > > > This approach won't speed up all uprobes as it's limited to using nop5 as > > original instruction, but we plan to use nop5 as USDT probe instruction > > (which currently uses single byte nop) and speed up the USDT probes. > > > > The arch_uprobe_optimize triggers the uprobe optimization and is called > > after > > first uprobe hit. I originally had it called on uprobe installation but then > > it clashed with elf loader, because the user space trampoline was added in a > > place where loader might need to put elf segments, so I decided to do it > > after > > first uprobe hit when loading is done. > > > > The uprobe is un-optimized in arch specific set_orig_insn call. > > > > The instruction overwrite is x86 arch specific and needs to go through 3 > > updates: > > (on top of nop5 instruction) > > > > - write int3 into 1st byte > > - write last 4 bytes of the call instruction > > - update the call instruction opcode > > > > And cleanup goes though similar reverse stages: > > > > - overwrite call opcode with breakpoint (int3) > > - write last 4 bytes of the nop5 instruction > > - write the nop5 first instruction byte > > > > We do not unmap and release uprobe trampoline when it's no longer needed, > > because there's no easy way to make sure none of the threads is still > > inside the trampoline. But we do not waste memory, because there's just > > single page for all the uprobe trampoline mappings. > > > > We do waste frame on page mapping for every 4GB by keeping the uprobe > > trampoline page mapped, but that seems ok. > > > > We take the benefit from the fact that set_swbp and set_orig_insn are > > called under mmap_write_lock(mm), so we can use the current instruction > > as the state the uprobe is in - nop5/breakpoint/call trampoline - > > and decide the needed action (optimize/un-optimize) based on that. > > > > Attaching the speed up from benchs/run_bench_uprobes.sh script: > > > > current: > > usermode-count : 152.604 ± 0.044M/s > > syscall-count : 13.359 ± 0.042M/s > > --> uprobe-nop : 3.229 ± 0.002M/s > > uprobe-push : 3.086 ± 0.004M/s > > uprobe-ret : 1.114 ± 0.004M/s > > uprobe-nop5 : 1.121 ± 0.005M/s > > uretprobe-nop : 2.145 ± 0.002M/s > > uretprobe-push : 2.070 ± 0.001M/s > > uretprobe-ret : 0.931 ± 0.001M/s > > uretprobe-nop5 : 0.957 ± 0.001M/s > > > > after the change: > > usermode-count : 152.448 ± 0.244M/s > > syscall-count : 14.321 ± 0.059M/s > > uprobe-nop : 3.148 ± 0.007M/s > > uprobe-push : 2.976 ± 0.004M/s > > uprobe-ret : 1.068 ± 0.003M/s > > --> uprobe-nop5 : 7.038 ± 0.007M/s > > uretprobe-nop : 2.109 ± 0.004M/s > > uretprobe-push : 2.035 ± 0.001M/s > > uretprobe-ret : 0.908 ± 0.001M/s > > uretprobe-nop5 : 3.377 ± 0.009M/s > > > > I see bit more speed up on Intel (above) compared to AMD. The big nop5 > > speed up is partly due to emulating nop5 and partly due to optimization. > > > > The key speed up we do this for is the USDT switch from nop to nop5: > > uprobe-nop : 3.148 ± 0.007M/s > > uprobe-nop5 : 7.038 ± 0.007M/s > > > > This also looks good to me. > > Acked-by: Masami Hiramatsu (Google) <mhira...@kernel.org>
thanks! Peter, do you have more comments? thanks, jirka